当前位置:文档之家› 数据挖掘中的矩阵

数据挖掘中的矩阵


Row percent Column percent
forecast 0 1 case Fact 0; 1case Fact 1 3 cases
4
1 correct for. R 100% C 100% corr 1 false for R 0% incor C 0% incorr
Forecast 1 2 cases


A confusion matrix (general definition and two-class case, FP,FN, TP (recall), TN, AC (accuracy), P(precision) ) and ROC curve. Analyzing data mining algorithms by building curves that represent pairs of components of the confusion matrix. A graph with x=Precision and y=Accuracy and examples.

பைடு நூலகம்
The point (0,1) is the perfect classifier:


The point (0,0) represents a classifier that predicts all cases to be negative, while the point (1,1) corresponds to a classifier that predicts every case to be positive.

ROC curves provide a visual tool for examining the tradeoff provided by a classifier between the number of
◦ correctly identified positive cases and ◦ incorrectly classified negative cases.


ROC graphs are another way besides confusion matrices to examine the performance of classifiers (Swets, 1988). A ROC graph is a plot with the false positive rate on the X axis and the true positive rate on the Y axis.
Predicted Negative Positive
Negative Actual Positive
a
b
c
d


Implement a two-class confusion matrix in Excel with computing FP,FN, TP (recall), TN, AC (accuracy), (precision) on simulated or real data of your choice with total 15 data records.




Another way of comparing ROC points is the Euclidian distance from the perfect classifier, point (0,1) on the graph. The Euclidian distance can be substituted by a weighted the Euclidian distance if relative misclassification costs are known It is equal to zero only if all cases are classified correctly.



The system predicted three dogs from the 8 actual cats. The system predicted one rabbit and two cats from the six actual dogs. This system cannot distinguish well cats and dogs, but can distinguish rabbits and other animals.



A confusion matrix is a visualization tool typically used in supervised learning (in unsupervised learning it is typically called a matching matrix). Each column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class. One benefit of a confusion matrix is that it is easy to see if the system is confusing two classes (i.e. commonly mislabeling one as another).
◦ See sheet 3 in the file “Weight-Height data.xlsx”

This confusion matrix will be a part of the experimental section of your final report.




Analyzing data mining algorithms by building curves that represent pairs of components of the confusion matrix. A graph with x=Precision and y=Accuracy and examples. More on ROC. X –FP Y-TP
Example confusion matrix Cat Dog Rabbit Cat Dog 5 2 3 3 2 0 1 11
Rabbit 0
X1 0 0 1 1
Y target 0 1 1 1
Forecast 0 0 1 1
forecast 0 2 cases Fact 0; 1case Fact 1 3 cases
1 correct for. R 100% C 50% corr 1 false for R33.3% incor C 50% incorr
Forecast 1 2 cases
0 false for. R 0% incorr C 0% 2 correct for R66.7% corr C 100% corr
a FT PT
0 0 0
0.1 0.1 0.2
0.9 0.87 0.99
1 1 1



A non-parametric classifier produces a single ROC point, corresponding to its (FP,TP) pair. The figure shows an example of an ROC graph with two ROC curves C1 and C2, and two ROC points P1( )and P2( )

An ROC curve does not take into account error costs An ROC graph contains all information contained in the confusion matrix, since
◦ FN is the complement of TP and ◦ TN is the complement of FP.
0 false for. R 0% incorr C 0% 2 correct for R 100% corr C 100% corr
1
2




A confusion matrix (Kohavi and Provost, 1998) contains information about actual and predicted classifications done by a classification system. Performance of such systems is commonly evaluated using the data in the matrix. The following table shows the confusion matrix for a two class classifier. The entries in the confusion matrix have the following meaning in the context of our study: a is the number of correct predictions that an instance is negative, b is the number of incorrect predictions that an instance is positive, c is the number of incorrect of predictions that an instance negative, and d is the number of correct predictions that an instance is positive.
相关主题