Classifier uncertainty: evidence, potential impact, and probabilistic treatment

Classifiers are often tested on relatively small data sets, which should lead to uncertain performance metrics. Nevertheless, these metrics are usually taken at face value. We present an approach to quantify the uncertainty of classification performance metrics, based on a probability model of the confusion matrix. Application of our approach to classifiers from the scientific literature and a classification competition shows that uncertainties can be surprisingly large and limit performance evaluation. In fact, some published classifiers may be misleading. The application of our approach is simple and requires only the confusion matrix. It is agnostic of the underlying classifier. Our method can also be used for the estimation of sample sizes that achieve a desired precision of a performance metric.

performs better than random guessing for this specific case. Fisher (1922) It remains popular, yet the 23 underlying assumption is usually violated. McElreath (2018); Gelman (2003) 24 A fixed φ and an unspecified marginal on the predicted labels is more common. For instance, in 25 a controlled study, test sets may be curated to include 50% patients suffering from a disease and 50% 26 healthy subjects in a control group. In this example there is no uncertainty in φ , but it is fixed at φ =0.5. 27 If φ in the test set is not deliberately chosen before the compilation, it must be determined from the 28 data set. For small sample sizes, φ is uncertain like all other metrics. In the present study, we infer φ from 29 the CM but our method also copes with fixed φ .   Table 2  5  0  3  0  8  10  2 10.1021/ci200579f Table 3  10  0  3  1  14  48  3 10.1021/ci020045 Table 5  6  0  7  1  14  51  4a 10.1155/2015/485864 where α 0 = ∑ α k . The expected value and variance of the entry V i of a confusion matrix generated by a 36 multinomial distribution From this, we can calculate the expected value and variance for the proportion of i, V i In Caelen's approach, N ≈ α 0 . Therefore, the variance 40 will be overestimated by approximately a factor of two. Since the variance of V i N are overestimated w.r.t.

41
θ i , the same holds for V N w.r.t. θ and metrics calculated on V N and θ, respectively.

42
If N was increased beyond α 0 , it would converge towards the true variance

45
For a normal distribution, approximately 95% of the density are within two standard deviations σ from the 46 mean. Therefore, the length of the 95% highest posterior density interval will be close to 4σ . According to the central limit theorem, beta distributions behave for large sample sizes like normal distributions. The 48 standard deviation σ of a beta distribution is given by . (S10)

S5/S7
where α and β are the counts of observations per class, where the meaning of "class" depends on the 50 studied metric. As discussed in the main text, if one is looking at accuracy (ACC), α denotes correct 51 classifications (TP + TN) and β denotes wrong classifications (FP + FN). In the case of TPR, α counts 52 the number of true positives (TPs) whereas β counts false negatives (FNs).

53
To make explicit the dependency on sample size N, we express α as a · N and β as b · N with fractions 54 a = α N , b = β N of the two classes.
Since α + β = N, we know that a + b = 1. Now we can simplify Equation S13 to 56 σ = a · b N + 1 (S14) For large N, this approximates to In the main text, we have defined metric uncertainty (MU) as the length of the 95% highest posterior 59 density interval. Therefore, its upper limit can be approximated as 4σ ≈ 2 √ N . If one cannot reject the 60 possibility that a = b = 0.5, one will need 4 MU 2 samples to obtain the desired MU.
and Hall/CRC.