Evaluation of binary classifiers

For example, in medicine sensitivity and specificity are often used, while in computer science precision and recall are preferred.

One then evaluates the classifier relative to the gold standard by computing summary statistics of these 4 numbers.

The basic marginal ratio statistics are obtained by dividing the 2×2=4 values in the table by the marginal totals (either rows or columns), yielding 2 auxiliary 2×2 tables, for a total of 8 ratios.

The contingency table and the most common derived ratios are summarized below; see sequel for details.

Note that the rows correspond to the condition actually being positive or negative (or classified as such by the gold standard), as indicated by the color-coding, and the associated statistics are prevalence-independent, while the columns correspond to the test being positive or negative, and the associated statistics are prevalence-dependent.

As with sensitivity, it can be looked at as the probability that the test result is negative given that the patient is not sick.

The relationship between sensitivity and specificity, as well as the performance of the classifier, can be visualized and studied using the Receiver Operating Characteristic (ROC) curve.

Because hCG can also be produced by a tumor, the specificity of modern pregnancy tests cannot be 100% (because false positives are possible).

Also, because hCG is present in the urine in such small concentrations after fertilization and early embryogenesis, the sensitivity of modern pregnancy tests cannot be 100% (because false negatives are possible).

If the prevalence, sensitivity, and specificity are known, the positive predictive value can be obtained from the following identity: If the prevalence, sensitivity, and specificity are known, the negative predictive value can be obtained from the following identity: In addition to the paired metrics, there are also unitary metrics that give a single number to evaluate the test.

If not known and calculated from data, an accuracy comparison test could be made using "Two-proportion z-test, pooled for Ho: p1 = p2".

Not used very much is the complementary statistic, the fraction incorrect (FiC): FC + FiC = 1, or (FP + FN)/(TP + TN + FP + FN) – this is the sum of the antidiagonal, divided by the total population.

Cost-weighted fractions incorrect could compare expected costs of misclassification for different methods.

There is a one-parameter family of statistics, with parameter β, which determines the relative weights of precision and recall.

The traditional or balanced F-score (F1 score) is the harmonic mean of precision and recall: F-scores do not take the true negative rate into account and, therefore, are more suited to information retrieval and information extraction evaluation where the true negatives are innumerable.

[13] Cullerne Bown has distinguished three basic approaches to evaluation: ° Mathematical - such as the Matthews Correlation Coefficient, in which both kinds of error are axiomatically treated as equally problematic; ° Cost-benefit - in which a currency is adopted (e.g. money or Quality Adjusted Life Years) and values assigned to errors and successes on the basis of empirical measurement; ° Judgemental - in which a human judgement is made about the relative importance of the two kinds of error; typically this starts by adopting a pair of indicators such as sensitivity and specificity, precision and recall or positive predictive value and negative predictive value.

In the judgemental case, he has provided a flow chart for determining which pair of indicators should be used when, and consequently how to choose between the Receiver Operating Characteristic and the Precision-Recall Curve.

For such evaluations a useful single measure is "area under the ROC curve", AUC.

Apart from accuracy, binary classifiers can be assessed in many other ways, for example in terms of their speed or cost.

Probabilistic classification models go beyond providing binary outputs and instead produce probability scores for each class.

These models are designed to assess the likelihood or probability of an instance belonging to different classes.

These metrics take into account the probabilistic nature of the classifier's output and provide a more comprehensive assessment of its effectiveness in assigning accurate probabilities to different classes.

These evaluation metrics aim to capture the degree of calibration, discrimination, and overall accuracy of the probabilistic classifier's predictions.

Ranking is very important for web search engines because readers seldom go past the first page of results, and there are too many documents on the web to manually classify all of them as to whether they should be included or excluded from a given search.

Adding a cutoff at a particular number of results takes ranking into account to some degree.

More sophisticated metrics, such as discounted cumulative gain, take into account each individual ranking, and are more commonly used where this is important.