McNemar's test

[2] The commonly used parameters to assess a diagnostic test in medical sciences are sensitivity and specificity.

Sensitivity (or recall) is the ability of a test to correctly identify the people with disease.

Thus the null and alternative hypotheses are[1] Here pa, etc., denote the theoretical probability of occurrences in cells with the corresponding label.

The McNemar test statistic is: Under the null hypothesis, with a sufficiently large number of discordants (cells b and c),

For b ≥ c: which is simply twice the binomial distribution cumulative distribution function with p = 0.5 and n = b + c. Edwards[4] proposed the following continuity corrected version of the McNemar test to approximate the binomial exact-P-value: The mid-P McNemar test (mid-p binomial test) is calculated by subtracting half the probability of the observed b from the exact one-sided P-value, then double it to obtain the two-sided mid-P-value:[5][6] This is equivalent to: where the second term is the binomial distribution probability mass function and n = b + c. Binomial distribution functions are readily available in common software packages and the McNemar mid-P test can easily be calculated.

The mid-P version was almost as powerful as the asymptotic McNemar test and was not found to exceed the nominal significance level.

There are 314 patients, and they are diagnosed (disease: present or absent) before and after using the drug, which means that each sample can be described using 1 out of 4 combinations.

The test requires the same subjects to be included in the before-and-after measurements (matched pairs).

From the above data, the McNemar test statistic: has the value 21.35, which is extremely unlikely to form the distribution implied by the null hypothesis (p < 0.001).

Thus the test provides strong evidence to reject the null hypothesis of no treatment effect.

Both the McNemar's test and mid-P version provide stronger evidence for a statistically significant treatment effect in this second example.

An interesting observation when interpreting McNemar's test is that the elements of the main diagonal do not contribute to the decision about whether (in the above example) pre- or post-treatment condition is more favourable.

Thus, the sum b + c can be small and statistical power of the tests described above can be low even though the number of pairs a + b + c + d is large (see second example above).

These investigators presented the following table: They calculated a chi-squared statistic [...] [they] had made an error in their analysis by ignoring the pairings.[...]