Fisher's exact test

The test assumes that all row and column sums of the contingency table were fixed by design and tends to be conservative and underpowered outside of this setting.

[5] The test is useful for categorical data that result from classifying objects in two different ways; it is used to examine the significance of the association (contingency) between the two kinds of classification.

The p-value from the test is computed as if the margins of the table are fixed, i.e. as if, in the tea-tasting example, Bristol knows the number of cups with each treatment (milk or tea first) and will therefore provide guesses with the correct number in each category.

As pointed out by Fisher, this leads under a null hypothesis of independence to a hypergeometric distribution of the numbers in the cells of the table.

The approximation is poor when sample sizes are small, or the data are very unequally distributed among the cells of the table, resulting in the cell counts predicted on the null hypothesis (the "expected values") being low.

The usual rule for deciding whether the chi-squared approximation is good enough is that the chi-squared test is not suitable when the expected values in any of the cells of a contingency table are below 5, or below 10 when there is only one degree of freedom (this rule is now known to be overly conservative[6]).

In fact, for small, sparse, or unbalanced data, the exact and asymptotic p-values can be quite different and may lead to opposite conclusions concerning the hypothesis of interest.

It becomes difficult to calculate with large samples or well-balanced tables, but fortunately these are exactly the conditions where the chi-squared test is appropriate.

For hand calculations, the test is feasible only in the case of a 2 × 2 contingency table.

However the principle of the test can be extended to the general case of an m × n table,[9][10] and some statistical packages provide a calculation (sometimes using a Monte Carlo method to obtain an approximation) for the more general case.

To count these possibilities, we do the following: first select uniformly at random a subset of size

For example, a sample of teenagers might be divided into male and female on one hand and those who are and are not currently studying for a statistics exam on the other.

The formula above gives the exact hypergeometric probability of observing this particular arrangement of the data, assuming the given marginal totals, on the null hypothesis that men and women are equally likely to be studiers.

, and we assume that both men and women enter our sample independently of whether or not they are studiers, then this hypergeometric formula gives the conditional probability of observing the values a, b, c, d in the four cells, conditionally on the observed marginals (i.e., assuming the row and column totals shown in the margins of the table are given).

This remains true even if men enter our sample with different probabilities than women.

Then still, were we to calculate the distribution of cell entries conditional given marginals, we would obtain the above formula in which neither

In order to calculate the significance of the observed data, i.e. the total probability of observing data as extreme or more extreme if the null hypothesis is true, we have to calculate the values of p for both these tables, and add them together.

For example, in the R statistical computing environment, this value can be obtained as fisher.test(rbind(c(1,9),c(11,3)), alternative="less")$p.value, or in Python, using scipy.stats.fisher_exact(table=[[1,9],[11,3]], alternative="less") (where one receives both the prior odds ratio and the p-value).

This value can be interpreted as the sum of evidence provided by the observed data—or any more extreme table—for the null hypothesis (that there is no difference in the proportions of studiers between men and women).

In the example here, the 2-sided p-value is twice the 1-sided value—but in general these can differ substantially for tables with small counts, unlike the case with test statistics that have a symmetric sampling distribution.

Fisher's test gives exact p-values, but some authors have argued that it is conservative, i.e. that its actual rejection rate is below the nominal significance level.

[4][14][15][16] The apparent contradiction stems from the combination of a discrete statistic with fixed significance levels.

[20][21] The p-values derived from Fisher's test come from the distribution that conditions on the margin totals.

It is possible to obtain an exact p-value for the 2×2 table when the margins are not held fixed.

They argue that the marginal success total is an (almost[18]) ancillary statistic, containing (almost) no information about the tested property.

The act of conditioning on the marginal success rate from a 2×2 table can be shown to ignore some information in the data about the unknown odds ratio.

[22] Whether this lost information is important for inferential purposes is the essence of the controversy.

[25] Most modern statistical packages will calculate the significance of Fisher tests, in some cases even where the chi-squared approximation would also be acceptable.

The actual computations as performed by statistical software packages will as a rule differ from those described above, because numerical difficulties may result from the large values taken by the factorials.

A simple, somewhat better computational approach relies on a gamma function or log-gamma function, but methods for accurate computation of hypergeometric and binomial probabilities remains an active research area.