Tajima's D

[1] Tajima's D is computed as the difference between two measures of genetic diversity: the mean number of pairwise differences and the number of segregating sites, each scaled so that they are expected to be the same in a neutrally evolving population of constant size.

The purpose of Tajima's D test is to distinguish between a DNA sequence evolving randomly ("neutrally") and one evolving under a non-random process, including directional selection or balancing selection, demographic expansion or contraction, genetic hitchhiking, or introgression.

A randomly evolving DNA sequence contains mutations with no effect on the fitness and survival of an organism.

For example, a mutation that causes prenatal death or severe disease would be expected to be under selection.

In the population as a whole, the frequency of a neutral mutation fluctuates randomly (i.e. the percentage of individuals in the population with the mutation changes from one generation to the next, and this percentage is equally likely to go up or down) through genetic drift.

This equilibrium has important properties, including the number of segregating sites

This is simply the sum of the pairwise differences divided by the number of pairs, and is often symbolized by

The purpose of Tajima's test is to identify sequences which do not fit the neutral theory model at equilibrium between mutation and genetic drift.

Tajima's statistic computes a standardized measure of the total number of segregating sites (these are DNA sites that are polymorphic) in the sampled DNA and the average number of mutations between pairs in the sample.

If these two numbers only differ by as much as one could reasonably expect by chance, then the null hypothesis of neutrality cannot be rejected.

Under the neutral theory model, for a population at constant size at equilibrium: for diploid DNA, and for haploid.

is the mutation rate at the examined genomic locus, and i is the index of summation.

But selection, demographic fluctuations and other violations of the neutral model (including rate heterogeneity and introgression) will change the expected values of

The difference in the expectations for these two variables (which can be positive or negative) is the crux of Tajima's D test statistic.

is calculated by taking the difference between the two estimates of the population genetics parameter

are two estimates of the expected number of single nucleotide polymorphisms (SNPs) between two DNA sequences under the neutral mutation model in a sample size

The lower-case d described above is the difference between these two numbers—the average number of polymorphisms found in pairwise comparison (2) and M. Thus

A negative Tajima's D signifies an excess of low frequency polymorphisms relative to expectation, indicating population size expansion (e.g., after a bottleneck or a selective sweep).

However, calculating a conventional "p-value" associated with any Tajima's D value that is obtained from a sample is impossible.

Briefly, this is because there is no way to describe the distribution of the statistic that is independent of the true, and unknown, theta parameter (no pivot quantity exists).

Simulations have shown this distribution to be conservative,[3] and now that the computing power is more readily available this approximation is not frequently used.

An alternative approach is for the investigator to perform the grid search over the values of theta which they believe to be plausible based on their knowledge of the organism under study.

This rule is based on an appeal to asymptotic properties of some statistics, and thus +/- 2 does not actually represent a critical value for a significance test.

Finally, genome wide scans of Tajima's D in sliding windows along a chromosomal segment are often performed.

With this approach, those regions that have a value of D that greatly deviates from the bulk of the empirical distribution of all such windows are reported as significant.

This method does not assess significance in the traditional statistical sense, but is quite powerful given a large genomic region, and is unlikely to falsely identify interesting regions of a chromosome if only the greatest outliers are reported.