Models of DNA evolution

A number of different Markov models of DNA sequence evolution have been proposed.

[1] These substitution models differ in terms of the parameters used to describe the rates at which one nucleotide replaces another during evolution.

These models are phenomenological descriptions of the evolution of DNA as a string of four discrete states.

These Markov models do not explicitly depict the mechanism of mutation nor the action of natural selection.

For example, mutational biases and purifying selection favoring conservative changes are probably both responsible for the relatively high rate of transitions compared to transversions in evolving sequences.

However, the Kimura (K80) model described below only attempts to capture the effect of both forces in a parameter that reflects the relative rate of transitions to transversions.

Evolutionary analyses of sequences are conducted on a wide variety of time scales.

Thus, it is convenient to express these models in terms of the instantaneous rates of change between different states (the Q matrices below).

By expressing models in terms of the instantaneous rates of change we can avoid estimating a large numbers of parameters for each branch on a phylogenetic tree (or each comparison if the analysis involves many pairwise sequence comparisons).

They are often used for analyzing the evolution of an entire locus by making the simplifying assumption that different sites evolve independently and are identically distributed.

If the primary effect of natural selection on the evolution of the sequences is to constrain some sites, then models of among-site rate-heterogeneity can be used.

Continuous-time Markov chains have the usual transition matrices which are, in addition, parameterized by time,

are the states, then the transition matrix Example: We would like to model the substitution process in DNA sequences (i.e. Jukes–Cantor, Kimura, etc.)

(i) In the context of Markov chains, transition is the general term for the change between two states.

Consider a DNA sequence of fixed length m evolving in time by base replacement.

In DNA evolution, under the assumption of a common process for each site, the stationary frequencies

sum to zero) can be completely determined by 9 numbers; these are: 6 exchangeability terms and 3 stationary frequencies

This raw measurement of divergence provides information about the number of changes that have occurred along the path separating the sequences.

The simple count of differences (the Hamming distance) between sequences will often underestimate the number of substitution because of multiple hits (see homoplasy).

The value of β can be found by forcing the expected rate of flux of states to 1.

The diagonal entries of the rate-matrix (the Q matrix) represent -1 times the rate of leaving each state.

[5][6][7] One important property is the ability to perform a Hadamard transform assuming the site patterns were generated on a tree with nucleotides evolving under the K81 model.

[8][9][10] When used in the context of phylogenetics the Hadamard transform provides an elegant and fully invertible means to calculate expected site pattern frequencies given a set of branch lengths (or vice versa).

can vary across branches and the Hadamard transform can even provide evidence that the data do not fit a tree.

The Hadamard transform can also be combined with a wide variety of methods to accommodate among-sites rate heterogeneity,[11] using continuous distributions rather than the discrete approximations typically used in maximum likelihood phylogenetics[12] (although one must sacrifice the invertibility of the Hadamard transform to use certain among-sites rate heterogeneity distributions[11]).

) Rate matrix: When branch length, ν, is measured in the expected number of changes per site then: HKY85, the Hasegawa, Kishino and Yano 1985 model,[14] can be thought of as combining the extensions made in the Kimura80 and Felsenstein81 models.

Namely, it distinguishes between the rate of transitions and transversions (using the κ parameter), and it allows unequal base frequencies (

If we express the branch length, ν in terms of the expected number of changes per site then: and formula for the other combinations of states can be obtained by substituting in the appropriate base frequencies.

T92, the Tamura 1992 model,[17] is a mathematical method developed to estimate the number of nucleotide substitutions per site between two DNA sequences, by extending Kimura's (1980) two-parameter method to the case where a G+C content bias exists.

This method will be useful when there are strong transition-transversion and G+C-content biases, as in the case of Drosophila mitochondrial DNA.