Phylogenetic invariants

Phylogenetic invariants[1] are polynomial relationships between the frequencies of various site patterns in an idealized DNA multiple sequence alignment.

They have received substantial study in the field of biomathematics, and they can be used to choose among phylogenetic tree topologies in an empirical setting.

The primary advantage of phylogenetic invariants relative to other methods of phylogenetic estimation like maximum likelihood or Bayesian MCMC analyses is that invariants can yield information about the tree without requiring the estimation of branch lengths of model parameters.

At this point the number of programs that allow empirical datasets to be analyzed using invariants is limited.

Felsenstein[4] stated it best when he said, "invariants are worth attention, not for what they do for us now, but what they might lead to in the future."

For example, there are 256 possible site patterns for four taxa (fAAAA, fAAAC, fAAAG, ... fTTTT), which can be written as a vector.

Thus, there should be polynomials involving those frequencies that take on a value of zero if the DNA sequences were generated on a specific tree given a particular substitution model.

When they are computed using the observed pattern frequencies, we will usually find that they are not precisely zero even when the model and tree topology are correct.

Some invariants are straightforward consequences of symmetries in the model of nucleotide substitution and they will take on a value of zero regardless of the underlying tree topology.

For example, if we assume the Jukes-Cantor model of sequence evolution and a four-taxon tree we expect:

This is a simple outgrowth of the fact that base frequencies are constrained to be equal under the Jukes-Cantor model.

Symmetry invariants are non-phylogenetic in nature; they take on the expected value of zero regardless of the tree topology.

However, it is possible to determine whether a particular multiple sequence alignment fits the Jukes-Cantor model of evolution (i.e., by testing whether the site patterns of the appropriate types are present in equal numbers).

The ability to perform tests using non-homogeneous models represents a major benefit of the invariants methods relative to the more commonly used maximum likelihood methods for phylogenetic model testing.

This can be used to construct a test based on following invariant relationship, which holds for the two incorrect trees when sites evolve under the Kimura two-parameter model of sequence evolution:

The advantage of using Lake's invariants relative to maximum likelihood or neighbor joining of Kimura two-parameter distances is that the invariants should hold regardless of the model parameters, branch lengths, or patterns of among-sites rate heterogeneity.

A classic study by John Huelsenbeck and David Hillis[10] found that Lake's invariants converges on the true tree over all of the branch length space they examined when the Kimura two-parameter (K80) model[11] ist he underlying model of evolution.

However, they also found that Lake's invariants are very inefficient (large amounts of data are necessary to converge on the correct tree).

"Squangles" (stochastic quartet tangles[14]) are another example of a modern invariants method[15] and it has been implemented in software package that is practical to be used with empirical datasets.

There are three squangles that are useful for differentiating among quartets, which can be denoted as q1(f), q2(f), and q3(f) (f is a 256 element vector containing the site frequency spectrum).

Empirical tests of the squangles method have been limited[15][16] but they appear to be promising.

Another important class of modern invariants methods is based on the use of singular value decomposition (SVD) to examine the rank of matrices corresponding to flattenings of a tensor with the site pattern frequency spectrum.

{\displaystyle \mathbf {Flat_{T1}} ={\begin{bmatrix}p_{AAAA}&p_{AAAC}&p_{AAAG}&p_{AAAT}&p_{AACA}&\cdots &p_{AATT}\\p_{ACAA}&p_{ACAC}&p_{ACAG}&p_{ACAT}&p_{ACCA}&\cdots &p_{ACTT}\\p_{AGAA}&p_{AGAC}&p_{AGAG}&p_{AGAT}&p_{AGCA}&\cdots &p_{AGTT}\\p_{ATAA}&p_{ATAC}&p_{ATAG}&p_{ATAT}&p_{ATCA}&\cdots &p_{ATTT}\\p_{CAAA}&p_{CAAC}&p_{CAAG}&p_{CAAT}&p_{CACA}&\cdots &p_{CATT}\\\vdots &\vdots &\vdots &\vdots &\vdots &\ddots &\vdots \\p_{TTAA}&p_{TTAC}&p_{TTAG}&p_{TTAT}&p_{TTCA}&\cdots &p_{TTTT}\\\end{bmatrix}}}

However, performance of the original Eriksson method (ErikSVD) was mixed when it was compared to neighbor-joining and the maximum likelihood approach implemented in the PHYLIP program dnaml.

ErikSVD appeared to perform better than dnaml when applied to an empirical mammalian dataset based on an early release of data from the ENCODE project but it underperformed the other phylogenetic methods when it was used with simulated data.

Fernández-Sánchez and Casanellas,[18] proposed a normalization (Erik+2) that improved the original ErikSVD method.

ErikSVD is statistically consistent given the general Markov model for sequence evolution (it converges on.

ErikSVD and Erik+2 has been implemented in the software package PAUP* as part of the SVDquartets method.

[19] Scores for the alternative topologies are calculated using the singular values, as shown below:

values provide information about the rank of the flattening matrix; if the sequences were generated on a single tree topology then