Least squares inference in phylogeny

Least squares inference in phylogeny generates a phylogenetic tree based on an observed matrix of pairwise genetic distances and optionally a weight matrix.

over a phylogenetic tree (i.e. the sum of the branch lengths in the path from leaf

It involves searching the discrete space of unrooted binary tree topologies whose size is exponential in the number of leaves.

The evaluation of S for a given topology (which includes the computation of the branch lengths) is a linear least squares problem.

, depending on the knowledge and assumptions about the variances of the observed distances.

When nothing is known about the errors, or if they are assumed to be independently distributed and equal for all observed distances, then all the weights

In the weighted least squares case the errors are assumed to be independent (or their correlations are not known).

Given independent errors, a particular weight should ideally be set to the inverse of the variance of the corresponding distance estimate.

In the Fitch and Margoliash method [1] for instance it is assumed that the variances are proportional to the squared distances.

The ordinary and weighted least squares methods described above assume independent distance estimates.

If the distances are derived from genomic data their estimates covary, because evolutionary events on internal branches (of the true tree) can push several distances up or down at the same time.

The resulting covariances can be taken into account using the method of generalized least squares, i.e. minimizing the following quantity where

are the entries of the inverse of the covariance matrix of the distance estimates.

Finding the tree and branch lengths minimizing the least squares residual is an NP-complete problem.