Representative sequences

When the diversity of the sequences is large, a single representative is often insufficient to efficiently characterize the set.

A solution also considered is to select the medoids of relative frequency groups.

More specifically, the method consists in sorting the sequences (for example, according to the first principal coordinate of the pairwise dissimilarity matrix), splitting the sorted list into equal sized groups (called relative frequency groups), and selecting the medoids of the equal sized groups.

[5] The methods for identifying representative sequences described above have been implemented in the R package TraMineR.

Representative sequences are contiguous subsequences (typically 300 residues) from ubiquitous, conserved proteins, such that each orthologous family of representative sequences taken alone gives a distance matrix in close agreement with the consensus matrix.