Tree alignment

Sequences are arranged into a phylogenetic tree, modeling the evolutionary relationships between species or taxa.

[1] In bioinformatics, the basic method of information processing is to contrast the sequence data.

Biologists use it to discover the function, structure, and evolutionary information in biological sequences.

The following analyses are based on the sequence assembly: the phylogenetic analysis, the haplotype comparison, and the prediction of RNA structure.

Therefore, the efficiency of sequence alignment will directly affect the efficacy of solving these problems.

In order to design a rational and efficient sequence alignment, the algorithm derivation becomes an important branch of research in the field of bioinformatics.

[2] Distance method measures the minimum operation number of character insertions, deletions, and substitutions that are required to transform one sequence u to the other sequence v when being operated on a pair of strings.

The calculation of edit distance can be based on dynamic programming, and the equation is in O(|u|×|v|) time, where |u| and |v| are the lengths of u and v.[3] The efficient estimation of edit distance is essential as Distance method is a basic principle in computational biology[4] For functions of hereditary properties "symmetrization" can be used.

Finding an optimal edit distance function is essential for the tree alignment problem.

Tree alignment results in a NP-hard problem, where scoring modes and alphabet sizes are restricted.

However, there is an exponential relationship between its efficiency and the number sequences, which means that when the length of the sequence is very large, the computation time required to get results is enormously long.

However, whatever the degree of multiple-sequence similarity is, the time complexity of star alignment has a proportional relationship with the square of the sequence number and the square of the sequence average length.

Therefore, the challenge of reducing the time complexity to linear is one of the core issues in tree alignment.

Combinatorial optimization is a good strategy to solve MSA problems.

The aim of tree alignment is to find an assigned sequence, which can obtain a maximum score, and get the final matching result from the evolutionary tree and its nodes' assigned sequence.

The Keyword Tree Theory and the Aho-Corasick search algorithm is an efficient approach to solve the pairwise sequence alignment problem.

The aim of combining the keyword tree theory and the Aho-Corasick search algorithm is to solve this kind of problem: for a given long string

(2): Any two edges separated from the same node are to correspond to different letters.

For each leaf node of this K tree, it corresponds to one of the certain patterns of set

Establishing a failure link is the key to improve the time complexity of the Aho-Corasick algorithm.

Because this suffix matches the STRING beginning with the root node (similar to prefix), the

After establishing all failure links in the keyword tree, the Aho-Corasick search algorithm is used to find the locations of all

In MSA, DNA, RNA, and proteins, sequences are usually generated and they are assumed to have an evolutionary relationship.

By comparing generated maps of RNA, DNA, and sequences from evolutionary families, people can assess conservation of proteins and find functional gene domains by comparing differences between evolutionary sequences.

Generally, heuristic algorithms rely on the iterative strategy, which is to say that based on a comparison method, optimizing the results of multiple sequence alignment by the iterative process.

Davie M. proposed using the particle swarm optimization algorithm to solve the multiple sequence alignment problem; Ikeda Takahiro proposed a heuristic algorithm which is based on the A* search algorithm; E. Birney first proposed using the hidden Markov model to solve the multiple sequence alignment problem; and many other biologists use the genetic algorithm to solve it.

[8][9] All these algorithms generally are robust and insensitive to the number of sequences, but they also have shortcomings.

[clarification needed] Roughly, tree alignment graph aims to align trees into a graph and finally synthesize them to develop statistics.

In biology, tree alignment graphs (TAGs) are used to remove the evolutionary conflicts or overlapping taxa from sets of trees and can then be queried to explore uncertainty and conflict.

By integrating methods of aligning, synthesizing and analyzing, the TAG aims to solve the conflicting relationships and partial overlapping taxon sets obtained from a wide range of sequences.

This is a simple Sequence Alignment of Insulin gene between rat, human and chicken. The labeled nucleotides are the different nucleotides with rats I and --- means the missing nucleotides
This figure indicates the growth rate about the exponential time, the polynomial time and the linear time