Alignment-free sequence analysis

[1] The emergence and need for the analysis of different types of data generated through biological research has given rise to the field of bioinformatics.

The size of this sequence data poses challenges on alignment-based algorithms in their assembly, annotation and comparative studies.

[1][10][11][12][13][14][15] The AFproject is an international collaboration to benchmark and compare software tools for alignment-free sequence comparison.

The pair wise distance between two sequences is then calculated Jensen–Shannon (JS) divergence between their respective FFPs.

The distance matrix thus obtained can be used to construct phylogenetic tree using clustering algorithms like neighbor-joining, UPGMA etc.

The normalized frequencies are put a fixed order to form the composition vector (CV) of a given sequence.

The distance matrix thus obtained can be used to construct phylogenetic tree using clustering algorithms like neighbor-joining, UPGMA etc.

Thus the occurrence of each k-mer in a sequence is calculated in the form of RTD, which is then summarised using two statistical parameters mean (μ) and standard deviation (σ).

Thus each sequence is represented in the form of numeric vector of size 2⋅4k containing μ and σ of 4k RTDs.

The distance matrix thus obtained can be used to construct phylogenetic tree using clustering algorithms like neighbor-joining, UPGMA etc.

[23] Note that the pre-defined pattern can be selected by analysis of the Variance of the number of matches,[28] the probability of the first occurrence on several models,[29] or the Pearson correlation coefficient between the expected word frequency and the true alignment distance.

[36] This approach is closely related to the ACS, which calculates the number of substitutions per site between two DNA sequences using the shortest absent substring (termed as shustring).

[37] This approach uses the program kmacs[36] to calculate longest common substrings with up to k mismatches for a pair of DNA sequences.

The phylogenetic distance between the sequences can then be estimated from a local maximum in the length distribution of the k-mismatch common substrings.

[39] This is an extremely fast method that uses the MinHash bottom sketch strategy for estimating the Jaccard index of the multi-sets of

[40] This approach calculates a distance value between two protein sequences based on the decay of the number of

In contrast to MASH, the program is still accurate for low sequencing coverage, so it can be used for genome skimming.

[44] andi estimates phylogenetic distances between genomic sequences based on ungapped local alignments that are flanked by maximal exact word matches.

The gapfree alignments between the exact word matches are then used to estimate phylogenetic distances between genome sequences.

[46] FSWM has been adapted to estimate distances based on unassembled NGS reads, this version of the program is called Read-SpaM.

[47] Prot-SpaM (Proteome-based Spaced-word Matches) is an implementation of the FSWM algorithm for partial or whole proteome sequences.

[48] Multi-SpaM (MultipleSpaced-word Matches) is an approach to genome-based phylogeny reconstruction that extends the FSWM idea to multiple sequence comparison.

Information Theory has provided successful methods for alignment-free sequence analysis and comparison.

The existing applications of information theory include global and local characterization of DNA, RNA and proteins, estimating genome entropy to motif and region classification.

[51] Base–base correlation (BBC) converts the genome sequence into a unique 16-dimensional numeric vector using the following equation, The

IC and PIC were calculated using following formulas, The final vector is obtained as follows: which defines the range of distance between bases.

The distance matrix thus obtained can be used to construct phylogenetic tree using clustering algorithms like neighbor-joining, UPGMA, etc..

The theoretic basis for the Kolmogorov complexity approach was laid by Bennett, Gacs, Li, Vitanyi, and Zurek (1998) by proposing the information distance.

[62] This objection was overruled by the end of that decade when the opposite was found to be the case – that CGR bijectively maps Markov transition is into a fractal, order-free (degree-free) representation.

[64] A number of web apps such as https://github.com/usm/usm.github.com/wiki,[65] are available to demonstrate how to encode and compare arbitrary symbolic sequences in a manner that takes full advantage of modern MapReduce distribution developed for cloud computing.