Longest common subsequence

It is also widely used by revision control systems such as Git for reconciling multiple changes made to a revision-controlled collection of files.

For the general case of an arbitrary number of input sequences, the problem is NP-hard.

[1] When the number of sequences is constant, the problem is solvable in polynomial time by dynamic programming.

[2] For an arbitrary number of input sequences, the dynamic programming approach gives a solution in There exist methods with lower complexity,[3] which often depend on the length of the LCS, the size of the alphabet, or both.

[4] The LCS problem has an optimal substructure: the problem can be broken down into smaller, simpler subproblems, which can, in turn, be broken down into simpler subproblems, and so on, until, finally, the solution becomes trivial.

This allows one to simplify the LCS computation for two sequences ending in the same symbol.

The sequence above is empty; the one to the left contains one element, G. Selecting the longest of these, LCS(R1, C3) is (G).

For LCS(R2, C4), A matches A, which is appended to the upper left cell, giving (GA).

The final result is that the last cell contains all the longest subsequences common to (AGCAT) and (GAC); these are (AC), (GC), and (GA).

The table also shows the longest common subsequences for every possible pair of prefixes.

The actual subsequences are deduced in a "traceback" procedure that follows the arrows backwards, starting from the last cell in the table.

Below is the table for such an analysis, with numbers colored in cells where the length is about to decrease.

, the length of the shortest common supersequence is related to the length of the LCS by[3] The edit distance when only insertion and deletion is allowed (no substitution), or when the cost of the substitution is the double of the cost of an insertion or deletion, is: The function below takes as input sequences X[1..m] and Y[1..n], computes the LCS between X[1..i] and Y[1..j] for all 1 ≤ i ≤ m and 1 ≤ j ≤ n, and stores it in C[i,j].

The table C shown below, which is generated by the function LCSLength, shows the lengths of the longest common subsequences between prefixes of

The highlighted numbers show the path the function backtrack would follow from the bottom right to the top left corner, when reading out an LCS.

The C matrix in the naive algorithm grows quadratically with the lengths of the sequences.

In most real-world cases, especially source code diffs and patches, the beginnings and ends of files rarely change, and almost certainly not both at the same time.

In the best-case scenario, a sequence with no changes, this optimization would eliminate the need for the C matrix.

In the worst-case scenario, a change to the very first and last items in the sequence, only two additional comparisons are performed.

Most of the time taken by the naive algorithm is spent performing comparisons between items in the sequences.

Additionally, the randomized nature of hashes and checksums would guarantee that comparisons would short-circuit faster, as lines of source code will rarely be changed at the beginning.

A cryptographic hash would therefore be far better suited for this optimization, as its entropy is going to be significantly greater than that of a simple checksum.

However, the benefits may not be worth the setup and computational requirements of a cryptographic hash for small sequence lengths.

vector as the dynamic programming approach requires only the current and previous columns of the matrix.

Hirschberg's algorithm allows the construction of the optimal sequence itself in the same quadratic time and linear space bounds.

[8] Chowdhury and Ramachandran devised a quadratic-time linear-space algorithm[9][10] for finding the LCS length along with an optimal sequence which runs faster than Hirschberg's algorithm in practice due to its superior cache performance.

Several algorithms exist that run faster than the presented dynamic programming approach.

[12] For problems with a bounded alphabet size, the Method of Four Russians can be used to reduce the running time of the dynamic programming algorithm by a logarithmic factor.

[13] Beginning with Chvátal & Sankoff (1975),[14] a number of researchers have investigated the behavior of the longest common subsequence length when the two given strings are drawn randomly from the same alphabet.

[16] Simplified mathematical models of the longest common subsequence problem have been shown to be controlled by the Tracy–Widom distribution.