Needleman–Wunsch algorithm

It was one of the first applications of dynamic programming to compare biological sequences.

The algorithm was developed by Saul B. Needleman and Christian D. Wunsch and published in 1970.

[2] It is also sometimes referred to as the optimal matching algorithm and the global alignment technique.

Fill out the rest of the column and row headers as in Figure 1.

For now, the system used by Needleman and Wunsch[1][failed verification] will be used: For the Example above, the score of the alignment would be 0: Start with a zero in the first row, first column (not including the cells containing nucleotides).

For both X and Y, the highest score is zero: The highest candidate score may be reached by two of the neighboring cells: In this case, all directions reaching the highest candidate score must be noted as possible origin cells in the finished diagram in figure 1, e.g. in the cell in row and column 6.

From this path, the sequence is constructed by these rules: Following these rules, the steps for one possible alignment candidate in figure 1 are: The simplest scoring schemes simply give a value for each match, mismatch and indel.

Different scoring systems can be devised for different situations, for example if gaps are considered very bad for your alignment you may use a scoring system that penalises gaps heavily, such as:

More complicated scoring systems attribute values not only for the type of alteration, but also for the letters that are involved.

In order to represent all the possible combinations of letters and their resulting scores a similarity matrix is used.

Hence this represents all possible matches and mismatches (for an alphabet of ACGT).

If implementing the T-T = 4 rule from above the following similarity matrix is produced: Different scoring matrices have been statistically constructed which give weight to different actions appropriate to a particular scenario.

Having weighted scoring matrices is particularly important in protein sequence alignment due to the varying frequency of the different amino acids.

There are two broad families of scoring matrices, each with further alterations for specific scenarios: When aligning sequences there are often gaps (i.e. indels), sometimes large ones.

It uses a linear gap penalty, here called d. For example, if the similarity matrix was then the alignment: with a gap penalty of −5, would have the following score: To find the alignment with the highest score, a two-dimensional array (or matrix) F is allocated.

To compute an alignment that actually gives this score, you start from the bottom right cell, and compare the value with the three possible sources (Match, Insert, and Delete above) to see which it came from.

(In general, more than one choice may have the same value, leading to alternative optimal alignments.)

[3] The original purpose of the algorithm described by Needleman and Wunsch was to find similarities in the amino acid sequences of two proteins.

[1] Needleman and Wunsch describe their algorithm explicitly for the case when the alignment is penalized solely by the matches and mismatches, and gaps have no penalty (d=0).

The corresponding dynamic programming algorithm takes cubic time.

The penalty factor could be a function of the size and/or direction of the gap.

[page 444] A better dynamic programming algorithm with quadratic running time for the same problem (no gap penalty) was introduced later[5] by David Sankoff in 1972.

Similar quadratic-time algorithms were discovered independently by T. K. Vintsyuk[6] in 1968 for speech processing ("time warping"), and by Robert A. Wagner and Michael J. Fischer[7] in 1974 for string matching.

Needleman and Wunsch formulated their problem in terms of maximizing similarity.

Another possibility is to minimize the edit distance between sequences, introduced by Vladimir Levenshtein.

Recent development has focused on improving the time and space cost of the algorithm while maintaining quality.

When images have been rectified, an analogy can be drawn between aligning nucleotide and protein sequences and matching pixels belonging to scan lines, since both tasks aim at establishing optimal correspondence between two strings of characters.

Although in many applications image rectification can be performed, e.g. by camera resectioning or calibration, it is sometimes impossible or impractical since the computational cost of accurate rectification models prohibit their usage in real-time applications.

Moreover, none of these models is suitable when a camera lens displays unexpected distortions, such as those generated by raindrops, weatherproof covers or dust.

Experiments demonstrated that such extension allows dense pixel matching between unrectified or distorted images.