Edit distance

Levenshtein distance operations are the removal, insertion, or substitution of a character in the string.

A more general definition associates non-negative weight functions wins(x), wdel(x) and wsub(x, y) with the operations.

Damerau–Levenshtein distance counts as a single edit a common mistake: transposition of two adjacent characters, formally characterized by an operation that changes uxyv into uyxv.

[3][4] For the task of correcting OCR output, merge and split operations have been used which replace a single character into a pair of them or vice versa.

[1]: 37  Similarly, by only allowing substitutions (again at unit cost), Hamming distance is obtained; this must be restricted to equal-length strings.

A minimal edit script that transforms the former into the latter is: LCS distance (insertions and deletions only) gives a different distance and minimal edit script: for a total cost/distance of 5 operations.

, defined by the recurrence[2] This algorithm can be generalized to handle transpositions by adding another term in the recursive clause's minimization.

Therefore, it is usually computed using a dynamic programming algorithm that is commonly credited to Wagner and Fischer,[7] although it has a history of multiple invention.

When the full dynamic programming table is constructed, its space complexity is also Θ(mn); this can be improved to Θ(min(m,n)) by observing that at any instant, the algorithm only requires two rows (or two columns) in memory.

However, this optimization makes it impossible to read off the minimal series of edit operations.

[8]: 634  A general recursive divide-and-conquer framework for solving such recurrences and extracting an optimal sequence of operations cache-efficiently in space linear in the size of the input is given by Chowdhury, Le, and Ramachandran.

It achieves this by only computing and storing a part of the dynamic programming table around its diagonal.

[3] Further improvements by Landau, Myers, and Schmidt [1] give an O(s2 + max(m,n)) time algorithm.

[11] For a finite alphabet and edit costs which are multiples of each other, the fastest known exact algorithm is of Masek and Paterson[12] having worst case runtime of O(nm/logn).

Edit distance finds applications in computational biology and natural language processing, e.g. the correction of spelling mistakes or OCR errors, and approximate string matching, where the objective is to find matches for short strings in many longer texts, in situations where a small number of differences is to be expected.

[16] Language edit distance has found many diverse applications, such as RNA folding, error correction, and solutions to the Optimum Stack Generation problem.