Approximate string matching

This number is called the edit distance between the string and the pattern.

The usual primitive operations are:[1] These three operations may be generalized as forms of substitution by adding a NULL character (here symbolized by *) wherever a character has been deleted or inserted: Some approximate matchers also treat transposition, in which the positions of two letters in the string are swapped, to be a primitive operation.

Some matchers use a single global unweighted cost, that is, the total number of primitive operations necessary to convert the match to the pattern.

Some matchers permit separate assignments of limits and weights to individual groups in the pattern.

If the field we arrived at was E(0, y1), then T[y1 + 1] ... T[y2] is a substring of T with the minimal edit distance to the pattern P. Computing the E(x, y) array takes O(mn) time with the dynamic programming algorithm, while the backwards-working phase takes O(n + m) time.

Most of them are designed to fit some framework (such as Map-Reduce) to compute concurrently.

Traditionally, approximate string matching algorithms are classified into two categories: online and offline.

Early algorithms for online approximate matching were suggested by Wagner and Fischer[3] and by Sellers.

[2] Both algorithms are based on dynamic programming but solve different problems.

Sellers' algorithm searches approximately for a substring in a text while the algorithm of Wagner and Fischer calculates Levenshtein distance, being appropriate for dictionary fuzzy search only.

The bitap algorithm is the heart of the Unix searching utility agrep.

[4] Although very fast online techniques exist, their performance on large data is disfavored.

Text preprocessing or indexing makes searching dramatically faster.

[7][8] A detailed survey of indexing techniques that allows one to find an arbitrary substring in a text is given by Navarro et al.[7] A computational survey of dictionary methods (i.e., methods that permit finding all dictionary words that approximately match a search pattern) is given by Boytsov.

[9] Common applications of approximate matching include spell checking.

[5] With the availability of large amounts of DNA data, matching of nucleotide sequences has become an important application.

String matching cannot be used for most binary data, such as images and music.