In computer science, the Knuth–Morris–Pratt algorithm (or KMP algorithm) is a string-searching algorithm that searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.
The algorithm was conceived by James H. Morris and independently discovered by Donald Knuth "a few weeks later" from automata theory.
[1] Independently, in 1969, Matiyasevich[4][5] discovered a similar algorithm, coded by a two-dimensional Turing machine, while studying a string-pattern-matching recognition problem over a binary alphabet.
[6] A string-matching algorithm wants to find the starting index m in string S[] that matches the search word W[].
At each position m the algorithm first checks for equality of the first character in the word being searched, i.e. S[m] =?
The algorithm retrieves the character W[i] in the word being searched and checks for equality of the expression S[m+i] =?
If the index m reaches the end of the string then there is no match, in which case the search is said to "fail".
If the strings are uniformly distributed random letters, then the chance that characters match is 1 in 26.
In most cases, the trial check will reject the match at the initial letter.
So if the characters are random, then the expected complexity of searching string S[] of length n is on the order of n comparisons or Θ(n).
The difference is that KMP makes use of previous match information that the straightforward algorithm does not.
KMP maintains its knowledge in the precomputed table and two state variables.
Assuming the prior existence of the table T, the search portion of the Knuth–Morris–Pratt algorithm has complexity O(n), where n is the length of S and the O is big-O notation.
Except for the fixed overhead incurred in entering and exiting the function, all the computations are performed in the while loop.
At the same time, the second branch leaves m + i unchanged, for m gets i - T[i] added to it, and immediately after T[i] gets assigned as the new value of i, hence new_m + new_i = old_m + old_i - T[old_i] + T[old_i] = old_m + old_i.
Upon failure, that is, the word and the text do not match at the positions (W[i] ≠ S[p+i]), the text pointer is kept still, while the word pointer is rolled back a certain amount (i = T[i], where T is the jump table), and we attempt to match W[T[i]] with S[p+i].
The key observation about the nature of a linear search that allows this to happen is that in having checked some segment of the main string against an initial segment of the pattern, we know exactly at which places a new potential match which could continue to the current position could begin prior to the current position.
In other words, we "pre-search" the pattern itself and compile a list of all possible fallback positions that bypass a maximum of hopeless characters while not sacrificing any potential matches in doing so.
We want to be able to look up, for each position in W, the length of the longest possible initial segment of W leading up to (but not including) that position, other than the full segment starting at W[0] that just failed to match; this is how far we have to backtrack in finding the next match.
Since a mismatch at the very start of the pattern is a special case (there is no possibility of backtracking), we set T[0] = -1, as discussed below.
We will see that it follows much the same pattern as the main search, and is efficient for similar reasons.
Continuing to T[3], we first check the proper suffix of length 1, and as in the previous case it fails.
The same logic shows that the longest substring we need to consider has length 1, and as in the previous case it fails since "D" is not a prefix of W. But instead of setting T[4] = 0, we can do better by noting that W[4] = W[0], and also that a look-up of T[4] implies the corresponding S character, S[m+4], was a mismatch and therefore S[m+4] ≠ 'A'.
W[5] itself extends the prefix match begun with W[4], and we can assume that the corresponding character in S, S[m+5] ≠ 'B'.
The principle is that of the overall search: most of the work was already done in getting to the current position, so very little needs to be done in leaving it.
The only minor complication is that the logic which is correct late in the string erroneously gives non-proper substrings at the beginning.
These complexities are the same, no matter how many repetitive patterns are in W or S. A real-time version of KMP can be implemented using a separate failure function table for each character in the alphabet.
Booth's algorithm uses a modified version of the KMP preprocessing function to find the lexicographically minimal string rotation.
He presented them as constructions for a Turing machine with a two-dimensional working memory.