Boyer–Moore string-search algorithm

[2] The original paper contained static tables for computing the pattern shifts without an explanation of how to produce them.

It is thus well-suited for applications in which the pattern is much shorter than the text or where it persists across multiple searches.

The Boyer–Moore algorithm uses information gathered during the preprocess step to skip sections of the text, resulting in a lower constant factor than many other string search algorithms.

In general, the algorithm runs faster as the pattern length increases.

The key features of the algorithm are to match on the tail of the pattern rather than the head, and to skip along the text in jumps of multiple characters rather than searching every single character in the text.

The Boyer–Moore algorithm searches for occurrences of P in T by performing explicit character comparisons at different alignments.

If the characters do not match, there is no need to continue searching backwards along the text.

If the character in the text is in the pattern, then a partial shift of the pattern along the text is done to line up along the matching character and the process is repeated.

Jumping along the text to make comparisons rather than checking every character in the text decreases the number of comparisons that have to be made, which is the key to the efficiency of the algorithm.

The strings are matched from the end of P to the start of P. The comparisons continue until either the beginning of P is reached (which means there is a match) or a mismatch occurs upon which the alignment is shifted forward (to the right) according to the maximum value permitted by a number of rules.

The bad-character rule considers the character in T at which the comparison process failed (assuming such a failure occurred).

⁠ space, assuming a finite alphabet of length k. The C and Java implementations below have a ⁠

The good-suffix rule requires two tables: one for use in the general case (where a copy t′ is found), and another for use when the general case returns no meaningful result.

Since there are plenty of letters in the pattern that are also not N, we have minimal information here - shifting by 1 is the least interesting result.

That means no part of the good suffix can be useful to us -- shift by the full pattern length 8.

[6] A simple but important optimization of Boyer–Moore was put forth by Zvi Galil in 1979.

[7] As opposed to shifting, the Galil rule deals with speeding up the actual comparisons done at each alignment by skipping sections that are known to match.

Suppose that at an alignment k1, P is compared with T down to character c of T. Then if P is shifted to k2 such that its left end is between c and k1, in the next comparison phase a prefix of P must match the substring T[(k2 - n)..k1].

Thus if the comparisons get down to position k1 of T, an occurrence of P can be recorded without explicitly comparing past k1.

In addition to increasing the efficiency of Boyer–Moore, the Galil rule is required for proving linear-time execution in the worst case.

A generalized version for dealing with submatches was reported in 1985 as the Apostolico–Giancarlo algorithm.

[8] The Boyer–Moore algorithm as presented in the original paper has worst-case running time of ⁠

This was first proved by Knuth, Morris, and Pratt in 1977,[3] followed by Guibas and Odlyzko in 1980[9] with an upper bound of 5n comparisons in the worst case.

Richard Cole gave a proof with an upper bound of 3n comparisons in the worst case in 1991.

[10] When the pattern does occur in the text, running time of the original algorithm is ⁠

This is easy to see when both pattern and text consist solely of the same repeated character.

However, inclusion of the Galil rule results in linear runtime across all cases.

D (programming language) uses a BoyerMooreFinder for predicate based matching within ranges as a part of the Phobos Runtime Library.

The Apostolico–Giancarlo algorithm speeds up the process of checking whether a match has occurred at the given alignment by skipping explicit character comparisons.

Storing suffix match lengths requires an additional table equal in size to the text being searched.