FM-index

In computer science, an FM-index is a compressed full-text substring index based on the Burrows–Wheeler transform, with some similarities to the suffix array.

It was created by Paolo Ferragina and Giovanni Manzini,[1] who describe it as an opportunistic data structure as it allows compression of the input text while still permitting fast substring queries.

The query time, as well as the required storage space, has a sublinear complexity with respect to the size of the input data.

[3] A further improvement, the alphabet-friendly FM-index, combines the use of compression boosting and wavelet trees[4] to significantly reduce the space usage for large alphabets.

[5] Using an index is a common strategy to efficiently search a large body of text.

The BWT in itself allows for some compression with, for instance, move to front and Huffman encoding, but the transform has even more uses.

For any row i of the matrix, the character in the last column L[i] precedes the character in the first column F[i] also in T. Finally, if L[i] = T[k], then L[LF(i)] = T[k - 1], and using the equality it is possible to extract a string of T from L. The FM-index itself is a compression of the string L together with C and Occ in some form, as well as information that maps a selection of indices in L to positions in the original string T. The operation count takes a pattern P[1..p] and returns the number of occurrences of that pattern in the original text T. Since the rows of matrix M are sorted, and it contains every suffix of T, the occurrences of pattern P will be next to each other in a single continuous range.

[1] FM index with backtracking has been successfully (>2000 citations) applied to approximate string matching/sequence alignment, See Bowtie http://bowtie-bio.sourceforge.net/index.shtml