Burrows–Wheeler transform

More importantly, the transformation is reversible, without needing to store any additional data except the position of the first original character.

The BWT is thus a "free" method of improving the efficiency of text compression algorithms, costing only some extra computation.

The output of the encoding phase is the last column L = BNN^AA$A after step 3, and the index (0-based) I of the row containing the original string S, in this case I = 6.

To understand why this creates more-easily-compressible data, consider transforming a long English text frequently containing the word "the".

So it can be seen that the success of this transform depends upon one value having a high probability of occurring before a sequence, so that in general it needs fairly long samples (a few kilobytes at least) of appropriate data (such as text).

The remarkable thing about the BWT is not that it generates a more easily encoded output—an ordinary sort would do that—but that it does this reversibly, allowing the original document to be re-generated from the last column data.

Reversing the example above is done like this: A number of optimizations can make these algorithms run more efficiently without changing the output.

In the encoder, each row of the table can be represented by a single pointer into the strings, and the sort performed using the indices.

A complete description of the algorithms can be found in Burrows and Wheeler's paper, or in a number of online sources.

Increasing the size of the alphabet (by appending the EOF character) makes later compression steps awkward.

For example, the text "^BANANA$" is transformed into "ANNBAA^$" through these steps (the red $ character indicates the EOF pointer) in the original string.

Using the STX/ETX control codes to mark the start and end of the text, and using s[i:] + s[:i] to construct the ith rotation of s, the forward transform takes the last character of each of the sorted rows: The inverse transform repeatedly inserts r as the left column of the table and sorts the table.

The advent of next-generation sequencing (NGS) techniques at the end of the 2000s decade has led to another application of the Burrows–Wheeler transformation.

In many experiments, e.g., in ChIP-Seq, the task is now to align these reads to a reference genome, i.e., to the known, nearly complete sequence of the organism in question (which may be up to several billion base pairs long).

A number of alignment programs, specialized for this task, were published, which initially relied on hashing (e.g., Eland, SOAP,[10] or Maq[11]).

For example,[15] Showed a compression pipeline based on the application of the Burrows–Wheeler transformation followed by inversion, run-length, and arithmetic encoders.

BWIC is shown to outperform those in terms of final compression size of radiography medical images on the order of 5.1% and 4.1% respectively.

The improvements are achieved by combining BWIC and a pre-BWIC scan of the image in a vertical snake order fashion.

BWT has also been proved to be useful on sequence prediction which is a common area of study in machine learning and natural-language processing.

In particular, Ktistakis et al.[18] proposed a sequence prediction scheme called SuBSeq that exploits the lossless compression of data of the Burrows–Wheeler transform.

SuBSeq exploits BWT by extracting the FM-index and then performing a series of operations called backwardSearch, forwardSearch, neighbourExpansion, and getConsequents in order to search for predictions given a suffix.

SuBSeq has been shown to outperform state of the art algorithms for sequence prediction both in terms of training time and accuracy.