Byte pair encoding

Byte pair encoding[1][2] (also known as BPE, or digram coding)[3] is an algorithm, first described in 1994 by Philip Gage, for encoding strings of text into smaller strings by creating and using a translation table.

[5][6][7] The original BPE algorithm operates by iteratively replacing the most common contiguous sequences of characters in a target text with unused 'placeholder' bytes.

The iteration ends when no sequences can be found, leaving the target text effectively compressed.

Decompression can be performed by reversing this process, querying known placeholder terms against their corresponding denoted sequence, using a lookup table.

In the original paper, this lookup table is encoded and stored alongside the compressed text.

Note that new words can always be constructed from final vocabulary tokens and initial-set characters.

In the above example, the output of the BPE is a vocabulary, which can be used to encode any text that is written with the letters "abcd".