LZ77 and LZ78

LZ77 and LZ78 are the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977[1] and 1978.

[3] These two algorithms form the basis for many variations including LZW, LZSS, LZMA and others.

Since LZ77 encodes and decodes from a sliding window over previously seen characters, decompression must always start at the beginning of the input.

Conceptually, LZ78 decompression could allow random access to the input if the entire dictionary were known in advance.

[5] In 2021 Jacob Ziv was awarded the IEEE Medal of Honor for his involvement in their development.

[6] In the second of the two papers that introduced these algorithms they are analyzed as encoders defined by finite-state machines.

A measure analogous to information entropy is developed for individual sequences (as opposed to probabilistic ensembles).

In this sense an algorithm based on this scheme produces asymptotically optimal encodings.

The larger the sliding window is, the longer back the encoder may search for creating references.

The operation is thus equivalent to the statement "copy the data you were given and repetitively paste it until it fits".

As this type of pair repeats a single copy of data multiple times, it can be used to incorporate a flexible and easy form of run-length encoding.

But mirroring the encoding process, since the pattern is repetitive, the read pointer need only trail in sync with the write pointer by a fixed distance equal to the run length LR until L characters have been copied to output in total.

Considering the above, especially if the compression of data runs is expected to predominate, the window search should begin at the end of the window and proceed backwards, since run patterns, if they exist, will be found first and allow the search to terminate, absolutely if the current maximal matching sequence length is met, or judiciously, if a sufficient length is met, and finally for the simple possibility that the data is more recent and may correlate better with the next input.

A few examples: The LZ78 algorithms compress sequential data by building a dictionary of token sequences from the input, and then replacing the second and subsequent occurrence of the sequence in the data stream with a reference to the dictionary entry.

Note how the algorithm is greedy, and so nothing is added to the table until a unique making token is found.

Note also that in this case the output 0A1B0B1$ is longer than the original input but compression ratio improves considerably as the dictionary grows, and in binary the indexes need not be represented by any more than the minimum number of bits.

Finally a dictionary entry for 1$ is created and A$ is output resulting in A AB B A$ or AABBA removing the spaces and EOF marker.

BTLZ is an LZ78-based algorithm that was developed for use in real-time communications systems (originally modems) and standardized by CCITT/ITU as V.42bis.