Lossless compression

Compression algorithms are usually effective for human- and machine-readable documents and cannot shrink the size of random data that contain no redundancy.

Arithmetic coding achieves compression rates close to the best possible for a particular statistical model, which is given by the information entropy, whereas Huffman compression is simpler and faster but produces poor results for models that deal with symbol probabilities close to 1.

Many of the lossless compression techniques used for text also work reasonably well for indexed images.

These techniques take advantage of the specific characteristics of images such as the common phenomenon of contiguous 2-D areas of similar tones.

Because of patents on certain kinds of LZW compression, and in particular licensing practices by patent holder Unisys that many developers considered abusive, some open source proponents encouraged people to avoid using the Graphics Interchange Format (GIF) for compressing still image files in favor of Portable Network Graphics (PNG), which combines the LZ77-based deflate algorithm with a selection of domain-specific prediction filters.

Lossless sound compression algorithms can take advantage of the repeating patterns shown by the wave-like nature of the data‍—‍essentially using autoregressive models to predict the "next" value and encoding the (possibly small) difference between the expected value and the actual data.

This is called delta encoding (from the Greek letter Δ, which in mathematics, denotes a difference), but the term is typically only used if both versions are meaningful outside compression and decompression.

See list of lossless video codecs Cryptosystems often compress data (the "plaintext") before encryption for added security.

When properly implemented, compression greatly increases the unicity distance by removing patterns that might facilitate cryptanalysis.

[9] However, many ordinary lossless compression algorithms produce headers, wrappers, tables, or other predictable output that might instead make cryptanalysis easier.

[11] For eukaryotes XM is slightly better in compression ratio, though for sequences larger than 100 MB its computational requirements are impractical.

This type of compression is not strictly limited to binary executables, but can also be applied to scripts, such as JavaScript.

[15] The Compression Analysis Tool[16] is a Windows application that enables end users to benchmark the performance characteristics of streaming implementations of LZF4, Deflate, ZLIB, GZIP, BZIP2 and LZMA using their own data.

This is easily proven with elementary mathematics using a counting argument called the pigeonhole principle, as follows:[17][18] Most practical compression algorithms provide an "escape" facility that can turn off the normal coding for files that would become longer by being encoded.

In theory, only a single additional bit is required to tell the decoder that the normal coding has been turned off for the entire input; however, most encoding algorithms use at least one full byte (and typically more than one) for this purpose.

Such an algorithm contradicts fundamental laws of mathematics because, if it existed, it could be applied repeatedly to losslessly reduce any file to length 1.

Hence it is possible that any particular file, even if it appears random, may be significantly compressed, even including the size of the decompressor.

An example is the digits of the mathematical constant pi, which appear random but can be generated by a very small program.

An obvious way of detection is applying a raw compression algorithm and testing if its output is smaller than its input.

For example, the zip data format specifies the 'compression method' of 'Stored' for input files that have been copied into the archive verbatim.

[22] Mark Nelson, in response to claims of "magic" compression algorithms appearing in comp.compression, has constructed a 415,241 byte binary file of highly entropic content, and issued a public challenge of $100 to anyone to write a program that, together with its input, would be smaller than his provided binary data yet be able to reconstitute it without error.