Asymmetric numeral systems

Asymmetric numeral systems (ANS)[1][2] is a family of entropy encoding methods introduced by Jarosław (Jarek) Duda[3] from Jagiellonian University, used in data compression since 2014[4] due to improved performance compared to previous methods.

In the tabled ANS (tANS) variant, this is achieved by constructing a finite-state machine to operate on a large alphabet without using multiplication.

The basic idea is to encode information into a single natural number

For the encoding rule, the set of natural numbers is split into disjoint subsets corresponding to different symbols – like into even and odd numbers, but with densities corresponding to the probability distribution of the symbols to encode.

There are alternative ways to apply it in practice – direct mathematical formulas for encoding and decoding steps (uABS and rANS variants), or one can put the entire behavior into a table (tANS variant).

Suppose a sequence of 1,000 zeros and ones would be encoded, which would take 1000 bits to store directly.

Using Stirling's approximation we get their asymptotic number being called Shannon entropy.

For example, ANS could be directly used to enumerate combinations: assign a different natural number to every sequence of symbols having fixed proportions in a nearly optimal way.

In contrast to encoding combinations, this probability distribution usually varies in data compressors.

For this purpose, Shannon entropy can be seen as a weighted average: a symbol of probability

We see that an equivalent method for performing the encoding is as follows: Consider a more general source with k letters, with rational probabilities

The above procedure is optimal for the uniform (symmetric) probability distribution of symbols

ANS generalize it to make it optimal for any chosen (asymmetric) probability distribution of symbols:

ranges, and split each of them in identical way into subranges of proportions given by the assumed probability distribution.

To avoid using large number arithmetic, in practice stream variants are used: which enforce

, decoder refills the least significant bits from the bitstream when needed: tANS variant puts the entire behavior (including renormalization) for

For example, a->0, b->100, c->101, d->11 prefix code would be obtained for tANS with "aaaabcdd" symbol assignment.

As for Huffman coding, modifying the probability distribution of tANS is relatively costly, hence it is mainly used in static situations, usually with some Lempel–Ziv scheme (e.g. ZSTD, LZFSE).

In contrast, rANS is usually used as a faster replacement for range coding (e.g. CRAM, LZNA, Draco,[14]).

It requires multiplication, but is more memory efficient and is appropriate for dynamically adapting probability distributions.

Encoding and decoding of ANS are performed in opposite directions, making it a stack for symbols.

For context-dependence, like Markov model, the encoder needs to use context from the perspective of later decoding.

The final state of encoding is required to start decoding, hence it needs to be stored in the compressed file.

This cost can be compensated by storing some information in the initial state of encoder.

The author of the novel ANS algorithm and its variants tANS and rANS specifically intended his work to be available freely in the public domain, for altruistic reasons.

In 2015, Google published a US and then worldwide patent for "Mixed boolean-token ans coefficient coding".

[24] At the time, Professor Duda had been asked by Google to help it with video compression, so was intimately aware of this domain, having the original author assisting them.

Duda subsequently filed a third-party application[25] to the US Patent office seeking a rejection.

[26] In June 2019 Microsoft lodged a patent application called "Features of range asymmetric number system encoding and decoding".

Yet on March 2, 2021, Microsoft gave a USPTO explanatory filing stating "The Applicant respectfully disagrees with the rejections.