Canonical Huffman code

Rather than storing the structure of the code tree explicitly, canonical Huffman codes are ordered in such a way that it suffices to only store the lengths of the codewords, which reduces the overhead of the codebook.

In order for a symbol code scheme such as the Huffman code to be decompressed, the same model that the encoding algorithm used to compress the source data must be provided to the decoding algorithm so that it can use it to decompress the encoded data.

Secondly, traversing the tree is computationally costly, since it requires the algorithm to jump randomly through the structure in memory as each bit in the encoded data is read in.

Additionally, because the codes are sequential, the decoding algorithm can be dramatically simplified so that it is computationally efficient.

The bit lengths stay the same with the code book being sorted first by codeword length and secondly by alphabetical value of the letter: Each of the existing codes are replaced with a new one of the same length, using the following algorithm: By following these three rules, the canonical version of the code book produced will be: Another perspective on the canonical codewords is that they are the digits past the radix point (binary point) in a binary representation of a certain series.

This perspective is particularly useful in light of Kraft's inequality, which says that the sum above will always be less than or equal to 1 (since the lengths come from a prefix free code).

This shows that adding one in the algorithm above never overflows and creates a codeword that is longer than intended.

Each way makes the description longer than the following (Method2): The JPEG File Interchange Format uses Method2 of encoding, because at most only 162 symbols out of the 8-bit alphabet, which has size 256, will be in the codebook.

Given a list of symbols sorted by bit-length, the following pseudocode will print a canonical Huffman code book: [1] [2]