Rolling hash

One of the main applications is the Rabin–Karp string search algorithm, which uses the rolling hash described below.

Another popular application is the rsync program, which uses a checksum based on Mark Adler's adler-32 as its rolling hash.

Low Bandwidth Network Filesystem (LBFS) uses a Rabin fingerprint as its rolling hash.

FastCDC (Fast Content-Defined Chunking) uses a compute-efficient Gear fingerprint as its rolling hash.

At best, rolling hash values are pairwise independent[1] or strongly universal.

The Rabin–Karp string search algorithm is often explained using a rolling hash function that only uses multiplications and additions: where

Shifting all characters by one position to the left requires multiplying the entire sum

Shifting all characters by one position to the right requires dividing the entire sum

The Rabin fingerprint is another hash, which also interprets the input as a polynomial, but over the Galois field GF(2).

It is possible to update a Rabin fingerprint using only the entering and the leaving byte, making it effectively a rolling hash.

The backup software restic uses a Rabin fingerprint for splitting files, with blob size varying between 512KiB and 8MiB.

[2] Hashing by cyclic polynomial[3]—sometimes called Buzhash—is also simple and it has the benefit of avoiding multiplications, using barrel shifts instead.

be a cyclic binary rotation (or circular shift): it rotates the bits by 1 to the left, pushing the latest bit in the first position.

The hash values are defined as where the multiplications by powers of two can be implemented by binary shifts.

Hashing by cyclic polynomials is strongly universal or pairwise independent: simply keep the first

One of the interesting use cases of the rolling hash function is that it can create dynamic, content-based chunks of a stream or file.

This is especially useful when it is required to send only the changed chunks of a large file over a network: a simple byte addition at the front of the file would normally cause all fixed size windows to become updated, while in reality, only the first "chunk" has been modified.

[4] A simple approach to making dynamic chunks is to calculate a rolling hash, and if the hash value matches an arbitrary pattern (e.g. all zeroes) in the lower N bits (with a probability of

, given the hash has a uniform probability distribution) then it’s chosen to be a chunk boundary.

This approach ensures that unmodified data (more than a window size away from the changes) will have the same boundaries.

Once the boundaries are known, the chunks need to be compared by cryptographic hash value to detect changes.

[5] The backup software Borg uses the Buzhash algorithm with a customizable chunk size range for splitting file streams.

[4][6] Several programs, including gzip (with the --rsyncable option) and rsyncrypto, do content-based slicing based on this specific (unweighted) moving sum:[7] where Shifting the window by one byte simply involves adding the new character to the sum and subtracting the oldest character (no longer in the window) from the sum.

However, comparing a string byte-by-byte will introduce the heavy computation overhead.

FastCDC [8] proposes a new and efficient Content-Defined Chunking approach.

It uses a fast rolling Gear hash algorithm,[9] skipping the minimum length, normalizing the chunk-size distribution, and last but not the least, rolling two bytes each time to speed up the CDC algorithm, which can achieve about 10X higher throughput than Rabin-based CDC approach.

Compared with the traditional Rabin hashing algorithm, it achieves a much faster speed.

All rolling hash functions can be computed in time linear in the number of characters and updated in constant time when characters are shifted by one position.

In particular, computing the Rabin-Karp rolling hash of a string of length

modular arithmetic operations, and hashing by cyclic polynomials requires