Tabulation hashing

In computer science, tabulation hashing is a method for constructing universal families of hash functions by combining table lookup with exclusive or operations.

It was first studied in the form of Zobrist hashing for computer games; later work by Carter and Wegman extended this method to arbitrary fixed-length keys.

Generalizations of tabulation hashing have also been developed that can handle variable-length keys such as text strings.

More sophisticated but slower variants of tabulation hashing extend the method to higher degrees of independence.

The basic idea is as follows: First, divide the key to be hashed into smaller "blocks" of a chosen length.

Then, create a set of lookup tables, one for each block, and fill them with random values.

Choose a block size r ≤ p; the choice of block size controls the tradeoff between time and memory usage, so it should be made so that the tables are not too large, e.g., so that the tables fit into the computer's cache memory.

[2] Smaller blocks use less memory but slow down the hash function.

Compute t = ceil(p/r), the number of r-bit blocks needed to represent a key.

Create a two-dimensional 2r × t array, T, and fill it with random q-bit numbers.

Then, use these r-bit and position values as indices into T, and combine the results using the exclusive or operation:[1] Note that it is not valid to use the same table (e.g. T[0]) for each xi, since then the hash function would not be able to distinguish between strings with the same xis, but permuted differently.

[3] In this method, a random bitstring is generated for each game feature such as a combination of a chess piece and a square of the chessboard.

The resulting hash value can then be used as an index into a transposition table.

[4] Tabulation hashing in greater generality, for arbitrary binary values, was later rediscovered by Carter & Wegman (1979) and studied in more detail by Pătraşcu & Thorup (2012).

Carter & Wegman (1979) define a randomized scheme for generating hash functions to be universal if, for any two keys, the probability that they collide (that is, they are mapped to the same value as each other) is 1/m, where m is the number of values that the keys can take on.

They defined a stronger property in the subsequent paper Wegman & Carter (1981): a randomized scheme for generating hash functions is k-independent if, for every k-tuple of keys, and each possible k-tuple of values, the probability that those keys are mapped to those values is 1/mk.

However, k-independence for larger values of k is a stronger property, held by fewer hashing algorithms.

[7] Nevertheless, despite only being 3-independent, tabulation hashing provides the same constant-time guarantee for linear probing.

With tabulation hashing, on the other hand, the best bound known on the failure probability is higher, high enough that insertions cannot be guaranteed to take constant expected time.

Siegel (2004) uses the same idea of using exclusive or operations to combine random values from a table, with a more complicated algorithm based on expander graphs for transforming the key bits into table indices, to define hashing schemes that are k-independent for any constant or even logarithmic value of k. However, the number of table lookups needed to compute each hash value using Siegel's variation of tabulation hashing, while constant, is still too large to be practical, and the use of expanders in Siegel's technique also makes it not fully constructive.

Thorup (2013) provides a scheme based on tabulation hashing that reaches high degrees of independence more quickly, in a more constructive way.

He observes that using one round of simple tabulation hashing to expand the input keys to six times their original length, and then a second round of simple tabulation hashing on the expanded keys, results in a hashing scheme whose independence number is exponential in the parameter r, the number of bits per block in the partition of the keys into blocks.

Simple tabulation is limited to keys of a fixed length, because a different table of random values needs to be initialized for each position of a block in the keys.

Lemire (2012) studies variations of tabulation hashing suitable for variable-length keys such as character strings.

The general type of hashing scheme studied by Lemire uses a single table T indexed by the value of a block, regardless of its position within the key.

However, the values from this table may be combined by a more complicated function than bitwise exclusive or.

: Mixed Tabulation was shown in 2016[11] to have strong concentration with regards to k-partitions, which are useful in algorithms for counting distinct elements, such as the classical method by Flajolet and Martin.