This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary.
Many universal families are known (for hashing integers, vectors, strings), and their evaluation is often very efficient.
Usually, the goal of hashing is to obtain a low number of collisions (keys from
A deterministic hash function cannot offer any guarantee in an adversarial setting if
This means that all data keys land in the same bin, making hashing useless.
Sometimes, the definition is relaxed by a constant factor, only requiring collision probability
This concept was introduced by Carter and Wegman[1] in 1977, and has found numerous applications in computer science (see, for example[2]).
Given a family with the uniform distance property, one can produce a pairwise independent or strongly universal hash family by adding a uniformly distributed random constant with values in
is a power of two, we can achieve pairwise independence from an XOR universal hash family by doing an exclusive or with a uniformly distributed random constant.)
Since a shift by a constant is sometimes irrelevant in applications (e.g. hash tables), a careful distinction between the uniform distance property and pairwise independent is sometimes not made.
UMAC and Poly1305-AES and several other message authentication code algorithms are based on universal hashing.
Universality guarantees that the number of repetitions is a geometric random variable.
This section refers to the case of hashing integers that fit in machines words; thus, operations like multiplication, addition, division, etc.
The state of the art for hashing integers is the multiply-shift scheme described by Dietzfelbinger et al. in 1997.
[8] By avoiding modular arithmetic, this method is much easier to implement and also runs significantly faster in practice (usually by at least a factor of four[9]).
In mathematical notation, this is This scheme does not satisfy the uniform difference property and is only
To obtain a truly 'universal' hash function, one can use the multiply-add-shift scheme that picks higher-order bits where
This version of multiply-shift is due to Dietzfelbinger, and was later analyzed more precisely by Woelfel.
[10] This section is concerned with hashing a fixed-length vector of machine words.
is a universal family with the uniform difference property, the following family (dating back to Carter and Wegman[1]) also has the uniform difference property (and hence is universal): If
: It is possible to halve the number of multiplications, which roughly translates to a two-fold speed-up in practice.
The following hash family is universal:[13] If double-precision operations are not available, one can interpret the input as a vector of half-words (
The same scheme can also be used for hashing integers, by interpreting their bits as vectors of bytes.
Experimentally, it was found to run at 0.2 CPU cycle per byte on recent Intel processors for
The space required is the maximal length of the string, but the time to evaluate
As long as zeroes are forbidden in the string, the zero-padding can be ignored when evaluating the hash function without affecting universality.
be a prime and define: Using properties of modular arithmetic, above can be computed without producing large numbers for large strings as follows:[16] This Rabin-Karp rolling hash is based on a linear congruential generator.
Below table shows values chosen to initialize h and a for some of the popular implementations.
is sufficiently large compared to the length of strings hashed, the family is very close to universal (in statistical distance).
To mitigate the computational penalty of modular arithmetic, three tricks are used in practice:[11]