Perfect hash function

Perfect hash functions may be used to implement a lookup table with constant worst-case access time.

For frequently changing S dynamic perfect hash functions may be used at the cost of additional space.

[1] The space requirement to store the perfect hash function is in O(n) where n is the number of keys in the structure.

A perfect hash function with values in a limited range can be used for efficient lookup operations, by placing keys from S (or other associated values) in a lookup table indexed by the output of the function.

[2] With perfect hashing, the associated data can be read or written with a single access to the table.

[4] The lower bound for the representation size depends on m and n. Let m = (1+ε) n and h a perfect hash function.

For minimal perfect hashing, ε = 0, the lower bound is log e ≈ 1.44 bits per element.

[4] A perfect hash function for a specific set S that can be evaluated in constant time, and with values in a small range, can be found by a randomized algorithm in a number of operations that is proportional to the size of S. The original construction of Fredman, Komlós & Szemerédi (1984) uses a two-level scheme to map a set S of n elements to a range of O(n) indices, and then map each index to a range of hash values.

The first level of their construction chooses a large prime p (larger than the size of the universe from which S is drawn), and a parameter k, and maps each element x of S to the index If k is chosen randomly, this step is likely to have collisions, but the number of elements ni that are simultaneously mapped to the same index i is likely to be small.

It uses a second set of linear modular functions, one for each index i, to map each member x of S into the range associated with g(x).

[2] As Fredman, Komlós & Szemerédi (1984) show, there exists a choice of the parameter k such that the sum of the lengths of the ranges for the n different values of g(x) is O(n).

A modified version of this two-level scheme with a larger number of values at the top level can be used to construct a perfect hash function that maps S into a smaller range of length n + o(n).

[4] Finally, to reduce the representation size, the (σ(i))0 ≤ i < r are compressed into a form that still allows the evaluation in O(1).

For example, with m = 1.23n Belazzougui, Botelho & Dietzfelbinger (2009) achieved a representation size between 3.03 bits/key and 1.40 bits/key for their given example set of 10 million entries, with lower values needing a higher computation time.

[4] For perfect hash functions, it is first assumed that the range of h is bounded by n as m = (1+ε) n. With the formula given by Belazzougui, Botelho & Dietzfelbinger (2009) and for a universe

whose size |U| = u tends towards infinity, the space lower bounds is bits/key, minus log(n) bits overall.

[4] A trivial but pervasive example of perfect hashing is implicit in the (virtual) memory address space of a computer.

[6] Using a perfect hash function is best in situations where there is a frequently queried large set, S, which is seldom updated.

A minimal perfect hash function is a perfect hash function that maps n keys to n consecutive integers – usually the numbers from 0 to n − 1 or from 1 to n. A more formal way of expressing this is: Let j and k be elements of some finite set S. Then h is a minimal perfect hash function if and only if h(j) = h(k) implies j = k (injectivity) and there exists an integer a such that the range of h is a..a + |S| − 1.

It has been proven that a general purpose minimal perfect hash scheme requires at least

The changes necessary to accomplish this are minimal, and are underlined in the adapted pseudocode below: A minimal perfect hash function F is order preserving if keys are given in some order a1, a2, ..., an and for any keys aj and ak, j < k implies F(aj) < F(ak).

A simple implementation of order-preserving minimal perfect hash functions with constant access time is to use an (ordinary) perfect hash function to store a lookup table of the positions of each key.

Lookups with this scheme are slower, because multiple locations must be checked, but nevertheless take constant worst-case time.