Linear probing

Linear probing is a scheme in computer programming for resolving collisions in hash tables, data structures for maintaining a collection of key–value pairs and looking up the value associated with a given key.

It was invented in 1954 by Gene Amdahl, Elaine M. McGraw, and Arthur Samuel and first analyzed in 1963 by Donald Knuth.

As Thorup & Zhang (2012) write, "Hash tables are the most commonly used nontrivial data structures, and the most popular implementation on standard hardware uses linear probing, which is both fast and simple.

"[1] Linear probing can provide high performance because of its good locality of reference, but is more sensitive to the quality of its hash function than some other collision resolution schemes.

[2] Linear probing is a component of open addressing schemes for using a hash table to solve the dictionary problem.

In open addressing solutions to this problem, the data structure is an array T (the hash table) whose cells T[i] (when nonempty) each store a single key–value pair.

Linear probing is a strategy for resolving collisions, by placing the new key into the closest following empty cell.

Setting this threshold close to zero and using a high growth rate for the table size leads to faster hash table operations but greater memory usage than threshold values close to one and low growth rates.

[3][4] Linear probing provides good locality of reference, which causes it to require few uncached memory accesses per operation.

[3] Additionally, achieving good performance with this method requires a higher-quality hash function than for some other collision resolution schemes.

In other words, insert, remove and search operations can be implemented in O(1), as long as the load factor of the hash table is a constant strictly less than one.

Therefore, the expected time for an operation can be calculated as the product of these two terms, O(k2/N), summed over all of the maximal blocks of contiguous cells in the table.

[11] Linear probing hash tables suffer from a problem known as primary clustering, in which elements to group together into long contiguous runs.

[10] Primary clustering also affects unsuccessful searches, since like insertions, they must travel to the end of a run.

[10] Some variations of linear probing are able to achieve better bounds for unsuccessful searches and insertions, by using techniques that reduce primary clustering.

These methods compute the hash function quickly, and can be proven to work well with linear probing.

In particular, linear probing has been analyzed from the framework of k-independent hashing, a class of hash functions that are initialized from a small random seed and that are equally likely to map any k-tuple of distinct keys to any k-tuple of indexes.

Nevertheless, linear probing using these hash functions takes constant expected time per operation.

[1][20] In an experimental comparison, Richter et al. found that the Multiply-Shift family of hash functions (defined as

[2] They point out that each table look-up require several cycles, being more expensive than simple arithmetic operations.

The idea of an associative array that allows data to be accessed by its value rather than by its address dates back to the mid-1940s in the work of Konrad Zuse and Vannevar Bush,[21] but hash tables were not described until 1953, in an IBM memorandum by Hans Peter Luhn.

According to Knuth, it was first used by Gene Amdahl, Elaine M. McGraw (née Boehme), and Arthur Samuel in 1954, in an assembler program for the IBM 701 computer.

[8] The first published description of linear probing is by Peterson (1957),[8] who also credits Samuel, Amdahl, and Boehme but adds that "the system is so natural, that it very likely may have been conceived independently by others either before or since that time".

[24] The first theoretical analysis of linear probing, showing that it takes constant expected time per operation with random hash functions, was given by Knuth.

[10] Significant later developments include a more detailed analysis of the probability distribution of the running time,[25][26] and the proof that linear probing runs in constant time per operation with practically usable hash functions rather than with the idealized random functions assumed by earlier analysis.