Cuckoo hashing

The name derives from the behavior of some species of cuckoo, where the cuckoo chick pushes the other eggs or young out of the nest when it hatches in a variation of the behavior referred to as brood parasitism; analogously, inserting a new key into a cuckoo hashing table may push an older key to a different location in the table.

Cuckoo hashing was first described by Rasmus Pagh and Flemming Friche Rodler in a 2001 conference paper.

However, open addressing suffers from collisions, which happens when more than one key is mapped to the same cell.

It is also possible for both hash functions to provide indexes into a single table.

The process continues until an empty position is found to insert the key.

are rehashed with new hash functions and the insertion procedure repeats.

The following is pseudocode for insertion:[1]: 125 On lines 10 and 15, the "cuckoo approach" of kicking other keys which occupy

[1]: 124-125 Insertions succeed in expected constant time,[1] even considering the possibility of having to rebuild the table, as long as the number of keys is kept below half of the capacity of the hash table, i.e., the load factor is below 50%.

One method of proving this uses the theory of random graphs: one may form an undirected graph called the "cuckoo graph" that has a vertex for each hash table location, and an edge for each hashed value, with the endpoints of the edge being the two possible locations of the value.

Then, the greedy insertion algorithm for adding a set of values to a cuckoo hash table succeeds if and only if the cuckoo graph for this set of values is a pseudoforest, a graph with at most one cycle in each of its connected components.

Any vertex-induced subgraph with more edges than vertices corresponds to a set of keys for which there are an insufficient number of slots in the hash table.

With high probability, for load factor less than 1/2 (corresponding to a random graph in which the ratio of the number of edges to the number of vertices is bounded below 1/2), the graph is a pseudoforest and the cuckoo hashing algorithm succeeds in placing all keys.

The same theory also proves that the expected size of a connected component of the cuckoo graph is small, ensuring that each insertion takes constant expected time.

However, also with high probability, a load factor greater than 1/2 will lead to a giant component with two or more cycles, causing the data structure to fail and need to be resized.

A third approach from 2014[6] is to slightly modify the cuckoo hashtable with a so-called stash, which makes it possible to use nothing more than 2-independent hash functions.

In practice, cuckoo hashing is about 20–30% slower than linear probing, which is the fastest of the common approaches.

[1] The reason is that cuckoo hashing often causes two cache misses per search, to check the two locations where a key might be stored, while linear probing usually causes only one cache miss per search.

However, because of its worst case guarantees on search time, cuckoo hashing can still be valuable when real-time response rates are required.

The last column illustrates a failed insertion due to a cycle, details below.

In the last row of the table we find the same initial situation as at the beginning again.

Several variations of cuckoo hashing have been studied, primarily with the aim of improving its space usage by increasing the load factor that it can tolerate to a number greater than the 50% threshold of the basic algorithm.

Some of these methods can also be used to reduce the failure rate of cuckoo hashing, causing rebuilds of the data structure to be much less frequent.

The stash, in this data structure, is an array of a constant number of keys, used to store keys that cannot successfully be inserted into the main hash table of the structure.

This modification reduces the failure rate of cuckoo hashing to an inverse-polynomial function with an exponent that can be made arbitrarily large by increasing the stash size.

This data structure forms an approximate set membership data structure with much the same properties as a Bloom filter: it can store the members of a set of keys, and test whether a query key is a member, with some chance of false positives (queries that are incorrectly reported as being part of the set) but no false negatives.

However, it improves on a Bloom filter in multiple respects: its memory usage is smaller by a constant factor, it has better locality of reference, and (unlike Bloom filters) it allows for fast deletion of set elements with no additional storage penalty.

Kenneth Ross[14] has shown bucketized versions of cuckoo hashing (variants that use buckets that contain more than one key) to be faster than conventional methods also for large hash tables, when space utilization is high.

A survey by Mitzenmacher[7] presents open problems related to cuckoo hashing as of 2009.

Cuckoo hashing is used in TikTok's recommendation system to solve the problem of "embedding table collisions", which can result in reduced model quality.

The TikTok recommendation system "Monolith" takes advantage cuckoo hashing's collision resolution to prevent different concepts from being mapped to the same vectors.