Primary clustering

In computer programming, primary clustering is a phenomenon that causes performance degradation in linear-probing hash tables.

The phenomenon states that, as elements are added to a linear probing hash table, they have a tendency to cluster together into long runs (i.e., long contiguous regions of the hash table that contain no free slots).

This causes insertions and negative queries to take expected time

Primary clustering has two causes: Another way to understand primary clustering is by examining the standard deviation on the number of items that hash to a given region within the hash table.

The expected number of items that hash into the region is

On the other hand, the standard deviation on the number of such items is

, the number of items that hash into the region will exceed the size

This intuition is often used as the starting point for formal analyses of primary clustering.

[2][3][4] Primary clustering causes performance degradation for both insertions and queries in a linear probing hash table.

Insertions must travel to the end of a run, and therefore take expected time

[1] Negative queries (i.e., queries that are searching for an element that turns out not to be present) must also travel to the end of a run, and thus also take expected time

[1] Positive queries can terminate as soon as they find the element that they are searching for.

As a result, the expected time to query a random element in the hash table is

[1] These bounds also hold for linear probing with lazy deletions (i.e., using tombstones for deletions), as long as the hash table is rebuilt (and the tombstones are cleared out) semi-frequently.

[2] Many textbooks describe the winner-keeps-winning effect (in which the longer a run becomes, the more likely it is to accrue additional elements) as the sole cause of primary clustering.

[5][6][7][8][9][10][11] However, as noted by Knuth,[1] this is not the main cause of primary clustering.

Some textbooks state that the expected time for a positive query is

Some positive queries may have much larger expected running times, however.

Ordered linear probing[13] (often referred to as Robin Hood hashing[14]) is a technique for reducing the effects of primary clustering on queries.

Ordered linear probing sorts the elements within each run by their hash.

This results in both positive and negative queries taking expected time

Graveyard hashing is a variant of ordered linear probing that eliminates the asymptotic effects of primary clustering for all operations.

[2] Graveyard hashing strategically leaves gaps within runs that future insertions can make use of.

These gaps, which can be thought of as tombstones (like those created by lazy deletions), are inserted into the table during semi-regular rebuilds.

The gaps then speed up the insertions that take place until the next semi-regular rebuild occurs.

Every operation in a graveyard hash table takes expected time