Extendible hashing

[1] Because of the hierarchical nature of the system, re-hashing is an incremental operation (done one bucket at a time, as needed).

This means that time-sensitive applications are less affected by table growth than by standard full-table rehashes.

Practically all modern filesystems use either extendible hashing or B-trees.

In particular, the Global File System, ZFS, and the SpadFS filesystem use extendible hashing.

bits of each string will be used as indices to figure out where they will go in the "directory" (hash table), where

is the smallest number such that the index of every item in the table is unique.

The first two keys to be inserted, k1 and k2, can be distinguished by the most significant bit, and would be inserted into the table as follows: Now, if k3 were to be hashed to the table, it wouldn't be enough to distinguish all three keys by one bit (because both k3 and k1 have 1 as their leftmost bit).

Because comparing the first two most significant bits would give each key a unique location, the directory size is doubled as follows: And so now k1 and k3 have a unique location, being distinguished by the first two leftmost bits.

Because k2 is in the top half of the table, both 00 and 01 point to it because there is no other key to compare to that begins with a 0.

Now, k4 needs to be inserted, and it has the first two bits as 01..(1110), and using a 2 bit depth in the directory, this maps from 01 to Bucket A. Bucket A is full (max size 1), so it must be split; because there is more than one pointer to Bucket A, there is no need to increase the directory size.

What is needed is information about: In order to distinguish the two action cases: Examining the initial case of an extendible hash structure, if each directory entry points to one bucket, then the local depth should be equal to the global depth.

The number of directory entries is equal to 2global depth, and the initial number of buckets is equal to 2local depth.

But just to emphasize the role of the depth information, the example will be pursued logically to the end.)

So Bucket D needs to be split, but a check of its local depth, which is 2, is the same as the global depth, which is 2, so the directory must be split again, in order to hold keys of sufficient detail, e.g. 3 bits.

⁠ so is full; D's local depth is 2 but now the global depth is 3 after the directory doubling, so now D can be split into bucket's D' and E, the contents of D, ⁠

bitmasked using the new global depth bit count of 3 and this gives 011 which now points to a new bucket E which is empty.

⁠ goes in Bucket E. Below is the extendible hashing algorithm in Python, with the disc block / memory page association, caching and consistency issues removed.

Note a problem exists if the depth exceeds the bit size of an integer, because then doubling of the directory or splitting of a bucket won't allow entries to be rehashed to different buckets.

The code uses the least significant bits, which makes it more efficient to expand the table, as the entire directory can be copied as one block (Ramakrishnan & Gehrke (2003)).