Linear hashing

Linear hashing (LH) is a dynamic data structure which implements a hash table and grows or shrinks one bucket at a time.

The trigger for a reconstruction depends on the flavor of the scheme; it could be an overflow at a bucket or load factor (i.e., the number of records divided by the number of buckets) moving outside of a predetermined range.

spiral storage) distributes records unevenly over the buckets such that buckets with high costs of insertion, deletion, or retrieval are earliest in line for a split.

[5] Linear Hashing has also been made into a scalable distributed data structure, LH*.

[9] LH* itself has been expanded to provide data availability in the presence of failed buckets.

[10] Key based operations (inserts, deletes, updates, reads) in LH and LH* take maximum constant time independent of the number of buckets and hence of records.

For example, in Ellis' implementation, a bucket is a linked list of records.

[2] The file allows the key based CRUD operations create or insert, read, update, and delete as well as a scan operations that scans all records, for example to do a database select operation on a non-key attribute.

[10] Records are stored in buckets whose numbering starts with 0.

[10] The key distinction from schemes such as Fagin's extendible hashing is that as the file expands due to insertions, only one bucket is split at a time, and the order in which buckets are split is already predetermined.

returns the 0-based index of the bucket that contains the record with key

corresponds to the number of rightmost binary digits of the key

This dynamic hash function can be expressed arithmetically as

Note that when the total number of buckets is equal to one,

Instead of using the load factor, this threshold can also be expressed as an occupancy percentage, in which case, the maximum number of records in the hash index equals (occupancy percentage)*(max records per non-overflowed bucket)*(number of buckets).

File contraction occurs in some LH algorithm implementations if a controlled split causes the load factor to sink below a threshold.

In this case, a merge operation would be triggered which would undo the last split, and reset the file state.

The split pointer corresponds to the first bucket that uses the hash function

[10] For example, if numerical records are inserted into the hash index according to their farthest right binary digits, the bucket corresponding with the appended bucket will be split.

When a bucket is split, split pointer and possibly the level are updated according to the following, such that the level is 0 when the linear hashing index only has 1 bucket.

[10]The main contribution of LH* is to allow a client of an LH* file to find the bucket where the record resides even if the client does not know the file state.

Based on their file state, a client calculates the address of a key and sends a request to that bucket.

In a reasonably stable system, that is, if there is only one split or merge going on while the request is processed, it can be shown that there are at most two forwards.

After a forward, the final bucket sends an Image Adjustment Message to the client whose state is now closer to the state of the distributed file.

[10] While forwards are reasonably rare for active clients, their number can be even further reduced by additional information exchange between servers and clients [13] The file state consists of split pointer

Griswold and Townsend [14] discussed the adoption of linear hashing in the Icon language.

They discussed the implementation alternatives of dynamic array algorithm used in linear hashing, and presented performance comparisons using a list of Icon benchmark applications.

Linear hashing is used in the Berkeley database system (BDB), which in turn is used by many software systems, using a C implementation derived from the CACM article and first published on the Usenet in 1988 by Esmond Pitt.