Quotient filter

The more elements are added to the set, the larger the probability of false positives.

As keys are added to or removed from the database, the filter is updated to reflect this.

If the filter returns absence, the key is known not to be in the database without any disk accesses having been performed.

A quotient filter has the usual AMQ operations of insert and query.

The compact hash table underlying a quotient filter was described by Cleary in 1984.

[6] The quotient filter is based on a kind of hash table in which entries contain only a portion of the key plus some additional meta-data bits.

These bits are used to deal with the case when distinct keys happen to hash to the same table entry.

By way of contrast, other types of hash tables that deal with such collisions by linking to overflow areas are not compact because the overhead due to linkage can exceed the storage used to store the key.

[1] In a quotient filter a hash function generates a p-bit fingerprint.

As described below, the insertion algorithm ensures that all fingerprints having the same quotient are stored in contiguous slots.

However a run whose first fingerprint occupies its canonical slot indicates the start of a cluster.

They have the following function: The various combinations have the following meaning: We can test if a quotient filter contains some key, d, as follows.

We would compute hash(e), partition it into its remainder, eR and its quotient eQ, which is 4.

The scan stops at slot 1, which we detect as the cluster's start because it is not empty and not shifted.

The figure shows a quotient filter proceeding through a series of states as elements are added.

This is important because lookups and inserts require locating the start and length of an entire cluster.

A large database, such as Webtable[8] may be composed of smaller sub-tables each of which has an associated filter.

However quotient filters can be efficiently merged within memory without having to re-insert the original keys.

This is particularly important in some log structured storage systems that use the log-structured merge-tree or LSM-tree.

One variation of the LSM-Tree is the Sorted Array Merge Tree or SAMT.

A query on the SAMT is directed at only select Wanna-B-trees as evidenced by their quotient filters.

An essential property of quotient filters is that they can be efficiently merged without having to re-insert the original keys.

Given that for large data sets the Wanna-B-trees may not be in memory, accessing them to retrieve the original keys would incur many I/Os.

By construction the values in a quotient filter are stored in sorted order.

So, by working from left to right, one can reconstruct all the fingerprints and the resulting list of integers will be in sorted order.