Approximate membership query filter

Approximate membership query filters (hereafter, AMQ filters) comprise a group of space-efficient probabilistic data structures that support approximate membership queries.

An approximate membership query answers whether an element is in a set or not with a false positive rate of

AMQ filters have numerous applications, mainly in distributed systems and databases.

The approximate membership query problem is to store information about a set of elements S in a space-efficient way.

The goal is to answer queries about whether an element x is in the set S or not, while constraining false positives to a maximal probability

Dynamic AMQ filters support inserting elements one at a time without rebuilding the data structure.

Increasing the storage space reduces the false positive rate.

[1] Dynamic AMQ filters cannot reach this lower bound.

There are different ways to solve the approximate membership query problem.

The most known data structure are Bloom filters, but there are other data structures that perform better for some false positive rates and space requirements, support additional operations, or have other insertion and lookup times.

To insert an element, all hash functions are calculated and all corresponding bits in the array are set to one.

To reduce the false positive rate, the number of hash functions and

The idea of quotient filters is to hash an element and to split its fingerprint into the

Additional three bits for every slot in the hash table are used to resolve soft collisions (same quotient but different remainders).

After reaching a load threshold the insertion speed of cuckoo filter degrades.

Similar to cuckoo filters, they save fingerprints of the elements in a hash table.

This construction algorithm can fail in a way such that no dynamic insertions are possible without rebuilding the hash table.

The disadvantage of this filter is that the data structure has to be rebuilt if additional elements are added.

The AMQ filter functions as a proxy to the set of keys of a database or remote memory.

Before a presumably slow query to the database or to remote memory is performed, the AMQ filter is used to give an approximate answer as to whether the key is in the database or in remote memory.

The slow query is only performed when the AMQ filter returns true.

Only in the case of a (hopefully rare) false positive is an unnecessary I/O or remote access performed.

The applications are numerous and include package and resource routing, P2P networks, and distributed caching.

[4] AMQ filters are often used as an in-memory data structure to avoid expensive disk accesses.

LSM trees are used in databases like Apache AsterixDB, Bigtable, HBase, LevelDB, SQLite4.

Networks offer a lot of applications for AMQ filters.

Even if the set on the remote server changes the AMQ filter is often not updated right away, but some false positives are tolerated.

In this setting, false negatives can occur if the cache changed in between periodic updates.

AMQ filters can be used to approximate what is stored at each node of the network.

The filter can be filled with ids or keywords of the actual documents of the nodes.