Bloom filter

Typically, k is a small constant which depends on the desired false error rate ε, while m is proportional to k and the number of elements to be added.

While risking false positives, Bloom filters have a substantial space advantage over other data structures for representing sets, such as self-balancing binary search trees, tries, hash tables, or simple arrays or linked lists of the entries.

Hash tables gain a space and time advantage if they begin ignoring collisions and store only whether each bucket contains an entry; in this case, they have effectively become Bloom filters with k = 1.

No other constant-space set data structure has this property, but the average access time of sparse hash tables can make them faster in practice than some Bloom filters.

If k = 1, then in order to keep the false positive rate sufficiently low, a small fraction of bits should be set, which means the array must be very large and contain long runs of zeros.

The true probability of a false positive, without assuming independence, is where the {braces} denote Stirling numbers of the second kind.

This technique, which was first introduced by Carter et al. in 1978,[26] relies on the fact that compact hash tables can be implemented to use roughly

The space efficient variant relies on using a single hash function that generates for each key a value in the range

Because of decompression overhead, this variant may be slower than classic Bloom filters but this may be compensated by the fact that a single hash function needs to be computed.

[31] A treatment which unifies Bloom filters with other work on random projections, compressive sensing, and locality sensitive hashing remains to be done (though see Dasgupta, et al[32] for one attempt inspired by neuroscience).

Unlike the typical Bloom filter, elements are hashed to a bit array through deterministic, fast and simple-to-calculate functions.

If it does occur then the increment and decrement operations must leave the bucket set to the maximum possible value in order to retain the properties of a Bloom filter.

Once the designed capacity of the table is exceeded, the false positive rate will grow rapidly as more keys are inserted.

Bonomi et al. (2006) introduced a data structure based on d-left hashing that is functionally equivalent but uses approximately half as much space as counting Bloom filters.

The space efficient variant by Putze, Sanders & Singler (2007) could also be used to implement counting filters by supporting insertions and deletions.

Rottenstreich, Kanizo & Keslassy (2012) introduced a new general method based on variable increments that significantly improves the false positive probability of counting Bloom filters and their variants, while still supporting deletions.

Kim et al. (2019) shows that false positive of Counting Bloom filter decreases from k=1 to a point defined

[35] Bloom filters can be organized in distributed data structures to perform fully decentralized computations of aggregate functions.

Decentralized aggregation makes collective measurements locally available in every node of a distributed network without involving a centralized computational entity for this purpose.

One of the main obstacles for a parallel Bloom filter is the organization and communication of the unordered data which is, in general, distributed evenly over all PEs at the initiation or at batch insertions.

[37] For both approaches a "Single Shot" Bloom filter is used which only calculates one hash, resulting in one flipped bit per element, to reduce the communication volume.

While using more than two repetitions can reduce the communication volume further if the number of duplicates in a set is small, the payoff for the additional complications is low.

Counting Bloom filters can be used to approximate the number of differences between two sets and this approach is described in Agarwal & Trachtenberg (2006).

[43] Almeida et al. (2007) proposed a variant of Bloom filters that can adapt dynamically to the number of elements stored, while assuring a minimum false positive probability.

However, the main characteristic of SBFs is their ability to store multiple sets in a single data structure, which makes them suitable for a number of different application scenarios.

In the context of service discovery in a network, each node stores regular and attenuated Bloom filters locally.

Since the patterns don't match, we check the attenuated Bloom filter in order to determine which node should be the next hop.

More advanced filters also encode atom counts, larger substructure features like carboxyl groups, and graph properties like the number of rings.

In hash-based fingerprints, a hash function based on atom and bond properties is used to turn a subgraph into a PRNG seed, and the first output values used to set bits in the Bloom filter.

However, it wasn't until around 1990 that Daylight Chemical Information Systems, Inc. introduced a hash-based method to generate the bits, rather than use a precomputed table.

Using a Bloom filter to prevent one-hit-wonders from being stored in a web cache decreased the rate of disk writes by nearly one half, reducing the load on the disks and potentially increasing disk performance. ^{[

13

]}

Distributed Single Shot Bloom filter for duplicate detection with false positive rate: 6 elements are distributed over 3 PEs, each with a bit array of length 4. During the first communication step PE 1 receives the hash '2' twice and sends it back to either PE 2 or 3, depending on who sent it later. The PE that receives the hash '2' then searches for the element with that hash and marks it as possible duplicate.