HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset.
[1] Calculating the exact cardinality of the distinct elements of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets.
The HyperLogLog algorithm is able to estimate cardinalities of > 109 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory.
[3] In the original paper by Flajolet et al.[1] and in related literature on the count-distinct problem, the term "cardinality" is used to mean the number of distinct elements in a data stream with repeated elements.
This article chooses to use Flajolet's definition for consistency with the sources.
The basis of the HyperLogLog algorithm is the observation that the cardinality of a multiset of uniformly distributed random numbers can be estimated by calculating the maximum number of leading zeros in the binary representation of each number in the set.
[1] In the HyperLogLog algorithm, a hash function is applied to each element in the original multiset to obtain a multiset of uniformly distributed random numbers with the same cardinality as the original multiset.
The cardinality of this randomly distributed set can then be estimated using the algorithm above.
The simple estimate of cardinality obtained using the algorithm above has the disadvantage of a large variance.
In the HyperLogLog algorithm, the variance is minimised by splitting the multiset into numerous subsets, calculating the maximum number of leading zeros in the numbers in each of these subsets, and using a harmonic mean to combine these estimates for each subset into an estimate of the cardinality of the whole set.
Some derived operations can be computed using the inclusion–exclusion principle like the cardinality of the intersection or the cardinality of the difference between two HyperLogLogs combining the merge and count operations.
The data of the HyperLogLog is stored in an array M of m counters (or "registers") that are initialized to 0.
Array M initialized from a multiset S is called HyperLogLog sketch of S. The add operation consists of computing the hash of the input data v with a hash function h, getting the first b bits (where b is
), and adding 1 to them to obtain the address of the register to modify.
The count algorithm consists in computing the harmonic mean of the m registers, and using a constant to derive an estimate
of the count: The intuition is that n being the unknown cardinality of M, each subset
is introduced to correct a systematic multiplicative bias present in
is not simple to calculate, and can be approximated with the formula[1] The HyperLogLog technique, though, is biased for small cardinalities below a threshold of
The original paper proposes using a different algorithm for small cardinalities known as Linear Counting.
, the alternative calculation can be used: Additionally, for very large cardinalities approaching the limit of the size of the registers (
) consists in obtaining the maximum for each pair of registers
space, where n is the set cardinality and m is the number of registers (usually less than one byte size).
The add operation depends on the size of the output of the hash function.
As this size is fixed, we can consider the running time for the add operation to be
The count and merge operations depend on the number of registers m and have a theoretical cost of
In some implementations (Redis)[7] the number of registers is fixed and the cost is considered to be
The HyperLogLog++ algorithm proposes several improvements in the HyperLogLog algorithm to reduce memory requirements and increase accuracy in some ranges of cardinalities:[6] When the data arrives in a single stream, the Historic Inverse Probability or martingale estimator[8][9] significantly improves the accuracy of the HLL sketch and uses 36% less memory to achieve a given error level.
This estimator is provably optimal for any duplicate insensitive approximate distinct counting sketch on a single stream.
The single stream scenario also leads to variants in the HLL sketch construction.
HLL-TailCut+ uses 45% less memory than the original HLL sketch but at the cost of being dependent on the data insertion order and not being able to merge sketches.