Reservoir sampling

Reservoir sampling is a family of randomized algorithms for choosing a simple random sample, without replacement, of k items from a population of unknown size n in a single pass over the items.

The size of the population n is not known to the algorithm and is typically too large for all n items to fit into main memory.

At any point, the current state of the algorithm must permit extraction of a simple random sample without replacement of size k over the part of the population seen so far.

If we know the total number of items n and can access the items arbitrarily, then the solution is easy: select 10 distinct indices i between 1 and n with equal probability, and keep the i-th elements.

, Algorithm R returns all inputs, thus providing the basis for a proof by mathematical induction.

is generated uniformly at random; once it becomes clear that a replacement will in fact occur, the probability that

Therefore, we conclude by the principle of mathematical induction that Algorithm R does indeed produce a uniform random sample of the inputs.

Generating this amount of randomness and the linear run time causes the algorithm to be unnecessarily slow if the input population is large.

Second simplification: it's unnecessary to remember the entire array of the smallest

[1] At the same time, it is simple to implement efficiently and does not depend on random deviates from exotic or hard-to-compute distributions.

If we associate with each item of the input a uniformly generated random number, the k items with the largest (or, equivalently, smallest) associated values form a simple random sample.

[3] A simple reservoir-sampling thus maintains the k items with the currently largest associated values in a priority queue.

and it is relevant mainly because it can easily be extended to items with weights.

The methods presented in the previous sections do not allow to obtain a priori fixed inclusion probabilities.

where r is the random number and then selecting the k items with the largest keys.

Equivalently, a more numerically stable formulation of this algorithm computes the keys as

[6][failed verification] The following algorithm is a more efficient version of A-Res, also given by Efraimidis and Spirakis:[5] This algorithm follows the same mathematical properties that are used in A-Res, but instead of calculating the key for each item and checking whether that item should be inserted or not, it calculates an exponential jump to the next item which will be inserted.

This avoids having to create random variates for each item, which may be expensive.

[5] Warning: the following description is wrong, see Chao's original paper and the discussion here.

He simple assumes we have some other way of picking them in proportion to their weight.

Chao: "Assume that we have a sampling plan of fixed size with respect to S_k at time A; such that its first-order inclusion probability of X_t is π(k; i)".

Similar to the other algorithms, it is possible to compute a random weight j and subtract items' probability mass values, skipping them while j > 0, reducing the number of random numbers that have to be generated.

[4] In machine learning applications, fairness is a critical consideration, especially in scenarios where data streams exhibit class imbalance.

To address this, Nikoloutsopoulos, Titsias, and Koutsopoulos proposed the Kullback-Leibler Reservoir Sampling (KLRS) algorithm as a solution to the challenges of Continual Learning, where models must learn incrementally from a continuous data stream.

The KLRS algorithm was designed to create a flexible policy that matches class percentages in the buffer to a target distribution while employing Reservoir Sampling techniques.

KLRS generalizes earlier methods like Reservoir Sampling and Class-Balancing Reservoir Sampling, as verified by experiments using confidence intervals, demonstrating its broader applicability and improved performance.

The KLRS algorithm operates by maintaining a buffer of size and updating its contents as new data points arrive in a stream.

A natural approach would be to shuffle the deck and then take the top k cards.

Truncating R to length k, the algorithm is modified accordingly: Since the order of the first k cards is immaterial, the first loop can be removed and R can be initialized to be the first k items of the input.

This yields Algorithm R. Reservoir sampling makes the assumption that the desired sample fits into main memory, often implying that k is a constant independent of n. In applications where we would like to select a large subset of the input list (say a third, i.e.