These algorithms are designed to operate with limited memory, generally logarithmic in the size of the stream and/or in the maximum value in the stream, and may also have limited processing time per item.
Though streaming algorithms had already been studied by Munro and Paterson[1] as early as 1978, as well as Philippe Flajolet and G. Nigel Martin in 1982/83,[2] the field of streaming algorithms was first formalized and popularized in a 1996 paper by Noga Alon, Yossi Matias, and Mario Szegedy.
[3] For this paper, the authors later won the Gödel Prize in 2005 "for their foundational contribution to streaming algorithms."
There has since been a large body of work centered around data streaming algorithms that spans a diverse spectrum of computer science fields such as theory, databases, networking, and natural language processing.
Semi-streaming algorithms were introduced in 2005 as a relaxation of streaming algorithms for graphs,[4] in which the space allowed is linear in the number of vertices n, but only logarithmic in the number of edges m. This relaxation is still meaningful for dense graphs, and can solve interesting problems (such as connectivity) that are insoluble in
In the data stream model, some or all of the input is represented as a finite sequence of integers (from some finite domain) which is generally not available for random access, but instead arrives one at a time in a "stream".
[6] Much of the streaming literature is concerned with computing statistics on frequency distributions that are too large to be stored.
[citation needed] In this model, the function of interest is computing over a fixed-size window in the stream.
Data stream algorithms only have limited memory available but they may be able to defer action until a group of points arrive, while online algorithms are required to take action as soon as each point arrives.
[8] They also have applications in databases, such as estimating the size of a join [citation needed].
is useful for computing statistical properties of the data, such as the Gini coefficient of variation.
The seminal paper of Alon, Matias, and Szegedy dealt with the problem of estimating the frequency moments.
[citation needed] A direct approach to find the frequency moments requires to maintain a register mi for all distinct elements ai ∈ (1,2,3,4,...,N) which requires at least memory of order
[3] But we have space limitations and require an algorithm that computes in much lower memory.
[10] Flajolet et al. in [2] introduced probabilistic method of counting which was inspired from a paper by Robert Morris.
represents the position of least significant 1-bit in the binary representation of yi with a suitable convention for
Let A be the sequence of data stream of length M whose cardinality need to be determined.
The below algorithm then determines approximate cardinality of A.If there are N distinct elements in a data stream.
The previous algorithm describes the first attempt to approximate F0 in the data stream by Flajolet and Martin.
Bar-Yossef et al. in [10] introduced k-minimum value algorithm for determining number of distinct elements in data stream.
That is, in a close-to uniform hash space, they expect at-least t elements to be less than
The access time can be reduced if we store the t hash values in a binary tree.
Alon et al. estimates Fk by defining random variables that can be computed within given space and time.
Alon et al. in [3] simplified this algorithm using four-wise independent random variable with values mapped to
This approach can be refined by using exponentially weighted moving averages and variance for normalization.
[14] Counting the number of distinct elements in a stream (sometimes called the F0 moment) is another problem that has been well studied.
In 2010, Daniel Kane, Jelani Nelson and David Woodruff found an asymptotically optimal algorithm for this problem.
Learn a model (e.g. a classifier) by a single pass over a training set.
Lower bounds have been computed for many of the data streaming problems that have been studied.
By far, the most common technique for computing these lower bounds has been using communication complexity.