Good–Turing frequency estimation

Good–Turing frequency estimation is a statistical technique for estimating the probability of encountering an object of a hitherto unseen species, given a set of past observations of objects from different species.

Good–Turing frequency estimation was developed by Alan Turing and his assistant I. J.

Good as part of their methods used at Bletchley Park for cracking German ciphers for the Enigma machine during World War II.

Turing at first modelled the frequencies as a multinomial distribution, but found it inaccurate.

Good developed smoothing algorithms to improve the estimator's accuracy.

The discovery was recognised as significant when published by Good in 1953,[1] but the calculations were difficult so it was not used as widely as it might have been.

[2] The method even gained some literary fame due to the Robert Harris novel Enigma.

In the 1990s, Geoffrey Sampson worked with William A. Gale of AT&T to create and implement a simplified and easier-to-use variant of the Good–Turing method[3][4] described below.

Various heuristic justifications[5] and a simple combinatorial derivation have been provided.

[6] The Good–Turing estimator is largely independent of the distribution of species frequencies.

is the number of species for which only one individual was observed.

Note that the total number of objects observed,

can be found from The first step in the calculation is to estimate the probability that a future observed individual (or the next observed individual) is a member of a thus far unseen species.

For a single species this estimate is: Here, the notation

means the smoothed, or adjusted value of the frequency shown in parentheses.

An overview of how to perform this smoothing follows in the next section (see also empirical Bayes method).

To estimate the probability that the next observed individual is from any species from this group (i.e., the group of species seen

times) one can use the following formula: For smoothing the erratic values in

is defined as and where q, r, and t are three consecutive subscripts with non-zero counts

is the index of the last non-zero count, replace the divisor

A simple linear regression is then fitted to the log–log plot.

[9][full citation needed] Code for the method is available in the public domain.

[1][6][11][12] One of the simplest ways to motivate the formula is by assuming the next item will behave similarly to the previous item.

Our goal is to estimate just how likely each of these categories is, for the next item we will see.

Since we don't assume anything about the underlying probability distribution, it does sound a bit mysterious at first.

th time is simply the chance that it was one of the items that we have now seen

In other words, our chance of seeing an item that had been seen r times before was

So now we simply assume that this chance will be about the same for the next item we see.

Of course, your actual data will probably be a bit noisy, so you will want to smooth the values first to get a better estimate of how quickly the category counts are growing, and this gives the formula as shown above.

This approach is in the same spirit as deriving the standard Bernoulli estimator by simply asking what the two probabilities were for the previous coin flip (after scrambling the trials seen so far), given only the current result counts, while assuming nothing about the underlying distribution.