It more specifically relates to how many new species would be discovered if more samples were taken in an ecosystem.
The study of the unseen species problem was started in the early 1940s, by Alexander Steven Corbet.
He spent two years in British Malaya trapping butterflies and was curious how many new species he would discover if he spent another two years trapping.
Many different estimation methods have been developed to determine how many new species would be discovered given more samples.
The unseen species problem also applies more broadly, as the estimators can be used to estimate any new elements of a set not previously found in samples.
An example of this is determining how many words William Shakespeare knew based on all of his written works.
[1] The unseen species problem can be broken down mathematically as follows: If
In the early 1940s Alexander Steven Corbet spent 2 years in British Malaya trapping butterflies.
When Corbet returned to the United Kingdom, he approached biostatistician Ronald Fisher and asked how many new species of butterflies he could expect to catch if he went trapping for another two years;[3] in essence, Corbet was asking how many species he observed zero times.
Fisher responded with a simple estimation: for an additional 2 years of trapping, Corbet could expect to capture 75 new species.
He did this using a simple summation (data provided by Orlitsky[3] in the table from the Example below:
corresponds to the number of individual species that were observed
times (for example, if there were 74 species of butterflies with 2 observed members throughout the samples, then
The Good–Toulmin (GT) estimator was developed by Good and Toulmin in 1953.
, the Good–Toulmin estimator fails to capture accurate results.
is the location chosen to truncate the Euler transform.
Similar to the approach by Efron and Thisted, Alon Orlitsky, Ananda Theertha Suresh, and Yihong Wu developed the smooth Good–Toulmin estimator.
They realized that the Good–Toulmin estimator failed because of the exponential growth, and not its bias.
[3] Therefore, they estimated the number of unseen species by truncating the series
Orlitsky, Suresh, and Wu also noted that for distributions with
[2] To solve this, they selected a random nonnegative integer
This means that the estimator can be written as the linear combination of the prevalence:[2]
This curve relates the number of species found in an area as a function of the time.
[5] Two common models for a species discovery curve are the logarithmic and the exponential function.
[3] Using the Good–Toulmin model, the number of unseen species is found using
that Corbet brought to Fisher, the resulting estimate of
Based on research of Shakespeare's known works done by Thisted and Efron, there are 884,647 total words.
Therefore, the total number of unique words was found to be 31,534.
[1] Applying the Good–Toulmin model, if an equal number of works by Shakespeare were discovered, then it is estimated that
, meaning that Shakespeare most likely knew over twice as many words as he actually used in all of his writings.