In statistics, the hypergeometric distribution is the discrete probability distribution generated by picking colored balls at random from an urn without replacement.
Assume that an opinion poll is conducted by calling random telephone numbers.
The probability distribution of employed versus unemployed respondents in a sample of n respondents can be described as a noncentral hypergeometric distribution.
The description of biased urn models is complicated by the fact that there is more than one noncentral hypergeometric distribution.
Agner Fog (2007, 2008) suggested that the best way to avoid confusion is to use the name Wallenius' noncentral hypergeometric distribution for the distribution of a biased urn model in which a predetermined number of items are drawn one by one in a competitive manner and to use the name Fisher's noncentral hypergeometric distribution for one in which items are drawn independently of each other, so that the total number of items drawn is known only after the experiment.
The names refer to Kenneth Ted Wallenius and R. A. Fisher, who were the first to describe the respective distributions.
We assume that the probability of taking a particular ball is proportional to its weight.
The physical property that determines the odds may be something else than weight, such as size or slipperiness or some other factor, but it is convenient to use the word weight for the odds parameter.
The important fact that distinguishes Wallenius' distribution is that there is competition between the balls.
And the weight of the competing balls depends on the outcomes of all preceding draws.
A multivariate version of Wallenius' distribution is used if there are more than two different colors.
In the Fisher model, the fates of the balls are independent and there is no dependence between draws.
A multivariate version of the Fisher's distribution is used if there are more than two colors of balls.
Wallenius' and Fisher's distributions are approximately equal when the odds ratio
is near 1, and n is low compared to the total number of balls, N. The difference between the two distributions becomes higher when the odds ratio is far from one and n is near N. The two distributions approximate each other better when they have the same mean than when they have the same odds (ω = 1) (see figures above).
To understand why the two distributions are different, we may consider the following extreme example: An urn contains one red ball with the weight 1000, and a thousand white balls each with the weight 1.
Continuing in this way, we can calculate that the probability of not taking the red ball in n draws is approximately 2−n as long as n is small compared to N. In other words, the probability of not taking a very heavy ball in n draws falls almost exponentially with n in Wallenius' model.
The exponential function arises because the probabilities for each draw are all multiplied together.
This is not the case in Fisher's model, where balls are taken independently, and possibly simultaneously.
The probability of not taking the heavy red ball in Fisher's model is approximately 1/(n + 1).
The probability of catching a particular fish at a particular moment is proportional to its weight.
The total number of fish that will be caught in this scenario is not known in advance.
You may put the excess fish back into the lake, but this still does not give Wallenius' distribution.
Fish swim into the net randomly in a situation that resembles a Poisson process.
You will stop when the total weight of the fish caught reaches this predetermined limit.
Johnson, N. L.; Kemp, A. W.; Kotz, S. (2005), Univariate Discrete Distributions, Hoboken, New Jersey: Wiley and Sons.
(1983), Generalized Linear Models, London: Chapman and Hall.
Fog, Agner (2008), "Calculation Methods for Wallenius' Noncentral Hypergeometric Distribution", Communications in Statistics - Simulation and Computation, vol.