Gibbs sampling

Samples from the beginning of the chain (the burn-in period) may not accurately represent the desired distribution and are usually discarded.

The algorithm was described by brothers Stuart and Donald Geman in 1984, some eight decades after the death of Gibbs,[1] and became popularized in the statistics community for calculating marginal probability distribution, especially the posterior distribution.

[2] In its basic version, Gibbs sampling is a special case of the Metropolis–Hastings algorithm.

Gibbs sampling, in its basic incarnation, is a special case of the Metropolis–Hastings algorithm.

We proceed iteratively: If such sampling is performed, these important facts hold: When performing the sampling: Furthermore, the conditional distribution of one variable given all others is proportional to the joint distribution, i.e., for all possible value

In practice, this means doing one of three things: Gibbs sampling is commonly used for statistical inference (e.g. determining the best value of a parameter, such as determining the number of people likely to shop at a particular store on a given day, the candidate a voter will most likely vote for, etc.).

The most likely value of a desired parameter (the mode) could then simply be selected by choosing the sample value that occurs most commonly; this is essentially equivalent to maximum a posteriori estimation of a parameter.

More commonly, however, the expected value (mean or average) of the sampled values is chosen; this is a Bayes estimator that takes advantage of the additional data about the entire distribution that is available from Bayesian sampling, whereas a maximization algorithm such as expectation maximization (EM) is capable of only returning a single point from the distribution.

(If a distribution is multimodal, the expected value may not return a meaningful point, and any of the modes is typically a better choice.)

Instead, in such a case there will be variables representing the unknown true mean and true variance, and the determination of sample values for these variables results automatically from the operation of the Gibbs sampler.

For example, probit regression for determining the probability of a given binary (yes/no) choice, with normally distributed priors placed over the regression coefficients, can be implemented with Gibbs sampling because it is possible to add additional variables and take advantage of conjugacy.

One possibility is to approximate the logistic function with a mixture (typically 7–9) of normal distributions.

Then an alternative method of calculating the marginal densities is to create a Markov chain on the space

Thus the detailed balance equations are satisfied, implying the chain is reversible and it has invariant distribution

In general this gives a non-stationary Markov process, but each individual step will still be reversible, and the overall process will still have the desired stationary distribution (as long as the chain can access all states under the fixed ordering).

Then one of the central goals of the Bayesian statistics is to approximate the posterior density where the marginal likelihood

To explain the Gibbs sampler, we additionally assume that the parameter space

Note that Gibbs sampler is operated by the iterative Monte Carlo scheme within a cycle.

drawn by the above algorithm formulates Markov Chains with the invariant distribution to be the target density

The goal of these variations is to reduce the autocorrelation between samples sufficiently to overcome any added computational costs.

In hierarchical Bayesian models with categorical variables, such as latent Dirichlet allocation and various other models used in natural language processing, it is quite common to collapse out the Dirichlet distributions that are typically used as prior distributions over the categorical variables.

The result of this collapsing introduces dependencies among all the categorical variables dependent on a given Dirichlet prior, and the joint distribution of these variables after collapsing is a Dirichlet-multinomial distribution.

For example, collapsing an inverse-gamma-distributed variance out of a network with a single Gaussian child will yield a Student's t-distribution.

(For that matter, collapsing both the mean and variance of a single Gaussian child will still yield a Student's t-distribution, provided both are conjugate, i.e. Gaussian mean, inverse-gamma variance.)

In these cases where compounding produces a well-known distribution, efficient sampling procedures often exist, and using them will often (although not necessarily) be more efficient than not collapsing, and instead sampling both prior and child nodes separately.

If the child nodes of the collapsed nodes are continuous, this distribution will generally not be of a known form, and may well be difficult to sample from despite the fact that a closed form can be written, for the same reasons as described above for non-well-known compound distributions.

In fact, the principle involved here is described in fair detail in the article on the Dirichlet-multinomial distribution.

More generally, for any distribution over high-dimensional, real-valued vectors, if two particular elements of the vector are perfectly correlated (or perfectly anti-correlated), those two elements will become stuck, and Gibbs sampling will never be able to change them.

If you want to estimate the probability of the zero vector, it would be sufficient to take 100 or 1000 samples from the true distribution.

But Gibbs sampling will alternate between returning only the zero vector for long periods (about