Metropolis–Hastings algorithm

In statistics and statistical physics, the Metropolis–Hastings algorithm is a Markov chain Monte Carlo (MCMC) method for obtaining a sequence of random samples from a probability distribution from which direct sampling is difficult.

The resulting sequence can be used to approximate the distribution (e.g. to generate a histogram) or to compute an integral (e.g. an expected value).

Metropolis–Hastings and other MCMC algorithms are generally used for sampling from multi-dimensional distributions, especially when the number of dimensions is high.

The algorithm is named in part for Nicholas Metropolis, the first coauthor of a 1953 paper, entitled Equation of State Calculations by Fast Computing Machines, with Arianna W. Rosenbluth, Marshall Rosenbluth, Augusta H. Teller and Edward Teller.

[3] The generalized method was eventually identified by both names, although the first use of the term "Metropolis-Hastings algorithm" is unclear.

Metropolis, who was familiar with the computational aspects of the method, had coined the term "Monte Carlo" in an earlier article with Stanisław Ulam, and led the group in the Theoretical Division that designed and built the MANIAC I computer used in the experiments in 1952.

Shortly before his death, Marshall Rosenbluth attended a 2003 conference at LANL marking the 50th anniversary of the 1953 publication.

[4] Further historical clarification is made by Gubernatis in a 2005 journal article[5] recounting the 50th anniversary conference.

Rosenbluth makes it clear that he and his wife Arianna did the work, and that Metropolis played no role in the development other than providing computer time.

This contradicts an account by Edward Teller, who states in his memoirs that the five authors of the 1953 article worked together for "days (and nights)".

This, says Rosenbluth, started him thinking about the generalized Monte Carlo approach – a topic which he says he had discussed often with John Von Neumann.

In an oral history recorded shortly before his death,[7] Rosenbluth again credits Teller with posing the original problem, himself with solving it, and Arianna with programming the computer.

are more likely to be visited next, making the sequence of samples into a Gaussian random walk.

Thus, we will tend to stay in (and return large numbers of samples from) high-density regions of

Intuitively, this is why this algorithm works and returns samples that follow the desired distribution with density

Compared with an algorithm like adaptive rejection sampling[8] that directly generates independent samples from a distribution, Metropolis–Hastings and other MCMC algorithms have a number of disadvantages: On the other hand, most simple rejection sampling methods suffer from the "curse of dimensionality", where the probability of rejection increases exponentially as a function of the number of dimensions.

Metropolis–Hastings, along with other MCMC methods, do not have this problem to such a degree, and thus are often the only solutions available when the number of dimensions of the distribution to be sampled is high.

In multivariate distributions, the classic Metropolis–Hastings algorithm as described above involves choosing a new multi-dimensional sample point.

The purpose of the Metropolis–Hastings algorithm is to generate a collection of states according to a desired distribution

To accomplish this, the algorithm uses a Markov process, which asymptotically reaches a unique stationary distribution

The derivation of the algorithm starts with the condition of detailed balance: which is re-written as The approach is to separate the transition in two sub-steps; the proposal and the acceptance-rejection.

The transition probability can be written as the product of them: Inserting this relation in the previous equation, we have The next step in the derivation is to choose an acceptance ratio that fulfills the condition above.

The Metropolis–Hastings algorithm can thus be written as follows: Provided that specified conditions are met, the empirical distribution of saved states

[13] For distribution on discrete state spaces, it has to be of the order of the autocorrelation time of the Markov process.

one should use or the number of iterations necessary for proper estimation; both are free parameters of the method, which must be adjusted to the particular problem in hand.

and calculate a value where is the probability (e.g., Bayesian posterior) ratio between the proposed sample

The algorithm works best if the proposal density matches the shape of the target distribution

[15] These guidelines can work well when sampling from sufficiently regular Bayesian posteriors as they often follow a multivariate normal distribution as can be established using the Bernstein–von Mises theorem.

is too large, the acceptance rate will be very low because the proposals are likely to land in regions of much lower probability density, so

One typically tunes the proposal distribution so that the algorithms accepts on the order of 30% of all samples – in line with the theoretical estimates mentioned in the previous paragraph.