Laplace's approximation provides an analytical expression for a posterior probability distribution by fitting a Gaussian distribution with a mean equal to the MAP solution and precision equal to the observed Fisher information.
[1][2] The approximation is justified by the Bernstein–von Mises theorem, which states that, under regularity conditions, the error of the approximation tends to 0 as the number of data points tends to infinity.
[3][4] For example, consider a regression or classification model with data set
comprising inputs
with (unknown) parameter vector
The likelihood is denoted
and the parameter prior
Suppose one wants to approximate the joint density of outputs and parameters
Bayes' formula reads: The joint is equal to the product of the likelihood and the prior and by Bayes' rule, equal to the product of the marginal likelihood
the joint is an un-normalised density.
In Laplace's approximation, we approximate the joint by an un-normalised Gaussian
to denote approximate density,
for un-normalised density and
the normalisation constant of
Since the marginal likelihood
doesn't depend on the parameter
we can immediately identify them with
of our approximation, respectively.
Laplace's approximation is where we have defined where
is the location of a mode of the joint target density, also known as the maximum a posteriori or MAP point and
positive definite matrix of second derivatives of the negative log joint target density at the mode
Thus, the Gaussian approximation matches the value and the log-curvature of the un-normalised target density at the mode.
is usually found using a gradient based method.
In summary, we have for the approximate posterior over
and the approximate log marginal likelihood respectively.
The main weaknesses of Laplace's approximation are that it is symmetric around the mode and that it is very local: the entire approximation is derived from properties at a single point of the target density.
Laplace's method is widely used and was pioneered in the context of neural networks by David MacKay,[5] and for Gaussian processes by Williams and Barber.