Evidence lower bound

In variational Bayesian methods, the evidence lower bound (often abbreviated ELBO, also sometimes called the variational lower bound[1] or negative variational free energy) is a useful lower bound on the log-likelihood of some observed data.

The ELBO is useful because it provides a guarantee on the worst-case for the log-likelihood of some distribution (e.g.

The actual log-likelihood may be higher (indicating an even better fit to the distribution) because the ELBO includes a Kullback-Leibler divergence (KL divergence) term which decreases the ELBO due to an internal part of the model being inaccurate despite good fit of the model overall.

or the fit of a component internal to the model, or both, and the ELBO score makes a good loss function, e.g., for training a deep neural network to improve both the model overall and the internal component.

, which relates the ELBO to the Helmholtz free energy.

forms a lower bound on the evidence (ELBO inequality)

Suppose we have an observable random variable

This would allow us to generate data by sampling, and estimate probabilities of future events.

exactly, forcing us to search for a good approximation.

That is, we define a sufficiently large parametric family

In other words, we have a generative model for both the observable and the latent.

only, the distribution on the left side must marginalize the latent variable

(Bayes' Rule), it suffices to find a good approximation of

The entire situation is summarized in the following table: In Bayesian language,

The usual Bayesian method is to estimate the integral

This is expensive to perform in general, but if we can simply find a good approximation

All in all, we have found a problem of variational Bayesian inference.

A basic result in variational inference is that minimizing the Kullback–Leibler divergence (KL-divergence) is equivalent to maximizing the log-likelihood:

is the number of samples drawn from the true distribution.

This usually has no closed form and must be estimated.

Subtracting the right side, we see that the problem comes down to a biased estimator of zero:

At this point, we could branch off towards the development of an importance-weighted autoencoder[note 2], but we will instead continue with the simplest case with

The tightness of the inequality has a closed form:

In other words, maximizing the ELBO would simultaneously allow us to obtain an accurate generative model

This form shows that the ELBO is a lower bound on the evidence

This form shows that maximizing the ELBO simultaneously attempts to keep

balances between staying close to the prior

This result can be interpreted as a special case of the data processing inequality.

, which upper-bounds the real quantity of interest

That is, we append a latent space to the observable space, paying the price of a weaker inequality for the sake of more computationally efficient minimization of the KL-divergence.