Variational Bayesian methods

Variational Bayesian methods are primarily used for two purposes: In the former purpose (that of approximating a posterior probability), variational Bayes is an alternative to Monte Carlo sampling methods—particularly, Markov chain Monte Carlo methods such as Gibbs sampling—for taking a fully Bayesian approach to statistical inference over complex distributions that are difficult to evaluate directly or sample.

In particular, whereas Monte Carlo techniques provide a numerical approximation to the exact posterior using a set of samples, variational Bayes provides a locally-optimal, exact analytical solution to an approximation of the posterior.

Variational Bayes can be seen as an extension of the expectation–maximization (EM) algorithm from maximum likelihood (ML) or maximum a posteriori (MAP) estimation of the single most probable value of each parameter to fully Bayesian estimation which computes (an approximation to) the entire posterior distribution of the parameters and latent variables.

As in EM, it finds a set of optimal parameter values, and it has the same alternating structure as does EM, based on a set of interlocked (mutually dependent) equations that cannot be solved analytically.

For many applications, variational Bayes produces solutions of comparable accuracy to Gibbs sampling at greater speed.

In variational inference, the posterior distribution over a set of unobserved variables

The most common type of variational Bayes uses the Kullback–Leibler divergence (KL-divergence) of Q from P as the choice of dissimilarity function.

is a distribution, we have which, according to the definition of expected value (for a discrete random variable), can be written as follows which can be rearranged to become As the log-evidence

By the generalized Pythagorean theorem of Bregman divergence, of which KL-divergence is a special case, it can be shown that:[1][2] where

is a convex set and the equality holds if: In this case, the global minimizer

is the expectation of the logarithm of the joint probability of the data and latent variables, taken with respect to

) and is usually reinstated by inspection, as the rest of the expression can usually be recognized as being a known type of distribution (e.g. Gaussian, gamma, etc.).

This naturally suggests an iterative algorithm, much like EM (the expectation–maximization algorithm), in which the expectations (and possibly higher moments) of the latent variables are initialized in some fashion (perhaps randomly), and then the parameters of each distribution are computed in turn using the current values of the expectations, after which the expectation of the newly computed distribution is set appropriately according to the computed parameters.

In most cases, the other variables' distributions will be from known families, and the formulas for the relevant expectations can be looked up.

However, as described above, the dependencies suggest a simple iterative algorithm, which in most cases is guaranteed to converge.

(From a theoretical standpoint, precision and variance are equivalent since there is a one-to-one correspondence between the two.)

The joint probability of all variables can be rewritten as where the individual factors are where Assume that

With a certain amount of tedious math (expanding the squares inside of the braces, separating out and grouping the terms involving

), we can derive the parameters of the Gaussian distribution: Note that all of the above steps can be shortened by using the formula for the sum of two quadratics.

takes more work: We can then write the parameter equations as follows, without any expectations: Note that there are circular dependencies among the formulas for

This naturally suggests an EM-like algorithm: We then have values for the hyperparameters of the approximating distributions of the posterior parameters, which we can use to compute any properties we want of the posterior — e.g. its mean and variance, a 95% highest-density region (the smallest interval that includes 95% of the total probability), etc.

The above example shows the method by which the variational-Bayesian approximation to a posterior probability density in a given Bayesian network is derived: Due to all of the mathematical manipulations involved, it is easy to lose track of the big picture.

The important things are: Variational Bayes (VB) is often compared with expectation–maximization (EM).

The initial steps to derive the respective procedures are also vaguely similar, both starting out with formulas for probability densities and both involving significant amounts of mathematical manipulations.

as a Dirichlet distribution where where Finally Grouping and reading off terms involving

, the result is a Gaussian-Wishart distribution given by given the definitions Finally, notice that these functions require the values of

Now that we have determined the distributions over which these expectations are taken, we can derive formulas for them: These results lead to These can be converted from proportional to absolute values by normalizing over

Note that: This suggests an iterative procedure that alternates between two steps: Note that these steps correspond closely with the standard EM algorithm to derive a maximum likelihood or maximum a posteriori (MAP) solution for the parameters of a Gaussian mixture model.

in the E step correspond closely to the posterior probabilities of the latent variables given the data, i.e.

This is a general result that holds true for all prior distributions derived from the exponential family.