Posterior predictive distribution

In Bayesian statistics, the posterior predictive distribution is the distribution of possible unobserved values conditional on the observed values.

, and because a source of uncertainty is ignored, the predictive distribution will be too narrow.

will have a lower probability than if the uncertainty in the parameters as given by their posterior distribution is accounted for.

A posterior predictive distribution accounts for uncertainty about

, the posterior predictive distribution will in general be wider than a predictive distribution which plugs in a single best estimate for

For example, the Student's t-distribution can be defined as the prior predictive distribution of a normal distribution with known mean μ but unknown variance σx2, with a conjugate prior scaled-inverse-chi-squared distribution placed on σx2, with hyperparameters ν and σ2.

is indeed a non-standardized Student's t-distribution, and follows one of the two most common parameterizations of this distribution.

Then, the corresponding posterior predictive distribution would again be Student's t, with the updated hyperparameters

In some cases the appropriate compound distribution is defined using a different parameterization than the one that would be most natural for the predictive distributions in the current problem at hand.

For example, as indicated above, the Student's t-distribution was defined in terms of a scaled-inverse-chi-squared distribution placed on the variance.

However, it is more common to use an inverse gamma distribution as the conjugate prior in this situation.

The two are in fact equivalent except for parameterization; hence, the Student's t-distribution can still be used for either predictive distribution, but the hyperparameters must be reparameterized before being plugged in.

Exponential families have a large number of useful properties.

) is The last line follows from the previous one by recognizing that the function inside the integral is the density function of a random variable distributed as

Hence the result of the integration will be the reciprocal of the normalizing function.

is a function of the parameter and hence will assume different forms depending on choice of parametrization.)

The reason the integral is tractable is that it involves computing the normalization constant of a density defined by the product of a prior distribution and a likelihood.

As shown above, the density function of the compound distribution follows a particular form, consisting of the product of the function

The beta-binomial distribution is a good example of how this process works.

Despite the analytical tractability of such distributions, they are in themselves usually not members of the exponential family.

either ignores the corresponding argument entirely or uses it only in the exponent of an expression.

When a conjugate prior is being used, the posterior predictive distribution belongs to the same family as the prior predictive distribution, and is determined simply by plugging the updated hyperparameters for the posterior distribution of the parameter(s) into the formula for the prior predictive distribution.

Using the general form of the posterior update equations for exponential-family distributions (see the appropriate section in the exponential family article), we can write out an explicit formula for the posterior predictive distribution: where This shows that the posterior predictive distribution of a series of observations, in the case where the observations follow an exponential family with the appropriate conjugate prior, has the same probability density as the compound distribution, with parameters as specified above.

This is termed the sufficient statistic of the observations, because it tells us everything we need to know about the observations in order to compute a posterior or posterior predictive distribution based on them (or, for that matter, anything else based on the likelihood of the observations, such as the marginal likelihood).

In a Bayesian setting, this comes up in various contexts: computing the prior or posterior predictive distribution of multiple new observations, and computing the marginal likelihood of observed data (the denominator in Bayes' law).

It is easy to show, in fact, that the joint compound distribution of a set

observations is This result and the above result for a single compound distribution extend trivially to the case of a distribution over a vector-valued observation, such as a multivariate Gaussian distribution.

As a result, when a set of independent identically distributed (i.i.d.)

That is, it is generally possible to implement collapsing out of a node simply by attaching all parents of the node directly to all children, and replacing the former conditional probability distribution associated with each child with the corresponding posterior predictive distribution for the child conditioned on its parents and the other formerly i.i.d.

For an example, for more specific discussion and for some cautions about certain tricky issues, see the Dirichlet-multinomial distribution article.