A conjugate prior is an algebraic convenience, giving a closed-form expression for the posterior; otherwise, numerical integration may be necessary.
The concept, as well as the term "conjugate prior", were introduced by Howard Raiffa and Robert Schlaifer in their work on Bayesian decision theory.
[1] A similar concept had been discovered independently by George Alfred Barnard.
For example, consider a random variable which consists of the number of successes
Bernoulli trials with unknown probability of success
This random variable will follow the binomial distribution, with a probability mass function of the form The usual conjugate prior is the beta distribution with parameters (
are chosen to reflect any existing belief or information (
is the Beta function acting as a normalising constant.
A typical characteristic of conjugate priors is that the dimensionality of the hyperparameters is one greater than that of the parameters of the original distribution.
(See the general article on the exponential family, and also consider the Wishart distribution, conjugate prior of the covariance matrix of a multivariate normal distribution, for an example where a large dimensionality is involved.)
This posterior distribution could then be used as the prior for more samples, with the hyperparameters simply adding each extra piece of information as it comes.
It is often useful to think of the hyperparameters of a conjugate prior distribution corresponding to having observed a certain number of pseudo-observations with properties specified by the parameters.
failures if the posterior mode is used to choose an optimal parameter setting, or
failures if the posterior mean is used to choose an optimal parameter setting.
In general, for nearly all conjugate prior distributions, the hyperparameters can be interpreted in terms of pseudo-observations.
This can help provide intuition behind the often messy update equations and help choose reasonable hyperparameters for a prior.
One can think of conditioning on conjugate priors as defining a kind of (discrete time) dynamical system: from a given set of hyperparameters, incoming data updates these hyperparameters, so one can see the change in hyperparameters as a kind of "time evolution" of the system, corresponding to "learning".
Starting at different points yields different flows over time.
This is again analogous with the dynamical system defined by a linear operator, but note that since different samples lead to different inferences, this is not simply dependent on time but rather on data over time.
For related approaches, see Recursive Bayesian estimation and Data assimilation.
Suppose a rental car service operates in your city.
Drivers can drop off and pick up cars anywhere inside the city limits.
Over three days you look at the app and find the following number of cars within a short distance of your home address:
In that case, we can compute the maximum likelihood estimate of the parameters of the model, which is
Using this maximum likelihood estimate, we can compute the probability that there will be at least one car available on a given day:
In fact, there is an infinite number of Poisson distributions that could have generated the observed data.
Intuitively we should instead take a weighted average of the probability of
, which seems to be a reasonable prior for the average number of cars.
This much more conservative estimate reflects the uncertainty in the model parameters, which the posterior predictive takes into account.
In all cases below, the data is assumed to consist of n points