Exponential family

This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider.

Sometimes loosely referred to as the exponential family, this class of distributions is distinct because they all possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

A casual reader may wish to restrict attention to the first and simplest definition, which corresponds to a single-parameter family of discrete or continuous probability distributions.

For example: Note that in each case, the parameters which must be fixed are those that set a limit on the range of values that can possibly be observed.

is automatically determined once the other functions have been chosen, since it must assume a form that causes the distribution to be normalized (sum or integrate to one over the entire domain).

(However, a form of this sort is a member of a curved exponential family, which allows multiple factorized terms in the exponent.

Any member of that exponential family has cumulative distribution function H(x) is a Lebesgue–Stieltjes integrator for the reference measure.

The function A is important in its own right, because the mean, variance and other moments of the sufficient statistic T(x) can be derived simply by differentiating A(η).

Exponential families have a large number of properties that make them extremely useful for statistical analysis.

This is why the above cases (e.g. binomial with varying number of trials, Pareto with varying minimum bound) are not exponential families — in all of the cases, the parameter in question affects the support (particularly, changing the minimum or maximum possible value).

Unlike in the previous examples, the shape parameter does not affect the support; the fact that allowing it to vary makes the Weibull non-exponential is due rather to the particular form of the Weibull's probability density function (k appears in the exponent of an exponent).

As a first example, consider a random variable distributed normally with unknown mean μ and known variance σ2.

The probability density function is then This is an exponential family which can be written in canonical form by defining As an example of a discrete exponential family, consider the binomial distribution with known number of trials n. The probability mass function for this distribution is This can equivalently be written as which shows that the binomial distribution is an exponential family, whose natural parameter is This function of p is known as logit.

The reason for this is so that the moments of the sufficient statistics can be calculated easily, simply by differentiating this function.

In standard exponential families, the derivatives of this function correspond to the moments (more technically, the cumulants) of the sufficient statistics, e.g. the mean and variance.

In general, any non-negative function f(x) that serves as the kernel of a probability distribution (the part encoding all dependence on x) can be made into a proper distribution by normalizing: i.e. where The factor Z is sometimes termed the normalizer or partition function, based on an analogy to statistical physics.

Another way to see this that does not rely on the theory of cumulants is to begin from the fact that the distribution of an exponential family must be normalized, and differentiate.

We illustrate using the simple case of a one-dimensional parameter, but an analogous derivation holds more generally.

In the one-dimensional case, we have This must be normalized, so Take the derivative of both sides with respect to η: Therefore, As an introductory example, consider the gamma distribution, whose distribution is defined by Referring to the above table, we can see that the natural parameter is given by the reverse substitutions are the sufficient statistics are

To compute the variance of x, we just differentiate again: All of these calculations can be done using integration, making use of various properties of the gamma function, but this requires significantly more work.

As another example consider a real valued random variable X with density indexed by shape parameter

Even taking derivatives is a bit tricky, as it involves matrix calculus, but the respective identities are listed in that article.

We use the following forms: To differentiate with respect to η1, we need the following matrix calculus identity: Then: The last line uses the fact that V is symmetric, and therefore it is the same when transposed.

Exponential families arise naturally as the answer to the following question: what is the maximum-entropy distribution consistent with given constraints on expected values?

The ordinary definition of entropy for a discrete distribution supported on a set I, namely assumes, though this is seldom pointed out, that dH is chosen to be the counting measure on I.

The probability distribution dF whose entropy with respect to dH is greatest, subject to the conditions that the expected value of Ti be equal to ti, is an exponential family with dH as reference measure and (T1, ..., Tn) as sufficient statistic.

Only if their distribution is one of the exponential family of distributions is there a sufficient statistic T(X1, ..., Xn) whose number of scalar components does not increase as the sample size n increases; the statistic T may be a vector or a single scalar number, but whatever it is, its size will neither grow nor shrink when more data are obtained.

It can however be represented by using a mixture density as the prior, here a combination of two beta distributions; this is a form of hyperprior.

An arbitrary likelihood will not belong to an exponential family, and thus in general no conjugate prior exists.

First, assume that the probability of a single observation follows an exponential family, parameterized using its natural parameter: Then, for data