Maximum entropy probability distribution

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions.

According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class (usually defined in terms of specified properties or measures), then the distribution with the largest entropy should be chosen as the least-informative default.

The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

In connection with maximum entropy distributions, this is the only one needed, because maximizing

Information theorists may prefer to use base 2 in order to express the entropy in bits; mathematicians and physicists often prefer the natural logarithm, resulting in a unit of "nat"s for the entropy.

is crucial, even though the typical use of the Lebesgue measure is often defended as a "natural" choice: Which measure is chosen determines the entropy and the consequent maximum entropy distribution.

Many statistical distributions of applicable interest are those for which the moments or other measurable quantities are constrained to be constants.

The following theorem by Ludwig Boltzmann gives the form of the probability density under these constraints.

is a continuous, closed subset of the real numbers

can be dropped, which makes optimization over the Lagrange multipliers unconstrained.

is a (finite or infinite) discrete subset of the reals, and that we choose to specify

moment conditions If there exists a member of class

In the case of equality constraints, this theorem is proved with the calculus of variations and Lagrange multipliers.

it is clear that this distribution satisfies the expectation-constraints and furthermore has as support

We now note a series of identities: Via 1the satisfaction of the expectation-constraints and utilising gradients / directional derivatives, one has

Assuming that no non-trivial linear combination of the observables is almost everywhere (a.e.)

is the maximum entropy probability distribution under the constraint Nontrivial examples are distributions that are subject to multiple constraints that are different from the assignment of the entropy.

A table of examples of maximum entropy distributions is given in Lisman (1972)[6] and Park & Bera (2009).

More generally, if we are given a subdivision a=a0 < a1 < ... < ak = b of the interval [a,b] and probabilities p1,...,pk that add up to one, then we can consider the class of all continuous distributions such that The density of the maximum entropy distribution for this class is constant on each of the intervals [aj-1,aj).

The same is true when the mean μ and the variance σ2 is specified (the first two moments), since entropy is translation invariant on (−∞,∞).

Therefore, the assumption of normality imposes the minimal prior structural constraint beyond these moments.

Among all the discrete distributions supported on the set {x1,...,xn} with a specified mean μ, the maximum entropy distribution has the following shape: where the positive constants C and r can be determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be μ.

Finally, among all the discrete distributions supported on the infinite set

with mean μ, the maximum entropy distribution has the shape: where again the constants C and r were determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be μ.

[9] There exists an upper bound on the entropy of continuous random variables on

To introduce a positive skew, perturb the normal distribution upward by a small amount at a value many σ larger than the mean.

, but when the support is limited to a bounded or semi-bounded interval the upper entropy bound may be achieved (e.g. if x lies in the interval [0,∞] and λ< 0, the exponential distribution will result).

be included in the support of the probability density, which is listed in the fourth column.

[6][7] Several listed examples (Bernoulli, geometric, exponential, Laplace, Pareto) are trivially true, because their associated constraints are equivalent to the assignment of their entropy.

They are included anyway because their constraint is related to a common or easily measured quantity.