Consider the set of all trial probability distributions that would encode the prior data.
The principle was first expounded by E. T. Jaynes in two papers in 1957,[1][2] where he emphasized a natural correspondence between statistical mechanics and information theory.
Consequently, statistical mechanics should be considered a particular application of a general tool of logical inference and information theory.
In most practical cases, the stated prior data or testable information is given by a set of conserved quantities (average values of some moment functions), associated with the probability distribution in question.
The maximum entropy principle is also needed to guarantee the uniqueness and consistency of probability assignments obtained by different methods, statistical mechanics and logical inference in particular.
The maximum entropy principle makes explicit our freedom in using different forms of prior data.
However these statements do not imply that thermodynamical systems need not be shown to be ergodic to justify treatment as a statistical ensemble.
Testable information is a statement about a probability distribution whose truth or falsity is well-defined.
This constrained optimization problem is typically solved using the method of Lagrange multipliers.
[3] Entropy maximization with no testable information respects the universal "constraint" that the sum of the probabilities is one.
[4] A large amount of literature is now dedicated to the elicitation of maximum entropy priors and links with channel coding.
Richard Jeffrey's probability kinematics is a special case of maximum entropy inference.
[9] Alternatively, the principle is often invoked for model specification: in this case the observed data itself is assumed to be the testable information.
An example of such a model is logistic regression, which corresponds to the maximum entropy classifier for independent observations.
One of the main applications of the maximum entropy principle is in discrete and continuous density estimation.
[10][11] Similar to support vector machine estimators, the maximum entropy principle may require the solution to a quadratic programming problem, and thus provide a sparse mixture model as the optimal density estimator.
One important advantage of the method is its ability to incorporate prior information in the density estimation.
[10] In both cases, there is no closed form solution, and the computation of the Lagrange multipliers usually requires numerical methods.
where q(x), which Jaynes called the "invariant measure", is proportional to the limiting density of discrete points.
[11] The invariant measure function q(x) can be best understood by supposing that x is known to take values only in the bounded interval (a, b), and that no other information is given.
Proponents of the principle of maximum entropy justify its use in assigning probabilities in several ways, including the following two arguments.
In that case, the only reasonable probability distribution would be uniform, and then the information entropy would be equal to its maximum possible value,
It has the advantage of being strictly combinatorial in nature, making no reference to information entropy as a measure of 'uncertainty', 'uninformativeness', or any other imprecisely defined concept.
(For this step to be successful, the information must be a constraint given by an open set in the space of probability measures).
Rather than actually carry out, and possibly have to repeat, the rather long random experiment, the protagonist decides to simply calculate and use the most probable result.
He decides to maximize At this point, in order to simplify the expression, the protagonist takes the limit as
Using Stirling's approximation, he finds All that remains for the protagonist to do is to maximize entropy under the constraints of his testable information.
Moreover, recent contributions (Lazar 2003, and Schennach 2005) show that frequentist relative-entropy-based inference approaches (such as empirical likelihood and exponentially tilted empirical likelihood – see e.g. Owen 2001 and Kitamura 2006) can be combined with prior information to perform Bayesian posterior analysis.
[15] It is however, possible in concept to solve for a posterior distribution directly from a stated prior distribution using the principle of minimum cross-entropy (or the Principle of Maximum Entropy being a special case of using a uniform distribution as the given prior), independently of any Bayesian considerations by treating the problem formally as a constrained optimisation problem, the Entropy functional being the objective function.
The principle of maximum entropy bears a relation to a key assumption of kinetic theory of gases known as molecular chaos or Stosszahlansatz.