Kullback–Leibler divergence

[5] Relative entropy is always a non-negative real number, with value 0 if and only if the two distributions in question are identical.

It has diverse applications, both theoretical, such as characterizing the relative (Shannon) entropy in information systems, randomness in continuous time-series, and information gain when comparing statistical models of inference; and practical, such as applied statistics, fluid mechanics, neuroscience, bioinformatics, and machine learning.

is then interpreted as the average difference of the number of bits required for encoding samples of P using a code optimized for Q rather than one optimized for P. Note that the roles of P and Q can be reversed in some situations where that is easier to compute, such as with the expectation–maximization algorithm (EM) and evidence lower bound (ELBO) computations.

[10] Numerous references to earlier uses of the symmetrized divergence and to other statistical distances are given in Kullback (1959, pp.

is zero the contribution of the corresponding term is interpreted as zero because For distributions P and Q of a continuous random variable, relative entropy is defined to be the integral[14] where p and q denote the probability densities of P and Q.

This example uses the natural log with base e, designated ln to get results in nats (see units of information): In the field of statistics, the Neyman–Pearson lemma states that the most powerful way to distinguish between the two distributions P and Q based on an observation Y (drawn from one of them) is through the log of the ratio of their likelihoods:

Fisher information metric on the certain probability distribution let determine the natural gradient for information-geometric optimization algorithms.

[18] Relative entropy satisfies a generalized Pythagorean theorem for exponential families (geometrically interpreted as dually flat manifolds), and this allows one to minimize relative entropy by geometric means, for example by information projection and in maximum likelihood estimation.

[19] Consider a growth-optimizing investor in a fair game with mutually exclusive outcomes (e.g. a “horse race” in which the official odds add up to one).

Extending this concept, relative entropy can be hypothetically utilised to identify the behaviour of informed investors, if one takes this to be represented by the magnitude and deviations away from the prior expectations of fund flows, for example.

, the expected number of bits required when using a code based on Q rather than P; and the Kullback–Leibler divergence therefore represents the expected number of extra bits that must be transmitted to identify a value x drawn from X, if a code is used corresponding to the probability distribution Q, rather than the "true" distribution P. Denote

is a sequence of distributions such that then it is said that Pinsker's inequality entails that where the latter stands for the usual convergence in total variation.

This measure is computed using Kullback-Leibler divergences between the two distributions in a quantized embedding space of a foundation model.

Many of the other quantities of information theory can be interpreted as applications of relative entropy to specific cases.

The self-information, also known as the information content of a signal, random variable, or event is defined as the negative logarithm of the probability of the given outcome occurring.

When applied to a discrete random variable, the self-information can be represented as[citation needed] is the relative entropy of the probability distribution

This compresses the data by replacing each fixed-length input symbol with a corresponding unique, variable-length, prefix-free code (e.g.: the events (A, B, C) with probabilities p = (1/2, 1/4, 1/4) can be encoded as the bits (0, 10, 11)).

However, if we use a different probability distribution (q) when creating the entropy encoding scheme, then a larger number of bits will be used (on average) to identify an event from a set of possibilities.

Under this scenario, relative entropies (kl-divergence) can be interpreted as the extra number of bits, on average, that are needed (beyond

A common goal in Bayesian experimental design is to maximise the expected relative entropy between the prior and the posterior.

[34] When posteriors are approximated to be Gaussian distributions, a design maximising the expected relative entropy is called Bayes d-optimal.

These two different scales of loss function for uncertainty are both useful, according to how well each reflects the particular circumstances of the problem in question.

In the engineering literature, MDI is sometimes called the Principle of Minimum Cross-Entropy (MCE) or Minxent for short.

Minimising relative entropy from m to p with respect to m is equivalent to minimizing the cross-entropy of p and m, since which is appropriate if one is trying to choose an adequate approximation to p. However, this is just as often not the task one is trying to achieve.

This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by redefining cross-entropy to be

Best-guess states (e.g. for atoms in a gas) are inferred by maximizing the average surprisal S (entropy) for a given set of control parameters (like pressure P or volume V).

For instance, the work available in equilibrating a monatomic ideal gas to ambient values of

, where relative entropy The resulting contours of constant relative entropy, shown at right for a mole of Argon at standard temperature and pressure, for example put limits on the conversion of hot to cold as in flame-powered air-conditioning or in the unpowered device to convert boiling-water to ice-water discussed here.

In the Banking and Finance industries, this quantity is referred to as Population Stability Index (PSI), and is used to assess distributional shifts in model features through time.

It is similar to the Hellinger metric (in the sense that it induces the same affine connection on a statistical manifold).