f-divergence

that measures the difference between two probability distributions

Many common divergences, such as KL-divergence, Hellinger distance, and total variation distance, are special cases of

He proved that these divergences decrease in Markov processes.

In concrete applications, there is usually a reference distribution

, the reference distribution is the Lebesgue measure), such that

, then we can use Radon–Nikodym theorem to take their probability densities

, giving When there is no such reference distribution ready at hand, we can simply define

In particular, the monotonicity implies that if a Markov process has a positive equilibrium probability distribution

is a monotonic (non-increasing) function of time, where the probability distribution

is a solution of the Kolmogorov forward equations (or Master equation), used to describe the time evolution of the probability distribution in the Markov process.

are the Lyapunov functions of the Kolmogorov forward equations.

is a Lyapunov function for all Markov chains with positive equilibrium

, for some convex function f.[3][4] For example, Bregman divergences in general do not have such property and can increase in Markov processes.

[5] The f-divergences can be expressed using Taylor series and rewritten using a weighted sum of chi-type distances (Nielsen & Nock (2013)).

[2] Using this theorem on total variation distance, with generator

Applying this theorem yields, after substitution with

varies is not affine-invariant in general, unlike the

, yields two variational representations of the squared Hellinger distance:

Applying this theorem to the KL-divergence, defined by

Assume the setup in the beginning of this section ("Variational representations").

[2] Applying this theorem to KL-divergence yields the Donsker–Varadhan representation.

Attempting to apply this theorem to the general

The following table lists many of the common divergences between probability distributions and the possible generating functions to which they correspond.

Notably, except for total variation distance, all others are special cases of

, its generating function is not uniquely defined, but only up to

In particular, this shows that the squared Hellinger distance and Jensen-Shannon divergence are symmetric.

[6] The only f-divergence that is also an integral probability metric is the total variation.

[7] A pair of probability distributions can be viewed as a game of chance in which one of the distributions defines the official odds and the other contains the actual probabilities.

Knowledge of the actual probabilities allows a player to profit from the game.

For a large class of rational players the expected profit rate has the same general form as the ƒ-divergence.

Comparison between the generators of alpha-divergences, as alpha varies from -1 to 2.