Integral probability metric

In probability theory, integral probability metrics are types of distance functions between probability distributions, defined by how well a class of functions can distinguish the two distributions.

In addition to theoretical importance, integral probability metrics are widely used in areas of statistics and machine learning.

The name "integral probability metric" was given by German statistician Alfred Müller;[1] the distances had also previously been called "metrics with a ζ-structure.

"[2] Integral probability metrics (IPMs) are distances on the space of distributions over a set

here the notation P f refers to the expectation of f under the distribution P. The absolute value in the definition is unnecessary, and often omitted, for the usual case where for every

achieves the supremum, it is often termed a "witness function"[4] (it "witnesses" the difference in the distributions).

These functions try to have large values for samples from P and small (likely negative) values for samples from Q; this can be thought of as a weaker version of classifers, and indeed IPMs can be interpreted as the optimal risk of a particular classifier.[5]: sec.

satisfies all the definitions of a metric except that we may have we may have

for some P ≠ Q; this is variously termed a "pseudometric" or a "semimetric" depending on the community.

separates points on the space of probability distributions, i.e. for any P ≠ Q there is some

;[1] most, but not all, common particular cases satisfy this property.

All of these examples are metrics except when noted otherwise.

The f-divergences are probably the best-known way to measure dissimilarity of probability distributions.

2 that the only functions which are both IPMs and f-divergences are of the form

is the total variation distance between distributions.

One major difference between f-divergences and most IPMs is that when P and Q have disjoint support, all f-divergences take on a constant value;[17] by contrast, IPMs where functions in

are "smooth" can give "partial credit."

of Dirac measures at 1/n; this sequence converges in distribution to

, but no nonzero f-divergence can satisfy this.

That is, many IPMs are continuous in weaker topologies than f-divergences.

This property is sometimes of substantial importance,[18] although other options also exist, such as considering f-divergences between distributions convolved with continuous noise.

[18][19] Because IPM values between discrete distributions are often sensible, it is often reasonable to estimate

using a simple "plug-in" estimator:

are empirical measures of sample sets.

These empirical distances can be computed exactly for some classes

;[5] estimation quality varies depending on the distance, but can be minimax-optimal in certain settings.

[14][20][21] When exact maximization is not available or too expensive, another commonly used scheme is to divide the samples into "training" sets (with empirical measures

[22][12][23][24] This estimator can possibly be consistent, but has a negative bias[22]: thm.

In fact, no unbiased estimator can exist for any IPM[22]: thm.

3 , although there is for instance an unbiased estimator of the squared maximum mean discrepancy.