Intuitively, if each distribution is viewed as a unit amount of earth (soil) piled on
Because of this analogy, the metric is known in computer science as the earth mover's distance.
The name "Wasserstein distance" was coined by R. L. Dobrushin in 1970, after learning of it in the work of Leonid Vaseršteĭn on Markov processes describing large systems of automata[1] (Russian, 1969).
However the metric was first defined by Leonid Kantorovich in The Mathematical Method of Production Planning and Organization[2] (Russian original 1939) in the context of optimal transport planning of goods and materials.
Most English-language publications use the German spelling "Wasserstein" (attributed to the name "Vaseršteĭn" (Russian: Васерштейн) being of Yiddish origin).
that gives the cost of transporting a unit mass from the point
such that at the end, both the pile of earth and the hole in the ground completely vanish.
In order for this plan to be meaningful, it must satisfy the following properties: That is, that the total mass moved out of an infinitesimal region around
As mentioned, the requirement for a plan to be valid is that it is a joint distribution with marginals
denote the set of all such measures as in the first section, the cost of the optimal plan is
This is a linear assignment problem, and can be solved by the Hungarian algorithm in cubic time.
Note that the second term (involving the trace) is precisely the (unnormalised) Bures metric between
This result generalises the earlier example of the Wasserstein distance between two point masses (at least in the case
), since a point mass can be regarded as a normal distribution with covariance matrix equal to zero, in which case the trace term disappears and only the term involving the Euclidean distance between the means remains.
In computer science, for example, the metric W1 is widely used to compare discrete distributions, e.g. the color histograms of two digital images; see earth mover's distance for more details.
In their paper 'Wasserstein GAN', Arjovsky et al.[5] use the Wasserstein-1 metric as a way to improve the original framework of generative adversarial networks (GAN), to alleviate the vanishing gradient and the mode collapse issues.
The special case of normal distributions is used in a Frechet inception distance.
[7] In computational biology, Wasserstein metric can be used to compare between persistence diagrams of cytometry datasets.
[9] The Wasserstein metric is used in integrated information theory to compute the difference between concepts and conceptual structures.
[10] The Wasserstein metric and related formulations have also been used to provide a unified theory for shape observable analysis in high energy and collider physics datasets.
[11][12] It can be shown that Wp satisfies all the axioms of a metric on the Wasserstein space Pp(M) consisting of all Borel probability measures on M having finite pth moment.
[13] The following dual representation of W1 is a special case of the duality theorem of Kantorovich and Rubinstein (1958): when μ and ν have bounded support,
where Lip(f) denotes the minimal Lipschitz constant for f. This form shows that W1 is an integral probability metric.
For the general case, the dual problem is found by converting sums to integrals:
Cédric Villani recounts the following interpretation from Luis Caffarelli:[16] Suppose you want to ship some coal from mines, distributed as
The Kantorovich duality states that the shipper can make a price schedule that makes you pay almost as much as you would ship yourself.This result can be pressed further to yield:Theorem (Kantorovich-Rubenstein duality) — When the probability space
The two infimal convolution steps are visually clear when the probability space is
That is, the mass should be conserved, and the velocity field should transport the probability distribution
of order two is Lipschitz equivalent to a negative-order homogeneous Sobolev norm.
to be a connected Riemannian manifold equipped with a positive measure