It was first formulated as a tool to assess the quality of Markov chain Monte Carlo samplers,[1] but has since been used in diverse settings in statistics, machine learning and computer science.
, is provided by an integral probability metric[3] where for the purposes of exposition we assume that the expectations exist, and that the set
is sufficiently rich that (1.1) is indeed a metric on the set of probability distributions on
The GSD is actually the solution of a finite-dimensional linear program, with the size of
be the unit ball in a (possibly vector-valued) reproducing kernel Hilbert space
, then the Stein discrepancy takes the closed form A Stein discrepancy constructed in this manner is called a kernel Stein discrepancy[5][6][7][8] and the construction is closely connected to the theory of kernel embedding of probability distributions.
is a matrix-valued function determined by the infinitesimal generator of the diffusion.
Stein discrepancy can sometimes be computed in challenging settings where the probability distribution
A basic requirement of Stein discrepancy is that it is a statistical divergence, meaning that
This property can be shown to hold for classical Stein discrepancy[1] and kernel Stein discrepancy[6][7][8] a provided that appropriate regularity conditions hold.
A stronger property, compared to being a statistical divergence, is convergence control, meaning that
For example, under appropriate regularity conditions, both the classical Stein discrepancy and graph Stein discrepancy enjoy Wasserstein convergence control, meaning that
[1][19][9] For the kernel Stein discrepancy, weak convergence control has been established[8][20] under regularity conditions on the distribution
, such as based on the Gaussian kernel, provably do not enjoy weak convergence control.
For kernel Stein discrepancy, Wasserstein convergence detection has been established,[8] under appropriate regularity conditions on the distribution
, the quantization task is to select a small number of states
Extensions of this result, to allow for imperfect numerical optimisation, have also been derived.
[20][22][21] Sophisticated optimisation algorithms have been designed to perform efficient quantisation based on Stein discrepancy, including gradient flow algorithms that aim to minimise kernel Stein discrepancy over an appropriate space of probability measures.
[23] If one is allowed to consider weighted combinations of point masses, then more accurate approximation is possible compared to (3.1).
[5] Some authors[24][25] consider imposing, in addition, a non-negativity constraint on the weights, i.e.
can involve solving linear systems of equations that are numerically ill-conditioned.
In particular, the greedy Stein thinning algorithm has been shown to satisfy an error bound Non-myopic and mini-batch generalisations of the greedy algorithm have been demonstrated[26] to yield further improvement in approximation quality relative to computational cost.
of interest: A possible advantage of Stein discrepancy in this context,[28] compared to the traditional Kullback–Leibler variational objective, is that
This property can be used to circumvent the use of flow-based generative models, for example, which impose diffeomorphism constraints in order to enforce absolute continuity of
Stein discrepancy has been proposed as a tool to fit parametric statistical models to data.
which is compatible with the dataset using a minimum Stein discrepancy estimator[29] The approach is closely related to the framework of minimum distance estimation, with the role of the "distance" being played by the Stein discrepancy.
Alternatively, a generalised Bayesian approach to estimation of the parameter
can be considered[4] where, given a prior probability distribution with density function
), one constructs a generalised posterior with probability density function for some
[30] Since the aforementioned tests have a computational cost quadratic in the sample size, alternatives have been developed with (near-)linear runtimes.