Stein discrepancy

It was first formulated as a tool to assess the quality of Markov chain Monte Carlo samplers,[1] but has since been used in diverse settings in statistics, machine learning and computer science.

, is provided by an integral probability metric[3] where for the purposes of exposition we assume that the expectations exist, and that the set

is sufficiently rich that (1.1) is indeed a metric on the set of probability distributions on

The GSD is actually the solution of a finite-dimensional linear program, with the size of

be the unit ball in a (possibly vector-valued) reproducing kernel Hilbert space

, then the Stein discrepancy takes the closed form A Stein discrepancy constructed in this manner is called a kernel Stein discrepancy[5][6][7][8] and the construction is closely connected to the theory of kernel embedding of probability distributions.

is a matrix-valued function determined by the infinitesimal generator of the diffusion.

Stein discrepancy can sometimes be computed in challenging settings where the probability distribution

A basic requirement of Stein discrepancy is that it is a statistical divergence, meaning that

This property can be shown to hold for classical Stein discrepancy[1] and kernel Stein discrepancy[6][7][8] a provided that appropriate regularity conditions hold.

A stronger property, compared to being a statistical divergence, is convergence control, meaning that

For example, under appropriate regularity conditions, both the classical Stein discrepancy and graph Stein discrepancy enjoy Wasserstein convergence control, meaning that

[1][19][9] For the kernel Stein discrepancy, weak convergence control has been established[8][20] under regularity conditions on the distribution

, such as based on the Gaussian kernel, provably do not enjoy weak convergence control.

For kernel Stein discrepancy, Wasserstein convergence detection has been established,[8] under appropriate regularity conditions on the distribution

, the quantization task is to select a small number of states

Extensions of this result, to allow for imperfect numerical optimisation, have also been derived.

[20][22][21] Sophisticated optimisation algorithms have been designed to perform efficient quantisation based on Stein discrepancy, including gradient flow algorithms that aim to minimise kernel Stein discrepancy over an appropriate space of probability measures.

[23] If one is allowed to consider weighted combinations of point masses, then more accurate approximation is possible compared to (3.1).

[5] Some authors[24][25] consider imposing, in addition, a non-negativity constraint on the weights, i.e.

can involve solving linear systems of equations that are numerically ill-conditioned.

In particular, the greedy Stein thinning algorithm has been shown to satisfy an error bound Non-myopic and mini-batch generalisations of the greedy algorithm have been demonstrated[26] to yield further improvement in approximation quality relative to computational cost.

of interest: A possible advantage of Stein discrepancy in this context,[28] compared to the traditional Kullback–Leibler variational objective, is that

This property can be used to circumvent the use of flow-based generative models, for example, which impose diffeomorphism constraints in order to enforce absolute continuity of

Stein discrepancy has been proposed as a tool to fit parametric statistical models to data.

which is compatible with the dataset using a minimum Stein discrepancy estimator[29] The approach is closely related to the framework of minimum distance estimation, with the role of the "distance" being played by the Stein discrepancy.

Alternatively, a generalised Bayesian approach to estimation of the parameter

can be considered[4] where, given a prior probability distribution with density function

), one constructs a generalised posterior with probability density function for some

[30] Since the aforementioned tests have a computational cost quadratic in the sample size, alternatives have been developed with (near-)linear runtimes.

Optimal quantisation using Stein discrepancy. The contours in this video represent level sets of a continuous probability distribution

P

and we consider the task of summarising this distribution with a discrete set of states

x_{1},\dots ,x_{m}

selected from its domain

{\mathcal {X}}

. In particular, we suppose that the density function

p(x)

is known only up to proportionality, a setting where Markov chain Monte Carlo (MCMC) methods are widely used. In the first half of this video a Markov chain produces samples that are approximately distributed from

P

, with the sample path shown in black. In the second half of the video an algorithm, called Stein thinning , ^{[

21

]} is applied to select a subset of states from the sample path, with selected states shown in red. These states are selected based on greedy minimisation of a Stein discrepancy between the discrete distribution and

P

. Together, the selected states provide an approximation of

P

that, in this instance, is more accurate than that provided by the original MCMC output.