Kernel embedding of distributions

[1] A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis.

, discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects.

[3][4] The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf.

[6] Commonly, methods for modeling complex distributions rely on parametric assumptions that may be unfounded or computationally challenging (e.g. Gaussian mixture models), while nonparametric methods like kernel density estimation (Note: the smoothing kernels in this context have a different interpretation than the kernels discussed here) or characteristic function representation (via the Fourier transform of the distribution) break down in high-dimensional settings.

[2] Methods based on the kernel embedding of distributions sidestep these problems and also possess the following advantages:[6] Thus, learning via the kernel embedding of distributions offers a principled drop-in replacement for information theoretic approaches and is a framework which not only subsumes many popular methods in machine learning and statistics as special cases, but also can lead to entirely new learning algorithms.

via [2] By the equivalence between a tensor and a linear map, this joint embedding may be interpreted as an uncentered cross-covariance operator

This section illustrates how basic probabilistic rules may be reformulated as (multi)linear algebraic operations in the kernel embedding framework and is primarily based on the work of Song et al.[2][8] The following notation is adopted: In practice, all embeddings are empirically estimated from data

where In practical implementations, the kernel chain rule takes the following form In probability theory, a posterior distribution can be expressed in terms of a prior distribution and a likelihood function as The analog of this rule in the kernel embedding framework expresses the kernel embedding of the conditional distribution in terms of conditional embedding operators which are modified by the prior distribution where from the chain rule: In practical implementations, the kernel Bayes' rule takes the following form where Two regularization parameters are used in this framework:

which is defined as the distance between their embeddings in the RKHS [6] While most distance-measures between distributions such as the widely used Kullback–Leibler divergence either require density estimation (either parametrically or nonparametrically) or space partitioning/bias correction strategies,[6] the MMD is easily estimated as an empirical mean which is concentrated around the true value of the MMD.

The characterization of this distance as the maximum mean discrepancy refers to the fact that computing the MMD is equivalent to finding the RKHS function that maximizes the difference in expectations between the two probability distributions a form of integral probability metric.

Although learning algorithms in the kernel embedding framework circumvent the need for intermediate density estimation, one may nonetheless use the empirical embedding to perform density estimation based on n samples drawn from an underlying distribution

The distribution which solves this optimization may be interpreted as a compromise between fitting the empirical kernel means of the samples well, while still allocating a substantial portion of the probability mass to all regions of the probability space (much of which may not be represented in the training examples).

Connections between the ideas underlying Gaussian processes and conditional random fields may be drawn with the estimation of conditional probability distributions in this fashion, if one views the feature mappings associated with the kernel as sufficient statistics in generalized (possibly infinite-dimensional) exponential families.

(from any domains on which sensible kernels can be defined) can be formulated based on the Hilbert–Schmidt Independence Criterion [17] and can be used as a principled replacement for mutual information, Pearson correlation or any other dependence measure used in learning algorithms.

samples of each random variable, a simple parameter-free unbiased estimator of HSIC which exhibits concentration about the true value can be computed in

The desirable properties of HSIC have led to the formulation of numerous algorithms which utilize this dependence measure for a variety of common machine learning tasks such as: feature selection (BAHSIC [18]), clustering (CLUHSIC [19]), and dimensionality reduction (MUHSIC [20]).

The question of when HSIC captures independence in this case has recently been studied:[21] for more than two variables Belief propagation is a fundamental algorithm for inference in graphical models in which nodes repeatedly pass and receive messages corresponding to the evaluation of conditional expectations.

Using the kernel conditional distribution embedding framework, these quantities may be expressed in terms of samples from the HMM.

A serious limitation of the embedding methods in this domain is the need for training samples containing hidden states, as otherwise inference with arbitrary distributions in the HMM is not possible.

One common use of HMMs is filtering in which the goal is to estimate posterior distribution over the hidden state

is given, one can in practice estimate and filtering with kernel embeddings is thus implemented recursively using the following updates for the weights

[22] SMMs solve the standard SVM dual optimization problem using the following expected kernel which is computable in closed form for many common specific distributions

, and thus the SMM can be viewed as a flexible SVM in which a different data-dependent kernel (specified by the assumed form of the distribution

[22] The goal of domain adaptation is the formulation of learning algorithms which generalize well when the training and test data have different distributions.

:[23][24] By utilizing the kernel embedding of marginal and conditional distributions, practical approaches to deal with the presence of these types of differences between training and test domains can be formulated.

which solves the following optimization problem (where in practice, empirical approximations must be used) [23] To deal with location scale conditional shift, one can perform a LS transformation of the training points to obtain new transformed training data

are estimated by minimizing the following empirical kernel embedding distance [23] In general, the kernel embedding methods for dealing with LS conditional shift and target shift may be combined to find a reweighted transformation of the training data which mimics the test distribution, and these methods may perform well even in the presence of conditional shifts other than location-scale changes.

[25] DICA thus extracts invariants, features that transfer across domains, and may be viewed as a generalization of many popular dimension-reduction methods such as kernel principal component analysis, transfer component analysis, and covariance operator inverse regression.

Distribution regression has been successfully applied for example in supervised entropy learning, and aerosol prediction using multispectral satellite images.

Under mild regularity conditions this estimator can be shown to be consistent and it can achieve the one-stage sampled (as if one had access to the true