Bayesian interpretation of kernel regularization

Within bayesian statistics for machine learning, kernel methods arise from the assumption of an inner product space or similarity structure on inputs.

For some such methods, such as support vector machines (SVMs), the original formulation and its regularization were not Bayesian in nature.

More recently these methods have been extended to problems that deal with multiple outputs such as in multi-task learning.

[1] A mathematical equivalence between the regularization and the Bayesian point of view is easily proved in cases where the reproducing kernel Hilbert space is finite-dimensional.

We start with a brief review of the main ideas underlying kernel methods for scalar learning, and briefly introduce the concepts of regularization and Gaussian processes.

The classical supervised learning problem requires estimating the output for some new input point

called a kernel, one of the most popular estimators in machine learning is given by where

The main assumption in the regularization perspective is that the set of functions

is assumed to belong to a reproducing kernel Hilbert space

The squared norm in an RKHS can be written as and could be viewed as measuring the complexity of the function.

The first term in this functional, which measures the average of the squares of the errors between the

, is called the empirical risk and represents the cost we pay by predicting

The second term in the functional is the squared norm in a RKHS multiplied by a weight

and serves the purpose of stabilizing the problem[3][5] as well as of adding a trade-off between fitting and complexity of the estimator.

, called the regularizer, determines the degree to which instability and complexity of the estimator should be penalized (higher penalty for increasing value of

The explicit form of the estimator in equation (1) is derived in two steps.

First, the representer theorem[9][10][11] states that the minimizer of the functional (2) can always be written as a linear combination of the kernels centered at the training-set points, for some

to zero, Substituting this expression for the coefficients in equation (3), we obtain the estimator stated previously in equation (1), The notion of a kernel plays a crucial role in Bayesian probability as the covariance function of a stochastic process called the Gaussian process.

As part of the Bayesian framework, the Gaussian process specifies the prior distribution that describes the prior beliefs about the properties of the function being modeled.

These beliefs are updated after taking into account observational data by means of a likelihood function that relates the prior beliefs to the observations.

A Gaussian process (GP) is a stochastic process in which any finite number of random variables that are sampled follow a joint Normal distribution.

[12] The mean vector and covariance matrix of the Gaussian distribution completely specify the GP.

, In terms of the underlying Gaussian distribution, we have that for any finite set

In a regression context, the likelihood function is usually assumed to be a Gaussian distribution and the observations to be independent and identically distributed (iid), This assumption corresponds to the observations being corrupted with zero-mean Gaussian noise with variance

The iid assumption makes it possible to factorize the likelihood function over the data points given the set of inputs

denotes the set of parameters which include the variance of the noise

[3][12][13] In the finite dimensional case, every RKHS can be described in terms of a feature map

The resulting posterior distribution is the given by We can see that a maximum posterior (MAP) estimate is equivalent to the minimization problem defining Tikhonov regularization, where in the Bayesian case the regularization parameter is related to the noise variance.

Whereas the loss function measures the error that is incurred when predicting

, the likelihood function measures how likely the observations are from the model that was assumed to be true in the generative process.