Kernel methods for vector output

Kernel methods are a well-established tool to analyze the relationship between input data and the corresponding output of a function.

In typical machine learning algorithms, these functions produce a scalar output.

Recent development of kernel methods for functions with vector-valued output is due, at least in part, to interest in simultaneously solving related problems.

Multi-label classification can be interpreted as mapping inputs to (binary) coding vectors with length equal to the number of classes.

[2] The use of probabilistic models and Gaussian processes was pioneered and largely developed in the context of geostatistics, where prediction over vector-valued output data is known as cokriging.

The regularization and kernel theory literature for vector-valued functions followed in the 2000s.

[6][7] While the Bayesian and regularization perspectives were developed independently, they are in fact closely related.

[8] Geostatistics literature calls this case heterotopic, and uses isotopic to indicate that the each component of the output vector has the same set of inputs.

[9] Here, for simplicity in the notation, we assume the number and sample space of the data for each output are the same.

belonging to a reproducing kernel Hilbert space of vector-valued functions (

This is similar to the scalar case of Tikhonov regularization, with some extra care in the notation.

by taking the derivative of the learning problem, setting it equal to zero, and substituting in the above expression for

It is possible, though non-trivial, to show that a representer theorem also holds for Tikhonov regularization in the vector-valued setting.

An isometry exists between the Hilbert spaces associated with these two kernels: The estimator of the vector-valued regularization framework can also be derived from a Bayesian viewpoint using Gaussian process methods in the case of a finite dimensional Reproducing kernel Hilbert space.

The derivation is similar to the scalar-valued case Bayesian interpretation of regularization.

to the identity matrix treats the outputs as unrelated and is equivalent to solving the scalar-output problems separately.

treats all the components as independent and is the same as solving the scalar problems separately.

together based on the cluster regularizer,[15] and sparsity-based approaches which assume only a few of the features are needed.

Therefore, the kernel derived from LMC is a sum of the products of two covariance functions, one that models the dependence between the outputs, independently of the input vector

contributes equally to the construction of the autocovariances and cross covariances for the outputs.

Another simplified version of the LMC is the semiparametric latent factor model (SLFM), which corresponds to setting

A non-trivial way to mix the latent functions is by convolving a base process with a smoothing kernel.

[21] When implementing an algorithm using any of the kernels above, practical considerations of tuning the parameters and ensuring reasonable computation time must be considered.

Approached from the regularization perspective, parameter tuning is similar to the scalar-valued case and can generally be accomplished with cross validation.

Solving the required linear system is typically expensive in memory and time.

to a block-diagonal matrix, greatly reducing the computational burden by solving D independent subproblems (plus the eigendecomposition of

In particular, for a least squares loss function (Tikhonov regularization), there exists a closed form solution for

Some methods such as maximization of the marginal likelihood (also known as evidence approximation, type II maximum likelihood, empirical Bayes), and least squares give point estimates of the parameter vector

There are also works employing a full Bayesian inference by assigning priors to

A summary of different methods for reducing computational complexity in multi-output Gaussian processes is presented in.