Multiple kernel learning

Multiple kernel learning refers to a set of machine learning methods that use a predefined set of kernels and learn an optimal linear or non-linear combination of kernels as part of the algorithm.

Multiple kernel learning approaches have been used in many applications, such as event recognition in video,[1] object recognition in images,[2] and biomedical data fusion.

Most work has been done on the supervised learning case with linear combinations of kernels, however, many algorithms have been developed.

The basic idea behind multiple kernel learning algorithms is to add an extra parameter to the minimization problem of the learning algorithm.

As an example, consider the case of supervised learning of a linear combination of a set of

Adaptations of existing techniques such as the Sequential Minimal Optimization have also been developed for multiple kernel SVM-based methods.

The following categorization has been proposed by Gonen and Alpaydın (2011)[5] Fixed rules approaches such as the linear combination algorithm described above use rules to set the combination of the kernels.

These do not require parameterization and use rules like summation and multiplication to combine the kernels.

be a threshold less than the minimum of the single-kernel accuracies, we can define Other approaches use a definition of kernel similarity, such as Using this measure, Qui and Lane (2009)[8] used the following heuristic to define These approaches solve an optimization problem to determine parameters for the kernel combination function.

This has been done with similarity measures and structural risk minimization approaches.

For similarity measures such as the one defined above, the problem can be formulated as follows:[9] where

to be the value of the objective function after solving a canonical SVM problem.

Many other variations exist on the same idea, with different methods of refining and solving the problem, e.g. with nonnegative weights for individual kernels and using non-linear combinations of kernels.

Bayesian approaches put priors on the kernel parameters and learn the parameter values from the priors and the base algorithm.

can be modeled with a zero-mean Gaussian and an inverse gamma variance prior.

This model is then optimized using a customized multinomial probit approach with a Gibbs sampler.

[11] These methods have been used successfully in applications such as protein fold recognition and protein homology problems [12][13] Boosting approaches add new kernels iteratively until some stopping criteria that is a function of performance is reached.

An example of this is the MARK model developed by Bennett et al. (2002) [14] The parameters

are learned by gradient descent on a coordinate basis.

The model is then rerun to generate the optimal weights

An inductive procedure has been developed that uses a log-likelihood empirical loss and group LASSO regularization with conditional expectation consensus on unlabeled data for image categorization.

is the loss function (weighted negative log-likelihood in this case),

is the regularization parameter (Group LASSO in this case), and

is the conditional expectation consensus (CEC) penalty on unlabeled data.

The combined minimization problem is optimized using a modified block gradient descent algorithm.

In this problem, the data needs to be "clustered" into groups based on the kernel distances.

Finally, we add a regularization term to avoid overfitting.

Combining these terms, we can write the minimization problem as follows.

Zhuang et al. solve this problem by an alternating minimization method for

For more information, see Zhuang et al.[16] Available MKL libraries include