Matrix regularization

In the field of statistical learning theory, matrix regularization generalizes notions of vector regularization to cases where the object to be learned is a matrix.

The purpose of regularization is to enforce conditions, for example sparsity or smoothness, that can produce stable predictive functions.

For example, in the more common vector framework, Tikhonov regularization optimizes over

Ideas of feature and group selection can also be extended to matrices, and these can be generalized to the nonparametric case of multiple kernel learning.

is typically chosen to be convex and is often selected to enforce sparsity (using

In this case the role of the Frobenius inner product is to select individual elements

is low-rank, in which case the regularization penalty can take the form of a nuclear norm.

Models used in multivariate regression are parameterized by a matrix of coefficients.

Many of the vector norms used in single variable regression can be extended to the multivariate case.

In the multivariate case the effect of regularizing with the Frobenius norm is the same as the vector case; very complex models will have larger norms, and, thus, will be penalized more.

The primary difference is that the input variables are also indexed by task (columns of

The problems can be coupled by adding an additional regularization penalty on the covariance of solutions

This scheme can be used to both enforce similarity of solutions across tasks, and to learn the specific structure of task similarity by alternating between optimizations of

Regularization by spectral filtering has been used to find stable solutions to problems such as those discussed above by addressing ill-posed matrix inversions (see for example Filter function for Tikhonov regularization).

In many cases the regularization function acts on the input (or kernel) to ensure a bounded inverse by eliminating small singular values, but it can also be useful to have spectral norms that act on the matrix that is to be learned.

Spectral Regularization is also used to enforce a reduced rank coefficient matrix in multivariate regression.

[4] In this setting, a reduced rank coefficient matrix can be found by keeping just the top

Sparse optimization has become the focus of much research interest as a way to find solutions that depend on a small number of variables (see e.g. the Lasso method).

-norm will find solutions with a small number of nonzero elements, applying an

-norm to different groups of variables can enforce structure in the sparsity of solutions.

norm is used in multi-task learning to group features across tasks, such that all the elements in a given row of the coefficient matrix can be forced to zero as a group.

-norm of each row, and then taking the total penalty to be the sum of these row-wise norms.

This regularization results in rows that will tend to be all zeros, or dense.

The same type of regularization can be used to enforce sparsity column-wise by taking the

Algorithms for solving these group sparsity problems extend the more well-known Lasso and group Lasso methods by allowing overlapping groups, for example, and have been implemented via matching pursuit:[7] and proximal gradient methods.

norms it is straightforward to enforce structure in the sparsity of a matrix either row-wise, column-wise, or in arbitrary blocks.

By enforcing group norms on blocks in multivariate or multi-task regression, for example, it is possible to find groups of input and output variables, such that defined subsets of output variables (columns in the matrix

The ideas of structured sparsity and feature selection can be extended to the nonparametric case of multiple kernel learning.

Thus, by choosing a matrix regularization function as this type of norm, it is possible to find a solution that is sparse in terms of which kernels are used, but dense in the coefficient of each used kernel.

Multiple kernel learning can also be used as a form of nonlinear variable selection, or as a model aggregation technique (e.g. by taking the sum of squared norms and relaxing sparsity constraints).