Regularized least squares

The second reason for using RLS arises when the learned model suffers from poor generalization.

RLS can be used in such cases to improve the generalizability of the model by constraining it at training time.

This constraint can either force the solution to be "sparse" in some way or to reflect other prior knowledge about the problem such as information about correlations between features.

A Bayesian understanding of this can be reached by showing that RLS methods are often equivalent to priors on the solution to the least-squares problem.

A good learning algorithm should provide an estimator with a small risk.

, this approach may overfit the training data, and lead to poor generalization.

In RLS, this is accomplished by choosing functions from a reproducing kernel Hilbert space (RKHS)

, this approach defines a general class of algorithms named Tikhonov regularization.

As a smooth finite dimensional problem is considered and it is possible to apply standard calculus tools.

In order to minimize the objective function, the gradient is calculated with respect to

This solution closely resembles that of standard linear regression, with an extra term

This follows from Mercer's theorem, which states that a continuous, symmetric, positive definite kernel function can be expressed as

Least squares can be viewed as a likelihood maximization under an assumption of normally distributed residuals.

This is because the exponent of the Gaussian distribution is quadratic in the data, and so is the least-squares objective function.

To see this, first note that the OLS objective is proportional to the log-likelihood function when each sampled

This gives a more intuitive interpretation for why Tikhonov regularization leads to a unique solution to the least-squares problem: there are infinitely many vectors

is normally distributed around the origin, we will end up choosing a solution with this constraint in mind.

The most common names for this are called Tikhonov regularization and ridge regression.

term adds positive entries along the diagonal "ridge" of the sample covariance matrix

to the sample covariance matrix ensures that all of its eigenvalues will be strictly greater than 0.

It accepts bias to reduce variance and the mean square error.

This algorithm, for automatic (as opposed to heuristic) regularization, is obtained as a fixed point solution in the maximum likelihood estimation of the parameters.

[2] Although the guarantees of convergence are not provided, the examples indicate that a satisfactory solution may be obtained after a couple of iterations.

The least absolute selection and shrinkage (LASSO) method is another popular choice.

Unlike Tikhonov regularization, this scheme does not have a convenient closed-form solution: instead, the solution is typically found using quadratic programming or more general convex optimization methods, as well as by specific algorithms such as the least-angle regression algorithm.

Which of these regimes is more relevant depends on the specific data set at hand.

The most extreme way to enforce sparsity is to say that the actual magnitude of the coefficients of

Elastic Net penalty function doesn't have the first derivative at 0 and it is strictly convex

One of the main properties of the Elastic Net is that it can select groups of correlated variables.

, along with the name for each one, the corresponding prior if there is a simple one, and ways for computing the solution to the resulting optimization problem.