Least-squares support-vector machines (LS-SVM) for statistics and in statistical modeling, are least-squares versions of support-vector machines (SVM), which are a set of related supervised learning methods that analyze data and recognize patterns, and which are used for classification and regression analysis.
In this version one finds the solution by solving a set of linear equations instead of a convex quadratic programming (QP) problem for classical SVMs.
Least-squares SVM classifiers were proposed by Johan Suykens and Joos Vandewalle.
[1] LS-SVMs are a class of kernel-based learning methods.
, the SVM[2] classifier, according to Vapnik's original formulation, satisfies the following conditions: which is equivalent to where
In case such a separating hyperplane does not exist, we introduce so-called slack variables
by its expression in the Lagrangian formed from the appropriate objective and constraints, we will get the following quadratic programming problem: where
Solving this QP problem subject to constraints in (1), we will get the hyperplane in the high-dimensional space and hence the classifier in the original space.
The least-squares version of the SVM classifier is obtained by reformulating the minimization problem as subject to the equality constraints The least-squares SVM (LS-SVM) classifier formulation above implicitly corresponds to a regression interpretation with binary targets
Notice, that this error would also make sense for least-squares data fitting, so that the same end results holds for the regression case.
should be considered as hyperparameters to tune the amount of regularization versus the sum squared error.
as parameters in order to provide a Bayesian interpretation to LS-SVM.
The solution of LS-SVM regressor will be obtained after we construct the Lagrangian function: where
will yield a linear system instead of a quadratic programming problem: with
For the kernel function K(•, •) one typically has the following choices: where
values in the polynomial and RBF case, but not for all possible choices of
determine the scaling of the inputs in the polynomial, RBF and MLP kernel function.
This scaling is related to the bandwidth of the kernel in statistics, where it is shown that the bandwidth is an important parameter of the generalization behavior of a kernel method.
A Bayesian interpretation of the SVM has been proposed by Smola et al.
They showed that the use of different kernels in SVM can be regarded as defining different prior probability distributions on the functional space, as
is the regularization operator corresponding to the selected kernel.
A general Bayesian evidence framework was developed by MacKay,[3][4][5] and MacKay has used it to the problem of regression, forward neural network and classification network.
, Bayesian inference is constructed with 3 levels of inference: We can see that Bayesian evidence framework is a unified theory for learning the model and model selection.
Kwok used the Bayesian evidence framework to interpret the formulation of SVM and model selection.
And he also applied Bayesian evidence framework to support vector regression.
We assume that the data points are independently identically distributed (i.i.d.
), so that: In order to obtain the least square cost function, it is assumed that the probability of a data point is proportional to: A Gaussian distribution is taken for the errors
follow a multivariate Gaussian distribution, which have variance
Combining the preceding expressions, and neglecting all constants, Bayes’ rule becomes The maximum posterior density estimates
are then obtained by minimizing the negative logarithm of (26), so we arrive (10).