Hinge loss

In machine learning, the hinge loss is a loss function used for training classifiers.

The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).

[1] For an intended output t = ±1 and a classifier score y, the hinge loss of the prediction y is defined as Note that

should be the "raw" output of the classifier's decision function, not the predicted class label.

While binary SVMs are commonly extended to multiclass classification in a one-vs.-all or one-vs.-one fashion,[2] it is also possible to extend the hinge loss itself for such an end.

[3] For example, Crammer and Singer[4] defined it for a linear classifier as[5] where

Weston and Watkins provided a similar definition, but with a sum rather than a max:[6][3] In structured prediction, the hinge loss can be further extended to structured output spaces.

Structured SVMs with margin rescaling use the following variant, where w denotes the SVM's parameters, y the SVM's predictions, φ the joint feature function, and Δ the Hamming loss: The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it.

It is not differentiable, but has a subgradient with respect to model parameters w of a linear SVM with score function

is undefined, smoothed versions may be preferred for optimization, such as Rennie and Srebro's[7] or the quadratically smoothed suggested by Zhang.