Supervised learning

In machine learning, supervised learning (SL) is a paradigm where a model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a supervisory signal), which are often human-made labels.

The training process builds a function that maps new data to expected output values.

[1] An optimal scenario will allow for the algorithm to accurately determine output values for unseen instances.

This requires the learning algorithm to generalize from the training data to unseen situations in a reasonable way (see inductive bias).

[2] Imagine that we have available several different, but equally good, training data sets.

if, when trained on each of these data sets, it is systematically incorrect when predicting the correct output for

But if the learning algorithm is too flexible, it will fit each training data set differently, and hence have high variance.

But if the true function is highly complex (e.g., because it involves complex interactions among many different input features and behaves differently in different parts of the input space), then the function will only be able to learn with a large amount of training data paired with a "flexible" learning algorithm with low bias and high variance.

Hence, input data of large dimensions typically requires tuning the classifier to have low variance and high bias.

In practice, if the engineer can manually remove irrelevant features from the input data, it will likely improve the accuracy of the learned function.

This is an instance of the more general strategy of dimensionality reduction, which seeks to map the input data into a lower-dimensional space prior to running the supervised learning algorithm.

A fourth issue is the degree of noise in the desired output values (the supervisory target variables).

In such a situation, the part of the target function that cannot be modeled "corrupts" your training data - this phenomenon has been called deterministic noise.

When either type of noise is present, it is better to go with a higher bias, lower variance estimator.

In practice, there are several approaches to alleviate noise in the output values such as early stopping to prevent overfitting as well as detecting and removing the noisy training examples prior to training the supervised learning algorithm.

[4][5] Other factors to consider when choosing and applying a learning algorithm include the following: When considering a new application, the engineer can compare multiple learning algorithms and experimentally determine which one works best on the problem at hand (see cross-validation).

[6] Empirical risk minimization seeks the function that best fits the training data.

Structural risk minimization includes a penalty function that controls the bias/variance tradeoff.

In both cases, it is assumed that the training set consists of a sample of independent and identically distributed pairs,

This can be estimated from the training data as In empirical risk minimization, the supervised learning algorithm seeks the function

, then empirical risk minimization is equivalent to maximum likelihood estimation.

contains many candidate functions or the training set is not sufficiently large, empirical risk minimization leads to high variance and poor generalization.

The learning algorithm is able to memorize the training examples without generalizing well (overfitting).

Structural risk minimization seeks to prevent overfitting by incorporating a regularization penalty into the optimization.

The regularization penalty can be viewed as implementing a form of Occam's razor that prefers simpler functions over more complex ones.

A wide variety of penalties have been employed that correspond to different definitions of complexity.

, this gives empirical risk minimization with low bias and high variance.

The complexity penalty has a Bayesian interpretation as the negative log prior probability of

is a joint probability distribution and the loss function is the negative log likelihood

In some cases, the solution can be computed in closed form as in naive Bayes and linear discriminant analysis.