Generalization error

For supervised learning applications in machine learning and statistical learning theory, generalization error[1] (also known as the out-of-sample error[2] or the risk) is a measure of how accurately an algorithm is able to predict outcomes for previously unseen data.

As learning algorithms are evaluated on finite samples, the evaluation of a learning algorithm may be sensitive to sampling error.

As a result, measurements of prediction error on the current data may not provide much information about the algorithm's predictive ability on new, unseen data.

The generalization error can be minimized by avoiding overfitting in the learning algorithm.

The performance of machine learning algorithms is commonly visualized by learning curve plots that show estimates of the generalization error throughout the learning process.

In a learning problem, the goal is to develop a function

is developed based on a data set of

The generalization error or expected loss or risk

is the expected value of the loss function

is the unknown joint probability distribution for

Without knowing the joint probability distribution

Instead, we can compute the error on sample data, which is called empirical error (or empirical risk).

data points, the empirical error of a candidate function

that is found by a learning algorithm based on the sample.

Instead, the aim of many problems in statistical learning theory is to bound or characterize the difference of the generalization error and the empirical error in probability: That is, the goal is to characterize the probability

Specifically, if an algorithm is symmetric (the order of inputs does not affect the result), has bounded loss and meets two stability conditions, it will generalize.

The first stability condition, leave-one-out cross-validation stability, says that to be stable, the prediction error for each data point when leave-one-out cross validation is used must converge to zero as

norm) is met if the prediction on a left-out datapoint does not change when a single data point is removed from the training dataset.

[3] A number of algorithms have been proven to be stable and as a result have bounds on their generalization error.

A list of these algorithms and the papers that proved stability is available here.

The concepts of generalization error and overfitting are closely related.

Overfitting occurs when the learned function

As a result, the function will perform well on the training set but not perform well on other data from the joint probability distribution of

Thus, the more overfitting occurs, the larger the generalization error.

The testing sample is previously unseen by the algorithm and so represents a random sample from the joint probability distribution of

This test sample allows us to approximate the expected error and as a result approximate a particular form of the generalization error.

Many algorithms exist to prevent overfitting.

The minimization algorithm can penalize more complex functions (known as Tikhonov regularization), or the hypothesis space can be constrained, either explicitly in the form of the functions or by adding constraints to the minimization function (Ivanov regularization).

The approach to finding a function that does not overfit is at odds with the goal of finding a function that is sufficiently complex to capture the particular characteristics of the data.

Keeping a function simple to avoid overfitting may introduce a bias in the resulting predictions, while allowing it to be more complex leads to overfitting and a higher variance in the predictions.

This figure illustrates the relationship between overfitting and the generalization error I [ f n ] - I S [ f n ]. Data points were generated from the relationship y = x with white noise added to the y values. In the left column, a set of training points is shown in blue. A seventh order polynomial function was fit to the training data. In the right column, the function is tested on data sampled from the underlying joint probability distribution of x and y . In the top row, the function is fit on a sample dataset of 10 datapoints. In the bottom row, the function is fit on a sample dataset of 100 datapoints. As we can see, for small sample sizes and complex functions, the error on the training set is small but error on the underlying distribution of data is large and we have overfit the data. As a result, generalization error is large. As the number of sample points increases, the prediction error on training and test data converges and generalization error goes to 0.