Early stopping

In machine learning, early stopping is a form of regularization used to avoid overfitting when training a model with an iterative method, such as gradient descent.

Such methods update the model to make it better fit the training data with each iteration.

Past that point, however, improving the model's fit to the training data comes at the expense of increased generalization error.

Early stopping rules provide guidance as to how many iterations can be run before the learner begins to over-fit.

Early stopping rules have been employed in many different machine learning methods, with varying amounts of theoretical foundation.

This section presents some of the basic machine-learning concepts required for a description of early stopping methods.

Overfitting occurs when a model fits the data in the training set well, while incurring larger generalization error.

This generally involves imposing some sort of smoothness constraint on the learned model.

[1] This smoothness may be enforced explicitly, by fixing the number of parameters in the model, or by augmenting the cost function as in Tikhonov regularization.

Each iteration updates an approximate solution to the optimization problem by taking a step in the direction of the negative of the gradient of the objective function.

By choosing the step-size appropriately, such a method can be made to converge to a local minimum of the objective function.

Early-stopping can be used to regularize non-parametric regression problems encountered in machine learning.

[2] These spaces can be infinite dimensional, in which they can supply solutions that overfit training sets of arbitrary size.

One way to regularize non-parametric regression problems is to apply an early stopping rule to an iterative procedure such as gradient descent.

The early stopping rules proposed for these problems are based on analysis of upper bounds on the generalization error as a function of the iteration number.

They yield prescriptions for the number of iterations to run that can be computed prior to starting the solution process.

We want to control the difference between the expected risk of the sample iteration and the minimum expected risk, that is, the expected risk of the regression function: This difference can be rewritten as the sum of two terms: the difference in expected risk between the sample and population iterations and that between the population iteration and the regression function: This equation presents a bias-variance tradeoff, which is then solved to give an optimal stopping rule that may depend on the unknown probability distribution.

For the analysis leading to the early stopping rule and bounds, the reader is referred to the original article.

[3] In practice, data-driven methods, e.g. cross-validation can be used to obtain an adaptive stopping rule.

It has been shown, for several boosting algorithms (including AdaBoost), that regularization via early stopping can provide guarantees of consistency, that is, that the result of the algorithm approaches the true solution as the number of samples goes to infinity.

These methods are employed in the training of many iterative machine learning algorithms including neural networks.

Prechelt gives the following summary of a naive implementation of holdout-based early stopping as follows:[9] Cross-validation is an alternative that is applicable to non time-series scenarios.

Even this simple procedure is complicated in practice by the fact that the validation error may fluctuate during training, producing multiple local minima.

This complication has led to the creation of many ad hoc rules for deciding when overfitting has truly begun.

Figure 1.  The green line represents an overfitted model and the black line represents a regularized model. While the green line best follows the training data, it is too dependent on that data and it is likely to have a higher error rate on new unseen data illustrated by black-outlined dots, compared to the black line.