Bias–variance tradeoff

In general, as we increase the number of tunable parameters in a model, it becomes more flexible, and can better fit a training data set.

The bias–variance dilemma or bias–variance problem is the conflict in trying to simultaneously minimize these two sources of error that prevent supervised learning algorithms from generalizing beyond their training set:[1][2] The bias–variance decomposition is a way of analyzing a learning algorithm's expected generalization error with respect to a particular problem as a sum of three terms, the bias, variance, and a quantity called the irreducible error, resulting from noise in the problem itself.

In contrast, algorithms with high bias typically produce simpler models that may fail to capture important regularities (i.e. underfit) in the data.

Accuracy is one way of quantifying bias and can intuitively be improved by selecting from only local information.

Consequently, a sample will appear accurate (i.e. have low bias) under the aforementioned selection conditions, but may result in underfitting.

A graphical example would be a straight line fit to data exhibiting quadratic behavior overall.

Precision is a description of variance and generally can only be improved by selecting information from a comparatively larger space.

The option to select many data points over a broad sample space is the ideal condition for any analysis.

The limiting case where only a finite number of data points are selected over a broad sample space may result in improved precision and lower variance overall, but may also result in an overreliance on the training data (overfitting).

To borrow from the previous example, the graphical representation would appear as a high-order polynomial fit to the same data exhibiting quadratic behavior.

To mitigate how much information is used from neighboring observations, a model can be smoothed via explicit regularization, such as shrinkage.

that generalizes to points outside of the training set can be done with any of the countless algorithms used for supervised learning.

However, complexity will make the model "move" more to capture the data points, and hence its variance will be larger.

Let us write the mean-squared error of our model: We can show that the second term of this equation is null:

Eventually, we plug our derivations back into the original equation, and identify each term:

Finally, the MSE loss function (or negative log-likelihood) is obtained by taking the expectation value over

: Dimensionality reduction and feature selection can decrease variance by simplifying models.

Adding features (predictors) tends to decrease bias, at the expense of introducing additional variance.

[14][15] For example, boosting combines many "weak" (high bias) models in an ensemble that has lower bias than the individual models, while bagging combines "strong" learners in a way that reduces their variance.

In the case of k-nearest neighbors regression, when the expectation is taken over the possible labeling of a fixed training set, a closed-form expression exists that relates the bias–variance decomposition to the parameter k:[8]: 37, 223 where

In fact, under "reasonable assumptions" the bias of the first-nearest neighbor (1-NN) estimator vanishes entirely as the size of the training set approaches infinity.

For the case of classification under the 0-1 loss (misclassification rate), it is possible to find a similar decomposition, with the caveat that the variance term becomes dependent on the target label.

It has been argued that as training data increases, the variance of learned models will tend to decrease, and hence that as training data quantity increases, error is minimised by methods that learn models with lesser bias, and that conversely, for smaller training data quantities it is ever more important to minimise variance.

[18] Even though the bias–variance decomposition does not directly apply in reinforcement learning, a similar tradeoff can also characterize generalization.

[20] Convergence diagnostics can be used to control bias via burn-in removal, but due to a limited computational budget, a bias–variance trade-off arises,[21] leading to a wide-range of approaches, in which a controlled bias is accepted, if this allows to dramatically reduce the variance, and hence the overall estimation error.

They have argued (see references below) that the human brain resolves the dilemma in the case of the typically sparse, poorly-characterized training-sets provided by experience by adopting high-bias/low variance heuristics.

This reflects the fact that a zero-bias approach has poor generalizability to new situations, and also unreasonably presumes precise knowledge of the true state of the world.

The resulting heuristics are relatively simple, but produce better inferences in a wider variety of situations.

[25] Geman et al.[12] argue that the bias–variance dilemma implies that abilities such as generic object recognition cannot be learned from scratch, but require a certain degree of "hard wiring" that is later tuned by experience.

This is because model-free approaches to inference require impractically large training sets if they are to avoid high variance.