Training, validation, and test data sets

[1] Such algorithms function by making data-driven predictions or decisions,[2] through building a mathematical model from input data.

[4] The model (e.g. a naive Bayes classifier) is trained on the training data set using a supervised learning method, for example using optimization methods such as gradient descent or stochastic gradient descent.

Based on the result of the comparison and the specific learning algorithm being used, the parameters of the model are adjusted.

[3] The validation data set provides an unbiased evaluation of a model fit on the training data set while tuning the model's hyperparameters[5] (e.g. the number of hidden units—layers and layer widths—in a neural network[4]).

[6] This simple procedure is complicated in practice by the fact that the validation data set's error may fluctuate during training, producing multiple local minima.

[5] To reduce the risk of issues such as over-fitting, the examples in the validation and test data sets should not be used to train the model.

[13] An example of a hyperparameter for artificial neural networks includes the number of hidden units in each layer.

Since this procedure can itself lead to some overfitting to the validation set, the performance of the selected network should be confirmed by measuring its performance on a third independent set of data called a test set.An application of this process is in early stopping, where the candidate models are successive iterations of the same network, and training stops when the error on the validation set grows, choosing the previous model (the one with minimum error).

Those predictions are compared to the examples' true classifications to assess the model's accuracy.

[12] However, when using a method such as cross-validation, two partitions can be sufficient and effective since results are averaged after repeated rounds of model training and testing to help reduce bias and variability.

This is the most blatant example of the terminological confusion that pervades artificial intelligence research.

To confirm the model's performance, an additional test data set held out from cross-validation is normally used.

[17] Types of such omissions include:[17] An example of an omission of particular circumstances is a case where a boy was able to unlock the phone because his mother registered her face under indoor, nighttime lighting, a condition which was not appropriately included in the training of the system.

A training set (left) and a test set (right) from the same statistical population are shown as blue points. Two predictive models are fit to the training data. Both fitted models are plotted with both the training and test sets. In the training set, the MSE of the fit shown in orange is 4 whereas the MSE for the fit shown in green is 9. In the test set, the MSE for the fit shown in orange is 15 and the MSE for the fit shown in green is 13. The orange curve severely overfits the training data, since its MSE increases by almost a factor of four when comparing the test set to the training set. The green curve overfits the training data much less, as its MSE increases by less than a factor of 2.
Comic strip demonstrating a fictional erroneous computer output (making a coffee 5 million degrees , from a previous definition of "extra hot"). This can be classified as both a failure in logic and a failure to include various relevant environmental conditions. [ 17 ]