MSE is a risk function, corresponding to the expected value of the squared error loss.
[3] In machine learning, specifically empirical risk minimization, MSE may refer to the empirical risk (the average loss on an observed data set), as an estimate of the true MSE (the true risk: the average loss on the actual population distribution).
As it is derived from the square of Euclidean distance, it is always a positive value that decreases as the error approaches zero.
Like the variance, MSE has the same units of measurement as the square of the quantity being estimated.
In an analogy to standard deviation, taking the square root of MSE yields the root-mean-square error or root-mean-square deviation (RMSE or RMSD), which has the same units as the quantity being estimated; for an unbiased estimator, the RMSE is the square root of the variance, known as the standard error.
The MSE either assesses the quality of a predictor (i.e., a function mapping arbitrary inputs to a sample of values of some random variable), or of an estimator (i.e., a mathematical function mapping a sample of data to an estimate of a parameter of the population from which the data is sampled).
is defined as[1] This definition depends on the unknown parameter, therefore the MSE is a priori property of an estimator.
The MSE could be a function of unknown parameters, in which case any estimator of the MSE based on estimates of these parameters would be a function of the data (and thus a random variable).
An even shorter proof can be achieved using the well-known formula that for a random variable
In regression analysis, plotting is a more natural way to view the overall trend of the whole data.
One example of a linear regression using this method is the least squares method—which evaluates appropriateness of linear regression model to model bivariate dataset,[6] but whose limitation is related to known distribution of the data.
The term mean squared error is sometimes used to refer to the unbiased estimate of error variance: the residual sum of squares divided by the number of degrees of freedom.
The denominator is the sample size reduced by the number of model parameters estimated from the same data, (n−p) for p regressors or (n−p−1) if an intercept is used (see errors and residuals in statistics for more details).
In regression analysis, "mean squared error", often referred to as mean squared prediction error or "out-of-sample mean squared error", can also refer to the mean value of the squared deviations of the predictions from the true values, over an out-of-sample test space, generated by a model estimated over a particular sample space.
This also is a known, computed quantity, and it varies by sample and by out-of-sample test space.
In the context of gradient descent algorithms, it is common to introduce a factor of
So a value which is technically half the mean of squared errors may be called the MSE.
is the sample average which has an expected value equal to the true mean
, and an appropriate choice can always give a lower mean squared error.
,[a] which is achieved by a Bernoulli distribution with p = 1/2 (a coin flip), and the MSE is minimized for
The following table gives several estimators of the true parameters of the population, μ and σ2, for the Gaussian case.
Both analysis of variance and linear regression techniques estimate the MSE as part of the analysis and use the estimated MSE to determine the statistical significance of the factors or predictors under study.
The goal of experimental design is to construct experiments in such a way that when the observations are analyzed, the MSE is close to zero relative to the magnitude of at least one of the estimated treatment effects.
In one-way analysis of variance, MSE can be calculated by the division of the sum of squared errors and the degree of freedom.
Carl Friedrich Gauss, who introduced the use of mean squared error, was aware of its arbitrariness and was in agreement with objections to it on these grounds.
The use of mean squared error without question has been criticized by the decision theorist James Berger.
There are, however, some scenarios where mean squared error can serve as a good approximation to a loss function occurring naturally in an application.
[10] Like variance, mean squared error has the disadvantage of heavily weighting outliers.
[11] This is a result of the squaring of each term, which effectively weights large errors more heavily than small ones.
This property, undesirable in many applications, has led researchers to use alternatives such as the mean absolute error, or those based on the median.