Bias of an estimator

Bias can also be measured with respect to the median, rather than the mean (expected value), in which case one distinguishes median-unbiased from the usual mean-unbiasedness property.

For example, there is no unbiased estimator for the reciprocal of the parameter of a binomial random variable.

[1] Suppose we have a statistical model, parameterized by a real number θ, giving rise to a probability distribution for observed data,

(where θ is a fixed, unknown constant that is part of this distribution), and then we construct some estimator

that maps observed data to values that we hope are close to θ.

The second equation follows since θ is measurable with respect to the conditional distribution

Concretely, the naive estimator sums the squared deviations and divides by n, which is biased.

Conversely, MSE can be minimized by dividing by a different number (depending on distribution), but this results in a biased estimator.

is unbiased because: where the transition to the second line uses the result derived above for the biased estimator.

The ratio between the biased (uncorrected) and unbiased estimates of the variance is known as Bessel's correction.

Since the expectation of an unbiased estimator δ(X) is equal to the estimand, i.e. the only function of the data constituting an unbiased estimator is To see this, note that when decomposing e−λ from the above expression for expectation, the sum that is left is a Taylor series expansion of e−λ as well, yielding e−λe−λ = e−2λ (see Characterizations of the exponential function).

Not only is its value always positive but it is also more accurate in the sense that its mean squared error is smaller; compare the unbiased estimator's MSE of The MSEs are functions of the true value λ.

Consider a case where n tickets numbered from 1 to n are placed in a box and one is selected at random, giving a value X.

This requirement seems for most purposes to accomplish as much as the mean-unbiased requirement and has the additional property that it is invariant under one-to-one transformation.Further properties of median-unbiased estimators have been noted by Lehmann, Birnbaum, van der Vaart and Pfanzagl.

There are methods of construction median-unbiased estimators for probability distributions that have monotone likelihood-functions, such as one-parameter exponential families, to ensure that they are optimal (in a sense analogous to minimum-variance property considered for mean-unbiased estimators).

[11] Any minimum-variance mean-unbiased estimator minimizes the risk (expected loss) with respect to the squared-error loss function (among mean-unbiased estimators), as observed by Gauss.

[12] A minimum-average absolute deviation median-unbiased estimator minimizes the risk with respect to the absolute loss function (among median-unbiased estimators), as observed by Laplace.

For example, the square root of the unbiased estimator of the population variance is not a mean-unbiased estimator of the population standard deviation: the square root of the unbiased sample variance, the corrected sample standard deviation, is biased.

While bias quantifies the average difference to be expected between an estimator and an underlying parameter, an estimator based on a finite sample can additionally be expected to differ from the parameter due to the randomness in the sample.

One measure which is used to try to reflect both types of difference is the mean square error,[2] This can be shown to be equal to the square of the bias, plus the variance:[2] When the parameter is a vector, an analogous decomposition applies:[15] where

For example,[16] suppose an estimator of the form is sought for the population variance as above, but this time to minimise the MSE: If the variables X1 ... Xn follow a normal distribution, then nS2/σ2 has a chi-squared distribution with n − 1 degrees of freedom, giving: and so With a little algebra it can be confirmed that it is c = 1/(n + 1) which minimises this combined loss function, rather than c = 1/(n − 1) which minimises just the square of the bias.

More generally it is only in restricted classes of problems that there will be an estimator that minimises the MSE independently of the parameter values.

However it is very common that there may be perceived to be a bias–variance tradeoff, such that a small increase in bias can be traded for a larger decrease in variance, resulting in a more desirable estimator overall.

Most bayesians are rather unconcerned about unbiasedness (at least in the formal sampling-theory sense above) of their estimates.

For example, Gelman and coauthors (1995) write: "From a Bayesian perspective, the principle of unbiasedness is reasonable in the limit of large samples, but otherwise it is potentially misleading.

This information plays no part in the sampling-theory approach; indeed any attempt to include it would be considered "bias" away from what was pointed to purely by the data.

To the extent that Bayesian calculations include prior information, it is therefore essentially inevitable that their results will not be "unbiased" in sampling theory terms.

For example, consider again the estimation of an unknown population variance σ2 of a Normal distribution with unknown mean, where it is desired to optimise c in the expected loss function A standard choice of uninformative prior for this problem is the Jeffreys prior,

One consequence of adopting this prior is that S2/σ2 remains a pivotal quantity, i.e. the probability distribution of S2/σ2 depends only on S2/σ2, independent of the value of S2 or σ2: However, while in contrast — when the expectation is taken over the probability distribution of σ2 given S2, as it is in the Bayesian case, rather than S2 given σ2, one can no longer take σ4 as a constant and factor it out.

The consequence of this is that, compared to the sampling-theory calculation, the Bayesian calculation puts more weight on larger values of σ2, properly taking into account (as the sampling-theory calculation cannot) that under this squared-loss function the consequence of underestimating large values of σ2 is more costly in squared-loss terms than that of overestimating small values of σ2.

Sampling distributions of two alternative estimators for a parameter β 0 . Although β 1 ^ is unbiased, it is clearly inferior to the biased β 2 ^ .

Ridge regression is one example of a technique where allowing a little bias may lead to a considerable reduction in variance, and more reliable estimates overall.