James–Stein estimator

It arose sequentially in two main published papers.

The earlier version of the estimator was developed in 1956,[1] when Charles Stein reached a relatively shocking conclusion that while the then-usual estimate of the mean, the sample mean, is admissible when

Stein proposed a possible improvement to the estimator that shrinks the sample means

(which can be chosen a priori or commonly as the "average of averages" of the sample means, given all samples share the same size).

This observation is commonly referred to as Stein's example or paradox.

In 1961, Willard James and Charles Stein simplified the original process.

In real-world application, this is a common situation in which a set of parameters is sampled, and the samples are corrupted by independent Gaussian noise.

Since this noise has mean of zero, it may be reasonable to use the samples themselves as an estimate of the parameters.

Stein demonstrated that in terms of mean squared error

[1] The paradoxical result, that there is a (possibly) better and never any worse estimate of

, meaning that the James–Stein estimator always achieves lower mean squared error (MSE) than the maximum likelihood estimator.

[2][4] By definition, this makes the least squares estimator inadmissible when

Let ν be an arbitrary fixed vector of dimension

A natural question to ask is whether the improvement over the usual estimator is independent of the choice of ν.

Thus to get a very great improvement some knowledge of the location of θ is necessary.

Of course this is the quantity we are trying to estimate so we don't have this knowledge a priori.

This can be considered a disadvantage of the estimator: the choice is not objective as it may depend on the beliefs of the researcher.

Nonetheless, James and Stein's result is that any finite guess ν improves the expected MSE over the maximum-likelihood estimator, which is tantamount to using an infinite ν, surely a poor guess.

Seeing the James–Stein estimator as an empirical Bayes method gives some intuition to this result: One assumes that θ itself is a random variable with prior distribution

A quirky example would be estimating the speed of light, tea consumption in Taiwan, and hog weight in Montana, all together.

The James–Stein estimator always improves upon the total MSE, i.e., the sum of the expected squared errors of each component.

Therefore, the total MSE in measuring light speed, tea consumption, and hog weight would improve by using the James–Stein estimator.

However, any particular component (such as the speed of light) would improve for some parameter values, and deteriorate for others.

The conclusion from this hypothetical example is that measurements should be combined if one is interested in minimizing their total MSE.

The James–Stein estimator has also found use in fundamental quantum theory, where the estimator has been used to improve the theoretical bounds of the entropic uncertainty principle for more than three measurements.

[6] An intuitive derivation and interpretation is given by the Galtonian perspective.

This can be easily remedied by replacing this multiplier by zero when it is negative.

[4] This follows from a more general result which requires admissible estimators to be smooth.

The James–Stein estimator may seem at first sight to be a result of some peculiarity of the problem setting.

[citation needed] This effect has been called Stein's phenomenon, and has been demonstrated for several different problem settings, some of which are briefly outlined below.

MSE (R) of least squares estimator (ML) vs. James–Stein estimator (JS). The James–Stein estimator gives its best estimate when the norm of the actual parameter vector θ is near zero.