In statistics, M-estimators are a broad class of extremum estimators for which the objective function is a sample average.
[1] Both non-linear least squares and maximum likelihood estimation are special cases of M-estimators.
For example, a maximum-likelihood estimate is the point where the derivative of the likelihood function with respect to the parameter is zero; thus, a maximum-likelihood estimator is a critical point of the score function.
[8] In many applications, such M-estimators can be thought of as estimating characteristics of the population.
satisfies or, equivalently, Maximum-likelihood estimators have optimal properties in the limit of infinitely many observations under rather general conditions, but may be biased and not the most efficient estimators for finite samples.
In 1964, Peter J. Huber proposed generalizing maximum likelihood estimation to the minimization of where ρ is a function with certain properties (see below).
The solutions are called M-estimators ("M" for "maximum likelihood-type" (Huber, 1981, page 43)); other types of robust estimators include L-estimators, R-estimators and S-estimators.
Maximum likelihood estimators (MLE) are thus a special case of M-estimators.
With suitable rescaling, M-estimators are special cases of extremum estimators (in which more general functions of the observations can be used).
The function ρ, or its derivative, ψ, can be chosen in such a way to provide the estimator desirable properties (in terms of bias and efficiency) when the data are truly from the assumed distribution, and 'not bad' behaviour when the data are generated from a model that is, in some sense, close to the assumed distribution.
Often it is simpler to differentiate with respect to θ and solve for the root of the derivative.
(if it exists) that solves the vector equation: For example, for the maximum likelihood estimator,
Such an estimator is not necessarily an M-estimator of ρ-type, but if ρ has a continuous first derivative with respect to
The previous definitions can easily be extended to finite samples.
Such estimators have some additional desirable properties, such as complete rejection of gross outliers.
For many choices of ρ or ψ, no closed form solution exists and an iterative approach to computation is required.
However, in most cases an iteratively re-weighted least squares fitting algorithm can be performed; this is typically the preferred method.
For some choices of ψ, specifically, redescending functions, the solution may not be unique.
Thus, some care is needed to ensure that good starting points are chosen.
Robust starting points, such as the median as an estimate of location and the median absolute deviation as a univariate estimate of scale, are common.
In computation of M-estimators, it is sometimes useful to rewrite the objective function so that the dimension of parameters is reduced.
Examples in which concentrating parameters increases computation speed include seemingly unrelated regressions (SUR) models.
However, when it is possible, concentrating parameters can facilitate computation to a great degree.
[9] Despite its appealing feature in computation, concentrating parameters is of limited use in deriving asymptotic properties of M-estimator.
[10] The presence of W in each summand of the objective function makes it difficult to apply the law of large numbers and the central limit theorem.
As such, Wald-type approaches to constructing confidence intervals and hypothesis tests can be used.
A proof of this property of M-estimators can be found in Huber (1981, Section 3.2).
For the median estimation of (X1, ..., Xn), instead we can define the ρ function as and similarly, the ρ function is minimized when θ is the median of the Xs.
While this ρ function is not differentiable in θ, the ψ-type M-estimator, which is the subgradient of ρ function, can be expressed as and M-estimators are consistent under various sets of conditions.
A typical set of assumptions is the class of functions satisfies a uniform law of large numbers and that the maximum is well-separated.