Errors-in-variables model

In contrast, standard regression models assume that those regressors have been measured exactly, or observed without error; as such, those models account only for errors in the dependent variables, or responses.

For simple linear regression the effect is an underestimate of the coefficient, known as the attenuation bias.

A practical application is the standard school science experiment for Hooke's Law, in which one estimates the relationship between the weight added to a spring and the amount by which the spring stretches.

′s (see simple linear regression), then the estimator for the slope coefficient is which converges as the sample size

This follows directly from the result quoted immediately above, and the fact that the regression coefficient relating the

It can be argued that almost all existing data sets contain errors of different nature and magnitude, so that attenuation bias is extremely frequent (although in multivariate regression the direction of bias is ambiguous[5]).

Jerry Hausman sees this as an iron law of econometrics: "The magnitude of the estimate is usually smaller than expected.

are observed values of the regressors, then it is assumed there exist some latent variables

Depending on the specification these error-free regressors may or may not be treated separately; in the latter case it is simply assumed that corresponding entries in the variance matrix of

Unlike standard least squares regression (OLS), extending errors in variables regression (EiV) from the simple to the multivariable case is not straightforward, unless one treats all variables in the same way i.e. assume equal reliability.

[10] The simple linear errors-in-variables model was already presented in the "motivation" section: where all variables are scalar.

The "true" regressor x* is treated as a random variable (structural model), independent of the measurement error η (classic assumption).

[11] That is, the parameters α, β can be consistently estimated from the data set

Before this identifiability result was established, statisticians attempted to apply the maximum likelihood technique by assuming that all variables are normal, and then concluded that the model is not identified.

The suggested remedy was to assume that some of the parameters of the model are known or can be estimated from the outside source.

In the case when the third central moment of the latent regressor x* is non-zero, the formula reduces to The multivariable model looks exactly like the simple linear model, only this time β, ηt, xt and x*t are k×1 vectors.

In the case when εt, ηt1,..., ηtk are mutually independent, the parameter β is not identified if and only if in addition to the conditions above some of the errors can be written as the sum of two independent variables one of which is normal.

designates the Hadamard product of matrices, and variables xt, yt have been preliminarily de-meaned.

The authors of the method suggest to use Fuller's modified IV estimator.

[17] A generic non-linear measurement error model takes form Here function g can be either parametric or non-parametric.

However in the case of scalar x* the model is identified unless the function g is of the "log-exponential" form [20] and the latent regressor x* has density where constants A,B,C,D,E,F may depend on a,b,c,d.

Despite this optimistic result, as of now no methods exist for estimating non-linear errors-in-variables models without any extraneous information.

However there are several techniques which make use of some additional data: either the instrumental variables, or repeated observations.

Simulated moments can be computed using the importance sampling algorithm: first we generate several random variables {vts ~ ϕ, s = 1,…,S, t = 1,…,T} from the standard normal distribution, then we compute the moments at t-th observation as where θ = (β, σ, γ), A is just some function of the instrumental variables z, and H is a two-component vector of moments In this approach two (or maybe more) repeated observations of the regressor x* are available.

Variables η1, η2 need not be identically distributed (although if they are efficiency of the estimator can be slightly improved).

With only these two observations it is possible to consistently estimate the density function of x* using Kotlarski's deconvolution technique.

[22] where it would be possible to compute the integral if we knew the conditional density function ƒx*|x.

Assuming for simplicity that η1, η2 are identically distributed, this conditional density can be computed as where with slight abuse of notation xj denotes the j-th component of a vector.

All densities in this formula can be estimated using inversion of the empirical characteristic functions.

In particular, In order to invert these characteristic function one has to apply the inverse Fourier transform, with a trimming parameter C needed to ensure the numerical stability.

Illustration of regression dilution (or attenuation bias) by a range of regression estimates in errors-in-variables models. Two regression lines (red) bound the range of linear regression possibilities. The shallow slope is obtained when the independent variable (or predictor) is on the x-axis. The steeper slope is obtained when the independent variable is on the y-axis. By convention, with the independent variable on the x-axis, the shallower slope is obtained. Green reference lines are averages within arbitrary bins along each axis. Note that the steeper green and red regression estimates are more consistent with smaller errors in the y-axis variable.