Regression dilution

Consider fitting a straight line for the relationship of an outcome variable y to a predictor variable x, and estimating the slope of the line.

Statistical variability, measurement error or random noise in the y variable causes uncertainty in the estimated slope, but not bias: on average, the procedure calculates the right slope.

The greater the variance in the x measurement, the closer the estimated slope must approach zero instead of the true value.

Recall that linear regression is not symmetric: the line of best fit for predicting y from x (the usual linear regression) is not the same as the line of best fit for predicting x from y.

[2] It can be corrected using total least squares[3] and errors-in-variables models in general.

For example, in a medical study patients are recruited as a sample from a population, and their characteristics such as blood pressure may be viewed as arising from a random sample.

[4] The term regression dilution ratio, although not defined in quite the same way by all authors, is used for this general approach, in which the usual linear regression is fitted, and then a correction applied.

The reply to Frost & Thompson by Longford (2001) refers the reader to other methods, expanding the regression model to acknowledge the variability in the x variable, so that no bias arises.

[5] Fuller (1987) is one of the standard references for assessing and correcting for regression dilution.

[6] Hughes (1993) shows that the regression dilution ratio methods apply approximately in survival models.

[7] Rosner (1992) shows that the ratio methods apply approximately to logistic regression models.

[8] Carroll et al. (1995) give more detail on regression dilution in nonlinear models, presenting the regression dilution ratio methods as the simplest case of regression calibration methods, in which additional covariates may also be incorporated.

This will require repeated measurements of the x variable in the same individuals, either in a sub-study of the main data set, or in a separate data set.

The case of multiple predictor variables subject to variability (possibly correlated) has been well-studied for linear regression, and for some non-linear regression models.

[7] Charles Spearman developed in 1904 a procedure for correcting correlations for regression dilution,[10] i.e., to "rid a correlation coefficient from the weakening effect of measurement error".

[12] The correction assures that the Pearson correlation coefficient across data units (for example, people) between two sets of variables is estimated in a manner that accounts for error contained within the measurement of those variables.

be the true values of two attributes of some person or statistical unit.

, which is analogous to Cronbach's alpha; that is, in terms of classical test theory,

corrected for attenuation is How well the variables are measured affects the correlation of X and Y.

The correction for attenuation tells one what the estimated correlation is expected to be if one could measure X′ and Y′ with perfect reliability.

Frost and Thompson suggest, for example, that x may be the true, long-term blood pressure of a patient, and w may be the blood pressure observed on one particular clinic visit.

In the example, assuming that blood pressure measurements are similarly variable in future patients, our regression line of y on w (observed blood pressure) gives unbiased predictions.

Suppose the change in x is known under some new circumstance: to estimate the likely change in an outcome variable y, the slope of the regression of y on x is needed, not y on w. This arises in epidemiology.

To continue the example in which x denotes blood pressure, perhaps a large clinical trial has provided an estimate of the change in blood pressure under a new treatment; then the possible effect on y, under the new treatment, should be estimated from the slope in the regression of y on x.

For example, if the current data set includes blood pressure measured with greater precision than is common in clinical practice.

One specific example of this arose when developing a regression equation based on a clinical trial, in which blood pressure was the average of six measurements, for use in clinical practice, where blood pressure is usually a single measurement.

[14] All of these results can be shown mathematically, in the case of simple linear regression assuming normal distributions throughout (the framework of Frost & Thompson).

It has been discussed that a poorly executed correction for regression dilution, in particular when performed without checking for the underlying assumptions, may do more damage to an estimate than no correction.

[15] Regression dilution was first mentioned, under the name attenuation, by Spearman (1904).

[16] Those seeking a readable mathematical treatment might like to start with Frost and Thompson (2000).

Illustration of regression dilution (or attenuation bias) by a range of regression estimates in errors-in-variables models . Two regression lines (red) bound the range of linear regression possibilities. The shallow slope is obtained when the independent variable (or predictor) is on the abscissa (x-axis). The steeper slope is obtained when the independent variable is on the ordinate (y-axis). By convention, with the independent variable on the x-axis, the shallower slope is obtained. Green reference lines are averages within arbitrary bins along each axis. Note that the steeper green and red regression estimates are more consistent with smaller errors in the y-axis variable.
Suppose the green and blue data points capture the same data, but with errors (either +1 or -1 on x-axis) for the green points. Minimizing error on the y-axis leads to a smaller slope for the green points, even if they are just a noisy version of the same data.