In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations.
High-leverage points, if any, are outliers with respect to the independent variables.
is the number of independent variables in a regression model.
This makes the fitted model likely to pass close to a high leverage observation.
[1] Hence high-leverage points have the potential to cause large changes in the parameter estimates when they are deleted i.e., to be influential points.
Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point.
The leverage is typically defined as the diagonal elements of the hat matrix.
[2] Consider the linear regression model
design matrix whose rows correspond to the observations and whose columns correspond to the independent or explanatory variables.
leverage score can be viewed as the 'weighted' distance between
): mathematically, Hence, the leverage score is also known as the observation self-sensitivity or self-influence.
Note that this leverage depends on the values of the explanatory variables
of all observations but not on any of the values of the dependent variables
Some statisticians prefer the threshold of
Leverage is closely related to the Mahalanobis distance (proof[4]).
The relationship between the two is: This relationship enables us to decompose leverage into meaningful components so that some sources of high leverage can be investigated analytically.
[5] In a regression context, we combine leverage and influence functions to compute the degree to which estimated coefficients would change if we removed a single data point.
, one can compare the estimated coefficient
[8] To gain intuition for this formula, note that
captures the potential for an observation to affect the regression parameters, and therefore
captures the actual influence of that observations' deviations from its fitted value on the regression parameters.
to account for the fact that we remove the observation rather than adjusting its value, reflecting the fact that removal changes the distribution of covariates more when applied to high-leverage observations (i.e. with outlier covariate values).
Similar formulas arise when applying general formulas for statistical influences functions in the regression context.
[9][10] If we are in an ordinary least squares setting with fixed
has variance In other words, an observation's leverage score determines the degree of noise in the model's misprediction of that observation, with higher leverage leading to less noise.
Partial leverage (PL) is a measure of the contribution of the individual independent variables to the total leverage of each observation.
changes as a variable is added to the regression model.
point in the partial regression plot for the
Data points with large partial leverage for an independent variable can exert undue influence on the selection of that variable in automatic regression model building procedures.
Many programs and statistics packages, such as R, Python, etc., include implementations of Leverage.