Instrumental variables estimation

[1] Intuitively, IVs are used when an explanatory variable of interest is correlated with the error term (endogenous), in which case ordinary least squares and ANOVA give biased results.

Such correlation may occur when: Explanatory variables that suffer from one or more of these issues in the context of a regression are sometimes referred to as endogenous.

For example, suppose a researcher wishes to estimate the causal effect of smoking (X) on general health (Y).

The tax rate for tobacco products is a reasonable choice for an instrument because the researcher assumes that it can only be correlated with health through its effect on smoking.

The first use of an instrument variable occurred in a 1928 book by Philip G. Wright, best known for his excellent description of the production, transport and sale of vegetable and animal oils in the early 1900s in the United States.

[6][7] In 1945, Olav Reiersøl applied the same approach in the context of errors-in-variables models in his dissertation, giving the method its name.

[8] Wright attempted to determine the supply and demand for butter using panel data on prices and quantities sold in the United States.

The problem was that price affected both supply and demand so that a function describing only one of the two could not be constructed directly from the observational data.

[9] Formal definitions of instrumental variables, using counterfactuals and graphical criteria, were given by Judea Pearl in 2000.

[11] Notions of causality in econometrics, and their relationship with instrumental variables and other methods, are discussed by Heckman (2008).

(the exclusion restriction), then IV may identify the causal parameter of interest where OLS fails.

If there are additional covariates W then the above definitions are modified so that Z qualifies as an instrument if the given criteria hold conditional on W. The essence of Pearl's definition is: These conditions do not rely on specific functional form of the equations and are applicable therefore to nonlinear equations, where U can be non-additive (see Non-parametric analysis).

Since U is unobserved, the requirement that Z be independent of U cannot be inferred from data and must instead be determined from the model structure, i.e., the data-generating process.

Suppose that we wish to estimate the effect of a university tutoring program on grade point average (GPA).

In that case, Proximity may also cause students to spend more time at the library, which in turn improves their GPA (see Figure 1).

Now, suppose that we notice that a student's "natural ability" affects his or her number of hours in the library as well as his or her GPA, as in Figure 3.

Using the causal graph, we see that Library Hours is a collider and conditioning on it opens the path Proximity

In this case, controlling for Library Hours still opens a spurious path from Proximity to GPA.

When X and the other unmeasured, causal variables collapsed into the e term are correlated, however, the OLS estimator is generally biased and inconsistent for β.

Now an extension: suppose that there are more instruments than there are covariates in the equation of interest, so that Z is a T × M matrix with M > K. This is often called the over-identified case.

We can expand the inverse, using the fact that, for any invertible n-by-n matrices A and B, (AB)−1 = B−1A−1 (see Invertible matrix#Properties): Reference: see Davidson and Mackinnnon (1993)[14]: 218 There is an equivalent under-identified estimator for the case where m < k. Since the parameters are the solutions to a set of linear equations, an under-identified model using the set of equations

This is commonly known in the econometric literature as the forbidden regression,[15] because second-stage IV parameter estimates are consistent only in special cases.

A small correction must be made to the sum-of-squared residuals in the second-stage fitted model in order that the covariance matrix of

satisfies the two equations above:[10] The exposition above assumes that the causal effect of interest does not vary across observations, that is, that

[1] Imbens and Angrist (1994) demonstrate that the linear IV estimate can be interpreted under weak conditions as a weighted average of local average treatment effects, where the weights depend on the elasticity of the endogenous regressor to changes in the instrumental variables.

Consequently, they are unlikely to have much success in predicting the ultimate outcome when they are used to replace the question predictor in the second-stage equation.

[20] A common rule of thumb for models with one endogenous regressor is: the F-statistic against the null that the excluded instruments are irrelevant in the first-stage regression should be larger than 10.

[21] The assumption that the instruments are not correlated with the error term in the equation of interest is not testable in exactly identified models.

(the number of observations multiplied by the coefficient of determination) from the OLS regression of the residuals onto the set of exogenous variables.

This statistic will be asymptotically chi-squared with m − k degrees of freedom under the null that the error term is uncorrelated with the instruments.