Contrary to popular belief, neither the Gauss–Markov theorem nor the more common maximum likelihood justification for ordinary least squares relies on any kind of correlation structure between dependent predictors[1][2][3] (although perfect collinearity can cause problems with some software).
There is no justification for the practice of removing collinear variables as part of regression analysis,[1][4][5][6][7] and doing so may constitute scientific misconduct.
However, because income is equal to expenses plus savings by definition, it is incorrect to include all 3 variables in a regression simultaneously.
Similarly, including a dummy variable for every category (e.g., summer, autumn, winter, and spring) as well as an intercept term will result in perfect collinearity.
[9] The other common cause of perfect collinearity is attempting to use ordinary least squares when working with very wide datasets (those with more variables than observations).
These require more advanced data analysis techniques like Bayesian hierarchical modeling to produce meaningful results.
In other words, highly correlated variables lead to poor estimates and large standard errors.
There are two ways to discover this information: This confounding becomes substantially worse when researchers attempt to ignore or suppress it by excluding these variables from the regression (see #Misuse).
Excluding multicollinear variables from regressions will invalidate causal inference and produce worse estimates by removing important confounders.
Bayesian hierarchical models (provided by software like BRMS) can perform such regularization automatically, learning informative priors from the data.
For example, complaints that coefficients have "wrong signs" or confidence intervals that "include unrealistic values" indicate there is important prior information that is not being incorporated into the model.
While the above strategies work in some situations, estimates using advanced techniques may still produce large standard errors.
[1] The scientific process often involves null or inconclusive results; not every experiment will be "successful" in the sense of decisively confirmation of the researcher's original hypothesis.
Variance inflation factors are often misused as criteria in stepwise regression (i.e. for variable inclusion/exclusion), a use that "lacks any logical basis but also is fundamentally misleading as a rule-of-thumb".
[1] Excluding variables with a high variance inflation factor also invalidates the calculated standard errors and p-values, by turning the results of the regression into a post hoc analysis.
[14] Because collinearity leads to large standard errors and p-values, which can make publishing articles more difficult, some researchers will try to suppress inconvenient data by removing strongly-correlated variables from their regression.
This procedure falls into the broader categories of p-hacking, data dredging, and post hoc analysis.
P-values and confidence intervals derived from post hoc analyses are invalidated by ignoring the uncertainty in the model selection procedure.
It is reasonable to exclude unimportant predictors if they are known ahead of time to have little or no effect on the outcome; for example, local cheese production should not be used to predict the height of skyscrapers.