FIGURE 2.1: Least squares estimation under collinearity. The only change in the data sets is the lightly colored data point. The planes are the estimated least squares fits.
Another problem with collinearity comes from attempting to use a fitted regression model for prediction. As was noted in Chapter 1, simple models tend to forecast better than more complex ones, since they make fewer assumptions about what the future will look like. If a model exhibiting collinearity is used for future prediction, the implicit assumption is that the relationships among the predicting variables, as well as their relationship with the target variable, remain the same in the future. This is less likely to be true if the predicting variables are collinear.
How can collinearity be diagnosed? The two‐predictor model
provides some guidance. It can be shown that in this case
and
where is the correlation between
and
. Note that as collinearity increases (
), both variances tend to
. This effect is quantified in Table 2.1.
Table 2.1: Variance inflation caused by correlation of predictors in a two‐predictor model.
|
Variance inflation |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This ratio describes by how much the variances of the estimated slope coefficients are inflated due to observed collinearity relative to when the predictors are uncorrelated. It is clear that when the correlation is high, the variability (and hence the instability) of the estimated slopes can increase dramatically.
A diagnostic to determine this in general is the variance inflation factor () for each predicting variable, which is defined as
where is the
of the regression of the variable
on the other predicting variables.
gives the proportional increase in the variance of
compared to what it would have been if the predicting variables had been uncorrelated. There are no formal cutoffs as to what constitutes a large
, but collinearity is generally not a problem if the observed
satisfies