Читать книгу Handbook of Regression Analysis With Applications in R - Samprit Chatterjee - Страница 32
2.2.2 COLLINEARITY
ОглавлениеRecall that the importance of a predictor can be difficult to assess using ‐tests when predictors are correlated with each other. A related issue is that of collinearity (sometimes somewhat redundantly referred to as multicollinearity), which refers to the situation when (some of) the predictors are highly correlated with each other. The presence of predicting variables that are highly correlated with each other can lead to instability in the regression coefficients, increasing their standard errors, and as a result the ‐statistics for the variables can be deflated. This can be seen in Figure 2.1. The two plots refer to identical data sets, other than the one data point that is lightly colored. Dropping the data points down to the plane makes clear the high correlation between the predictors. The estimated regression plane changes from
in the top plot to
in the bottom plot; a small change in only one data point causes a major change in the estimated regression function.
Thus, from a practical point of view, collinearity leads to two problems. First, it can happen that the overall ‐statistic is significant, yet each of the individual ‐statistics is not significant (more generally, the tail probability for the ‐test is considerably smaller than those of any of the individual coefficient ‐tests). Second, if the data are changed only slightly, the fitted regression coefficients can change dramatically. Note that while collinearity can have a large effect on regression coefficients and associated ‐statistics, it does not have a large effect on overall measures of fit like the overall ‐test or , since adding unneeded variables (whether or not they are collinear with predictors already in the model) cannot increase the residual sum of squares (it can only decrease it or leave it roughly the same).
FIGURE 2.1: Least squares estimation under collinearity. The only change in the data sets is the lightly colored data point. The planes are the estimated least squares fits.
Another problem with collinearity comes from attempting to use a fitted regression model for prediction. As was noted in Chapter 1, simple models tend to forecast better than more complex ones, since they make fewer assumptions about what the future will look like. If a model exhibiting collinearity is used for future prediction, the implicit assumption is that the relationships among the predicting variables, as well as their relationship with the target variable, remain the same in the future. This is less likely to be true if the predicting variables are collinear.
How can collinearity be diagnosed? The two‐predictor model
provides some guidance. It can be shown that in this case
and
where is the correlation between and . Note that as collinearity increases (), both variances tend to . This effect is quantified in Table 2.1.
Table 2.1: Variance inflation caused by correlation of predictors in a two‐predictor model.
Variance inflation | |
This ratio describes by how much the variances of the estimated slope coefficients are inflated due to observed collinearity relative to when the predictors are uncorrelated. It is clear that when the correlation is high, the variability (and hence the instability) of the estimated slopes can increase dramatically.
A diagnostic to determine this in general is the variance inflation factor () for each predicting variable, which is defined as
where is the of the regression of the variable on the other predicting variables. gives the proportional increase in the variance of compared to what it would have been if the predicting variables had been uncorrelated. There are no formal cutoffs as to what constitutes a large , but collinearity is generally not a problem if the observed satisfies
where is the usual for the regression fit. This means that either the predictors are more related to the target variable than they are to each other, or they are not related to each other very much. In either case coefficient estimates are not very likely to be very unstable, so collinearity is not a problem. If collinearity is present, a simplified model should be considered, but this is only a general guideline; sometimes two (or more) collinear predictors might be needed in order to adequately model the target variable. In the next section we discuss a methodology for judging the adequacy of fitted models and comparing them.