Читать книгу Handbook of Regression Analysis With Applications in R - Samprit Chatterjee - Страница 35
2.3.2 EXAMPLE — ESTIMATING HOME PRICES (CONTINUED)
ОглавлениеConsider again the home price data examined in Section 1.4. We repeat the regression output from the model based on all of the predictors below:
Coefficients: Estimate Std.Error t value Pr(>|t|) VIF (Intercept) -7.149e+06 3.820e+06 -1.871 0.065043 . Bedrooms -1.229e+04 9.347e+03 -1.315 0.192361 1.262 Bathrooms 5.170e+04 1.309e+04 3.948 0.000171 1.420 *** Living.area 6.590e+01 1.598e+01 4.124 9.22e-05 1.661 *** Lot.size -8.971e-01 4.194e+00 -0.214 0.831197 1.074 Year.built 3.761e+03 1.963e+03 1.916 0.058981 1.242 . Property.tax 1.476e+00 2.832e+00 0.521 0.603734 1.300 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 47380 on 78 degrees of freedom Multiple R-squared: 0.5065, Adjusted R-squared: 0.4685 F-statistic: 13.34 on 6 and 78 DF, p-value: 2.416e-10
This is identical to the output given earlier, except that variance inflation factor () values are given for each predictor. It is apparent that there is virtually no collinearity among these predictors (recall that is the minimum possible value of the ), which should make model selection more straightforward. The following output summarizes a best subsets fitting:
P L r i Y o B v e p B a i L a e e t n o r r d h g t . t r r . . b y o o a s u . o o r i i t Mallows m m e z l a Vars R-Sq R-Sq(adj) Cp AICc S s s a e t x 1 35.3 34.6 21.2 1849.9 52576 X 1 29.4 28.6 30.6 1857.3 54932 X 1 10.6 9.5 60.3 1877.4 61828 X 2 46.6 45.2 5.5 1835.7 48091 X X 2 38.9 37.5 17.5 1847.0 51397 X X 2 37.8 36.3 19.3 1848.6 51870 X X 3 49.4 47.5 3.0 1833.1 47092 X X X 3 48.2 46.3 4.9 1835.0 47635 X X X 3 46.6 44.7 7.3 1837.5 48346 X X X 4 50.4 48.0 3.3 1833.3 46885 X X X X 4 49.5 47.0 4.7 1834.8 47304 X X X X 4 49.4 46.9 5.0 1835.1 47380 X X X X 5 50.6 47.5 5.0 1835.0 47094 X X X X X 5 50.5 47.3 5.3 1835.2 47162 X X X X X 5 49.6 46.4 6.7 1836.8 47599 X X X X X 6 50.6 46.9 7.0 1836.9 47381 X X X X X X
Output of this type provides the tools to choose among candidate models. The output provides summary statistics for the three models with strongest fit for each number of predictors. So, for example, the best one‐predictor model is based on Bathrooms
, while the second best is based on Living.area
; the best two‐predictor model is based on Bathrooms
and Living.area
; and so on. The principle of parsimony noted earlier implies moving down the table as long as the gain in fit is big enough, but no further, thereby encouraging simplicity. A reasonable model selection strategy would not be based on only one possible measure, but looking at all of the measures together, using various guidelines to ultimately focus in on a few models (or only one) that best trade off strength of fit with simplicity, for example as follows:
1 Increase the number of predictors until the value levels off. Clearly, the highest for a given cannot be smaller than that for a smaller value of . If levels off, that implies that additional variables are not providing much additional fit. In this case, the largest values go from roughly to from to , which is clearly a large gain in fit, but beyond that more complex models do not provide much additional fit (particularly past ). Thus, this guideline suggests choosing either or .
2 Choose the model that maximizes the adjusted . Recall from equation (1.7) that the adjusted equalsIt is apparent that explicitly trades off strength of fit () versus simplicity [the multiplier ], and can decrease if predictors that do not add any predictive power are added to a model. Thus, it is reasonable to not complicate a model beyond the point where its adjusted increases. For these data, is maximized at .
The fourth column in the output refers to a criterion called Mallows' (Mallows, 1973). This criterion equals
where is the residual sum of squares for the model being examined, is the number of predictors in that model, and is the residual mean square based on using all of the candidate predicting variables. is designed to estimate the expected squared prediction error of a model. Like , explicitly trades off strength of fit versus simplicity, with two differences: it is now small values that are desirable, and the penalty for complexity is stronger, in that the penalty term now multiplies the number of predictors in the model by , rather than by (which means that using will tend to lead to more complex models than using will). This suggests another model selection rule:
1 Choose the model that minimizes . In case of tied values, the simplest model (smallest ) would be chosen. In these data, this rule implies choosing .
An additional operational rule for the use of has been suggested. When a particular model contains all of the necessary predictors, the residual mean square for the model should be roughly equal to . Since the model that includes all of the predictors should also include all of the necessary ones, should also be roughly equal to . This implies that if a model includes all of the necessary predictors, then
This suggests the following model selection rule:
1 Choose the simplest model such that or smaller. In these data, this rule implies choosing .
A weakness of the criterion is that its value depends on the largest set of candidate predictors (through ), which means that adding predictors that provide no predictive power to the set of candidate models can change the choice of best model. A general approach that avoids this is through the use of statistical information. A detailed discussion of the determination of information measures is beyond the scope of this book, but Burnham and Anderson (2002) provides extensive discussion of the topic. The Akaike Information Criterion , introduced by Akaike (1973),
where the function refers to natural logs, is such a measure, and it estimates the information lost in approximating the true model by a candidate model. It is clear from (2.2) that minimizing achieves the goal of balancing strength of fit with simplicity, and because of the term in the criterion this will result in the choice of similar models as when minimizing . It is well known that has a tendency to lead to overfitting, particularly in small samples. That is, the penalty term in designed to guard against too complicated a model is not strong enough. A modified version of that helps address this problem is the corrected ,
(Hurvich and Tsai, 1989). Equation (2.3) shows that (especially for small samples) models with fewer parameters will be more strongly preferred when minimizing than when minimizing , providing stronger protection against overfitting. In large samples, the two criteria are virtually identical, but in small samples, or when considering models with a large number of parameters, is the better choice. This suggests the following model selection rule:
1 Choose the model that minimizes . In case of tied values, the simplest model (smallest ) would be chosen. In these data, this rule implies choosing , although the value for is virtually identical to that of . Note that the overall level of the values is not meaningful, and should not be compared to values or values for other data sets; it is only the value for a model for a given data set relative to the values of others for that data set that matter.
, , and have the desirable property that they are efficient model selection criteria. This means that in the (realistic) situation where the set of candidate models does not include the “true” model (that is, a good model is just viewed as a useful approximation to reality), as the sample gets larger the error obtained in making predictions using the model chosen using these criteria becomes indistinguishable from the error obtained using the best possible model among all candidate models. That is, in this large‐sample predictive sense, it is as if the best approximation was known to the data analyst. Another well‐known criterion, the Bayesian Information Criterion [which substitutes for in (2.2)], does not have this property, but is instead a consistent criterion. Such a criterion has the property that if the “true” model is in fact among the candidate models the criterion will select that model with probability approaching as the sample size increases. Thus, is a more natural criterion to use if the goal is to identify the “true” predictors with nonzero slopes (which of course presumes that there are such things as “true” predictors in a “true” model). will generally choose simpler models than because of its stronger penalty ( for ), and a version that adjusts as in (2.3) leads to even simpler models. This supports the notion that from a predictive point of view including a few unnecessary predictors (overfitting) is far less damaging than is omitting necessary predictors (underfitting).
A final way of comparing models is from a directly predictive point of view. Since a rough prediction interval is , a useful model from a predictive point of view is one with small , suggesting choosing a model that has small while still being as simple as possible. That is,
1 Increase the number of predictors until levels off. For these data ( in the output refers to ), this implies choosing or .
Taken together, all of these rules imply that the appropriate set of models to consider are those with two, three, or four predictors. Typically, the strongest model of each size (which will have highest , highest , lowest , lowest , and lowest , so there is no controversy as to which one is strongest) is examined. The output on pages 31–32 provides summaries for the top three models of each size, in case there are reasons to examine a second‐ or third‐best model (if, for example, a predictor in the best model is difficult or expensive to measure), but here we focus on the best model of each size. First, here is output for the best four‐predictor model.
Coefficients: Estimate Std.Error t value Pr(>|t|) VIF (Intercept) -6.852e+06 3.701e+06 -1.852 0.0678 . Bedrooms -1.207e+04 9.212e+03 -1.310 0.1940 1.252 Bathrooms 5.303e+04 1.275e+04 4.160 7.94e-05 1.374 *** Living.area 6.828e+01 1.460e+01 4.676 1.17e-05 1.417 *** Year.built 3.608e+03 1.898e+03 1.901 0.0609 1.187 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 46890 on 80 degrees of freedom Multiple R-squared: 0.5044, Adjusted R-squared: 0.4796 F-statistic: 20.35 on 4 and 80 DF, p-value: 1.356e-11
The ‐statistic for number of bedrooms suggests very little evidence that it adds anything useful given the other predictors in the model, so we consider now the best three‐predictor model. This happens to be the best four‐predictor model with the one statistically insignificant predictor omitted, but this does not have to be the case.
Coefficients: Estimate Std.Error t value Pr(>|t|) VIF (Intercept) -7.653e+06 3.666e+06 -2.087 0.039988 * Bathrooms 5.223e+04 1.279e+04 4.084 0.000103 1.371 *** Living.area 6.097e+01 1.355e+01 4.498 2.26e-05 1.210 *** Year.built 4.001e+03 1.883e+03 2.125 0.036632 1.158 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 47090 on 81 degrees of freedom Multiple R-squared: 0.4937, Adjusted R-squared: 0.475 F-statistic: 26.33 on 3 and 81 DF, p-value: 5.489e-12
Each of the predictors is statistically significant at a level, and this model recovers virtually all of the available fit (, while that using all six predictors is ), so this seems to be a reasonable model choice. The estimated slope coefficients are very similar to those from the model using all predictors (which is not surprising given the low collinearity in the data), so the interpretations of the estimated coefficients on page 17 still hold to a large extent. A plot of the residuals versus the fitted values and a normal plot of the residuals (Figure 2.2) look fine, and similar to those for the model using all six predictors in Figure 1.5; plots of the residuals versus each of the predictors in the model are similar to those in Figure 1.6, so they are not repeated here.
Once a “best” model is chosen, it is tempting to use the usual inference tools (such as ‐tests and ‐tests) to try to explain the process being studied. Unfortunately, doing this while ignoring the model selection process can lead to problems. Since the model was chosen to be best (in some sense) it will tend to appear stronger than would be expected just by random chance. Conducting inference based on the chosen model as if it was the only one examined ignores an additional source of variability, that of actually choosing the model (model selection based on a different sample from the same population could very well lead to a different chosen “best” model). This is termed model selection uncertainty. As a result of ignoring model selection uncertainty, confidence intervals can have lower coverage than the nominal value, hypothesis tests can reject the null too often, and prediction intervals can be too narrow for their nominal coverage.
FIGURE 2.2: Residual plots for the home price data using the best three‐predictor model. (a) Plot of residuals versus fitted values. (b) Normal plot of the residuals.
Identifying and correcting for this uncertainty is a difficult problem, and an active area of research, and will be discussed further in Chapter 14. There are, however, a few things practitioners can do. First, it is not appropriate to emphasize too strongly the single “best” model; any model that has similar criteria values (such as or ) to those of the best model should be recognized as being one that could easily have been chosen as best based on a different sample from the same population, and any implications of such a model should be viewed as being as valid as those from the best model. Further, one should expect that ‐values for the predictors included in a chosen model are potentially smaller than they should be, so taking a conservative attitude regarding statistical significance is appropriate. Thus, for the chosen three‐predictor model summarized on page 35, number of bathrooms and living area are likely to correspond to real effects, but the reality of the year built effect is more questionable.
There is a straightforward way to get a sense of the predictive power of a chosen model if enough data are available. This can be evaluated by holding out some data from the analysis (a holdout or validation sample), applying the selected model from the original data to the holdout sample (based on the previously estimated parameters, not estimates based on the new data), and then examining the predictive performance of the model. If, for example, the standard deviation of the errors from this prediction is not very different from the standard error of the estimate in the original regression, chances are making inferences based on the chosen model will not be misleading. Similarly, if a (say) prediction interval does not include roughly of the new observations, that indicates poorer‐than‐expected predictive performance on new data.
FIGURE 2.3: Plot of observed versus predicted house sale price values of validation sample, with pointwise prediction interval limits superimposed. The dotted line corresponds to equality of observed values and predictions.
Figure 2.3 illustrates a validation of the three‐predictor housing price model on a holdout sample of houses. The figure is a plot of the observed versus predicted prices, with pointwise prediction interval limits superimposed. The intervals contain of the prices ( of ), and the average predictive error on the new houses is only (compared to an average observed price of more than ), not suggesting the presence of any forecasting bias in the model. Two of the houses, however, have sale prices well below what would have been expected (more than lower than expected), and this is reflected in a much higher standard deviation () of the predictive errors than from the fitted regression. If the two outlying houses are omitted, the standard deviation of the predictive errors is much smaller (), suggesting that while the fitted model's predictive performance for most houses is in line with its performance on the original sample, there are indications that it might not predict well for the occasional unusual house.
If validating the model on new data this way is not possible, a simple adjustment that is helpful is to estimate the variance of the errors as
where is based on the chosen “best” model, and is the number of predictors in the most complex model examined, in the sense of most predictors (Ye, 1998). Clearly, if very complex models are included among the set of candidate models, can be much larger than the standard error of the estimate from the chosen model, with correspondingly wider prediction intervals. This reinforces the benefit of limiting the set of candidate models (and the complexity of the models in that set) from the start. In this case , so the effect is not that pronounced.
The adjustment of the denominator in (2.4) to account for model selection uncertainty is just a part of the more general problem that standard degrees of freedom calculations are no longer valid when multiple models are being compared to each other as in the comparison of all models with a given number of predictors in best subsets. This affects other uses of those degrees of freedom, including the calculation of information measures like , , , and , and thus any decisions regarding model choice. This problem becomes progressively more serious as the number of potential predictors increases and is the subject of active research. This will be discussed further in Chapter 14.