Читать книгу Handbook of Regression Analysis With Applications in R - Samprit Chatterjee - Страница 38
2.5 Summary
ОглавлениеIn this chapter, we have discussed various issues related to model building and model selection. Such methods are important because both underfitting (not including variables that are needed) and overfitting (including variables that are not needed) lead to problems in interpreting the results of regression analyses and making predictions using fitted regression models. Hypothesis tests provide one tool for model building through formal comparisons of models. If one model is a special case of another, defined through a linear restriction, then a partial ‐statistic provides a test of whether the more complex model provides significantly more predictive power than does the simpler one. One important example of a partial ‐test is the standard ‐test for the significance of a slope coefficient. Another important use of partial ‐tests is in the construction of models for data where observations fall into two distinct subgroups that allow for common (pooled) relationships over groups, constant shift relationships that differ only in level but not in slopes, and completely distinct and different relationships across groups.
While useful, hypothesis tests do not provide a complete tool for model building. The problem is that a hypothesis test does not necessarily answer the question that is of primary importance to a data analyst. The ‐test for a particular slope coefficient tests whether a variable adds predictive power given the other variables in the model, but if predictors are collinear it could be that none add anything given the others, while separately still being very important. A related problem is that collinearity can lead to great instability in regression coefficients and ‐tests, making results difficult to interpret. Hypothesis tests also do not distinguish between statistical significance (whether or not a true coefficient is exactly zero) from practical importance (whether or not a model provides the ability for an analyst to make important discoveries in the context of how a model is used in practice).
These considerations open up a broader spectrum of tools for model building than just hypothesis tests. Best subsets regression algorithms allow for the quick summarization of hundreds or even thousands of potential regression models. The underlying principle of these summaries is the principle of parsimony, which implies the tradeoff of strength of fit versus simplicity: that a model should only be as complex as it needs to be. Measures such as the adjusted , , and explicitly provide this tradeoff, and are useful tools in helping to decide when a simpler model is preferred over a more complicated one. An effective model selection strategy uses these measures, as well as hypothesis tests and estimated prediction intervals, to suggest a set of potential “best” models, which can then be considered further. In doing so, it is important to remember that the variability that comes from model selection itself (model selection uncertainty) means that it is likely that several models actually provide descriptions of the underlying population process that are equally valid. One way of assessing the effects of this type of uncertainty is to keep some of the observed data aside as a holdout sample, and then validate the chosen fitted model(s) on that held out data.
A related point increasingly raised in recent years has been focused on issues of replicability, or the lack thereof — the alarming tendency for supposedly established relationships to not reappear as strongly (or at all) when new data are examined. Much of this phenomenon comes from quite valid attempts to find appropriate representations of relationships in a complicated world (including those discussed here and in the next three chapters), but that doesn't alter the simple fact that interacting with data to make models more appropriate tends to make things look stronger than they actually are. Replication and validation of models (and the entire model building process) should be a fundamental part of any exploration of a random process. Examining a problem further and discovering that a previously‐believed relationship does not replicate is not a failure of the scientific process; in fact, it is part of the essence of it.
A valid question regarding the logistics of performing model selection remains: what is the “correct” order in which to perform the different steps of model selection, assumption checking, and so on? Do you omit unusual observations first, and then try to determine the best model? Or do you work on variable selection, and then check diagnostics based on your chosen model? Unfortunately, there is no clear answer to this question, as neither order is guaranteed to work. The best answer is to try it both ways and see what happens; chances are results will be similar, and if they are not this could reveal alternative models that are equally valid and reasonable. What is certainly true is that if the data set is changed in any way, whether by omitting observations, taking logs, or anything else, model selection must be explored again, as the results previously obtained might not be appropriate for the new form of the data.
Although best subsets algorithms and modern computing power have made automatic model selection more feasible than it once was, they are still limited computationally to a maximum of roughly predictors. In recent years, it has become more common for a data analyst to be faced with data sets with hundreds or thousands of predictors, making such methods infeasible. Recent work has focused on alternatives to least squares called regularization methods, which can (possibly) be viewed as effectively variable selectors, and are feasible for very large numbers of predictors. These methods are discussed further in Chapter 14.