Читать книгу Handbook of Regression Analysis With Applications in R - Samprit Chatterjee - Страница 31

2.2.1 USING HYPOTHESIS TESTS TO COMPARE MODELS

Determining whether individual regression coefficients are statistically significant (as discussed in Section 1.3.3) is an obvious first step in deciding whether a model is overspecified. A predictor that does not add significantly to model fit should have an estimated slope coefficient that is not significantly different from , and is thus identified by a small ‐statistic. So, for example, in the analysis of home prices in Section 1.4, the regression output on page 17 suggests removing number of bedrooms, lot size, and property taxes from the model, as all three have insignificant ‐values.

Recall that ‐tests can only assess the contribution of a predictor given all of the others in the model. When predictors are correlated with each other, ‐tests can give misleading indications of the importance of a predictor. Consider a two‐predictor situation where the predictors are each highly correlated with the target variable, and are also highly correlated with each other. In this situation, it is likely that the ‐statistic for each predictor will be relatively small. This is not an inappropriate result, since given one predictor the other adds little (being highly correlated with each other, one is redundant in the presence of the other). This means that the ‐statistics are not effective in identifying important predictors when the two variables are highly correlated.

The ‐tests and ‐test of Section 1.3.3 are special cases of a general formulation that is useful for comparing certain classes of models. It might be the case that a simpler version of a candidate model (a subset model) might be adequate to fit the data. For example, consider taking a sample of college students and determining their college grade point average (), Scholastic Aptitude Test (SAT) evidence‐based reading and writing score (), and SAT math score (). The full regression model to fit to these data is

Instead of considering reading and math scores separately, we could consider whether can be predicted by one variable: total SAT score, which is the sum of and . This subset model is

with . This equality condition is called a linear restriction, because it defines a linear condition on the parameters of the regression model (that is, it only involves additions, subtractions, and equalities of coefficients and constants).

The question about whether the total SAT score is sufficient to predict grade point average can be stated using a hypothesis test about this linear restriction. As always, the null hypothesis gets the benefit of the doubt; in this case, that is the simpler restricted (subset) model that the sum of and is adequate, since it says that only one predictor is needed, rather than two. The alternative hypothesis is the unrestricted full model (with no conditions on ). That is,

versus

These hypotheses are tested using a partial ‐test. The ‐statistic has the form

(2.1)

where is the sample size, is the number of predictors in the full model, and is the difference between the number of parameters in the full model and the number of parameters in the subset model. This statistic is compared to an distribution on degrees of freedom. So, for example, for this GPA/SAT example, and , so the observed ‐statistic would be compared to an distribution on degrees of freedom. Some statistical packages allow specification of the full and subset models and will calculate the ‐test, but others do not, and the statistic has to be calculated manually based on the fits of the two models.

An alternative form for the ‐test above might make clearer what is going on here:

That is, if the strength of the fit of the full model (measured by ) isn't much larger than that of the subset model, the ‐statistic is small, and we do not reject the subset model; if, on the other hand, the difference in values is large (implying that the fit of the full model is noticeably stronger), we do reject the subset model in favor of the full model.

The ‐statistic to test the overall significance of the regression is a special case of this construction (with restriction ), as is each of the individual ‐statistics that test the significance of any variable (with restriction ). In the latter case .

Handbook of Regression Analysis With Applications in R

Подняться наверх