Читать книгу Handbook of Regression Analysis With Applications in R - Samprit Chatterjee - Страница 36

2.4 Indicator Variables and Modeling Interactions

It is not unusual for the observations in a sample to fall into two distinct subgroups; for example, people are either male or female. It might be that group membership has no relationship with the target variable (given other predictors); such a pooled model ignores the grouping and pools the two groups together.

On the other hand, it is clearly possible that group membership is predictive for the target variable (for example, expected salaries differing for men and women given other control variables could indicate gender discrimination). Such effects can be explored easily using an indicator variable, which takes on the value for one group and for the other (such variables are sometimes called dummy variables or variables). The model takes the form

where is an indicator variable with value if the observation is a member of group and otherwise. The usual interpretation of the slope still applies: is the expected change in associated with a one‐unit change in holding all else fixed. Since only takes on the values or , this is equivalent to saying that the expected target is higher for group members () than nonmembers (), holding all else fixed. This has the appealing interpretation of fitting a constant shift model, where the regression relationships for group members and nonmembers are identical, other than being shifted up or down; that is,

for nonmembers and

for members. The ‐test for whether is thus a test of whether a constant shift model (two parallel regression lines, planes, or hyperplanes) is a significant improvement over a pooled model (one common regression line, plane, or hyperplane).

Would two different regression relationships be better still? Say there is only one numerical predictor ; the full model that allows for two different regression lines is

for nonmembers (), and

for members (). The pooled model and the constant shift model can be made to be special cases of the full model, by creating a new variable that is the product of and . A regression model that includes this variable,

corresponds to the two different regression lines

for nonmembers (since ), implying and above, and

for members (since ), implying and above.

The ‐test for the slope of the product variable () is a test of whether the full model (two different regression lines) is significantly better than the constant shift model (two parallel regression lines); that is, it is a test of parallelism. The restriction defines the pooled model as a special case of the full model, so the partial ‐statistic based on (2.1),

on degrees of freedom, provides a test comparing the pooled model to the full model. This test is often called the Chow test (Chow, 1960) in the economics literature.

These constructions can be easily generalized to multiple predictors, with different variations of models obtainable. For example, a regression model with unequal slopes for some predictors and equal slopes for others is fit by including products of the indicator and the predictor for the ones with different slopes and not including them for the predictors with equal slopes. Appropriate ‐ and ‐tests can then be constructed to make particular comparisons of models.

A reasonable question to ask at this point is “Why bother to fit the full model? Isn't it just the same as fitting two separate regressions on the two groups?” The answer is no. The full model fit above assumes that the variance of the errors is the same (the constant variance assumption), while fitting two separate regressions allows the variances to be different. The fitted slope coefficients from the full model will, however, be identical to those from two separate fits. What is gained by analyzing the data this way is the comparison of versions of pooled, constant shift, and full models based on group membership, including different slopes for some variables and equal slopes for others, something that is not possible if separate regressions are fit to the two groups.

Another way of saying that the relationship between a predictor and the target is different for members of the two different groups is that there is an interaction effect between the predictor and group membership on the target. Social scientists would say that the grouping has a moderating effect on the relationship between the predictor and the target. The fact that in the case of a grouping variable, the interaction can be fit by multiplying the two variables together has led to a practice that is common in some fields: to try to represent any interaction between variables (that is, any situation where the relationship between a predictor and the target is different for different values of another predictor) by multiplying them together. Unfortunately, this is not a very reasonable way to think about interactions for numerical predictors, since there are many ways that the effect of one variable on the target can differ depending on the value of another that have nothing to do with product functions. See Section 15.6 for further discussion.

Handbook of Regression Analysis With Applications in R

Подняться наверх