rel="nofollow" href="#ulink_edba1f88-7c52-5535-a136-ecab1a4b9b9e">(2.4) to account for model selection uncertainty is just a part of the more general problem that standard degrees of freedom calculations are no longer valid when multiple models are being compared to each other as in the comparison of all models with a given number of predictors in best subsets. This affects other uses of those degrees of freedom, including the calculation of information measures like , , , and , and thus any decisions regarding model choice. This problem becomes progressively more serious as the number of potential predictors increases and is the subject of active research. This will be discussed further in Chapter 14.
2.4 Indicator Variables and Modeling Interactions
It is not unusual for the observations in a sample to fall into two distinct subgroups; for example, people are either male or female. It might be that group membership has no relationship with the target variable (given other predictors); such a pooled model ignores the grouping and pools the two groups together.
On the other hand, it is clearly possible that group membership is predictive for the target variable (for example, expected salaries differing for men and women given other control variables could indicate gender discrimination). Such effects can be explored easily using an indicator variable, which takes on the value for one group and for the other (such variables are sometimes called dummy variables or variables). The model takes the form
where is an indicator variable with value if the observation is a member of group and otherwise. The usual interpretation of the slope still applies: is the expected change in associated with a one‐unit change in holding all else fixed. Since only takes on the values or , this is equivalent to saying that the expected target is higher for group members () than nonmembers (), holding all else fixed. This has the appealing interpretation of fitting a constant shift model, where the regression relationships for group members and nonmembers are identical, other than being shifted up or down; that is,
for nonmembers and
for members. The ‐test for whether is thus a test of whether a constant shift model (two parallel regression lines, planes, or hyperplanes) is a significant improvement over a pooled model (one common regression line, plane, or hyperplane).
Would two different regression relationships be better still? Say there is only one numerical predictor ; the full model that allows for two different regression lines is
for nonmembers (), and
for members (). The pooled model and the constant shift model can be made to be special cases of the full model, by creating a new variable that is the product of and . A regression model that includes this variable,
corresponds to the two different regression lines
for nonmembers (since ), implying and above, and
for members (since ), implying and above.
The ‐test for the slope of the product variable () is a test of whether the full model (two different regression lines) is significantly better than the constant shift model (two parallel regression lines); that is, it is a test of parallelism. The restriction defines the pooled model as a special case of the full model, so the partial ‐statistic based on (2.1),