with . This equality condition is called a linear restriction, because it defines a linear condition on the parameters of the regression model (that is, it only involves additions, subtractions, and equalities of coefficients and constants).
The question about whether the total SAT score is sufficient to predict grade point average can be stated using a hypothesis test about this linear restriction. As always, the null hypothesis gets the benefit of the doubt; in this case, that is the simpler restricted (subset) model that the sum of and is adequate, since it says that only one predictor is needed, rather than two. The alternative hypothesis is the unrestricted full model (with no conditions on ). That is,
versus
These hypotheses are tested using a partial‐test. The ‐statistic has the form
where is the sample size, is the number of predictors in the full model, and is the difference between the number of parameters in the full model and the number of parameters in the subset model. This statistic is compared to an distribution on degrees of freedom. So, for example, for this GPA/SAT example, and , so the observed ‐statistic would be compared to an distribution on degrees of freedom. Some statistical packages allow specification of the full and subset models and will calculate the ‐test, but others do not, and the statistic has to be calculated manually based on the fits of the two models.
An alternative form for the ‐test above might make clearer what is going on here:
That is, if the strength of the fit of the full model (measured by ) isn't much larger than that of the subset model, the ‐statistic is small, and we do not reject the subset model; if, on the other hand, the difference in values is large (implying that the fit of the full model is noticeably stronger), we do reject the subset model in favor of the full model.
The ‐statistic to test the overall significance of the regression is a special case of this construction (with restriction ), as is each of the individual ‐statistics that test the significance of any variable (with restriction ). In the latter case .
2.2.2 COLLINEARITY
Recall that the importance of a predictor can be difficult to assess using ‐tests when predictors are correlated with each other. A related issue is that of collinearity (sometimes somewhat redundantly referred to as multicollinearity), which refers to the situation when (some of) the predictors are highly correlated with each other. The presence of predicting variables that are highly correlated with each other can lead to instability in the regression coefficients, increasing their standard errors, and as a result the ‐statistics for the variables can be deflated. This can be seen in Figure 2.1. The two plots refer to identical data sets, other than the one data point that is lightly colored. Dropping the data points down to the plane makes clear the high correlation between the predictors. The estimated regression plane changes from
in the top plot to
in the bottom plot; a small change in only one data point causes a major change in the estimated regression function.
Thus, from a practical point of view, collinearity leads to two problems. First, it can happen that the overall ‐statistic is significant, yet each of the individual ‐statistics is not significant (more generally, the tail probability for the ‐test is considerably smaller than those of any of the individual coefficient ‐tests). Second, if the data are changed only slightly, the fitted regression coefficients can change dramatically. Note that while collinearity can have a large effect on regression coefficients and associated Скачать книгу