slope coefficients are very similar to those from the model using all predictors (which is not surprising given the low collinearity in the data), so the interpretations of the estimated coefficients on page 17 still hold to a large extent. A plot of the residuals versus the fitted values and a normal plot of the residuals (Figure 2.2) look fine, and similar to those for the model using all six predictors in Figure 1.5; plots of the residuals versus each of the predictors in the model are similar to those in Figure 1.6, so they are not repeated here.
Once a “best” model is chosen, it is tempting to use the usual inference tools (such as ‐tests and ‐tests) to try to explain the process being studied. Unfortunately, doing this while ignoring the model selection process can lead to problems. Since the model was chosen to be best (in some sense) it will tend to appear stronger than would be expected just by random chance. Conducting inference based on the chosen model as if it was the only one examined ignores an additional source of variability, that of actually choosing the model (model selection based on a different sample from the same population could very well lead to a different chosen “best” model). This is termed model selection uncertainty. As a result of ignoring model selection uncertainty, confidence intervals can have lower coverage than the nominal value, hypothesis tests can reject the null too often, and prediction intervals can be too narrow for their nominal coverage.
FIGURE 2.2: Residual plots for the home price data using the best three‐predictor model. (a) Plot of residuals versus fitted values. (b) Normal plot of the residuals.
Identifying and correcting for this uncertainty is a difficult problem, and an active area of research, and will be discussed further in Chapter 14. There are, however, a few things practitioners can do. First, it is not appropriate to emphasize too strongly the single “best” model; any model that has similar criteria values (such as or ) to those of the best model should be recognized as being one that could easily have been chosen as best based on a different sample from the same population, and any implications of such a model should be viewed as being as valid as those from the best model. Further, one should expect that ‐values for the predictors included in a chosen model are potentially smaller than they should be, so taking a conservative attitude regarding statistical significance is appropriate. Thus, for the chosen three‐predictor model summarized on page 35, number of bathrooms and living area are likely to correspond to real effects, but the reality of the year built effect is more questionable.
There is a straightforward way to get a sense of the predictive power of a chosen model if enough data are available. This can be evaluated by holding out some data from the analysis (a holdout or validation sample), applying the selected model from the original data to the holdout sample (based on the previously estimated parameters, not estimates based on the new data), and then examining the predictive performance of the model. If, for example, the standard deviation of the errors from this prediction is not very different from the standard error of the estimate in the original regression, chances are making inferences based on the chosen model will not be misleading. Similarly, if a (say) prediction interval does not include roughly of the new observations, that indicates poorer‐than‐expected predictive performance on new data.
FIGURE 2.3: Plot of observed versus predicted house sale price values of validation sample, with pointwise prediction interval limits superimposed. The dotted line corresponds to equality of observed values and predictions.
Figure 2.3 illustrates a validation of the three‐predictor housing price model on a holdout sample of houses. The figure is a plot of the observed versus predicted prices, with pointwise prediction interval limits superimposed. The intervals contain of the prices ( of ), and the average predictive error on the new houses is only (compared to an average observed price of more than ), not suggesting the presence of any forecasting bias in the model. Two of the houses, however, have sale prices well below what would have been expected (more than lower than expected), and this is reflected in a much higher standard deviation () of the predictive errors than from the fitted regression. If the two outlying houses are omitted, the standard deviation of the predictive errors is much smaller (), suggesting that while the fitted model's predictive performance for most houses is in line with its performance on the original sample, there are indications that it might not predict well for the occasional unusual house.
If validating the model on new data this way is not possible, a simple adjustment that is helpful is to estimate the variance of the errors as
where is based on the chosen “best” model, and is the number of predictors in the most complex model examined, in the sense of most predictors (Ye, 1998). Clearly, if very complex models are included among the set of candidate models, can be much larger than the standard error of the estimate from the chosen model, with correspondingly wider prediction intervals. This reinforces the benefit of limiting the set of candidate models (and the complexity of the models in that set) from the start. In this case , so the effect is not that pronounced.