P L r i Y o B v e p B a i L a e e t n o r r d h g t . t r r . . b y o o a s u . o o r i i t Mallows m m e z l a Vars R-Sq R-Sq(adj) Cp AICc S s s a e t x 1 35.3 34.6 21.2 1849.9 52576 X 1 29.4 28.6 30.6 1857.3 54932 X 1 10.6 9.5 60.3 1877.4 61828 X 2 46.6 45.2 5.5 1835.7 48091 X X 2 38.9 37.5 17.5 1847.0 51397 X X 2 37.8 36.3 19.3 1848.6 51870 X X 3 49.4 47.5 3.0 1833.1 47092 X X X 3 48.2 46.3 4.9 1835.0 47635 X X X 3 46.6 44.7 7.3 1837.5 48346 X X X 4 50.4 48.0 3.3 1833.3 46885 X X X X 4 49.5 47.0 4.7 1834.8 47304 X X X X 4 49.4 46.9 5.0 1835.1 47380 X X X X 5 50.6 47.5 5.0 1835.0 47094 X X X X X 5 50.5 47.3 5.3 1835.2 47162 X X X X X 5 49.6 46.4 6.7 1836.8 47599 X X X X X 6 50.6 46.9 7.0 1836.9 47381 X X X X X X
Output of this type provides the tools to choose among candidate models. The output provides summary statistics for the three models with strongest fit for each number of predictors. So, for example, the best one‐predictor model is based on Bathrooms
, while the second best is based on Living.area
; the best two‐predictor model is based on Bathrooms
and Living.area
; and so on. The principle of parsimony noted earlier implies moving down the table as long as the gain in fit is big enough, but no further, thereby encouraging simplicity. A reasonable model selection strategy would not be based on only one possible measure, but looking at all of the measures together, using various guidelines to ultimately focus in on a few models (or only one) that best trade off strength of fit with simplicity, for example as follows:
1 Increase the number of predictors until the value levels off. Clearly, the highest for a given cannot be smaller than that for a smaller value of . If levels off, that implies that additional variables are not providing much additional fit. In this case, the largest values go from roughly to from to , which is clearly a large gain in fit, but beyond that more complex models do not provide much additional fit (particularly past ). Thus, this guideline suggests choosing either or .
2 Choose the model that maximizes the adjusted . Recall from equation (1.7) that the adjusted equalsIt is apparent that explicitly trades off strength of fit () versus simplicity [the multiplier ], and can decrease if predictors that do not add any predictive power are added to a model. Thus, it is reasonable to not complicate a model beyond the point where its adjusted increases. For these data, is maximized at .
The fourth column in the output refers to a criterion called Mallows' (Mallows, 1973). This criterion equals
where is the residual sum of squares for the model being examined,
is the number of predictors in that model, and
is the residual mean square based on using all
of the candidate predicting variables.
is designed to estimate the expected squared prediction error of a model. Like
,
explicitly trades off strength of fit versus simplicity, with two differences: it is now small values that are desirable, and the penalty for complexity is stronger, in that the penalty term now multiplies the number of predictors in the model by
, rather than by
(which means that using
will tend to lead to more complex models than using
will). This suggests another model selection rule:
1 Choose the model that minimizes . In case of tied values, the simplest model (smallest ) would be chosen. In these data, this rule implies choosing .
An additional operational rule for the use of has been suggested. When a particular model contains all of the necessary predictors, the residual mean square for the model should be roughly equal to
. Since the model that includes all of the predictors should also include all of the necessary ones,
should also be roughly equal to
. This implies that if a model includes all of the necessary predictors, then
This suggests the following model selection rule:
1 Choose the simplest model such that or smaller. In these data, this rule implies choosing .
A weakness of the criterion is that its value depends on the largest set of candidate predictors (through
), which means that adding predictors that provide no predictive power to the set of candidate models can change the choice of best model. A general approach that avoids this is through the use of statistical information. A detailed discussion of the determination of information measures is beyond the scope of this book, but Burnham and Anderson (2002) provides extensive discussion of the topic. The Akaike Information Criterion
, introduced by Akaike (1973),
where the function refers to natural logs, is such a measure, and it estimates the information lost in approximating the true model by a candidate model. It is clear from (2.2) that minimizing
achieves the goal of balancing strength of fit with simplicity, and because of the
term in the criterion this will result in the choice of similar models as when minimizing
. It is well known that
has a tendency to lead to overfitting, particularly in small samples. That is, the penalty term in
designed to guard against too complicated