does not necessarily answer the question that is of primary importance to a data analyst. The
‐test for a particular slope coefficient tests whether a variable adds predictive power given the other variables in the model, but if predictors are collinear it could be that none add anything given the others, while separately still being very important. A related problem is that collinearity can lead to great instability in regression coefficients and
‐tests, making results difficult to interpret. Hypothesis tests also do not distinguish between statistical significance (whether or not a true coefficient is exactly zero) from practical importance (whether or not a model provides the ability for an analyst to make important discoveries in the context of how a model is used in practice).
These considerations open up a broader spectrum of tools for model building than just hypothesis tests. Best subsets regression algorithms allow for the quick summarization of hundreds or even thousands of potential regression models. The underlying principle of these summaries is the principle of parsimony, which implies the tradeoff of strength of fit versus simplicity: that a model should only be as complex as it needs to be. Measures such as the adjusted , , and explicitly provide this tradeoff, and are useful tools in helping to decide when a simpler model is preferred over a more complicated one. An effective model selection strategy uses these measures, as well as hypothesis tests and estimated prediction intervals, to suggest a set of potential “best” models, which can then be considered further. In doing so, it is important to remember that the variability that comes from model selection itself (model selection uncertainty) means that it is likely that several models actually provide descriptions of the underlying population process that are equally valid. One way of assessing the effects of this type of uncertainty is to keep some of the observed data aside as a holdout sample, and then validate the chosen fitted model(s) on that held out data.
A related point increasingly raised in recent years has been focused on issues of replicability, or the lack thereof — the alarming tendency for supposedly established relationships to not reappear as strongly (or at all) when new data are examined. Much of this phenomenon comes from quite valid attempts to find appropriate representations of relationships in a complicated world (including those discussed here and in the next three chapters), but that doesn't alter the simple fact that interacting with data to make models more appropriate tends to make things look stronger than they actually are. Replication and validation of models (and the entire model building process) should be a fundamental part of any exploration of a random process. Examining a problem further and discovering that a previously‐believed relationship does not replicate is not a failure of the scientific process; in fact, it is part of the essence of it.
A valid question regarding the logistics of performing model selection remains: what is the “correct” order in which to perform the different steps of model selection, assumption checking, and so on? Do you omit unusual observations first, and then try to determine the best model? Or do you work on variable selection, and then check diagnostics based on your chosen model? Unfortunately, there is no clear answer to this question, as neither order is guaranteed to work. The best answer is to try it both ways and see what happens; chances are results will be similar, and if they are not this could reveal alternative models that are equally valid and reasonable. What is certainly true is that if the data set is changed in any way, whether by omitting observations, taking logs, or anything else, model selection must be explored again, as the results previously obtained might not be appropriate for the new form of the data.
Although best subsets algorithms and modern computing power have made automatic model selection more feasible than it once was, they are still limited computationally to a maximum of roughly predictors. In recent years, it has become more common for a data analyst to be faced with data sets with hundreds or thousands of predictors, making such methods infeasible. Recent work has focused on alternatives to least squares called regularization methods, which can (possibly) be viewed as effectively variable selectors, and are feasible for very large numbers of predictors. These methods are discussed further in Chapter 14.
KEY TERMS
A modified version of the Akaike Information Criterion () that guards against overfitting in small samples. It is used to compare models when performing model selection.Best subsets regressionA procedure that generates the best‐fitting models for each number of predictors in the model.Chow testA statistical (partial ‐)test for determining whether a single regression model can be used to describe the regression relationships when two groups are present in the data.CollinearityWhen predictor variables in a regression fit are highly correlated with each other.Constant shift modelRegression models that have different intercepts but the same slope coefficients for the predicting variables for different groups in the data.Indicator variableA variable that takes on the values or , indicating whether a particular observation belongs to a certain group or not.Interaction effectWhen the relationship between a predictor and the target variable differs depending on the group in which an observation falls.Linear restrictionA linear condition on the regression coefficients that defines a special case (subset) of a larger unrestricted model.Mallows' A criterion used for comparing several competing models to each other. It is designed to estimate the expected squared prediction error of a model.Model selection uncertaintyThe variability in results that comes from the fact that model selection is an iterative process, arrived at after examination of several models, and therefore the final model chosen is dependent on the particular sample drawn from the population. Significance levels, confidence intervals, etc., are not exact, as they depend on a chosen model that is itself random. This should be recognized when interpreting results.OverfittingIncluding redundant or noninformative predictors in a fitted regression.Partial ‐test‐test used to compare the fit of an unrestricted model to that of a restricted model (defined by a linear restriction), in order to see if the restricted model is adequate to describe the relationship in the data.Pooled modelA single model fit to the data that ignores group classification.ReplicabilityThe finding that a scientific experiment or modeling process obtains a consistent result when it is repeated.UnderfittingOmitting informative essential variables in a fitted regression.Variance inflation factorA statistic giving the proportional increase in the variance of the sample regression coefficient for a particular predictor due to the linear association of the predictor with other predictors.
PART TWO Addressing Violations of Assumptions