Chapter 11 discusses the modeling of time‐to‐event data, often referred to as survival data. The response variable measures the length of time until an event occurs, and a common complicator is that sometimes it is only known that a response value is greater than some number; that is, it is right‐censored. This can naturally occur, for example, in a clinical trial in which subjects enter the study at varying times, and the event of interest has not occurred at the end of the trial. Analysis focuses on the survival function (the probability of surviving past a given time) and the hazard function (the instantaneous probability of the event occurring at a given time given survival to that time). Parametric models based on appropriate distributions like the Weibull or log‐logistic can be fit that take censoring into account. Semiparametric models like the Cox proportional hazards model (the most commonly‐used model) and the Buckley‐James estimator are also available, which weaken distributional assumptions. Modeling can be adapted to situations where event times are truncated, and also when there are covariates that change over the life of the subject.
Chapter 13 extends applications to data with multiple observations for each subject consistent with some structure from the underlying process. Such data can take the form of nested or clustered data (such as students all in one classroom) or longitudinal data (where a variable is measured at multiple times for each subject). In this situation ignoring that structure results in an induced correlation that reflects unmodeled differences between classrooms and subjects, respectively. Mixed effects models generalize analysis of variance (ANOVA) models and time series models to this more complicated situation. Models with linear effects based on Gaussian distributions can be generalized to nonlinear models, and also can be generalized to non‐Gaussian distributions through the use of generalized linear mixed effects models.
Modern data applications can involve very large (even massive) numbers of predictors, which can cause major problems for standard regression methods. Best subsets regression (discussed in Chapter 2) does not scale well to very large numbers of predictors, and Chapter 14 discusses approaches that can accomplish that. Forward stepwise regression, in which potential predictors are stepped in one at a time, is an alternative to best subsets that scales to massive data sets. A systematic approach to reducing the dimensionality of a chosen regression model is through the use of regularization, in which the usual estimation criterion is augmented with a penalty that encourages sparsity; the most commonly‐used version of this is the lasso estimator, and it and its generalizations are discussed further.
Chapters 15 and 16 discuss methods that move away from specified relationships between the response and the predictor to nonparametric and semiparametric methods, in which the data are used to choose the form of the underlying relationship. In Chapter 15 linear or (specifically specified) nonlinear relationships are replaced with the notion of relationships taking the form of smooth curves and surfaces. Estimation at a particular location is based on local information; that is, the values of the response in a local neighborhood of that location. This can be done through local versions of weighted least squares (local polynomial estimation) or local regularization (smoothing splines). Such methods can also be used to help identify interactions between numerical predictors in linear regression modeling. Single predictor smoothing estimators can be generalized to multiple predictors through the use of additive functions of smooth curves. Chapter 16 focuses on an extremely flexible class of nonparametric regression estimators, tree‐based methods. Trees are based on the notion of binary recursive partitioning. At each step a set of observations (a node) is either split into two parts (children nodes) on the basis of the values of a chosen variable, or is not split at all, based on encouraging homogeneity in the children nodes. This approach provides nonparametric alternatives to linear regression (regression trees), logistic and multinomial regression (classification trees), accelerated failure time and proportional hazards regression (survival trees) and mixed effects regression (longitudinal trees).
A final small change from the first edition to the second edition is in the title, as it now includes the phrase With Applications in R. This is not really a change, of course, as all of the analyses in the first edition were performed using the statistics package R. Code for the output and figures in the book can (still) be found at its associated web site at http://people.stern.nyu.edu/jsimonof/RegressionHandbook/. As was the case in the first edition, even though analyses are performed in R, we still refer to general issues relevant to a data analyst in the use of statistical software even if those issues don't specifically apply to R.
We would like to once again thank our students and colleagues for their encouragement and support, and in particular students for the tough questions that have definitely affected our views on statistical modeling and by extension this book. We would like to thank Jon Gurstelle, and later Kathleen Santoloci and Mindy Okura‐Marszycki, for approaching us with encouragement to undertake a second edition. We would like to thank Sarah Keegan for her patient support in bringing the book to fruition in her role as Project Editor. We would like to thank Roni Chambers for computing assistance, and Glenn Heller and Marc Scott for looking at earlier drafts of chapters. Finally, we would like to thank our families for their continuing love and support.
SAMPRIT CHATTERJEE
Brooksville, Maine
JEFFREY S. SIMONOFF
New York, New York
October, 2019
Preface to the First Edition
How to Use This Book
This book is designed to be a practical guide to regression modeling. There is little theory here, and methodology appears in the service of the ultimate goal of analyzing real data using appropriate regression tools. As such, the target audience of the book includes anyone who is faced with regression data [that is, data where there is a response variable that is being modeled as a function of other variable(s)], and whose goal is to learn as much as possible from that data.
The book can be used as a text for an applied regression course (indeed, much of it is based on handouts that have been given to students in such a course), but that is not its primary purpose; rather, it is aimed much more broadly as a source of practical advice on how to address the problems that come up when dealing with regression data. While a text is usually organized in a way that makes the chapters interdependent, successively building on each other, that is not the case here. Indeed, we encourage readers to dip into different chapters for practical advice on specific topics as needed. The pace of the book is faster than might typically be the