This, however, is not a cookbook that presents a mechanical approach to doing regression analysis. Data analysis is perhaps an art, and certainly a craft; we believe that the goal of any data analysis book should be to help analysts develop the skills and experience necessary to adjust to the inevitable twists and turns that come up when analyzing real data.
We assume that the reader possesses a nodding acquaintance with regression analysis. The reader should be familiar with the basic terminology and should have been exposed to basic regression techniques and concepts, at least at the level of simple (one‐predictor) linear regression. We also assume that the user has access to a computer with an adequate regression package. The material presented here is not tied to any particular software. Almost all of the analyses described here can be performed by most standard packages, although the ease of doing this could vary. All of the analyses presented here were done using the free package R (R Development Core Team, 2017), which is available for many different operating system platforms (see http://www.R-project.org/ for more information). Code for the output and figures in the book can be found at its associated web site at http://people.stern.nyu.edu/jsimonof/RegressionHandbook/.
Each chapter of the book is laid out in a similar way, with most having at least four sections of specific types. First is an introduction, where the general issues that will be discussed in that chapter are presented. A section on concepts and background material follows, where a discussion of the relationship of the chapter's material to the broader study of regression data is the focus. This section also provides any theoretical background for the material that is necessary. Sections on methodology follow, where the specific tools used in the chapter are discussed. This is where relevant algorithmic details are likely to appear. Finally, each chapter includes at least one analysis of real data using the methods discussed in the chapter (as well as appropriate material from earlier chapters), including both methodological and graphical analyses.
The book begins with discussion of the multiple regression model. Many regression textbooks start with discussion of simple regression before moving on to multiple regression. This is quite reasonable from a pedagogical point of view, since simple regression has the great advantage of being easy to understand graphically, but from a practical point of view simple regression is rarely the primary tool in analysis of real data. For that reason, we start with multiple regression, and note the simplifications that come from the special case of a single predictor. Chapter 1 describes the basics of the multiple regression model, including the assumptions being made, and both estimation and inference tools, while also giving an introduction to the use of residual plots to check assumptions.
Since it is unlikely that the first model examined will ultimately be the final preferred model, Chapter 2 focuses on the very important areas of model building and model selection. This includes addressing the issue of collinearity, as well as the use of both hypothesis tests and information measures to help choose among candidate models.
Chapters 3 through 5 study common violations of regression assumptions, and methods available to address those model violations. Chapter 3 focuses on unusual observations (outliers and leverage points), while Chapter 4 describes how transformations (especially the log transformation) can often address both nonlinearity and nonconstant variance violations. Chapter 5 is an introduction to time series regression, and the problems caused by autocorrelation. Time series analysis is a vast area of statistical methodology, so our goal in this chapter is only to provide a good practical introduction to that area in the context of regression analysis.
Chapters 6 and 7 focus on the situation where there are categorical variables among the predictors. Chapter 6 treats analysis of variance (ANOVA) models, which include only categorical predictors, while Chapter 7 looks at analysis of covariance (ANCOVA) models, which include both numerical and categorical predictors. The examination of interaction effects is a fundamental aspect of these models, as are questions related to simultaneous comparison of many groups to each other. Data of this type often exhibit nonconstant variance related to the different subgroups in the population, and the appropriate tool to address this issue, weighted least squares, is also a focus here.
Chapters 8 though 10 examine the situation where the nature of the response variable is such that Gaussian‐based least squares regression is no longer appropriate. Chapter 8 focuses on logistic regression, designed for binary response data and based on the binomial random variable. While there are many parallels between logistic regression analysis and least squares regression analysis, there are also issues that come up in logistic regression that require special care. Chapter 9 uses the multinomial random variable to generalize the models of Chapter 8 to allow for multiple categories in the response variable, outlining models designed for response variables that either do or do not have ordered categories. Chapter 10 focuses on response data in the form of counts, where distributions like the Poisson and negative binomial play a central role. The connection between all these models through the generalized linear model framework is also exploited in this chapter.
The final chapter focuses on situations where linearity does not hold, and a nonlinear relationship is necessary. Although these models are based on least squares, from both an algorithmic and inferential point of view there are strong connections with the models of Chapters 8 through 10, which we highlight.
This Handbook can be used in several different ways. First, a reader may use the book to find information on a specific topic. An analyst might want additional information on, for example, logistic regression or autocorrelation. The chapters on these (and other) topics provide the reader with this subject matter information. As noted above, the chapters also include at least one analysis of a data set, a clarification of computer output, and reference to sources where additional material can be found. The chapters in the book are to a large extent self‐contained and can be consulted independently of other chapters.
The book can also be used as a template for what we view as a reasonable approach to data analysis in general. This is based on the cyclical paradigm of model formulation, model fitting, model evaluation, and model updating leading back to model (re)formulation. Statistical significance of test statistics does not necessarily mean that an adequate model has been obtained. Further analysis needs to be performed before the fitted model can be regarded as an acceptable description of the data, and this book concentrates on this important aspect of regression methodology. Detection of deficiencies of fit is based on both testing and graphical methods, and both approaches are highlighted here.
This preface is intended to indicate ways in which the Handbook can be used. Our hope