Figure 1.8 Regression model-building process.
CHAPTER 2
SIMPLE LINEAR REGRESSION
2.1 SIMPLE LINEAR REGRESSION MODEL
This chapter considers the simple linear regression model, that is, a model with a single regressor x that has a relationship with a response y that is a straight line. This simple linear regression model is
where the intercept β0 and the slope β1 are unknown constants and ε is a random error component. The errors are assumed to have mean zero and unknown variance σ2. Additionally we usually assume that the errors are uncorrelated. This means that the value of one error does not depend on the value of any other error.
It is convenient to view the regressor x as controlled by the data analyst and measured with negligible error, while the response y is a random variable. That is, there is a probability distribution for y at each possible value for x. The mean of this distribution is
(2.2a)
and the variance is
(2.2b)
Thus, the mean of y is a linear function of x although the variance of y does not depend on the value of x. Furthermore, because the errors are uncorrelated, the responses are also uncorrelated.
The parameters β0 and β1 are usually called regression coefficients. These coefficients have a simple and often useful interpretation. The slope β1 is the change in the mean of the distribution of y produced by a unit change in x. If the range of data on x includes x = 0, then the intercept β0 is the mean of the distribution of the response y when x = 0. If the range of x does not include zero, then β0 has no practical interpretation.
2.2 LEAST-SQUARES ESTIMATION OF THE PARAMETERS
The parameters β0 and β1 are unknown and must be estimated using sample data. Suppose that we have n pairs of data, say (y1, x1), (y2, x2), …, (yn, xn). As noted in Chapter 1, these data may result either from a controlled experiment designed specifically to collect the data, from an observational study, or from existing historical records (a retrospective study).
2.2.1 Estimation of β0 and β1
The method of least squares is used to estimate β0 and β1. That is, we estimate β0 and β1 so that the sum of the squares of the differences between the observations yi and the straight line is a minimum. From Eq. (2.1) we may write
Equation (2.1) maybe viewed as a population regression model while Eq. (2.3) is a sample regression model, written in terms of the n pairs of data (yi, xi) (i = 1, 2, …, n). Thus, the least-squares criterion is
(2.4)
The least-squares estimators of β0 and β1, say
and
Simplifying these two equations yields
Equations (2.5) are called the least-squares normal equations. The solution to the normal equations is
and
where
are the averages of yi and xi, respectively. Therefore,
Equation (2.8) gives a point estimate of the mean of y for a particular x.
Since the denominator of Eq. (2.7) is the corrected sum of squares of the xi and the numerator is the corrected sum of cross products of xi and yi, we may write these quantities in a more compact notation as
(2.9)