3 Outliers are observations that differ considerably from the rest of the data. They can seriously disturb the least-squares fit. For example, consider the data in Figure 2.11. Observation A seems to be an outlier because it falls far from the line implied by the rest of the data. If this point is really an outlier, then the estimate of the intercept may be incorrect and the residual mean square may be an inflated estimate of σ2. The outlier may be a “bad value” that has resulted from a data recording or some other error. On the other hand, the data point may not be a bad value and may be a highly useful piece of evidence concerning the process under investigation. Methods for detecting and dealing with outliers are discussed more completely in Chapter 4.Figure 2.11 An outlier.TABLE 2.9 Data Illustrating Nonsense Relationships between VariablesYearNumber of Certified Mental Defectives per 10,000 of Estimated Population in the U.K ( y)Number of Radio Receiver Licenses Issued (Millions) in the U.K (x1)First Name of President of the U.S. (x2)192481.350Calvin192581.960Calvin192692.270Calvin1927102.483Calvin1928112.730Calvin1929113.091Calvin1930123.647Herbert1931164.620Herbert1932185.497Herbert1933196.260Herbert1934207.012Franklin1935217.618Franklin1936228.131Franklin1937238.593FranklinSource: Kendall and Yule [1950] and Tufte [1974].
4 As mentioned in Chapter 1, just because a regression analysis has indicated a strong relationship between two variables, this does not imply that the variables are related in any causal sense. Causality implies necessary correlation. Regression analysis can only address the issues on correlation. It cannot address the issue of necessity. Thus, our expectations of discovering cause-and-effect relationships from regression should be modest.As an example of a “nonsense” relationship between two variables, consider the data in Table 2.9. This table presents the number of certified mental defectives in the United Kingdom per 10,000 of estimated population (y), the number of radio receiver licenses issued (x1), and the first name of the President of the United States (x2) for the years 1924–1937. We can show that the regression equation relating y to x1 isThe t statistic for testing H0: β1 = 0 for this model is t0 = 27.312 (the P value is 3.58 × 10−12), and the coefficient of determination is R2 = 0.9842. That is, 98.42% of the variability in the data is explained by the number of radio receiver licenses issued. Clearly this is a nonsense relationship, as it is highly unlikely that the number of mental defectives in the population is functionally related to the number of radio receiver licenses issued. The reason for this strong statistical relationship is that y and x1 are monotonically related (two sequences of numbers are monotonically related if as one sequence increases, the other always either increases or decreases). In this example y is increasing because diagnostic procedures for mental disorders are becoming more refined over the years represented in the study and x1 is increasing because of the emergence and low-cost availability of radio technology over the years.Any two sequences of numbers that are monotonically related will exhibit similar properties. To illustrate this further, suppose we regress y on the number of letters in the first name of the U.S. president in the corresponding year. The model iswith t0 = 8.996 (the P value is 1.11 × 10−6) and R2 = 0.8709. Clearly this is a nonsense relationship as well.This is a simple demonstration of the problems that can arise in using regression analysis in large data mining studies where there are many variables and often very many observations. Nonsense relationships are frequently encountered in these studies.
5 In some applications of regression the value of the regressor variable x required to predict y is unknown. For example, consider predicting maximum daily load on an electric power generation system from a regression model relating the load to the maximum daily temperature. To predict tomorrow’s maximum load, we must first predict tomorrow’s maximum temperature. Consequently, the prediction of maximum load is conditional on the temperature forecast. The accuracy of the maximum load forecast depends on the accuracy of the temperature forecast. This must be considered when evaluating model performance.
Other abuses of regression are discussed in subsequent chapters. For further reading on this subject, see the article by Box [1966].
2.11 REGRESSION THROUGH THE ORIGIN
Some regression situations seem to imply that a straight line passing through the origin should be fit to the data. A no-intercept regression model often seems appropriate in analyzing data from chemical and other manufacturing processes. For example, the yield of a chemical process is zero when the process operating temperature is zero.
The no-intercept model is
(2.48)
Given n observations (yi, xi), i = 1, 2, …, n, the least-squares function is
The only normal equation is
(2.49)
and the least-squares estimator of the slope is
The estimator of
(2.51)
The estimator of σ2 is
(2.52)
with n − 1 degrees of freedom.
Making the normality assumption on the errors, we may test hypotheses and construct confidence and prediction intervals for the no-intercept model. The 100(1 − α) percent CI on β1 is
(2.53)
A 100(1 − α) percent CI on E(y|x0), the mean response at x = x0, is
The 100(1 − α) percent prediction interval on a future observation at x = x0, say y0, is
Both the CI (2.54) and