Читать книгу Introduction to Linear Regression Analysis - Douglas C. Montgomery - Страница 49
2.10 SOME CONSIDERATIONS IN THE USE OF REGRESSION
ОглавлениеRegression analysis is widely used and, unfortunately, frequently misused. There are several common abuses of regression that should be mentioned:
1 Regression models are intended as interpolation equations over the range of the regressor variable(s) used to fit the model. As observed previously, we must be careful if we extrapolate outside of this range. Refer to Figure 1.5.
2 The disposition of the x values plays an important role in the least-squares fit. While all points have equal weight in determining the height of the line, the slope is more strongly influenced by the remote values of x. For example, consider the data in Figure 2.9. The slope in the least-squares fit depends heavily on either or both of the points A and B. Furthermore, the remaining data would give a very different estimate of the slope if A and B were deleted. Situations such as this often require corrective action, such as further analysis and possible deletion of the unusual points, estimation of the model parameters with some technique that is less seriously influenced by these points than least squares, or restructuring the model, possibly by introducing further regressors.A somewhat different situation is illustrated in Figure 2.10, wher one of the 12 observations is very remote in x space. In this example the slope is largely determined by the extreme point. If this point is deleted, the slope estimate is probably zero. Because of the gap between the two clusters of points, we really have only two distinct information units with which to fit the model. Thus, there are effectively far fewer than the apparent 10 degrees of freedom for error.Situations such as these seem to occur fairly often in practice. In general we should be aware that in some data sets one point (or a small cluster of points) may control key model properties.
3 Outliers are observations that differ considerably from the rest of the data. They can seriously disturb the least-squares fit. For example, consider the data in Figure 2.11. Observation A seems to be an outlier because it falls far from the line implied by the rest of the data. If this point is really an outlier, then the estimate of the intercept may be incorrect and the residual mean square may be an inflated estimate of σ2. The outlier may be a “bad value” that has resulted from a data recording or some other error. On the other hand, the data point may not be a bad value and may be a highly useful piece of evidence concerning the process under investigation. Methods for detecting and dealing with outliers are discussed more completely in Chapter 4.Figure 2.11 An outlier.TABLE 2.9 Data Illustrating Nonsense Relationships between VariablesYearNumber of Certified Mental Defectives per 10,000 of Estimated Population in the U.K ( y)Number of Radio Receiver Licenses Issued (Millions) in the U.K (x1)First Name of President of the U.S. (x2)192481.350Calvin192581.960Calvin192692.270Calvin1927102.483Calvin1928112.730Calvin1929113.091Calvin1930123.647Herbert1931164.620Herbert1932185.497Herbert1933196.260Herbert1934207.012Franklin1935217.618Franklin1936228.131Franklin1937238.593FranklinSource: Kendall and Yule [1950] and Tufte [1974].
4 As mentioned in Chapter 1, just because a regression analysis has indicated a strong relationship between two variables, this does not imply that the variables are related in any causal sense. Causality implies necessary correlation. Regression analysis can only address the issues on correlation. It cannot address the issue of necessity. Thus, our expectations of discovering cause-and-effect relationships from regression should be modest.As an example of a “nonsense” relationship between two variables, consider the data in Table 2.9. This table presents the number of certified mental defectives in the United Kingdom per 10,000 of estimated population (y), the number of radio receiver licenses issued (x1), and the first name of the President of the United States (x2) for the years 1924–1937. We can show that the regression equation relating y to x1 isThe t statistic for testing H0: β1 = 0 for this model is t0 = 27.312 (the P value is 3.58 × 10−12), and the coefficient of determination is R2 = 0.9842. That is, 98.42% of the variability in the data is explained by the number of radio receiver licenses issued. Clearly this is a nonsense relationship, as it is highly unlikely that the number of mental defectives in the population is functionally related to the number of radio receiver licenses issued. The reason for this strong statistical relationship is that y and x1 are monotonically related (two sequences of numbers are monotonically related if as one sequence increases, the other always either increases or decreases). In this example y is increasing because diagnostic procedures for mental disorders are becoming more refined over the years represented in the study and x1 is increasing because of the emergence and low-cost availability of radio technology over the years.Any two sequences of numbers that are monotonically related will exhibit similar properties. To illustrate this further, suppose we regress y on the number of letters in the first name of the U.S. president in the corresponding year. The model iswith t0 = 8.996 (the P value is 1.11 × 10−6) and R2 = 0.8709. Clearly this is a nonsense relationship as well.This is a simple demonstration of the problems that can arise in using regression analysis in large data mining studies where there are many variables and often very many observations. Nonsense relationships are frequently encountered in these studies.
5 In some applications of regression the value of the regressor variable x required to predict y is unknown. For example, consider predicting maximum daily load on an electric power generation system from a regression model relating the load to the maximum daily temperature. To predict tomorrow’s maximum load, we must first predict tomorrow’s maximum temperature. Consequently, the prediction of maximum load is conditional on the temperature forecast. The accuracy of the maximum load forecast depends on the accuracy of the temperature forecast. This must be considered when evaluating model performance.
Other abuses of regression are discussed in subsequent chapters. For further reading on this subject, see the article by Box [1966].