Читать книгу Statistics - David W. Scott - Страница 15
1.2.3 Pearson's Father–Son Height Data Revisited
ОглавлениеWe have explored the two variables in this dataset individually, but there is an obvious question of how accurately a son's height can be predicted knowing his father's height. In the first frame of Figure 1.6, we display a scatter diagram of the pairs. This diagram clearly shows a positive tilt, consistent with the expectation that the sons of tall fathers are tall, and vice versa; however, the strength of the relationship does not seem as strong as in the brain–body weight dataset.
In the top right frame, we have placed a red dot at the location of the average heights of the fathers and sons. We have also drawn a straight line fit using the intuitive equation . However, the equation is an improvement, since we observed earlier that sons were 1 inch taller than their fathers on average. As a reference, we have also included a horizontal line at the average heights of the sons. This line would be appropriate if there were no information about a son's height to be gleaned from his father's height; but a positive relationship (correlation) is clear.
Galton (1886) was one of the first to observe that many scatter diagrams observed in nature have an appearance similar to that in Figure 1.6. He noted that the shape appeared elliptical, so he superimposed elliptical contours over the scatter diagram. The bottom left frame in Figure 1.6 shows three (nested) ellipses for these data. Recall that a general ellipse has five parameters: two for the center of the ellipse; two for the horizontal and vertical scales; and a fifth called the eccentricity. Galton focused on this fifth parameter, and the correlation coefficient was the result. Ironically, this parameter is often referred to today as Pearson's correlation coefficient.
Figure 1.6 Father–son height data collected by Karl Pearson.
In the final frame, we take advantage of the large sample size to try to understand if the prediction (as weak as it may be) might be linear or nonlinear. For integer values of the rounded fathers' heights, we compute a three‐point summary of the corresponding sons' heights. The red dots are the arithmetic average of the sons' heights. The vertical lines display the (conditional) interquartile range. The final two red dots on each end are based on only a few points, so that the IQR can not be computed. These four red dots are shown in a smaller font size to indicate that even the averages are not so reliable.
We see that these summary points clearly suggest a linear rather than a nonlinear fit. We also see that the two blue reference lines from the second frame, namely and , both miss badly. A new (dashed) line with slope of 1/2 appears to capture the linear trend quite well. The relationship between this slope and the correlation coefficient, as well as a genetic explanation, will be discussed in Chapter 4.1.5.