Читать книгу Statistics - David W. Scott - Страница 9
1.1.1 Pearson's Father–Son Height Data
ОглавлениеWe illustrate these ideas on a set of data collected by Karl Pearson over a century ago. He recorded the heights of fathers and an adult son. In the left frame in Figure 1.1, we display a box‐and‐whiskers plot of these data. We see that the sons are taller than their fathers by about an inch. There are also more potential outliers among the sons for some reason.
In the middle frame of Figure 1.1, we show Tukey's stem‐and‐leaf plot of the 1078 differences of the heights of each son and his father. The range of the data is and the first seven sorted values rounded to one decimal place are . Each data point is decomposed into a stem and a leaf digit. Thus has a stem of and a leaf of 0. The top line is actually , although it is too small to see. With so much data, each stem is broken into two lines to provide more detail. Thus the next two lines show a stem of but no leaves twice. The fourth line shows and the fifth line reads and so on. This figure was generated using the command ; R Core Team (2018). (The default has half as many stems.) Thus the stem‐and‐leaf plot shows the frequency count of points for each stem as character strings.
In the right frame of Figure 1.1, we show the frequency counts in a histogram. The histogram uses a parameter called the bin width to construct an equally spaced mesh . Then we count the number of points in each interval. These counts are displayed as a bar chart. (The histogram can use any anchor point, although 0 is a common choice.) For the histogram shown, the anchor point selected was 0, and was chosen using Scott's rule ; see Scott (1979). This rule is discussed in Section 9.1.4.1. The default choice in function hist
is Sturges' rule, discussed in Section 9.1.4.3, which chooses 11 bins with (not shown).
The choice of is often considered a matter of convenience. The stem‐and‐leaf plot using one‐digit integer stems limits its choices. By way of contrast, any positive real number can be used in a histogram. In Figure 1.2, we show the histograms using by Scott's rule, as well as and . Loosely speaking, the histograms using are missing useful information, while the histograms using display spurious detail. We discuss strategies for finding the best choice of in Section 9.1. In any case, the histogram is a powerful tool for understanding the full distribution of data.
Figure 1.1 Displays of the father–son height data collected by Karl Pearson: (left) box‐and‐whiskers plot; (middle) stem‐and leaf plot; (right) histogram.
Figure 1.2 Histograms of the sons' heights (top row) and fathers' heights (bottom row) using three bin widths: , , from left to right; see text.