Читать книгу Applied Univariate, Bivariate, and Multivariate Statistics - Daniel J. Denis - Страница 71

2.27.1 Box‐and‐Whisker Plots

The boxplot was a contribution of John Tukey (1977) in the spirit of what is called exploratory data analysis, or “EDA” which encouraged scientists to spend more of their energy on descriptive techniques instead of focusing exclusively on confirmatory statistical tests. Boxplots of parent heights from Galton's data appear below:

> attach(Galton) > boxplot(parent) > library(lattice) > bwplot(parent)

The boxplot provides what is generally known as a five‐number summary of a distribution, of which we can obtain most of the numbers we need by the summary function in R:

> summary(parent) Min. 1st Qu. Median Mean 3rd Qu. Max. 64.00 67.50 68.50 68.31 69.50 73.00

Recall that the median is the point in the ordered data that divides the data set into two equal parts. The location of the median is computed by (n + 1)/2. In Galton's data, there are 928 observations, and so the location of the median is at 464.5^th (i.e., (928 + 1)/2) point in the ordered data set. For parent, this value is equal to 68.50. The first and third quartiles represent the 25th and 75th percentiles and are 67.50 and 69.50 respectively. We can also compute the range as

> range(parent) [1] 64 73

We can also generate boxplots by category. Throughout the book, we use Fisher's iris data (Fisher, 1936) in which flower characteristics such as sepal and petal length are categorized by species of flower. We plot sepal length by species:

> library(lattice) > attach(iris) > bwplot(Sepal.Length ~ Species)

Data points falling beyond the whiskers of the plots may reveal the presence of outliers, and should be investigated (though of course, not necessarily deleted, see Section for a discussion). If you are completely unfamiliar with boxplots, see Denis (2020) for an overview.

Stem‐and‐leaf plots are also easily produced. These visual displays are kind of “naked histograms,” because they reveal the actual observations in the data while also providing information about their frequency of occurrence. In 1710, John Arbuthnot analyzed data on the ratios of males to female births in London from 1629 to 1710 and in so doing made an argument for these births being a function of a “divine being” (Arbuthnot, 1710; Shoesmith, 1987). One of his variables was the number of male christenings (i.e., baptisms) over the period 1629–1710. We generate a stem‐and‐leaf plot in R of these male christenings using package aplpack (Wolf and Bielefeld, 2014), for which the “leaves” are corresponding hundreds. For example, in the following plot, the first value of 2|8 would appear to represent a value of 2800 but is rounded down from the actual value in the data (which is also the minimum) of 2890. The maximum in the data is actually equal to 8426, but is represented by 8400 (i.e., 8|0012334):

> install.packages(“aplpack”) > library(aplpack) > library(HistData) > attach(Arbuthnot) > stem.leaf(Males) 1 | 2: represents 1200 leaf unit: 100 n: 82 1 2. | 8 10 3* | 011222334 15 3. | 66777 18 4* | 014 25 4. | 6777899 36 5* | 01112233444 38 5. | 56 (11) 6* | 00001122444 33 6. | 5555899 26 7* | 244 23 7. | 5555666666778999 7 8* | 0012334

Applied Univariate, Bivariate, and Multivariate Statistics

Подняться наверх