Читать книгу Probability with R - Jane M. Horgan - Страница 59

3.7 GRAPHICAL DISPLAYS VERSUS SUMMARY STATISTICS

Before we finish, let us look at a simple, classic example of the importance of using graphical displays to provide insight into the data. The example is that of Anscombe (1973), who provides four data sets, given in Table 3.3 and often referred to as the Anscombe Quartet. Each data set consists of two variables on which there are 11 observations.

TABLE 3.3 The Anscombe Quartet

Data Set 1	Data Set 2	Data Set 3	Data Set 4
x1	y1	x2	y2	x3	y3	x4	y4
10	8.04	10	9.14	10	7.46	8	6.58
8	6.95	8	8.14	8	6.77	8	5.76
13	7.58	13	8.74	13	12.74	8	7.71
9	8.81	9	8.77	9	7.11	8	8.84
11	8.33	11	9.26	11	7.81	8	8.47
14	9.96	14	8.10	14	8.84	8	7.04
6	7.24	6	6.13	6	6.08	8	5.25
4	4.26	4	3.10	4	5.39	19	12.50
12	10.84	12	9.13	12	8.15	8	5.56
7	4.82	7	7.26	7	6.42	8	7.91
5	5.68	5	4.74	5	5.73	8	6.89

First, read the data into separate vectors.

x1 <- c(10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5) y1 <- c(8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68)

and so on for x2, y2, x3, y3, x4, and y4. Then, for convenience, group the data into data frames as follows:

dataset1 <- data.frame(x1,y1) dataset2 <- data.frame(x2,y2) dataset3 <- data.frame(x3,y3) dataset4 <- data.frame(x4,y4)

When presented with data such as these, it is usual to obtain summary statistics. Let us do this using R.

To obtain the means of the variables in each data set, write

mean(dataset1) x1 y1 9.000000 7.500909 mean(dataset2) x2 y2 9.000000 7.497273 mean(dataset3) x3 y3 9.0 7.5 mean(dataset4) x4 y4 9.000000 7.500909

The means for the variables, as you can see, are practically identical as are the means for the variables.

Let us look at the standard deviations.

sd(dataset1) x1 y1 3.316625 2.031568 sd(dataset2) x2 y2 3.316625 2.028463 sd(dataset3) x3 y3 3.316625 2.030424 sd(dataset4) x4 y4 3.316625 2.030579

The standard deviations, as you can see, are also practically identical for the four variables, and also for the variables.

Calculating the mean and standard deviation is the usual way to summarize data. With these data, if this was all that we did, we would conclude naively that the four data sets are “equivalent,” since that is what the statistics say. But what do the statistics not say?

Investigating further, using graphical displays, gives a different picture. Pairwise plots would be the obvious exploratory technique to use with paired data.

par(mfrow = c(2, 2)) plot(x1,y1, xlim = c(0, 20), ylim = c(0, 13)) plot(x2,y2, xlim = c(0, 20), ylim = c(0, 13)) plot(x3,y3, xlim = c(0, 20), ylim = c(0, 13)) plot(x4,y4, xlim = c(0, 20), ylim = c(0, 13))

gives Fig. 3.20. Notice again the use of xlim and ylim to ensure that the scales on the axes are the same in the four plots, in order that a valid comparison can be made.

Figure 3.20 Plots of Four Data Sets with Same Means and Standard Deviations

Examining Fig. 3.20, we see that there are very great differences in the data sets:

1 Data set 1 is linear with some scatter;
2 Data set 2 is quadratic;
3 Data set 3 has an outlier. If the outlier were removed the data would be linear;
4 Data set 4 contains values that are equal except for one outlier. If the outlier were removed, the data would be vertical.

Graphical displays are the core of getting “insight/feel” for the data. Such “insight/feel” does not come from the quantitative statistics; on the contrary, calculations of quantitative statistics should come after the exploratory data analysis using graphical displays.

Подняться наверх