Читать книгу The Big R-Book - Philippe J. S. De Brouwer - Страница 295
Note – A tibble is a special form of data-frame
ОглавлениеA tibble and data frame will produce the same summaries.
We might want to produce some specific information that somehow follows the format of the table. To illustrate this, we start from the dataset mtcars
and assume that we want to make a summary per brand for the top-brands (defined as the most frequent appearing in our database).
library(tidyverse) # not only for %>% but also for group_by, etc. # In mtcars the type of the car is only in the column names, # so we need to extract it to add it to the data n <- rownames(mtcars) # Now, add a column brand (use the first letters of the type) t <- mtcars %>% mutate(brand = str_sub(n, 1, 4)) # add column
To achieve this, the function group_by()
from dplyr
will be very handy. Note that this function does not change the dataset as such, it rather adds a layer of information about the grouping.
group_by()
# First, we need to find out which are the most abundant brands # in our dataset (set cutoff at 2: at least 2 cars in database) top_brands <- count(t, brand) %>% filter(n >= 2) # top_brands is not simplified to a vector in the tidyverse print(top_brands) ## # A tibble: 5 x 2 ## brand n
Table 8.2: Summary information based on the dataset mtcars
.
brand | avgDSP | avgCYL | minMPG | medMPG | avgMPG | maxMPG |
Fiat | 78.9 | 4.0 | 27.3 | 29.85 | 29.85 | 32.4 |
Horn | 309.0 | 7.0 | 18.7 | 20.05 | 20.05 | 21.4 |
Mazd | 160.0 | 6.0 | 21.0 | 21.00 | 21.00 | 21.0 |
Merc | 207.2 | 6.3 | 15.2 | 17.80 | 19.01 | 24.4 |
Toyo | 95.6 | 4.0 | 21.5 | 27.70 | 27.70 | 33.9 |
## <chr> <int> ## 1 Fiat 2 ## 2 Horn 2 ## 3 Mazd 2 ## 4 Merc 7 ## 5 Toyo 2 grouped_cars <- t %>% # start with cars filter(brand %in% top_brands$brand) %>% # only top-brands group_by(brand) %>% summarise( avgDSP = round(mean(disp), 1), avgCYL = round(mean(cyl), 1), minMPG = min(mpg), medMPG = median(mpg), avgMPG = round(mean(mpg),2), maxMPG = max(mpg), ) print(grouped_cars) ## # A tibble: 5 x 7 ## brand avgDSP avgCYL minMPG medMPGavgMPGmaxMPG ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Fiat 78.8 4 27.3 29.8 29.8 32.4 ## 2 Horn 309 7 18.7 20.0 20.0 21.4 ## 3 Mazd 160 6 21 21 21 21 ## 4 Merc 207. 6.3 15.2 17.8 19.0 24.4 ## 5 Toyo 95.6 4 21.5 27.7 27.7 33.9
summarise()
The sections on knitr
and rmarkdown
(respectively Chapter 33 on page 703 and Chapter 32 on page 699) will explain how to convert this output via the function kable()
into Table 8.2.
There are a few things about group_by()
and summarise()
that should be noted in order to make working with them easier. For example, summarize
works opposite to group_by
and hence will peel back any existing grouping, it is possible to use expression in group by, new groups will preplace by default existing ones, etc. These aspects are illustrated in the following code.
# Each call to summarise() removes a layer of grouping: by_vs_am <- mtcars %>% group_by(vs, am) by_vs <- by_vs_am %>% summarise(n = n()) by_vs ## # A tibble: 4 x 3 ## # Groups: vs [2] ## vs am n ## <dbl> <dbl> <int> ## 1 0 0 12 ## 2 0 1 6 ## 3 1 0 7 ## 4 1 1 7 by_vs %>% summarise(n = sum(n)) ## # A tibble: 2 x 2 ## vs n ## <dbl> <int> ## 1 0 18 ## 2 1 14 # To removing grouping, use ungroup: by_vs %>% ungroup() %>% summarise(n = sum(n)) ## # A tibble: 1 x 1 ## n ## <int> ## 1 32 # You can group by expressions: this is just short-hand for # a mutate/rename followed by a simple group_by: mtcars %>% group_by(vsam = vs + am) ## # A tibble: 32 x 12 ## # Groups: vsam [3] ## mpg cyl disp hp drat wt qsec vs am ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 21 6 160 110 3.9 2.62 16.5 0 1 ## 2 21 6 160 110 3.9 2.88 17.0 0 1 ## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 ## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 ## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 ## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 ## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 ## 8 24.4 4 147. 62 3.69 3.19 20 1 0 ## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 ## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 ## # … with 22 more rows, and 3 more variables: ## # gear <dbl>, carb <dbl>, vsam <dbl> # By default, group_by overrides existing grouping: mtcars %>% group_by(cyl) %>% group_by(vs, am) %>% group_vars() ## [1] “vs” “am” # Use add = TRUE to append grouping levels: mtcars %>% group_by(cyl) %>% group_by(vs, am, add = TRUE) %>% group_vars() ## [1] “cyl” “vs” “am”