5 Summarizing Data

5.1 Learning Objectives

This week we will:

  • Practice summarizing large datasets by groups
  • Tidy and untidy data with pivot_longer and pivot_wider
  • Join datasets using different _join functions

5.2 Introduction

This week we’re focused on building our ability to analyze data. We’ll incorporate new functions from the dplyr package, and explore table joins.

The information below augments the readings. Following this, you’ll be in good shape to start this week’s exercises. Let’s get started!

5.2.1 Readings (complete by class on Monday)

Required:

As you read through, my suggestion is to have the markdown version of this document open in Posit where you can take notes on important functions/concepts and also code along with the examples. This way you’ll get a headstart on your cheat sheet for this week. The below summarizes and provides a different example, but feel free to copy in examples from the reading and tinker with them a bit.

These optional readings will also be useful follow up:

  • dplyr lesson in Data Carpentry
  • dplyr vignette (long form example document): if you’re struggling, this provides additional examples that you can try on your own (and as a bonus they use a Star Wars data set). Package vignettes like this are an amazing resource for learning new packages, and evaluating their capabilities, once you get the basics.
  • tidyr pivot vignette again provides some nice examples of the pivot_longer and pivot_wider functions with different data set (although none quite as fun as the Star Wars data, alas).

5.3 group_by function: Summarizing info based on type/characteristic

Based on reading from R for Data Science, section 3.5: Groups

group_by(data, column_to_group_by)

or with pipes

data |> group_by(column_to_group_by)

  • combines rows into groups based on a column characteristic
  • provides ability to summarize information based on one or more variables in your dataset when combine with summarize

Let’s start by activating the data sets package within R-studio. We’re going to explore the sleep data set.

library(datasets)
library(dplyr)
sleep
##    extra group ID
## 1    0.7     1  1
## 2   -1.6     1  2
## 3   -0.2     1  3
## 4   -1.2     1  4
## 5   -0.1     1  5
## 6    3.4     1  6
## 7    3.7     1  7
## 8    0.8     1  8
## 9    0.0     1  9
## 10   2.0     1 10
## 11   1.9     2  1
## 12   0.8     2  2
## 13   1.1     2  3
## 14   0.1     2  4
## 15  -0.1     2  5
## 16   4.4     2  6
## 17   5.5     2  7
## 18   1.6     2  8
## 19   4.6     2  9
## 20   3.4     2 10
?sleep

The dataset has 3 variables:

  • extra, is a numeric variable representing the increase in hours of sleep
  • group is a categorical factor variable representing the drug given
  • ID is another factor for the patient ID

Suppose you want to know how the 2 drug treatment groups varied with respect to the extra hours of sleep. First we group the data by group, the drug given.

sleep |> group_by(group)
## # A tibble: 20 × 3
## # Groups:   group [2]
##    extra group ID   
##    <dbl> <fct> <fct>
##  1   0.7 1     1    
##  2  -1.6 1     2    
##  3  -0.2 1     3    
##  4  -1.2 1     4    
##  5  -0.1 1     5    
##  6   3.4 1     6    
##  7   3.7 1     7    
##  8   0.8 1     8    
##  9   0   1     9    
## 10   2   1     10   
## 11   1.9 2     1    
## 12   0.8 2     2    
## 13   1.1 2     3    
## 14   0.1 2     4    
## 15  -0.1 2     5    
## 16   4.4 2     6    
## 17   5.5 2     7    
## 18   1.6 2     8    
## 19   4.6 2     9    
## 20   3.4 2     10

Note that this looks the same as sleep only with Groups: group [2] as an attribute.

Now you can use summarize function to calculate statistics for each group within the dataset. Here, we only have 2 groups so it’s a simple example.

# summarize the number of rows for each group
sleep |>
  group_by(group) |>
  summarize(N = n())
## # A tibble: 2 × 2
##   group     N
##   <fct> <int>
## 1 1        10
## 2 2        10

We can save this summary table as a new object

sleep_N <-
  sleep |>
  group_by(group) |>
  summarize(N = n())
sleep_N
## # A tibble: 2 × 2
##   group     N
##   <fct> <int>
## 1 1        10
## 2 2        10

The result here is trivial, but imagine if you had a larger dataset, like the Dipodomys survey from last week. One technique in coding is to start with something small, where you can easily hand-calculate the answer.

Let’s look at other functions you can use within summarize:

Let’s look at the median number of extra sleep hours:

sleep |>
  group_by(group) |>
  summarize(median_extra_sleep = median(extra))
## # A tibble: 2 × 2
##   group median_extra_sleep
##   <fct>              <dbl>
## 1 1                   0.35
## 2 2                   1.75

Now let’s look at what would happen if there was missing data by changing a value:

# summarize the number of rows for each group
sleep[1,1] = NA # Oh I used `=` here to assign this, 
# the assignment arrows, `<-` or `->`, are generally best-coding 
# practice as they differentiate variables and objects, from function 
# arguments

Now, if we implement the summarize command for the median sleep we’ll obtain a NA value:

sleep |>
  group_by(group) |>
  summarize(median_sleep = median(extra))
## # A tibble: 2 × 2
##   group median_sleep
##   <fct>        <dbl>
## 1 1            NA   
## 2 2             1.75

So what can we do? Start by adding a na.rm = TRUE command.

sleep |>
  group_by(group) |>
  summarize(median_sleep = median(extra, na.rm = TRUE))
## # A tibble: 2 × 2
##   group median_sleep
##   <fct>        <dbl>
## 1 1             0   
## 2 2             1.75

If you still had NaN or some other troublesome value (depending on your data set), you could filter the data: filter(data, !is.na(variable))

5.4 Grouping with multiple variables

sleep is a really simple data set, but what if we have a slightly bigger data set with multiple variables we wanted to group by and summarize? Let’s look at the “Carbon Dioxide Uptake in Grass Plants” CO2 data set. This data set contains ambient CO2 (conc) and CO2 uptake rate (uptake) measurements for Echinochloa crus-galli grass species from Quebec or Mississippi (type) that have undergone a chilling treatment or not (treatment).

head(CO2)
##   Plant   Type  Treatment conc uptake
## 1   Qn1 Quebec nonchilled   95   16.0
## 2   Qn1 Quebec nonchilled  175   30.4
## 3   Qn1 Quebec nonchilled  250   34.8
## 4   Qn1 Quebec nonchilled  350   37.2
## 5   Qn1 Quebec nonchilled  500   35.3
## 6   Qn1 Quebec nonchilled  675   39.2

The obvious questions to ask of this data are:

  • What effect does the origin/type of grass species have on CO2 uptake?
  • What effect does the chilling treatment have on CO2 uptake?
  • What effect does ambient CO2 conc have on CO2 uptake?

Let’s try to answer these questions with one long pipe! The 3rd question is ideally a regression which we’ll save for after spring break, but we could separate the conc vector into a few groups and summarize the uptake for each group.

So let’s make a plan. We want to group by type and treatment as well as conc, but first we need to make a simplified, group_conc variable. Remember we use mutate to make new variables. case_when or cut are a convenient functions to use in mutate to make categorical variables from numerical variables. The easiest way to figure out how case_when works is to check out the Examples section in ?case_when. So we will,

  1. Make a new group_conc to simplify data a bit (let’s say 95 ~ “low” and 1000 ~ “high”).
  2. Group by Type, Treatment, and group_conc.
  3. Summarize the mean and standard deviation CO2 uptake of each group.

This table will allow us to compare the average uptake values of each of these groups. All we need to do to summarize all of these combinations of variables is pass the three variable names to group_by!

CO2 |>
  mutate(group_conc = case_when(conc == 1000 ~ "high", 
                               conc == 95 ~ "low", 
                               .default = NA)) |>
  group_by(Type, Treatment, group_conc) |>
  summarise(mean_uptake = mean(uptake), 
            sd_uptake = sd(uptake))# -> 
## `summarise()` has grouped output by 'Type', 'Treatment'. You can override using
## the `.groups` argument.
## # A tibble: 12 × 5
## # Groups:   Type, Treatment [4]
##    Type        Treatment  group_conc mean_uptake sd_uptake
##    <fct>       <fct>      <chr>            <dbl>     <dbl>
##  1 Quebec      nonchilled high              43.2      3.06
##  2 Quebec      nonchilled low               15.3      1.45
##  3 Quebec      nonchilled <NA>              37.8      4.91
##  4 Quebec      chilled    high              40.8      1.91
##  5 Quebec      chilled    low               12.9      3.12
##  6 Quebec      chilled    <NA>              33.7      5.72
##  7 Mississippi nonchilled high              31.6      3.85
##  8 Mississippi nonchilled low               11.3      0.7 
##  9 Mississippi nonchilled <NA>              27.8      4.45
## 10 Mississippi chilled    high              18.7      3.88
## 11 Mississippi chilled    low                9.6      1.65
## 12 Mississippi chilled    <NA>              16.5      3.23
  #CO2_summary

To make it easier for us to make some conclusions from this we can clean it up a little bit by removing the NA rows and arrange-ing by mean_uptake.

CO2 |>
  mutate(group_conc = case_when(conc == 1000 ~ "high", 
                               conc == 95 ~ "low", 
                               .default = NA)) |>
  group_by(Type, Treatment, group_conc) |>
  summarise(mean_uptake = mean(uptake), 
            sd_uptake = sd(uptake)) |>
  filter(!is.na(group_conc)) |>
  arrange(mean_uptake)
## `summarise()` has grouped output by 'Type', 'Treatment'. You can override using
## the `.groups` argument.
## # A tibble: 8 × 5
## # Groups:   Type, Treatment [4]
##   Type        Treatment  group_conc mean_uptake sd_uptake
##   <fct>       <fct>      <chr>            <dbl>     <dbl>
## 1 Mississippi chilled    low                9.6      1.65
## 2 Mississippi nonchilled low               11.3      0.7 
## 3 Quebec      chilled    low               12.9      3.12
## 4 Quebec      nonchilled low               15.3      1.45
## 5 Mississippi chilled    high              18.7      3.88
## 6 Mississippi nonchilled high              31.6      3.85
## 7 Quebec      chilled    high              40.8      1.91
## 8 Quebec      nonchilled high              43.2      3.06

Now looking at this final summary table we can see Quebec varieties had higher uptake rates than Mississippi, the chilling treatment reduced uptake rates, and high ambient concentrations increase uptake rates.

5.5 Tidying and untidying data with pivot_ functions

Based on reading from R for Data Science, section 5.3 and 5.4: Lengthening and Widening data

Data can come your way in untidy forms which you will need to tidy up for analysis. Also, occasionally you may want to intentionally untidy data to do some analyses, or present the data in a shorter form table. For these tasks, so long as the data is organized systematically, you can make use of the tidyr package functions pivot_longer or pivot_wider.

For tidying untidy data, we can use pivot_longer. For example if a variable is encoded in column names, like “cell_growth_5ug_ml_cefo” in which a column has cell growth measurements (or “observations”) at a specific concentration of an antibiotic, 5 µg/mL cefotaxime in this case.

The reading uses a fun billboard data set.

billboard
## # A tibble: 317 × 79
##    artist     track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##    <chr>      <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2 Pac      Baby… 2000-02-26      87    82    72    77    87    94    99    NA
##  2 2Ge+her    The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
##  3 3 Doors D… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
##  4 3 Doors D… Loser 2000-10-21      76    76    72    69    67    65    55    59
##  5 504 Boyz   Wobb… 2000-04-15      57    34    25    17    17    31    36    49
##  6 98^0       Give… 2000-08-19      51    39    34    26    26    19     2     2
##  7 A*Teens    Danc… 2000-07-08      97    97    96    95   100    NA    NA    NA
##  8 Aaliyah    I Do… 2000-01-29      84    62    51    41    38    35    35    38
##  9 Aaliyah    Try … 2000-03-18      59    53    38    28    21    18    16    14
## 10 Adams, Yo… Open… 2000-08-26      76    76    74    69    68    67    61    58
## # ℹ 307 more rows
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …

Looking at this dataset we can see each row is a song and there are columns for each week of the year which contains the position of the song on the billboard chart. Definitely not tidy, right?! The “week” columns are actually a variable themselves. An easy way to identify this sort of untidiness is that column names provide no indication of what values they contain.

To tidy this we want to make the wide table of weeks into a long table that has “week” as a variable, and “position” as a variable. This would be an enormous task to do, but fortunately pivot_longer makes it pretty easy. The arguments to pivot_longer are cols which is list of column names you want to condense into a single column. To the cols argument we can pass helper “tidy-select” functions which are

  • starts_with("a"): names that start with "a".

  • ends_with("z"): names that end with "z".

  • contains("b"): names that contain "b".

  • matches("x.y"): names that match regular expression x.y.

  • num_range(x, 1:4): names following the pattern, x1, x2, …, x4.

  • `all_of(vars)`` will match just the variables that exist.

  • everything(): all variables.

  • last_col(): furthest column on the right.

  • where(is.numeric): all variables where is.numeric() returns TRUE.

Let’s use starts_with() to grab all the columns that start with “wk”. The other arguments that we need are names_to and values_to

billboard |> 
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    values_to = "rank"
  )
## # A tibble: 24,092 × 5
##    artist track                   date.entered week   rank
##    <chr>  <chr>                   <date>       <chr> <dbl>
##  1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1      87
##  2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2      82
##  3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3      72
##  4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4      77
##  5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5      87
##  6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6      94
##  7 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk7      99
##  8 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk8      NA
##  9 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk9      NA
## 10 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk10     NA
## # ℹ 24,082 more rows

Note how long this longer table is now, 24 thousand rows! This tidied dataset can now be analyzed and plottes with ease.

However we can’t fit this dataset (or at least a few songs) onto a page easily in it’s tidy form. For that we need to widen. It is rarely the case that we need to use pivot_wider except for trying to display data in a more convenient table. We could pivot_wider our billboard dataset back to the original for example. The arguments are just the opposite, names_from = "week" and values_from = "rank".

billboard |> 
  pivot_longer(
    cols = starts_with("wk"), 
    names_to = "week", 
    values_to = "rank"
  ) |>
  pivot_wider(
    names_from = "week", 
    values_from = "rank"
  )
## # A tibble: 317 × 79
##    artist     track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##    <chr>      <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2 Pac      Baby… 2000-02-26      87    82    72    77    87    94    99    NA
##  2 2Ge+her    The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
##  3 3 Doors D… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
##  4 3 Doors D… Loser 2000-10-21      76    76    72    69    67    65    55    59
##  5 504 Boyz   Wobb… 2000-04-15      57    34    25    17    17    31    36    49
##  6 98^0       Give… 2000-08-19      51    39    34    26    26    19     2     2
##  7 A*Teens    Danc… 2000-07-08      97    97    96    95   100    NA    NA    NA
##  8 Aaliyah    I Do… 2000-01-29      84    62    51    41    38    35    35    38
##  9 Aaliyah    Try … 2000-03-18      59    53    38    28    21    18    16    14
## 10 Adams, Yo… Open… 2000-08-26      76    76    74    69    68    67    61    58
## # ℹ 307 more rows
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …

5.6 Joins and Binds within dplyr

Based on reading from R for Data Science, chapter 19: Joins

Within Biological Systems Engineering and the larger field, there are many times when we use multiple tables to reduce redundant data, particularly as data is being collected. Imagine you have a very large table that repeats double-precision data over and over again, resulting in a dataset that occupies more computer memory. While modern computers are incredible in their storage capacity, I can attest that processing speed and memory allocation is and will still be a consideration in your work. Here are some examples of such tasks:

  1. Perhaps, your experiment spans many days and you collected each day’s data in a different sheet that you now want to compile.

  2. Perhaps, you didn’t want to write down complete experimental conditions for each sample and instead used a code or number to indicate different conditions, now you want to add the full info to the table. Or maybe you measured something at a place over time and want to add weather station data as new columns.

The above are two different tasks: 1) just stitching datasets together is a bind_, whereas examples in 2) are adding new variables based on the values in some key/code variable, like the experiment code or date, which is a _join.

In dplyr we can use bind_rows to add new observations to a dataset, like binding multiple days data tables together. bind_rows will return an error if the column names of the tables we are joining are not the same.

You can also use bind_cols to add new columns to a dataset without any key or identifier variable, but this is risky, as there is no guarantee that the rows (observations) will be in the same order. It’s better to use a _join function which checks that the rows match based on shared variables.

Let’s illustrate joins using data from the Problem Set.

We’ll first read each of these into the workspace:

plots <- read_csv("data/plots.csv")
## Rows: 24 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): plot_type
## dbl (1): plot_id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
species <- read_csv("data/species.csv")
## Rows: 54 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): species_id, genus, species, taxa
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
surveys <- read_csv("data/surveys.csv")
## Rows: 35549 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): species_id, sex
## dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Note that when you do this, there are 24 plots, 54 types of species, but 35,549 lines of data in the surveys dataset!

These tables are all related the question is: “What is the column that is shared by, or links, the plots and surveys dataset?”

plot_id!

What is the column that links the surveys dataset with the species?

species_id!

If we want to combine the dataset, let’s look at how we would do this. We’ll employ the command:

inner_join

From help:

The mutating joins add columns from dataset y to dataset x, matching rows based on the keys:

  • inner_join(x,y): includes all rows in x and y.

  • left_join(x,y): includes all rows in x.

  • right_join(x,y): includes all rows in y.

  • full_join(x,y): includes all rows in x or y. If a row in x matches multiple rows in y, all the rows in y will be returned once for each matching row in x.

The dplyr cheat sheet (also up in the Help>Cheat Sheets menu) demonstrates this visually.

If you want to include the detailed names contained within the species table, use an inner join and join based on the common variable species_id, using the by argument to specify the column name(s), as a character vector, that you want to link the two datasets.

In this case, you end up with 3 additional variables that are within the species table added to the surveys data.

combo <- inner_join(surveys, species, by = "species_id")
head(combo)
## # A tibble: 6 × 12
##   record_id month   day  year plot_id species_id sex   hindfoot_length weight
##       <dbl> <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>  <dbl>
## 1         1     7    16  1977       2 NL         M                  32     NA
## 2         2     7    16  1977       3 NL         M                  33     NA
## 3         3     7    16  1977       2 DM         F                  37     NA
## 4         4     7    16  1977       7 DM         M                  36     NA
## 5         5     7    16  1977       3 DM         M                  35     NA
## 6         6     7    16  1977       1 PF         M                  14     NA
## # ℹ 3 more variables: genus <chr>, species <chr>, taxa <chr>

If you wanted to join all three tables together we could add this to the first combo:

combo2 <- inner_join(combo, plots, by = "plot_id")

Or use pipes. This is actually more efficient because you aren’t creating the intermediate combo object that takes up memory, plus, this way you don’t accidentally mix up combo and combo2. As always more descriptive names would be better!

combo2 <- surveys %>% # oops this is what the pipe symbol used to be
  inner_join(species, by = "species_id") |>
  inner_join(plots, by = "plot_id")
head(combo2)
## # A tibble: 6 × 13
##   record_id month   day  year plot_id species_id sex   hindfoot_length weight
##       <dbl> <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>  <dbl>
## 1         1     7    16  1977       2 NL         M                  32     NA
## 2         2     7    16  1977       3 NL         M                  33     NA
## 3         3     7    16  1977       2 DM         F                  37     NA
## 4         4     7    16  1977       7 DM         M                  36     NA
## 5         5     7    16  1977       3 DM         M                  35     NA
## 6         6     7    16  1977       1 PF         M                  14     NA
## # ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>

While it is best practice to specify the columns you are using to uniquely match the datasets (called unique identifiers sometimes) the _join functions are pretty smart and will find variables that match between the datasets automatically as well!

surveys |>
  inner_join(species) |>
  inner_join(plots) |>
  head() 
## Joining with `by = join_by(species_id)`
## Joining with `by = join_by(plot_id)`
## # A tibble: 6 × 13
##   record_id month   day  year plot_id species_id sex   hindfoot_length weight
##       <dbl> <dbl> <dbl> <dbl>   <dbl> <chr>      <chr>           <dbl>  <dbl>
## 1         1     7    16  1977       2 NL         M                  32     NA
## 2         2     7    16  1977       3 NL         M                  33     NA
## 3         3     7    16  1977       2 DM         F                  37     NA
## 4         4     7    16  1977       7 DM         M                  36     NA
## 5         5     7    16  1977       3 DM         M                  35     NA
## 6         6     7    16  1977       1 PF         M                  14     NA
## # ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>