5 Summarizing Data
5.1 Learning Objectives
This week we will:
- Practice summarizing large datasets by groups
- Tidy and untidy data with
pivot_longer
andpivot_wider
- Join datasets using different
_join
functions
5.2 Introduction
This week we’re focused on building our ability to analyze data. We’ll incorporate new functions from the dplyr package, and explore table joins.
The information below augments the readings. Following this, you’ll be in good shape to start this week’s exercises. Let’s get started!
5.2.1 Readings (complete by class on Monday)
Required:
- R for Data Science, section 3.5: Groups
- R for Data Science, section 5.3 and 5.4: lengthening and Widening data
- R for Data Science, chapter 19: Joins
As you read through, my suggestion is to have the markdown version of this document open in Posit where you can take notes on important functions/concepts and also code along with the examples. This way you’ll get a headstart on your cheat sheet for this week. The below summarizes and provides a different example, but feel free to copy in examples from the reading and tinker with them a bit.
These optional readings will also be useful follow up:
dplyr
lesson in Data Carpentrydplyr
vignette (long form example document): if you’re struggling, this provides additional examples that you can try on your own (and as a bonus they use a Star Wars data set). Package vignettes like this are an amazing resource for learning new packages, and evaluating their capabilities, once you get the basics.tidyr
pivot vignette again provides some nice examples of thepivot_longer
andpivot_wider
functions with different data set (although none quite as fun as the Star Wars data, alas).
5.3 group_by
function: Summarizing info based on type/characteristic
Based on reading from R for Data Science, section 3.5: Groups
group_by(data, column_to_group_by)
or with pipes
data |> group_by(column_to_group_by)
- combines rows into groups based on a column characteristic
- provides ability to summarize information based on one or more variables in your dataset when combine with
summarize
Let’s start by activating the data sets package within R-studio. We’re going to explore the sleep data set.
## extra group ID
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## 4 -1.2 1 4
## 5 -0.1 1 5
## 6 3.4 1 6
## 7 3.7 1 7
## 8 0.8 1 8
## 9 0.0 1 9
## 10 2.0 1 10
## 11 1.9 2 1
## 12 0.8 2 2
## 13 1.1 2 3
## 14 0.1 2 4
## 15 -0.1 2 5
## 16 4.4 2 6
## 17 5.5 2 7
## 18 1.6 2 8
## 19 4.6 2 9
## 20 3.4 2 10
The dataset has 3 variables:
extra
, is a numeric variable representing the increase in hours of sleepgroup
is a categorical factor variable representing the drug givenID
is another factor for the patient ID
Suppose you want to know how the 2 drug treatment groups varied with respect to the extra hours of sleep. First we group the data by group
, the drug given.
## # A tibble: 20 × 3
## # Groups: group [2]
## extra group ID
## <dbl> <fct> <fct>
## 1 0.7 1 1
## 2 -1.6 1 2
## 3 -0.2 1 3
## 4 -1.2 1 4
## 5 -0.1 1 5
## 6 3.4 1 6
## 7 3.7 1 7
## 8 0.8 1 8
## 9 0 1 9
## 10 2 1 10
## 11 1.9 2 1
## 12 0.8 2 2
## 13 1.1 2 3
## 14 0.1 2 4
## 15 -0.1 2 5
## 16 4.4 2 6
## 17 5.5 2 7
## 18 1.6 2 8
## 19 4.6 2 9
## 20 3.4 2 10
Note that this looks the same as sleep
only with Groups: group [2]
as an attribute.
Now you can use summarize function to calculate statistics for each group within the dataset. Here, we only have 2 groups so it’s a simple example.
## # A tibble: 2 × 2
## group N
## <fct> <int>
## 1 1 10
## 2 2 10
We can save this summary table as a new object
## # A tibble: 2 × 2
## group N
## <fct> <int>
## 1 1 10
## 2 2 10
The result here is trivial, but imagine if you had a larger dataset, like the Dipodomys survey from last week. One technique in coding is to start with something small, where you can easily hand-calculate the answer.
Let’s look at other functions you can use within summarize
:
basic stats: https://www.dummies.com/education/math/statistics/base-r-statistical-functions/
sum()
mean()
var()
sd()
range()
cor()
min()
summary()
max()
quantile()
median()
Let’s look at the median number of extra sleep hours:
## # A tibble: 2 × 2
## group median_extra_sleep
## <fct> <dbl>
## 1 1 0.35
## 2 2 1.75
Now let’s look at what would happen if there was missing data by changing a value:
# summarize the number of rows for each group
sleep[1,1] = NA # Oh I used `=` here to assign this,
# the assignment arrows, `<-` or `->`, are generally best-coding
# practice as they differentiate variables and objects, from function
# arguments
Now, if we implement the summarize command for the median sleep we’ll obtain a NA value:
## # A tibble: 2 × 2
## group median_sleep
## <fct> <dbl>
## 1 1 NA
## 2 2 1.75
So what can we do? Start by adding a na.rm = TRUE command.
## # A tibble: 2 × 2
## group median_sleep
## <fct> <dbl>
## 1 1 0
## 2 2 1.75
If you still had NaN or some other troublesome value (depending on your data set), you could filter the data: filter(data, !is.na(variable))
5.4 Grouping with multiple variables
sleep
is a really simple data set, but what if we have a slightly bigger data set with multiple variables we wanted to group by and summarize? Let’s look at the “Carbon Dioxide Uptake in Grass Plants” CO2
data set. This data set contains ambient CO2 (conc
) and CO2 uptake rate (uptake
) measurements for Echinochloa crus-galli grass species from Quebec or Mississippi (type
) that have undergone a chilling treatment or not (treatment
).
## Plant Type Treatment conc uptake
## 1 Qn1 Quebec nonchilled 95 16.0
## 2 Qn1 Quebec nonchilled 175 30.4
## 3 Qn1 Quebec nonchilled 250 34.8
## 4 Qn1 Quebec nonchilled 350 37.2
## 5 Qn1 Quebec nonchilled 500 35.3
## 6 Qn1 Quebec nonchilled 675 39.2
The obvious questions to ask of this data are:
- What effect does the origin/
type
of grass species have on CO2 uptake? - What effect does the chilling
treatment
have on CO2 uptake? - What effect does ambient CO2
conc
have on CO2 uptake?
Let’s try to answer these questions with one long pipe! The 3rd question is ideally a regression which we’ll save for after spring break, but we could separate the conc
vector into a few groups and summarize the uptake
for each group.
So let’s make a plan. We want to group by type
and treatment
as well as conc
, but first we need to make a simplified, group_conc
variable. Remember we use mutate
to make new variables. case_when
or cut
are a convenient functions to use in mutate
to make categorical variables from numerical variables. The easiest way to figure out how case_when
works is to check out the Examples section in ?case_when
. So we will,
- Make a new
group_conc
to simplify data a bit (let’s say 95 ~ “low” and 1000 ~ “high”). - Group by
Type
,Treatment
, andgroup_conc
. - Summarize the mean and standard deviation CO2 uptake of each group.
This table will allow us to compare the average uptake values of each of these groups. All we need to do to summarize all of these combinations of variables is pass the three variable names to group_by
!
CO2 |>
mutate(group_conc = case_when(conc == 1000 ~ "high",
conc == 95 ~ "low",
.default = NA)) |>
group_by(Type, Treatment, group_conc) |>
summarise(mean_uptake = mean(uptake),
sd_uptake = sd(uptake))# ->
## `summarise()` has grouped output by 'Type', 'Treatment'. You can override using
## the `.groups` argument.
## # A tibble: 12 × 5
## # Groups: Type, Treatment [4]
## Type Treatment group_conc mean_uptake sd_uptake
## <fct> <fct> <chr> <dbl> <dbl>
## 1 Quebec nonchilled high 43.2 3.06
## 2 Quebec nonchilled low 15.3 1.45
## 3 Quebec nonchilled <NA> 37.8 4.91
## 4 Quebec chilled high 40.8 1.91
## 5 Quebec chilled low 12.9 3.12
## 6 Quebec chilled <NA> 33.7 5.72
## 7 Mississippi nonchilled high 31.6 3.85
## 8 Mississippi nonchilled low 11.3 0.7
## 9 Mississippi nonchilled <NA> 27.8 4.45
## 10 Mississippi chilled high 18.7 3.88
## 11 Mississippi chilled low 9.6 1.65
## 12 Mississippi chilled <NA> 16.5 3.23
To make it easier for us to make some conclusions from this we can clean it up a little bit by removing the NA
rows and arrange
-ing by mean_uptake
.
CO2 |>
mutate(group_conc = case_when(conc == 1000 ~ "high",
conc == 95 ~ "low",
.default = NA)) |>
group_by(Type, Treatment, group_conc) |>
summarise(mean_uptake = mean(uptake),
sd_uptake = sd(uptake)) |>
filter(!is.na(group_conc)) |>
arrange(mean_uptake)
## `summarise()` has grouped output by 'Type', 'Treatment'. You can override using
## the `.groups` argument.
## # A tibble: 8 × 5
## # Groups: Type, Treatment [4]
## Type Treatment group_conc mean_uptake sd_uptake
## <fct> <fct> <chr> <dbl> <dbl>
## 1 Mississippi chilled low 9.6 1.65
## 2 Mississippi nonchilled low 11.3 0.7
## 3 Quebec chilled low 12.9 3.12
## 4 Quebec nonchilled low 15.3 1.45
## 5 Mississippi chilled high 18.7 3.88
## 6 Mississippi nonchilled high 31.6 3.85
## 7 Quebec chilled high 40.8 1.91
## 8 Quebec nonchilled high 43.2 3.06
Now looking at this final summary table we can see Quebec varieties had higher uptake rates than Mississippi, the chilling treatment reduced uptake rates, and high ambient concentrations increase uptake rates.
5.5 Tidying and untidying data with pivot_
functions
Based on reading from R for Data Science, section 5.3 and 5.4: Lengthening and Widening data
Data can come your way in untidy forms which you will need to tidy up for analysis. Also, occasionally you may want to intentionally untidy data to do some analyses, or present the data in a shorter form table. For these tasks, so long as the data is organized systematically, you can make use of the tidyr package functions pivot_longer
or pivot_wider
.
For tidying untidy data, we can use pivot_longer
. For example if a variable is encoded in column names, like “cell_growth_5ug_ml_cefo” in which a column has cell growth measurements (or “observations”) at a specific concentration of an antibiotic, 5 µg/mL cefotaxime in this case.
The reading uses a fun billboard
data set.
## # A tibble: 317 × 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
## 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
## 3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
## 4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59
## 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
## 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
## 7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA
## 8 Aaliyah I Do… 2000-01-29 84 62 51 41 38 35 35 38
## 9 Aaliyah Try … 2000-03-18 59 53 38 28 21 18 16 14
## 10 Adams, Yo… Open… 2000-08-26 76 76 74 69 68 67 61 58
## # ℹ 307 more rows
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …
Looking at this dataset we can see each row is a song and there are columns for each week of the year which contains the position of the song on the billboard chart. Definitely not tidy, right?! The “week” columns are actually a variable themselves. An easy way to identify this sort of untidiness is that column names provide no indication of what values they contain.
To tidy this we want to make the wide table of weeks into a long table that has “week” as a variable, and “position” as a variable. This would be an enormous task to do, but fortunately pivot_longer
makes it pretty easy. The arguments to pivot_longer
are cols
which is list of column names you want to condense into a single column. To the cols
argument we can pass helper “tidy-select” functions which are
starts_with("a")
: names that start with"a"
.ends_with("z")
: names that end with"z"
.contains("b")
: names that contain"b"
.matches("x.y")
: names that match regular expressionx.y
.num_range(x, 1:4)
: names following the pattern,x1
,x2
, …,x4
.`all_of(vars)`` will match just the variables that exist.
everything()
: all variables.last_col()
: furthest column on the right.where(is.numeric)
: all variables whereis.numeric()
returnsTRUE
.
Let’s use starts_with()
to grab all the columns that start with “wk”. The other arguments that we need are names_to
and values_to
## # A tibble: 24,092 × 5
## artist track date.entered week rank
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
## 8 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk8 NA
## 9 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk9 NA
## 10 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk10 NA
## # ℹ 24,082 more rows
Note how long this longer table is now, 24 thousand rows! This tidied dataset can now be analyzed and plottes with ease.
However we can’t fit this dataset (or at least a few songs) onto a page easily in it’s tidy form. For that we need to widen. It is rarely the case that we need to use pivot_wider
except for trying to display data in a more convenient table. We could pivot_wider our billboard dataset back to the original for example. The arguments are just the opposite, names_from = "week"
and values_from = "rank"
.
billboard |>
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank"
) |>
pivot_wider(
names_from = "week",
values_from = "rank"
)
## # A tibble: 317 × 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
## 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
## 3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
## 4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59
## 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
## 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
## 7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA
## 8 Aaliyah I Do… 2000-01-29 84 62 51 41 38 35 35 38
## 9 Aaliyah Try … 2000-03-18 59 53 38 28 21 18 16 14
## 10 Adams, Yo… Open… 2000-08-26 76 76 74 69 68 67 61 58
## # ℹ 307 more rows
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …
5.6 Joins and Binds within dplyr
Based on reading from R for Data Science, chapter 19: Joins
Within Biological Systems Engineering and the larger field, there are many times when we use multiple tables to reduce redundant data, particularly as data is being collected. Imagine you have a very large table that repeats double-precision data over and over again, resulting in a dataset that occupies more computer memory. While modern computers are incredible in their storage capacity, I can attest that processing speed and memory allocation is and will still be a consideration in your work. Here are some examples of such tasks:
Perhaps, your experiment spans many days and you collected each day’s data in a different sheet that you now want to compile.
Perhaps, you didn’t want to write down complete experimental conditions for each sample and instead used a code or number to indicate different conditions, now you want to add the full info to the table. Or maybe you measured something at a place over time and want to add weather station data as new columns.
The above are two different tasks: 1) just stitching datasets together is a bind_
, whereas examples in 2) are adding new variables based on the values in some key/code variable, like the experiment code or date, which is a _join
.
In dplyr we can use bind_rows
to add new observations to a dataset, like binding multiple days data tables together. bind_rows
will return an error if the column names of the tables we are joining are not the same.
You can also use bind_cols
to add new columns to a dataset without any key or identifier variable, but this is risky, as there is no guarantee that the rows (observations) will be in the same order. It’s better to use a _join
function which checks that the rows match based on shared variables.
Let’s illustrate joins using data from the Problem Set.
We’ll first read each of these into the workspace:
## Rows: 24 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): plot_type
## dbl (1): plot_id
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 54 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): species_id, genus, species, taxa
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 35549 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): species_id, sex
## dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Note that when you do this, there are 24 plots, 54 types of species, but 35,549 lines of data in the surveys dataset!
These tables are all related the question is: “What is the column that is shared by, or links, the plots and surveys dataset?”
plot_id
!
What is the column that links the surveys dataset with the species?
species_id
!
If we want to combine the dataset, let’s look at how we would do this. We’ll employ the command:
inner_join
From help:
The mutating joins add columns from dataset y
to dataset x
, matching rows based on the keys:
inner_join(x,y)
: includes all rows inx
andy
.left_join(x,y)
: includes all rows inx
.right_join(x,y)
: includes all rows iny
.full_join(x,y)
: includes all rows inx
ory
. If a row inx
matches multiple rows iny
, all the rows iny
will be returned once for each matching row inx
.
The dplyr cheat sheet (also up in the Help>Cheat Sheets menu) demonstrates this visually.
If you want to include the detailed names contained within the species table, use an inner join and join based on the common variable species_id
, using the by
argument to specify the column name(s), as a character vector, that you want to link the two datasets.
In this case, you end up with 3 additional variables that are within the species table added to the surveys data.
## # A tibble: 6 × 12
## record_id month day year plot_id species_id sex hindfoot_length weight
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 7 16 1977 2 NL M 32 NA
## 2 2 7 16 1977 3 NL M 33 NA
## 3 3 7 16 1977 2 DM F 37 NA
## 4 4 7 16 1977 7 DM M 36 NA
## 5 5 7 16 1977 3 DM M 35 NA
## 6 6 7 16 1977 1 PF M 14 NA
## # ℹ 3 more variables: genus <chr>, species <chr>, taxa <chr>
If you wanted to join all three tables together we could add this to the first combo:
Or use pipes. This is actually more efficient because you aren’t creating the intermediate combo object that takes up memory, plus, this way you don’t accidentally mix up combo
and combo2
. As always more descriptive names would be better!
combo2 <- surveys %>% # oops this is what the pipe symbol used to be
inner_join(species, by = "species_id") |>
inner_join(plots, by = "plot_id")
head(combo2)
## # A tibble: 6 × 13
## record_id month day year plot_id species_id sex hindfoot_length weight
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 7 16 1977 2 NL M 32 NA
## 2 2 7 16 1977 3 NL M 33 NA
## 3 3 7 16 1977 2 DM F 37 NA
## 4 4 7 16 1977 7 DM M 36 NA
## 5 5 7 16 1977 3 DM M 35 NA
## 6 6 7 16 1977 1 PF M 14 NA
## # ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>
While it is best practice to specify the columns you are using to uniquely match the datasets (called unique identifiers sometimes) the _join
functions are pretty smart and will find variables that match between the datasets automatically as well!
## Joining with `by = join_by(species_id)`
## Joining with `by = join_by(plot_id)`
## # A tibble: 6 × 13
## record_id month day year plot_id species_id sex hindfoot_length weight
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 7 16 1977 2 NL M 32 NA
## 2 2 7 16 1977 3 NL M 33 NA
## 3 3 7 16 1977 2 DM F 37 NA
## 4 4 7 16 1977 7 DM M 36 NA
## 5 5 7 16 1977 3 DM M 35 NA
## 6 6 7 16 1977 1 PF M 14 NA
## # ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>