The R Language and Tidy Data Examples

2.7 Tidy Data

Each of the following datasets shows TB cases and some other variables per country organized in different ways.

table1
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
table2
## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <dbl> <chr>           <dbl>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
table3
## # A tibble: 6 × 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583
# Spread across two tibbles
table4a  # cases
## # A tibble: 3 × 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
table4b  # population
## # A tibble: 3 × 3
##   country         `1999`     `2000`
##   <chr>            <dbl>      <dbl>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

There are three interrelated rules which make a dataset tidy:

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell.

Which table above is Tidy?

table1
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
table2
## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <dbl> <chr>           <dbl>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
table3
## # A tibble: 6 × 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583
# Spread across two tibbles
table4a  # cases
## # A tibble: 3 × 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
table4b  # population
## # A tibble: 3 × 3
##   country         `1999`     `2000`
##   <chr>            <dbl>      <dbl>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

Why ensure that your data is tidy? There are two main advantages:

There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.

There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.

2.7.1 Let’s make a data dictionary for this dataset

country = The country in which TB case data was reported. year = Calendar year cases = Test-positive cases with culture-based and ELISA-based tests in these countries with sample dates within the year above. population = The self-reported population of each country according to their census data.

2.8 R Basics

2.8.1 Data types

You can create objects (variables~values, large data structures~think spreadsheets and databases, and functions) using the =, <- or -> operators. You can see what type of data (or data type) a variable is using the class function. Go ahead, try class(x). Data in R can be of several different, basic types:

Data Type aka Example
Logical Boolean TRUE, FALSE
Numeric float 42, 3.14,
Character string ‘a’ , “good”, “TRUE”, ‘23.4’
Integer 2L, 34L, 0L
Complex 3 + 2i
Raw hexadecimal “Hello” is stored as 48 65 6c 6c 6f

2.8.2 Functions

What is a function?

function_name(argument_name = argument_value)

Using Tab-complete to make function calls will prevent errors!

2.8.3 Objects

An object is essentially anything that shows up in the Environment pane!

  • functions
  • variables
  • data objects

2.8.4 Vectors

To demonstrate what a vector is let’s load some data!

2.9 Data, tibbles, dataframes

2.9.1 Reading in data

library(readxl)
storm <- read_excel(path = "data/hazards.xlsx", sheet = 2)
storm
## # A tibble: 1,609 × 14
##    Date                County monthly.ppt max.daily.ppt rain.shortage
##    <dttm>               <dbl>       <dbl>         <dbl>         <dbl>
##  1 1981-06-01 00:00:00  24015        4.21         0.826          1.23
##  2 1982-05-01 00:00:00  24015        3.67         2.02           4.06
##  3 1983-04-01 00:00:00  24019        7.03         2.21           2.33
##  4 1983-07-01 00:00:00  24047        1.94         0.915          5.32
##  5 1983-08-01 00:00:00  24015        2.00         0.992          6   
##  6 1984-03-01 00:00:00  51001        6.85         1.77           1.87
##  7 1984-05-01 00:00:00  10005        7.64         2.60           3.81
##  8 1985-09-01 00:00:00  51001        5.73         4.65           4.53
##  9 1988-05-01 00:00:00  10003        6.03         2.12           1.68
## 10 1989-03-01 00:00:00  10003        4.57         1.06           3.23
## # ℹ 1,599 more rows
## # ℹ 9 more variables: max.5.day.ppt <dbl>, monthly.tmin <dbl>,
## #   monthly.tmax <dbl>, ndays.more.30tmax <dbl>, Hazard <chr>, TotalLoss <dbl>,
## #   TotalLoss_drought <dbl>, TotalLoss_heat <dbl>, TotalLoss_storm <dbl>

Each column is a vector!

# check whether something is a vector with is.vector()
storm$County |> is.vector()
## [1] TRUE
# but some vectors are special like dates with formatting
storm$Date |> is.vector()
## [1] FALSE
# use head(), summary(), or view() to look at data
head(storm)
## # A tibble: 6 × 14
##   Date                County monthly.ppt max.daily.ppt rain.shortage
##   <dttm>               <dbl>       <dbl>         <dbl>         <dbl>
## 1 1981-06-01 00:00:00  24015        4.21         0.826          1.23
## 2 1982-05-01 00:00:00  24015        3.67         2.02           4.06
## 3 1983-04-01 00:00:00  24019        7.03         2.21           2.33
## 4 1983-07-01 00:00:00  24047        1.94         0.915          5.32
## 5 1983-08-01 00:00:00  24015        2.00         0.992          6   
## 6 1984-03-01 00:00:00  51001        6.85         1.77           1.87
## # ℹ 9 more variables: max.5.day.ppt <dbl>, monthly.tmin <dbl>,
## #   monthly.tmax <dbl>, ndays.more.30tmax <dbl>, Hazard <chr>, TotalLoss <dbl>,
## #   TotalLoss_drought <dbl>, TotalLoss_heat <dbl>, TotalLoss_storm <dbl>

2.9.2 Factors

Factors are categorical variables.

# look at a factor variable
storm$Hazard |> as.factor() |> head()
## [1] Severe Storm/Thunder Storm Severe Storm/Thunder Storm
## [3] Severe Storm/Thunder Storm Severe Storm/Thunder Storm
## [5] Severe Storm/Thunder Storm Severe Storm/Thunder Storm
## Levels: Severe Storm/Thunder Storm
# can you add new values to factors? 
Hazard_fac <- as.factor(storm$Hazard)
Hazard_fac[1] <- "not severe"
## Warning in `[<-.factor`(`*tmp*`, 1, value = "not severe"): invalid factor
## level, NA generated
# we get a warning that an NA was generated
Hazard_fac |> head()
## [1] <NA>                       Severe Storm/Thunder Storm
## [3] Severe Storm/Thunder Storm Severe Storm/Thunder Storm
## [5] Severe Storm/Thunder Storm Severe Storm/Thunder Storm
## Levels: Severe Storm/Thunder Storm
# adding new values without the levels can result in odd behavior 
storm$Hazard |> as.factor() |> c("not severe") |> tail()
## [1] "1"          "1"          "1"          "1"          "1"         
## [6] "not severe"

How would we check that a variable only contains certain values?

(storm$monthly.ppt > 7) |> length()
## [1] 1609
storm$Hazard |> unique()
## [1] "Severe Storm/Thunder Storm"
is.na(storm) |> head()
##       Date County monthly.ppt max.daily.ppt rain.shortage max.5.day.ppt
## [1,] FALSE  FALSE       FALSE         FALSE         FALSE         FALSE
## [2,] FALSE  FALSE       FALSE         FALSE         FALSE         FALSE
## [3,] FALSE  FALSE       FALSE         FALSE         FALSE         FALSE
## [4,] FALSE  FALSE       FALSE         FALSE         FALSE         FALSE
## [5,] FALSE  FALSE       FALSE         FALSE         FALSE         FALSE
## [6,] FALSE  FALSE       FALSE         FALSE         FALSE         FALSE
##      monthly.tmin monthly.tmax ndays.more.30tmax Hazard TotalLoss
## [1,]        FALSE        FALSE             FALSE  FALSE     FALSE
## [2,]        FALSE        FALSE             FALSE  FALSE     FALSE
## [3,]        FALSE        FALSE             FALSE  FALSE     FALSE
## [4,]        FALSE        FALSE             FALSE  FALSE     FALSE
## [5,]        FALSE        FALSE             FALSE  FALSE     FALSE
## [6,]        FALSE        FALSE             FALSE  FALSE     FALSE
##      TotalLoss_drought TotalLoss_heat TotalLoss_storm
## [1,]             FALSE          FALSE           FALSE
## [2,]             FALSE          FALSE           FALSE
## [3,]             FALSE          FALSE           FALSE
## [4,]             FALSE          FALSE           FALSE
## [5,]             FALSE          FALSE           FALSE
## [6,]             FALSE          FALSE           FALSE