The overarching goal of data wrangling is to have a tidy, easy-to-use dataset.

This is the third module in the Data Wrangling I topic.

Overview

Learning Objectives

Understanding principles of “tidy” data, using relational databases, and merging datasets.

Slide Deck



Video Lecture


Example

I’ll keep using the same repo / project as in data import and data manipulation, but create a new .Rmd for tidying. I’m also going to load some relevant packages, and limit the number of lines printed in a tibble.

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'tidyr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

options(tibble.print_min = 5)

pivot_longer

In data import, we used the haven package to load the PULSE biomarkers dataset from a .sas7bdat. Let’s reload those data and take a closer look:

pulse_df = 
  haven::read_sas("./data/public_pulse_data.sas7bdat") |>
  janitor::clean_names()

pulse_df
## # A tibble: 1,087 × 7
##      id   age sex   bdi_score_bl bdi_score_01m bdi_score_06m bdi_score_12m
##   <dbl> <dbl> <chr>        <dbl>         <dbl>         <dbl>         <dbl>
## 1 10003  48.0 male             7             1             2             0
## 2 10015  72.5 male             6            NA            NA            NA
## 3 10022  58.5 male            14             3             8            NA
## 4 10026  72.7 male            20             6            18            16
## 5 10035  60.4 male             4             0             1             2
## # ℹ 1,082 more rows

With our new understanding of tidy data, we quickly recognize a problem: the BDI score is spread across four columns, which correspond to four observation times. We can fix this problem using pivot_longer:

pulse_tidy_df = 
  pivot_longer(
    pulse_df, 
    bdi_score_bl:bdi_score_12m,
    names_to = "visit", 
    values_to = "bdi")

pulse_tidy_df
## # A tibble: 4,348 × 5
##      id   age sex   visit           bdi
##   <dbl> <dbl> <chr> <chr>         <dbl>
## 1 10003  48.0 male  bdi_score_bl      7
## 2 10003  48.0 male  bdi_score_01m     1
## 3 10003  48.0 male  bdi_score_06m     2
## 4 10003  48.0 male  bdi_score_12m     0
## 5 10015  72.5 male  bdi_score_bl      6
## # ℹ 4,343 more rows

This looks much better! However, now visit is an issue. The original column names were informative but we probably don’t need to keep the bdi_score_ prefix in each case. I’ll use an additional option in pivot_longer to address this:

pulse_tidy_df = 
  pivot_longer(
    pulse_df, 
    bdi_score_bl:bdi_score_12m,
    names_to = "visit", 
    names_prefix = "bdi_score_",
    values_to = "bdi")

pulse_tidy_df
## # A tibble: 4,348 × 5
##      id   age sex   visit   bdi
##   <dbl> <dbl> <chr> <chr> <dbl>
## 1 10003  48.0 male  bl        7
## 2 10003  48.0 male  01m       1
## 3 10003  48.0 male  06m       2
## 4 10003  48.0 male  12m       0
## 5 10015  72.5 male  bl        6
## # ℹ 4,343 more rows

In the preceding I’ve saved intermediate datasets to make each step clear. While this can be a helpful crutch as you’re trying out code, it is generally bad practice. There are also some additional transformations needed to wrap up the data wrangling process, like changing bl to 00m for consistency across visits and converting visit to a factor variable. (It’s possible that you would want visit to be a numeric variable instead, which could be done with a different call to mutate.)

Altogether, then, the code below will import, tidy, and transform the PULSE dataset into a usable format:

pulse_df = 
  haven::read_sas("./data/public_pulse_data.sas7bdat") |>
  janitor::clean_names() |>
  pivot_longer(
    bdi_score_bl:bdi_score_12m,
    names_to = "visit", 
    names_prefix = "bdi_score_",
    values_to = "bdi") |>
  mutate(
    visit = replace(visit, visit == "bl", "00m"),
    visit = factor(visit)) 

print(pulse_df, n = 12)
## # A tibble: 4,348 × 5
##       id   age sex   visit   bdi
##    <dbl> <dbl> <chr> <fct> <dbl>
##  1 10003  48.0 male  00m       7
##  2 10003  48.0 male  01m       1
##  3 10003  48.0 male  06m       2
##  4 10003  48.0 male  12m       0
##  5 10015  72.5 male  00m       6
##  6 10015  72.5 male  01m      NA
##  7 10015  72.5 male  06m      NA
##  8 10015  72.5 male  12m      NA
##  9 10022  58.5 male  00m      14
## 10 10022  58.5 male  01m       3
## 11 10022  58.5 male  06m       8
## 12 10022  58.5 male  12m      NA
## # ℹ 4,336 more rows

Now we’re in pretty okay shape :-).

Learning Assessment: In the litters data, the variables gd0_weight and gd18_weight give the weight of the mother mouse on gestational days 0 and 18. Write a data cleaning chain that retains only litter_number and these columns; produces new variables gd and weight; and makes gd a numeric variable taking values 0 and 18 (for the last part, you might want to use recode …). Is this version “tidy”?

Solution

The code below shows one approach to this data cleaning process:

litters_wide = 
  read_csv(
    "./data/FAS_litters.csv",
    na = c("NA", ".", "")) |>
  janitor::clean_names() |>
  select(litter_number, ends_with("weight")) |> 
  pivot_longer(
    gd0_weight:gd18_weight,
    names_to = "gd", 
    values_to = "weight") |> 
  mutate(
    gd = case_match(
      gd,
      "gd0_weight"  ~ 0,
      "gd18_weight" ~ 18
    ))
## Rows: 49 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Group, Litter Number
## dbl (6): GD0 weight, GD18 weight, GD of Birth, Pups born alive, Pups dead @ ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

In one sense, this is “tidy” because I have a variable for day and a variable for weight rather that using values in my variable names. However, it’s less useful if I’m interested in computing or analyzing weight gain during pregnancy.

pivot_wider

We’ve been exclusively interested in tidying data, but we’ve admitted that sometimes untidy is better for human consumption. For that reason we’re going to take a short digression into untidying your tidy data.

The code below creates a tidy dataset that could result from an analysis. This is the correct format for additional analysis or visualization, but doesn’t facilitate quick comparisons for human readers.

analysis_result = 
  tibble(
    group = c("treatment", "treatment", "placebo", "placebo"),
    time = c("pre", "post", "pre", "post"),
    mean = c(4, 8, 3.5, 4)
  )

analysis_result
## # A tibble: 4 × 3
##   group     time   mean
##   <chr>     <chr> <dbl>
## 1 treatment pre     4  
## 2 treatment post    8  
## 3 placebo   pre     3.5
## 4 placebo   post    4

An alternative presentation of the same data might have groups in rows, times in columns, and mean values in table cells. This is decidedly non-tidy; to get there from here we’ll need to use pivot_wider, which is the inverse of pivot_longer:

pivot_wider(
  analysis_result, 
  names_from = "time", 
  values_from = "mean")
## # A tibble: 2 × 3
##   group       pre  post
##   <chr>     <dbl> <dbl>
## 1 treatment   4       8
## 2 placebo     3.5     4

We’re pretty much there now – in some cases you might use select to reorder columns, and (depending on your goal) use knitr::kable() to produce a nicer table for reading.

Binding rows

We’ve looked at single-table non-tidy data, but non-tidiness often stems from relevant data spread across multiple tables. In the simplest case, these tables are basically the same and can be stacked to produce a tidy dataset. That’s the setting in LotR_words.xlsx, where the word counts for different races and genders in each movie in the trilogy are spread across distinct data rectangles (these data are based on this example).

To produce the desired tidy dataset, we first need to read each table and do some cleaning.

fellowship_ring = 
  readxl::read_excel("./data/LotR_Words.xlsx", range = "B3:D6") |>
  mutate(movie = "fellowship_ring")

two_towers = 
  readxl::read_excel("./data/LotR_Words.xlsx", range = "F3:H6") |>
  mutate(movie = "two_towers")

return_king = 
  readxl::read_excel("./data/LotR_Words.xlsx", range = "J3:L6") |>
  mutate(movie = "return_king")

Here it was necessary to add a variable to each dataframe indicating the movie; that information had stored elsewhere in the original spreadsheet. As an aside, the three code snippets above are all basically the same except for the range and the movie name – later we’ll see a better way to handle cases like this by writing our own functions, but this works for now.

Once each table is ready to go, we can stack them up using bind_rows and tidy the result:

lotr_tidy = 
  bind_rows(fellowship_ring, two_towers, return_king) |>
  janitor::clean_names() |>
  pivot_longer(
    female:male,
    names_to = "gender", 
    values_to = "words") |>
  mutate(race = str_to_lower(race)) |> 
  select(movie, everything()) 

lotr_tidy
## # A tibble: 18 × 4
##    movie           race   gender words
##    <chr>           <chr>  <chr>  <dbl>
##  1 fellowship_ring elf    female  1229
##  2 fellowship_ring elf    male     971
##  3 fellowship_ring hobbit female    14
##  4 fellowship_ring hobbit male    3644
##  5 fellowship_ring man    female     0
##  6 fellowship_ring man    male    1995
##  7 two_towers      elf    female   331
##  8 two_towers      elf    male     513
##  9 two_towers      hobbit female     0
## 10 two_towers      hobbit male    2463
## 11 two_towers      man    female   401
## 12 two_towers      man    male    3589
## 13 return_king     elf    female   183
## 14 return_king     elf    male     510
## 15 return_king     hobbit female     2
## 16 return_king     hobbit male    2673
## 17 return_king     man    female   268
## 18 return_king     man    male    2459

Having the data in this form will make it easier to make comparisons across movies, aggregate within races across the trilogy, and perform other analyses.

Joining datasets

Data can be spread across multiple related tables, in which case it is necessary to combine or join them prior to analysis. We’ll focus on the problem of combining two tables only; combining three or more is done step-by-step using the same ideas.

There are four major ways join dataframes x and y:

  • Inner: keeps data that appear in both x and y
  • Left: keeps data that appear in x
  • Right: keeps data that appear in y
  • Full: keeps data that appear in either x or y

Left joins are the most common, because they add data from a smaller table y into a larger table x without removing anything from x.

As an example, consider the data tables in FAS_pups.csv and FAS_litters.csv, which are related through the Litter Number variable. The former contains data unique to each pup, and the latter contains data unique to each litter. We can combine these using a left join of litter data into pup data; doing so retains data on each pup and adds data in new columns.

(While revisiting this example, take a look at the group variable in the litters dataset: this encodes both dose and day of treatment! We’ll fix that bit of untidiness as part of the processing pipeline. I’m also going to address a pet peeve of mine, which is coding sex as an ambiguous numeric variable.)

pup_df = 
  read_csv(
    "./data/FAS_pups.csv",
    na = c("NA", "", ".")) |>
  janitor::clean_names() |>
  mutate(
    sex = 
      case_match(
        sex, 
        1 ~ "male", 
        2 ~ "female"),
    sex = as.factor(sex)) 
## Rows: 313 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Litter Number
## dbl (5): Sex, PD ears, PD eyes, PD pivot, PD walk
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

litter_df = 
  read_csv(
    "./data/FAS_litters.csv",
    na = c("NA", ".", "")) |>
  janitor::clean_names() |>
  separate(group, into = c("dose", "day_of_tx"), sep = 3) |>
  relocate(litter_number) |>
  mutate(
    wt_gain = gd18_weight - gd0_weight,
    dose = str_to_lower(dose))
## Rows: 49 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Group, Litter Number
## dbl (6): GD0 weight, GD18 weight, GD of Birth, Pups born alive, Pups dead @ ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

fas_df = 
  left_join(pup_df, litter_df, by = "litter_number")

fas_df
## # A tibble: 313 × 15
##   litter_number sex   pd_ears pd_eyes pd_pivot pd_walk dose  day_of_tx
##   <chr>         <fct>   <dbl>   <dbl>    <dbl>   <dbl> <chr> <chr>    
## 1 #85           male        4      13        7      11 con   7        
## 2 #85           male        4      13        7      12 con   7        
## 3 #1/2/95/2     male        5      13        7       9 con   7        
## 4 #1/2/95/2     male        5      13        8      10 con   7        
## 5 #5/5/3/83/3-3 male        5      13        8      10 con   7        
## # ℹ 308 more rows
## # ℹ 7 more variables: gd0_weight <dbl>, gd18_weight <dbl>, gd_of_birth <dbl>,
## #   pups_born_alive <dbl>, pups_dead_birth <dbl>, pups_survive <dbl>,
## #   wt_gain <dbl>

We made the key explicit in the join. By default, the *_join functions in dplyr will try to determine the key(s) based on variable names in the datasets you want to join. This is often but not always sufficient, and an extra step to make the key clear will help you and others reading your code.

Note that joining is not particularly amenable to the |> operator because it is fundamentally non-linear: two separate datasets are coming together, rather than a single dataset being processed in a step-by-step fashion.

As a final point, the *_join functions are very much related to SQL syntax, but emphasize operations common to data analysis.

Learning Assessment: The datasets in this zip file contain de-identified responses to surveys included in past years of this course. Both contain a unique student identifier; the first has responses to a question about operating systems, and the second has responses to questions about degree program and git experience. Write a code chunk that imports and cleans both datasets, and then joins them.

Solution

I put both datasets in the data directory in my repo / project. The code below imports both datasets, cleans up variable names, and joins the datasets using left_join, inner_join, and anti_join.

surv_os = 
  read_csv("data/surv_os.csv") |> 
  janitor::clean_names() |> 
  rename(id = what_is_your_uni, os = what_operating_system_do_you_use)

surv_pr_git = 
  read_csv("data/surv_program_git.csv") |> 
  janitor::clean_names() |> 
  rename(
    id = what_is_your_uni, 
    prog = what_is_your_degree_program,
    git_exp = which_most_accurately_describes_your_experience_with_git)

left_join(surv_os, surv_pr_git)
inner_join(surv_os, surv_pr_git)
anti_join(surv_os, surv_pr_git)
anti_join(surv_pr_git, surv_os)

Both left_join and inner_join give warnings about “many-to-many” relationships. These indicate that the matching variable is not unique, and the warnings should be investigated. In the code below we look into the warning that “Row 7 of x matches multiple rows in y.”

surv_os |> slice(7)
surv_pr_git |> filter(id == "student_15")
Sure enough, two rows in surv_pr_git gave the id value student_15 – probably someone filled in the survey more than once. Adjudicating and correcting these issues can be an important step in data cleaning!

A quick note on names

For a while, folks were using gather and spread instead of pivot_longer and pivot_wider. The new functions were updated for good reasons; gather and spread will still exist, but they’re going to be less common over time and you may never see them.

Other materials

  • R for Data Science, of course, has a chapter on tidy data. The paper that lays out the underlying ideas may also be useful.
  • R for Data Science also has an excellent and very detailed presentation of joins.
  • Jenny Bryan’s Stat 545 class has content on tidy data – parts 1, 2, 3, and 4 are all good (some of the content above is very much related to this).
  • You should revisit the data import cheatsheet, which also has some tips for tidying.

The code that I produced working examples in lecture is here.