dplyr
Once you’ve imported data, you’re going to need to do some cleaning up.
This is the secode module in the Data Wrangling I topic.
Clean and organize data using dplyr
verbs and
piping.
For this example, I’ll start a new R Markdown file to the repo / project I started for the Data Wrangling I topic; this will make it easy to load example data sets using the code I wrote in Data Import.
Once again we’re going to be using the tidyverse
, so
we’ll load that at the outset. We’re going to be looking at a lot of
output, so I’ll print only three lines of each tibble by default.
Lastly, we’ll focus on the data in FAS_litters.csv
and
FAS_pups.csv
, so we’ll load those data and clean up the
column names using what we learned in Data
Import.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
options(tibble.print_min = 3)
litters_df =
read_csv("./data/FAS_litters.csv", na = c("NA", ".", ""))
## Rows: 49 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Group, Litter Number
## dbl (6): GD0 weight, GD18 weight, GD of Birth, Pups born alive, Pups dead @ ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
litters_df =
janitor::clean_names(litters_df)
pups_df =
read_csv("./data/FAS_pups.csv", na = c("NA", "."))
## Rows: 313 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Litter Number
## dbl (5): Sex, PD ears, PD eyes, PD pivot, PD walk
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pups_df =
janitor::clean_names(pups_df)
select
For a given analysis, you may only need a subset of the columns in a
data table; extracting only what you need can helpfully de-clutter,
especially when you have large datasets. Select columns using
select
.
You can specify the columns you want to keep by naming all of them:
select(litters_df, group, litter_number, gd0_weight, pups_born_alive)
## # A tibble: 49 × 4
## group litter_number gd0_weight pups_born_alive
## <chr> <chr> <dbl> <dbl>
## 1 Con7 #85 19.7 3
## 2 Con7 #1/2/95/2 27 8
## 3 Con7 #5/5/3/83/3-3 26 6
## # ℹ 46 more rows
You can specify the specify a range of columns to keep:
select(litters_df, group:gd_of_birth)
## # A tibble: 49 × 5
## group litter_number gd0_weight gd18_weight gd_of_birth
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Con7 #85 19.7 34.7 20
## 2 Con7 #1/2/95/2 27 42 19
## 3 Con7 #5/5/3/83/3-3 26 41.4 19
## # ℹ 46 more rows
You can also specify columns you’d like to remove:
select(litters_df, -pups_survive)
## # A tibble: 49 × 7
## group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Con7 #85 19.7 34.7 20 3
## 2 Con7 #1/2/95/2 27 42 19 8
## 3 Con7 #5/5/3/83/3-3 26 41.4 19 6
## # ℹ 46 more rows
## # ℹ 1 more variable: pups_dead_birth <dbl>
You can rename variables as part of this process:
select(litters_df, GROUP = group, LiTtEr_NuMbEr = litter_number)
## # A tibble: 49 × 2
## GROUP LiTtEr_NuMbEr
## <chr> <chr>
## 1 Con7 #85
## 2 Con7 #1/2/95/2
## 3 Con7 #5/5/3/83/3-3
## # ℹ 46 more rows
If all you want to do is rename something, you can use
rename
instead of select
. This will rename the
variables you care about, and keep everything else:
rename(litters_df, GROUP = group, LiTtEr_NuMbEr = litter_number)
## # A tibble: 49 × 8
## GROUP LiTtEr_NuMbEr gd0_weight gd18_weight gd_of_birth pups_born_alive
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Con7 #85 19.7 34.7 20 3
## 2 Con7 #1/2/95/2 27 42 19 8
## 3 Con7 #5/5/3/83/3-3 26 41.4 19 6
## # ℹ 46 more rows
## # ℹ 2 more variables: pups_dead_birth <dbl>, pups_survive <dbl>
There are some handy helper functions for select
; read
about all of them using ?select_helpers
. I use
starts_with()
, ends_with()
, and
contains()
often, especially when there variables are named
with suffixes or other standard patterns:
select(litters_df, starts_with("gd"))
## # A tibble: 49 × 3
## gd0_weight gd18_weight gd_of_birth
## <dbl> <dbl> <dbl>
## 1 19.7 34.7 20
## 2 27 42 19
## 3 26 41.4 19
## # ℹ 46 more rows
I also frequently use is everything()
, which is handy
for reorganizing columns without discarding anything:
select(litters_df, litter_number, pups_survive, everything())
## # A tibble: 49 × 8
## litter_number pups_survive group gd0_weight gd18_weight gd_of_birth
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 #85 3 Con7 19.7 34.7 20
## 2 #1/2/95/2 7 Con7 27 42 19
## 3 #5/5/3/83/3-3 5 Con7 26 41.4 19
## # ℹ 46 more rows
## # ℹ 2 more variables: pups_born_alive <dbl>, pups_dead_birth <dbl>
relocate
does a similar thing (and is sort of like
rename
in that it’s handy but not critical):
relocate(litters_df, litter_number, pups_survive)
## # A tibble: 49 × 8
## litter_number pups_survive group gd0_weight gd18_weight gd_of_birth
## <chr> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 #85 3 Con7 19.7 34.7 20
## 2 #1/2/95/2 7 Con7 27 42 19
## 3 #5/5/3/83/3-3 5 Con7 26 41.4 19
## # ℹ 46 more rows
## # ℹ 2 more variables: pups_born_alive <dbl>, pups_dead_birth <dbl>
In larger datasets,
Lastly, like other functions in dplyr
,
select
will export a dataframe even if you only select one
column. Mostly this is fine, but sometimes you want the vector stored in
the column. To pull a single variable, use pull
.
Learning Assessment: In the pups data, select the columns containing litter number, sex, and PD ears.
The chunk below performs this action:
select(pups_df, litter_number, sex, pd_ears)
## # A tibble: 313 × 3
## litter_number sex pd_ears
## <chr> <dbl> <dbl>
## 1 #85 1 4
## 2 #85 1 4
## 3 #1/2/95/2 1 5
## # ℹ 310 more rows
filter
Some data tables will include rows you don’t need for your current
analysis. Although you could remove specific row numbers using base R,
you shouldn’t – this might break if the raw data are updated, and the
thought process isn’t transparent. Instead, you should filter rows based
on logical expressions using the filter
function. Like
select
, the first argument to filter
is the
dataframe you’re filtering; all subsequent arguments are logical
expressions.
You will often filter using comparison operators (>
,
>=
, <
, <=
,
==
, and !=
). You may also use
%in%
to detect if values appear in a set, and
is.na()
to find missing values. The results of comparisons
are logical – the statement is TRUE
or FALSE
depending on the values you compare – and can be combined with other
comparisons using the logical operators &
and
|
, or negated using !
.
Some ways you might filter the litters data are:
gd_of_birth == 20
pups_born_alive >= 2
pups_survive != 4
!(pups_survive == 4)
group %in% c("Con7", "Con8")
group == "Con7" & gd_of_birth == 20
A very common filtering step requires you to omit missing
observations. You can do this with filter
, but I
recommend using drop_na
from the tidyr
package:
drop_na(litters_df)
will remove any row with a missing
valuedrop_na(litters_df, wt_increase)
will remove rows for
which wt_increase
is missing.Filtering can be helpful for limiting a dataset to only those observations needed for an analysis. However, I recommend against the creation of many data subsets (e.g. one for each group). This can clutter up your workspace, and we’ll see good tools for the analysis of subsets before long.
Learning Assessment: In the pups data:
The chunk below performs these actions:
filter(pups_df, sex == 1)
## # A tibble: 155 × 6
## litter_number sex pd_ears pd_eyes pd_pivot pd_walk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 #85 1 4 13 7 11
## 2 #85 1 4 13 7 12
## 3 #1/2/95/2 1 5 13 7 9
## # ℹ 152 more rows
filter(pups_df, sex == 2, pd_walk < 11)
## # A tibble: 127 × 6
## litter_number sex pd_ears pd_eyes pd_pivot pd_walk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 #1/2/95/2 2 4 13 7 9
## 2 #1/2/95/2 2 4 13 7 10
## 3 #1/2/95/2 2 5 13 8 10
## # ℹ 124 more rows
mutate
Sometimes you need to select columns; sometimes you need to change
them or create new ones. You can do this using mutate
.
The example below creates a new variable measuring the difference
between gd18_weight
and gd0_weight
and
modifies the existing group
variable.
mutate(litters_df,
wt_gain = gd18_weight - gd0_weight,
group = str_to_lower(group)
)
## # A tibble: 49 × 9
## group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 con7 #85 19.7 34.7 20 3
## 2 con7 #1/2/95/2 27 42 19 8
## 3 con7 #5/5/3/83/3-3 26 41.4 19 6
## # ℹ 46 more rows
## # ℹ 3 more variables: pups_dead_birth <dbl>, pups_survive <dbl>, wt_gain <dbl>
A few things in this example are worth noting:
Creating a new variable that does exactly what you need can be a challenge; the more functions you know about, the easier this gets.
Learning Assessment: In the pups data:
The chunk below performs these actions:
mutate(pups_df, pivot_minus7 = pd_pivot - 7)
## # A tibble: 313 × 7
## litter_number sex pd_ears pd_eyes pd_pivot pd_walk pivot_minus7
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 #85 1 4 13 7 11 0
## 2 #85 1 4 13 7 12 0
## 3 #1/2/95/2 1 5 13 7 9 0
## # ℹ 310 more rows
mutate(pups_df, pd_sum = pd_ears + pd_eyes + pd_pivot + pd_walk)
## # A tibble: 313 × 7
## litter_number sex pd_ears pd_eyes pd_pivot pd_walk pd_sum
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 #85 1 4 13 7 11 35
## 2 #85 1 4 13 7 12 36
## 3 #1/2/95/2 1 5 13 7 9 34
## # ℹ 310 more rows
arrange
In comparison to the preceding, arranging is pretty straightforward. You can arrange the rows in your data according to the values in one or more columns:
head(arrange(litters_df, group, pups_born_alive), 10)
## # A tibble: 10 × 8
## group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Con7 #85 19.7 34.7 20 3
## 2 Con7 #5/4/2/95/2 28.5 44.1 19 5
## 3 Con7 #5/5/3/83/3-3 26 41.4 19 6
## 4 Con7 #4/2/95/3-3 NA NA 20 6
## 5 Con7 #2/2/95/3-2 NA NA 20 6
## 6 Con7 #1/2/95/2 27 42 19 8
## 7 Con7 #1/5/3/83/3-3/2 NA NA 20 9
## 8 Con8 #2/2/95/2 NA NA 19 5
## 9 Con8 #1/6/2/2/95-2 NA NA 20 7
## 10 Con8 #3/6/2/2/95-3 NA NA 20 7
## # ℹ 2 more variables: pups_dead_birth <dbl>, pups_survive <dbl>
|>
We’ve seen several commands you will use regularly for data
manipulation and cleaning. You will rarely use them in isolation. For
example, suppose you want to load the data, clean the column names,
remove pups_survive
, and create wt_gain
. There
are a couple of options for this kind of multi-step data
manipulation:
The following is an example of the first option:
litters_df_raw =
read_csv("./data/FAS_litters.csv", na = c("NA", ".", ""))
## Rows: 49 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Group, Litter Number
## dbl (6): GD0 weight, GD18 weight, GD of Birth, Pups born alive, Pups dead @ ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
litters_df_clean_names = janitor::clean_names(litters_df_raw)
litters_df_selected_cols = select(litters_df_clean_names, -pups_survive)
litters_df_with_vars =
mutate(
litters_df_selected_cols,
wt_gain = gd18_weight - gd0_weight,
group = str_to_lower(group))
litters_df_with_vars_without_missing =
drop_na(litters_df_with_vars, wt_gain)
litters_df_with_vars_without_missing
## # A tibble: 31 × 8
## group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 con7 #85 19.7 34.7 20 3
## 2 con7 #1/2/95/2 27 42 19 8
## 3 con7 #5/5/3/83/3-3 26 41.4 19 6
## # ℹ 28 more rows
## # ℹ 2 more variables: pups_dead_birth <dbl>, wt_gain <dbl>
Below, we try the second option:
litters_df_clean =
drop_na(
mutate(
select(
janitor::clean_names(
read_csv("./data/FAS_litters.csv", na = c("NA", ".", ""))
),
-pups_survive
),
wt_gain = gd18_weight - gd0_weight,
group = str_to_lower(group)
),
wt_gain
)
## Rows: 49 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Group, Litter Number
## dbl (6): GD0 weight, GD18 weight, GD of Birth, Pups born alive, Pups dead @ ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
litters_df_clean
## # A tibble: 31 × 8
## group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 con7 #85 19.7 34.7 20 3
## 2 con7 #1/2/95/2 27 42 19 8
## 3 con7 #5/5/3/83/3-3 26 41.4 19 6
## # ℹ 28 more rows
## # ℹ 2 more variables: pups_dead_birth <dbl>, wt_gain <dbl>
These are both confusing and bad: the first gets confusing and clutters our workspace, and the second has to be read inside out.
Piping solves this problem. It allows you to turn the nested approach into a sequential chain by passing the result of one function call as an argument to the next function call:
litters_df =
read_csv("./data/FAS_litters.csv", na = c("NA", ".", "")) |>
janitor::clean_names() |>
select(-pups_survive) |>
mutate(
wt_gain = gd18_weight - gd0_weight,
group = str_to_lower(group)) |>
drop_na(wt_gain)
## Rows: 49 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Group, Litter Number
## dbl (6): GD0 weight, GD18 weight, GD of Birth, Pups born alive, Pups dead @ ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
litters_df
## # A tibble: 31 × 8
## group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 con7 #85 19.7 34.7 20 3
## 2 con7 #1/2/95/2 27 42 19 8
## 3 con7 #5/5/3/83/3-3 26 41.4 19 6
## # ℹ 28 more rows
## # ℹ 2 more variables: pups_dead_birth <dbl>, wt_gain <dbl>
All three approaches result in the same dataset, but the piped
commands are by far the most straightforward. The easiest way to read
|>
is “then”; the keyboard shortcuts are Ctrl + Shift +
M (Windows) and Cmd + Shift + M (Mac). Note that by default RStudio will
insert the “traditional” pipe %>%
, and you can update to
the native in pipe through Global Preferences > Code > Use Native
Pipe Operator. More on the difference between |>
and
%>%
below.
The functions in dplyr
(and much of the
tidyverse
) are designed to work smoothly with the pipe
operator. By default, the pipe will take the result of one function call
and use that as the first argument of the next function call; by design,
functions in dplyr
will take a tibble as an input and
return a tibble as a result. As a consequence, functions in
dplyr
are easy to connect in a data cleaning chain.
In the majority of cases (and everywhere in the tidyverse) you can
trust that the first argument is the right one and be happy with life,
but there are some cases where what you’re piping isn’t going into the
first argument. Here, using the placeholder _
is necessary
to indicate where the object being piped should go. For example, to
regress wt_gain
on pups_born_alive
, you might
use:
litters_df |>
lm(wt_gain ~ pups_born_alive, data = _) |>
broom::tidy()
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 13.1 1.27 10.3 3.39e-11
## 2 pups_born_alive 0.605 0.173 3.49 1.55e- 3
There are limitations to the pipe. You shouldn’t have sequences that
are too long; there isn’t a great way to deal with multiple inputs and
outputs; and not everyone will know what |>
means or
does. That said, compared to days when R users only had the first two
options, life is much better!
Learning Assessment: Write a chain of commands that:
The chunk below performs these actions:
pups_df =
read_csv("./data/FAS_pups.csv", na = c("NA", ".")) |>
janitor::clean_names() |>
filter(sex == 1) |>
select(-pd_ears) |>
mutate(pd_pivot_gt7 = pd_pivot > 7)
## Rows: 313 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Litter Number
## dbl (5): Sex, PD ears, PD eyes, PD pivot, PD walk
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pups_df
## # A tibble: 155 × 6
## litter_number sex pd_eyes pd_pivot pd_walk pd_pivot_gt7
## <chr> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 #85 1 13 7 11 FALSE
## 2 #85 1 13 7 12 FALSE
## 3 #1/2/95/2 1 13 7 9 FALSE
## # ℹ 152 more rows
|>
vs %>%
There are some subtle differences between |>
and
%>%
that it can be helpful to read
about, but in the vast majority of cases these are effectively
interchangeable. While the similarities and differences can matter,
there are a few points that are likely to come up more often:
%>%
a lot in code online. It was
introduced first, and was very prevalent in tidyverse code for a long
time before |>
came along.|>
was introduced
in R 4.1.0, and it changed some in 4.2.0 and 4.3.0; it’s pretty stable
now and the main functionality is unlikely to change, but it won’t be
familiar to everyone and can’t be used at all in older versions of
R.There’s lots of stuff out there on how to clean your data using
dplyr
.
dplyr
.dplyr
, the
tidylog package can be
helpful.The code that I produced working examples in lecture is here.