Importing is the first step in wrangling.
This is the first module in the Data Wrangling I topic.
Read data into R from a variety of sources and parse variable types.
We’re going to figure out how to import the datasets in this zip file. First,
create a GitHub repo + local R project for today’s work (I’m calling
mine data_wrangling_i
). Move the datasets to your repo /
directory (perhaps in a sub-directory called data
), and
start an R Markdown; I’m reusing this template. We’re also going to be
making extensive use of the tidyverse
package, so go ahead
and load that in a “setup” code chunk.
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'tidyr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We’ve mentioned paths and your working directory in passing (when talking about best practices). A firmer understanding is helpful when you start loading data into R, because to load data R will have to know where to find it.
There are two kinds of paths:
Absolute paths are often conceptually easier, because you don’t really have to think about them – you’re just giving the complete address, starting from the root directory. These work from any current working directory on the machine. However, absolute paths make your code extremely fragile: if you move your directory, or share your project with someone else, the absolute path to files will change and your code won’t work.
An absolute path example is below:
"/Users/jeffgoldsmith/Dropbox/Work/Teaching/P8105/fall_2024/p8105.github.io/data/FAS_litters.csv"
Relative paths, meanwhile, navigate from your current working directory. Relative paths are portable, in that if you move your directory or share it with someone else, the relative path to files in the directory will stay the same and your code will work. For both of these reasons, relative paths are preferred in almost every setting.
(Hint: if you ever use setwd()
, you’re using absolute
paths and taking
big risks…)
The code below finds my current working directory:
getwd()
## [1] "/Users/jeffgoldsmith/Dropbox/Work/Teaching/P8105/fall_2024/p8105.github.io"
A relative path to the same file as the absolute path above is
"./data/FAS_litters.csv"
The table below, copied from R Programming for Research, gives useful shorthand notation for relative pathnames.
Shorthand | Meaning |
---|---|
~ |
Home directory |
. |
Current working directory |
.. |
One directory up from current working directory |
../.. |
Two directories up from current working directory |
Tabbed autocomplete in RStudio works with both absolute and relative paths.
One note: by default, R Markdown files treat whatever directory
they’re in as the starting point for relative paths. For now, putting
your R Markdown file in the same directory as you .RProj file will save
a lot of headaches – the same path will work when you knit the document
and when you write / edit code interactive in the console. For larger
projects, you might want to have a separate directory with several R
Markdown files. In that case using here::here()
to identify and navigate from the project top-level directory
is great, but you need a reasonable understanding of paths before
jumping into that. You can also modify the behavior of R Markdown files
(in Global Options) to evaluate code chunks in the project directory,
which alleviates most of the issues. But honestly, at that point, just
learn how here::here()
works …
One caveat: if you’re going to both share code and use confidential
data, you have to take extra steps to ensure data security (especially
if that directory lives on GitHub). One option is to use the
.gitignore
file to ensure data aren’t uploaded
inadvertently. Alternatively, you can store data outside of your shared
project directory. In both cases there are ways to make you paths
“stable”.
Now that we have paths handled, we can start loading data. We’re
going to start with rectangular data tables (data in rows and columns,
with data separated by a delimiter) saved in a plain text format. Of
these, CSV (comma separated value) files are most common, and others are
handled in basically the same way. To import a CSV, we’ll use a function
from readr
(included in the tidyverse):
litters_df = read_csv(file = "./data/FAS_litters.csv")
## Rows: 49 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Group, Litter Number, GD0 weight, GD18 weight
## dbl (4): GD of Birth, Pups born alive, Pups dead @ birth, Pups survive
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Great – we’ve imported data! The first argument to
read_csv
is the path to the data, and the line above
assigns the result of read_csv
to litters_df
.
This function call also prints information about the column parsing.
We’ll talk more about both of these shortly.
I (almost always) use janitor::clean_names()
to clean up
variable names after importing data. Doing so will take whatever the
column names are and convert them to lower snake case.
names(litters_df)
## [1] "Group" "Litter Number" "GD0 weight"
## [4] "GD18 weight" "GD of Birth" "Pups born alive"
## [7] "Pups dead @ birth" "Pups survive"
litters_df = janitor::clean_names(litters_df)
names(litters_df)
## [1] "group" "litter_number" "gd0_weight" "gd18_weight"
## [5] "gd_of_birth" "pups_born_alive" "pups_dead_birth" "pups_survive"
The package::function
syntax lets you use a function
from a package without loading the whole library. That’s really helpful,
because some packages have functions with the same name
(e.g. dplyr::filter
and stats::filter
), and R
has to choose which one you mean. In general, only load the packages you
need to prevent this kind of confusion. You can even use the conflicted
package if you want to force yourself into using good habits!
Learning Assessment: Make sure you are able
to load the FAS_pups.csv
dataset. Use both absolute and
relative paths. Quit R Studio and move the directory containing your
project, data, and R Markdown document. Repeat the previous data import
process; do both absolute and relative paths still work?
For me, loading the data with absolute and relative paths might look like the following:
pups_df = read_csv(file = "./data/FAS_pups.csv")
pups_df = read_csv(file = "~/Desktop/data_wranging_i/data/FAS_litters.csv")
If I move the directory holding this example, the relative path works and the absolute path is broken.
The first thing to do after importing the data (unless
read_csv
gives warnings) is to look at it. If there are
unexpected results during data import, you’ll catch a lot of them here.
In addition to printing the data (which is my first step), I often use
View
/ view
, str
,
head
, and tail
:
litters_df
## # A tibble: 49 × 8
## group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Con7 #85 19.7 34.7 20 3
## 2 Con7 #1/2/95/2 27 42 19 8
## 3 Con7 #5/5/3/83/3-3 26 41.4 19 6
## 4 Con7 #5/4/2/95/2 28.5 44.1 19 5
## 5 Con7 #4/2/95/3-3 <NA> <NA> 20 6
## 6 Con7 #2/2/95/3-2 <NA> <NA> 20 6
## 7 Con7 #1/5/3/83/3-3/2 <NA> <NA> 20 9
## 8 Con8 #3/83/3-3 <NA> <NA> 20 9
## 9 Con8 #2/95/3 <NA> <NA> 20 8
## 10 Con8 #3/5/2/2/95 28.5 <NA> 20 8
## # ℹ 39 more rows
## # ℹ 2 more variables: pups_dead_birth <dbl>, pups_survive <dbl>
tail(litters_df, 5)
## # A tibble: 5 × 8
## group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Low8 #100 20 39.2 20 8
## 2 Low8 #4/84 21.8 35.2 20 4
## 3 Low8 #108 25.6 47.5 20 8
## 4 Low8 #99 23.5 39 20 6
## 5 Low8 #110 25.5 42.7 20 7
## # ℹ 2 more variables: pups_dead_birth <dbl>, pups_survive <dbl>
Another tool that I use sometimes is skimr::skim
,
although for me it’s more “neat” than useful.
skimr::skim(litters_df)
Name | litters_df |
Number of rows | 49 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
character | 4 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
group | 0 | 1.00 | 4 | 4 | 0 | 6 | 0 |
litter_number | 0 | 1.00 | 3 | 15 | 0 | 49 | 0 |
gd0_weight | 13 | 0.73 | 1 | 4 | 0 | 26 | 0 |
gd18_weight | 15 | 0.69 | 1 | 4 | 0 | 31 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
gd_of_birth | 0 | 1 | 19.65 | 0.48 | 19 | 19 | 20 | 20 | 20 | ▅▁▁▁▇ |
pups_born_alive | 0 | 1 | 7.35 | 1.76 | 3 | 6 | 8 | 8 | 11 | ▁▃▂▇▁ |
pups_dead_birth | 0 | 1 | 0.33 | 0.75 | 0 | 0 | 0 | 0 | 4 | ▇▂▁▁▁ |
pups_survive | 0 | 1 | 6.41 | 2.05 | 1 | 5 | 7 | 8 | 9 | ▁▃▂▇▇ |
I’m not going to show the output of the line below, because
View
/ view
don’t work well inside of .Rmd
files. Instead, call these functions from the console, and be sure to
remove them from any code chunks in your .Rmd.
view(litters_df)
read_*
In the best case, the data are stored in the csv without any
weirdness – there are no blank lines or columns, the first row is the
variable name, missing values are stored in sensible ways. When this
isn’t the case, arguments to read_csv
are helpful. The ones
I use most frequently are:
col_names
: usually TRUE
. If
FALSE
, column names are X1
, X1
, …
. You can also supply column names.na
: string vector containing character expressions for
missing values.skip
: number of rows to skip before reading data.For example, the call below will skip the first 10 lines of data and not assume the first row are variable names:
litters_df =
read_csv(file = "./data/FAS_litters.csv",
skip = 10, col_names = FALSE)
## Rows: 40 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): X1, X2, X3, X4
## dbl (4): X5, X6, X7, X8
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(litters_df)
## # A tibble: 6 × 8
## X1 X2 X3 X4 X5 X6 X7 X8
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Con8 #3/5/2/2/95 28.5 <NA> 20 8 0 8
## 2 Con8 #5/4/3/83/3 28 <NA> 19 9 0 8
## 3 Con8 #1/6/2/2/95-2 <NA> <NA> 20 7 0 6
## 4 Con8 #3/5/3/83/3-3-2 <NA> <NA> 20 8 0 8
## 5 Con8 #2/2/95/2 <NA> <NA> 19 5 0 4
## 6 Con8 #3/6/2/2/95-3 <NA> <NA> 20 7 0 7
These arguments generally work for other members of the
read_*
family of functions.
A careful look at litters_df
suggests we might also need
to use the na
argument in read_csv()
– in
GD0 weight
and GD18 weight
there are “values”
of .
. In general (and in this case) the .
value is used to indicate missingness. And if we open
FAS_litters.csv
, we’ll see that NA
and blank
entries are also used to indicate a missing value.
The line below will treat .
, NA
, and blank
cells as missing.
litters_df =
read_csv(
file = "./data/FAS_litters.csv",
na = c(".", "NA", ""))
## Rows: 49 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Group, Litter Number
## dbl (6): GD0 weight, GD18 weight, GD of Birth, Pups born alive, Pups dead @ ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(litters_df)
## # A tibble: 6 × 8
## Group `Litter Number` `GD0 weight` `GD18 weight` `GD of Birth`
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Con7 #85 19.7 34.7 20
## 2 Con7 #1/2/95/2 27 42 19
## 3 Con7 #5/5/3/83/3-3 26 41.4 19
## 4 Con7 #5/4/2/95/2 28.5 44.1 19
## 5 Con7 #4/2/95/3-3 NA NA 20
## 6 Con7 #2/2/95/3-2 NA NA 20
## # ℹ 3 more variables: `Pups born alive` <dbl>, `Pups dead @ birth` <dbl>,
## # `Pups survive` <dbl>
I skipped the col_types
argument because it’s worth
talking about in some detail. The read_*
functions will
attempt to guess the data type stored in each column; by default, these
guesses are based on the first 1000 rows. The guesses are also usually
pretty good, and “bad” guesses often indicate that there’s something in
the data you should pay more attention to – before treating
.
as missing GD0 weight
was a character, but
after indicating this is missing it’s a numeric variable.
In some cases, you’ll want to give explicit column specifications.
This is done using the cols
function, where you can give
each column a type:
litters_df =
read_csv(file = "./data/FAS_litters.csv",
na = c(".", "NA", ""),
col_types = cols(
Group = col_character(),
`Litter Number` = col_character(),
`GD0 weight` = col_double(),
`GD18 weight` = col_double(),
`GD of Birth` = col_integer(),
`Pups born alive` = col_integer(),
`Pups dead @ birth` = col_integer(),
`Pups survive` = col_integer()
)
)
tail(litters_df)
## # A tibble: 6 × 8
## Group `Litter Number` `GD0 weight` `GD18 weight` `GD of Birth`
## <chr> <chr> <dbl> <dbl> <int>
## 1 Low8 #79 25.4 43.8 19
## 2 Low8 #100 20 39.2 20
## 3 Low8 #4/84 21.8 35.2 20
## 4 Low8 #108 25.6 47.5 20
## 5 Low8 #99 23.5 39 20
## 6 Low8 #110 25.5 42.7 20
## # ℹ 3 more variables: `Pups born alive` <int>, `Pups dead @ birth` <int>,
## # `Pups survive` <int>
Note that you don’t have to specify the variable type for every column, and can only focus on ones that are difficult:
litters_df =
read_csv(file = "./data/FAS_litters.csv",
na = c(".", "NA", ""),
col_types = cols(
Group = col_factor()
)
)
head(litters_df)
## # A tibble: 6 × 8
## Group `Litter Number` `GD0 weight` `GD18 weight` `GD of Birth`
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 Con7 #85 19.7 34.7 20
## 2 Con7 #1/2/95/2 27 42 19
## 3 Con7 #5/5/3/83/3-3 26 41.4 19
## 4 Con7 #5/4/2/95/2 28.5 44.1 19
## 5 Con7 #4/2/95/3-3 NA NA 20
## 6 Con7 #2/2/95/3-2 NA NA 20
## # ℹ 3 more variables: `Pups born alive` <dbl>, `Pups dead @ birth` <dbl>,
## # `Pups survive` <dbl>
Learning Assessment: Repeat the data import
process above for the file FAS_pups.csv
. Make sure the
column names are reasonable, and take some quick looks at the dataset.
What happens if your specifications for column parsing aren’t reasonable
(e.g. character instead of double, or vice versa)?
After some tinkering, the code I’d use to import and view the pups dataset is below.
pups_df =
read_csv("./data/FAS_pups.csv",
na = c(".", "NA"), col_types = "fddddd")
skimr::skim(pups_df)
Some kinds of poor column specification still “work” – you can code a
number as a character, for example, even if you shouldn’t. Others won’t
work – coding a character or factor as a number is nonsense, and you’ll
end up with a column full of NA
values.
Non-CSV plain text files (e.g. tab delimited files) can be handled
using read_delim()
. This is very similar to
read_csv
, but you have to specify a delimiter.
CSV format is great, but you’ll encounter a lot of Excel files too.
Although you can export these to a CSV, don’t – use the
readxl
package instead! This is tidyverse-adjacent but
you’ll have to download and install it separately. The
read_excel
function in this package has many of the same
arguments as read_csv
, including col_names
,
na
, skip
, and col_types
, and can
be used in basically the same way. There is also a sheet
option (useful when there are multiple sheets in the Excel file) and the
range
option (when you want to read in a specific data
rectangle). Lastly, in RStudio you can use File > Import Dataset >
From Excel for a GUI interface. The code below illustrates
read_excel
.
library(readxl)
mlb11_df = read_excel("data/mlb11.xlsx", n_max = 20)
head(mlb11_df, 5)
## # A tibble: 5 × 12
## team runs at_bats hits homeruns bat_avg strikeouts stolen_bases wins
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Texas Rang… 855 5659 1599 210 0.283 930 143 96
## 2 Boston Red… 875 5710 1600 203 0.28 1108 102 90
## 3 Detroit Ti… 787 5563 1540 169 0.277 1143 49 95
## 4 Kansas Cit… 730 5672 1560 129 0.275 1006 153 71
## 5 St. Louis … 762 5532 1513 162 0.273 978 57 90
## # ℹ 3 more variables: new_onbase <dbl>, new_slug <dbl>, new_obs <dbl>
The last tidyverse package for data import we’ll note is
haven
, which is used to import into R data files from SAS,
Stata, and SPSS. An example for reading data from SAS follows.
library(haven)
pulse_df = read_sas("./data/public_pulse_data.sas7bdat")
head(pulse_df, 5)
## # A tibble: 5 × 7
## ID age Sex BDIScore_BL BDIScore_01m BDIScore_06m BDIScore_12m
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 10003 48.0 male 7 1 2 0
## 2 10015 72.5 male 6 NA NA NA
## 3 10022 58.5 male 14 3 8 NA
## 4 10026 72.7 male 20 6 18 16
## 5 10035 60.4 male 4 0 1 2
You can read in data that isn’t coming as a flat file (think data rectangle), but it’s beyond our current scope.
The functions in readr
are relatively new, and can be
used in place of base R’s read.csv
,
read.table
, and so on. The base R versions tend to be
slower (very noticeably for large datasets), and the default options can
make less sense for modern datasets. Meanwhile, the readr
functions export tibbles, and keep characters as characters (instead of
converting to factors …).
Learning Assessment: Import the
FAS_pups.csv
dataset using read.csv
. Compare
the class of this dataset to the one produced by read_csv
.
Try printing both in the console – what happens? After cleaning up the
names, try accessing the Sex
variable using S
(e.g., pups_df$S
). What happens?
The code below shows how I might load the dataset using different import functions, and how I might compare the results.
pups_base = read.csv("./data/FAS_pups.csv")
pups_readr = read_csv("./data/FAS_pups.csv")
View(pups_base)
View(pups_readr)
pups_base
pups_readr
pups_base$S
pups_readr$S
In short, read_csv
produces tibbles which are very similar
to the base R data frames produced by read.csv
. However,
tibbles have some features that can help prevent mistakes and unwanted
behavior.
You can open many data files using RStudio’s drop-down menus. To import an excel spreadsheet, for example, you can use File > Import Dataset > From Excel. This allows several import options, previews the data, and shows the code necessary for importing. Importing in this way will load the data into your current session, but you’ll have to copy the import code to your RMarkdown file to ensure reproducibility. This approach to importing data can be helpful when you’re getting started, but gaining proficiency with writing code directly will be helpful in the long term and is more consistent with the goals of the course.
As a final point, you will sometimes need to export data after you
have imported and cleaned it. The write_*
functions in
readr
address this problem.
The content in this page draws heavily from several sources; each of the things below goes into more detail in one way or another.
The code that I produced working examples in lecture is here.