Importing is the first step in wrangling.
This is the first module in the Data Wrangling I topic.
Read data into R from a variety of sources and parse variable types.
We’re going to figure out how to import the datasets in this zip file. First,
create a GitHub repo + local R project for today’s work (I’m calling
data_wrangling_i). Move the datasets to your repo /
directory (perhaps in a sub-directory called
start an R Markdown; I’m reusing this template. We’re also going to be
making extensive use of the
tidyverse package, so go ahead
and load that in a “setup” code chunk.
library(tidyverse) ## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ## ✔ dplyr 1.1.3 ✔ readr 2.1.4 ## ✔ forcats 1.0.0 ✔ stringr 1.5.0 ## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1 ## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0 ## ✔ purrr 1.0.2 ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We’ve mentioned paths and your working directory in passing (when talking about best practices). A firmer understanding is helpful when you start loading data into R, because to load data R will have to know where to find it.
There are two kinds of paths:
Absolute paths are often conceptually easier, because you don’t really have to think about them – you’re just giving the complete address, starting from the root directory. These work from any current working directory on the machine. However, absolute paths make your code extremely fragile: if you move your directory, or share your project with someone else, the absolute path to files will change and your code won’t work.
An absolute path example is below:
Relative paths, meanwhile, navigate from your current working directory. Relative paths are portable, in that if you move your directory or share it with someone else, the relative path to files in the directory will stay the same and your code will work. For both of these reasons, relative paths are preferred in almost every setting.
(Hint: if you ever use
setwd(), you’re using absolute
paths and taking
The code below finds my current working directory:
getwd() ##  "/Users/jeff/Library/CloudStorage/Dropbox/Work/Teaching/P8105/fall_2023/p8105.github.io"
A relative path to the same file as the absolute path above is
The table below, copied from R Programming for Research, gives useful shorthand notation for relative pathnames.
||Current working directory|
||One directory up from current working directory|
||Two directories up from current working directory|
Tabbed autocomplete in RStudio works with both absolute and relative paths.
One note: by default, R Markdown files treat whatever directory
they’re in as the starting point for relative paths. For now, putting
your R Markdown file in the same directory as you .RProj file will save
a lot of headaches – the same path will work when you knit the document
and when you write / edit code interactive in the console. For larger
projects, you might want to have a separate directory with several R
Markdown files. In that case using
to identify and navigate from the project top-level directory
is great, but you need a reasonable understanding of paths before
jumping into that. You can also modify the behavior of R Markdown files
(in Global Options) to evaluate code chunks in the project directory,
which alleviates most of the issues. But honestly, at that point, just
here::here() works …
One caveat: if you’re going to both share code and use confidential
data, you have to take extra steps to ensure data security (especially
if that directory lives on GitHub). One option is to use the
.gitignore file to ensure data aren’t uploaded
inadvertently. Alternatively, you can store data outside of your shared
project directory. In both cases there are ways to make you paths
Now that we have paths handled, we can start loading data. We’re
going to start with rectangular data tables (data in rows and columns,
with data separated by a delimiter) saved in a plain text format. Of
these, CSV (comma separated value) files are most common, and others are
handled in basically the same way. To import a CSV, we’ll use a function
readr (included in the tidyverse):
litters_data = read_csv(file = "./data/FAS_litters.csv") ## Rows: 49 Columns: 8 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (2): Group, Litter Number ## dbl (6): GD0 weight, GD18 weight, GD of Birth, Pups born alive, Pups dead @ ... ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Great – we’ve imported data! The first argument to
read_csv is the path to the data, and the line above
assigns the result of
litters_data. This function call also prints information
about the column parsing. We’ll talk more about both of these
I (almost always) use
janitor::clean_names() to clean up
variable names after importing data. Doing so will take whatever the
column names are and convert them to lower snake case.
names(litters_data) ##  "Group" "Litter Number" "GD0 weight" ##  "GD18 weight" "GD of Birth" "Pups born alive" ##  "Pups dead @ birth" "Pups survive" litters_data = janitor::clean_names(litters_data) names(litters_data) ##  "group" "litter_number" "gd0_weight" "gd18_weight" ##  "gd_of_birth" "pups_born_alive" "pups_dead_birth" "pups_survive"
package::function syntax lets you use a function
from a package without loading the whole library. That’s really helpful,
because some packages have functions with the same name
stats::filter), and R
has to choose which one you mean. In general, only load the packages you
need to prevent this kind of confusion. You can even use the
package if you want to force yourself into using good habits!
Learning Assessment: Make sure you are able
to load the
FAS_pups.csv dataset. Use both absolute and
relative paths. Quit R Studio and move the directory containing your
project, data, and R Markdown document. Repeat the previous data import
process; do both absolute and relative paths still work?
For me, loading the data with absolute and relative paths might look like the following:
pups_data = read_csv(file = "./data/FAS_pups.csv") pups_data = read_csv(file = "~/Desktop/data_wranging_i/data/FAS_litters.csv")
If I move the directory holding this example, the relatitve path works and the absolute path is broken.
The first thing to do after importing the data (unless
read_csv gives warnings) is to look at it. If there are
unexpected results during data import, you’ll catch a lot of them here.
In addition to printing the data (which is my first step), I often use
litters_data ## # A tibble: 49 × 8 ## group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Con7 #85 19.7 34.7 20 3 ## 2 Con7 #1/2/95/2 27 42 19 8 ## 3 Con7 #5/5/3/83/3-3 26 41.4 19 6 ## 4 Con7 #5/4/2/95/2 28.5 44.1 19 5 ## 5 Con7 #4/2/95/3-3 NA NA 20 6 ## 6 Con7 #2/2/95/3-2 NA NA 20 6 ## 7 Con7 #1/5/3/83/3-3/2 NA NA 20 9 ## 8 Con8 #3/83/3-3 NA NA 20 9 ## 9 Con8 #2/95/3 NA NA 20 8 ## 10 Con8 #3/5/2/2/95 28.5 NA 20 8 ## # ℹ 39 more rows ## # ℹ 2 more variables: pups_dead_birth <dbl>, pups_survive <dbl> tail(litters_data, 5) ## # A tibble: 5 × 8 ## group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Low8 #100 20 39.2 20 8 ## 2 Low8 #4/84 21.8 35.2 20 4 ## 3 Low8 #108 25.6 47.5 20 8 ## 4 Low8 #99 23.5 39 20 6 ## 5 Low8 #110 25.5 42.7 20 7 ## # ℹ 2 more variables: pups_dead_birth <dbl>, pups_survive <dbl>
Another tool that I use sometimes is
although for me it’s more “neat” than useful.
|Number of rows||49|
|Number of columns||8|
|Column type frequency:|
Variable type: character
Variable type: numeric
I’m not going to show the output of the line below, because
view don’t work well inside of .Rmd
files. Instead, call these functions from the console, and be sure to
remove them from any code chunks in your .Rmd.
In the best case, the data are stored in the csv without any
weirdness – there are no blank lines or columns, the first row is the
variable name, missing values are stored in sensible ways. When this
isn’t the case, arguments to
read_csv are helpful. The ones
I use most frequently are:
FALSE, column names are
X1, … . You can also supply column names.
na: string vector containing character expressions for missing values.
skip: number of rows to skip before reading data.
For example, the call below will skip the first 50 lines of data and not assume the first row are variable names:
litters_data = read_csv(file = "./data/FAS_litters.csv", skip = 10, col_names = FALSE) ## Rows: 40 Columns: 8 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (2): X1, X2 ## dbl (6): X3, X4, X5, X6, X7, X8 ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. head(litters_data) ## # A tibble: 6 × 8 ## X1 X2 X3 X4 X5 X6 X7 X8 ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Con8 #3/5/2/2/95 28.5 NA 20 8 0 8 ## 2 Con8 #5/4/3/83/3 28 NA 19 9 0 8 ## 3 Con8 #1/6/2/2/95-2 NA NA 20 7 0 6 ## 4 Con8 #3/5/3/83/3-3-2 NA NA 20 8 0 8 ## 5 Con8 #2/2/95/2 NA NA 19 5 0 4 ## 6 Con8 #3/6/2/2/95-3 NA NA 20 7 0 7
These arguments generally work for other members of the
read_* family of functions.
I skipped the
col_types argument because it’s worth
talking about in some detail. The
read_* functions will
attempt to guess the data type stored in each column; by default, these
guesses are based on the first 1000 rows. The guesses are also usually
pretty good. In some cases, though, you’ll want to give explicit column
specifications. This is done using the
cols function, where
you can give each column a type:
litters_data = read_csv(file = "./data/FAS_litters.csv", col_types = cols( Group = col_character(), `Litter Number` = col_character(), `GD0 weight` = col_double(), `GD18 weight` = col_double(), `GD of Birth` = col_integer(), `Pups born alive` = col_integer(), `Pups dead @ birth` = col_integer(), `Pups survive` = col_integer() ) ) tail(litters_data) ## # A tibble: 6 × 8 ## Group `Litter Number` `GD0 weight` `GD18 weight` `GD of Birth` ## <chr> <chr> <dbl> <dbl> <int> ## 1 Low8 #79 25.4 43.8 19 ## 2 Low8 #100 20 39.2 20 ## 3 Low8 #4/84 21.8 35.2 20 ## 4 Low8 #108 25.6 47.5 20 ## 5 Low8 #99 23.5 39 20 ## 6 Low8 #110 25.5 42.7 20 ## # ℹ 3 more variables: `Pups born alive` <int>, `Pups dead @ birth` <int>, ## # `Pups survive` <int>
Note that you don’t have to specify the variable type for every column, and can only focus on ones that are difficult:
litters_data = read_csv(file = "./data/FAS_litters.csv", col_types = cols( Group = col_factor() ) ) head(litters_data) ## # A tibble: 6 × 8 ## Group `Litter Number` `GD0 weight` `GD18 weight` `GD of Birth` ## <fct> <chr> <dbl> <dbl> <dbl> ## 1 Con7 #85 19.7 34.7 20 ## 2 Con7 #1/2/95/2 27 42 19 ## 3 Con7 #5/5/3/83/3-3 26 41.4 19 ## 4 Con7 #5/4/2/95/2 28.5 44.1 19 ## 5 Con7 #4/2/95/3-3 NA NA 20 ## 6 Con7 #2/2/95/3-2 NA NA 20 ## # ℹ 3 more variables: `Pups born alive` <dbl>, `Pups dead @ birth` <dbl>, ## # `Pups survive` <dbl>
Learning Assessment: Repeat the data import
process above for the file
FAS_pups.csv. Make sure the
column names are reasonable, and take some quick looks at the dataset.
What happens if your specifications for column parsing aren’t reasonable
(e.g. character instead of double, or vice versa)?
After some tinkering, the code I’d use to import and view the pups dataset is below.
pups_data = read_csv("./data/FAS_pups.csv", col_types = "ciiiii") skimr::skim(pups_data)
If I move the directory holding this example, the relatitve path works and the absolute path is broken.
Non-CSV plain text files (e.g. tab delimited files) can be handled
read_delim(). This is very similar to
read_csv, but you have to specify a delimiter.
CSV format is great, but you’ll encounter a lot of Excel files too.
Although you can export these to a CSV, don’t – use the
readxl package instead! This is tidyverse-adjacent but
you’ll have to download and install it separately. The
read_excel function in this package has many of the same
col_types, and can
be used in basically the same way. There is also a
option (useful when there are multiple sheets in the Excel file) and the
range option (when you want to read in a specific data
rectangle). Lastly, in RStudio you can use File > Import Dataset >
From Excel for a GUI interface. The code below illustrates
library(readxl) mlb11_data = read_excel("data/mlb11.xlsx", n_max = 20) head(mlb11_data, 5) ## # A tibble: 5 × 12 ## team runs at_bats hits homeruns bat_avg strikeouts stolen_bases wins ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Texas Rang… 855 5659 1599 210 0.283 930 143 96 ## 2 Boston Red… 875 5710 1600 203 0.28 1108 102 90 ## 3 Detroit Ti… 787 5563 1540 169 0.277 1143 49 95 ## 4 Kansas Cit… 730 5672 1560 129 0.275 1006 153 71 ## 5 St. Louis … 762 5532 1513 162 0.273 978 57 90 ## # ℹ 3 more variables: new_onbase <dbl>, new_slug <dbl>, new_obs <dbl>
The last tidyverse package for data import we’ll note is
haven, which is used to import into R data files from SAS,
Stata, and SPSS. An example for reading data from SAS follows.
library(haven) pulse_data = read_sas("./data/public_pulse_data.sas7bdat") head(pulse_data, 5) ## # A tibble: 5 × 7 ## ID age Sex BDIScore_BL BDIScore_01m BDIScore_06m BDIScore_12m ## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 10003 48.0 male 7 1 2 0 ## 2 10015 72.5 male 6 NA NA NA ## 3 10022 58.5 male 14 3 8 NA ## 4 10026 72.7 male 20 6 18 16 ## 5 10035 60.4 male 4 0 1 2
You can read in data that isn’t coming as a flat file (think data rectangle), but it’s beyond our current scope.
The functions in
readr are relatively new, and can be
used in place of base R’s
read.table, and so on. The base R versions tend to be
slower (very noticeably for large datasets), and the default options can
make less sense for modern datasets. Meanwhile, the
functions export tibbles, and keep characters as characters (instead of
converting to factors …).
Learning Assessment: Import the
FAS_pups.csv dataset using
the class of this dataset to the one produced by
Try printing both in the console – what happens? After cleaning up the
names, try accessing the
Sex variable using
pups_data$S). What happens?
The code below shows how I might load the dataset using different import functions, and how I might compare the results.
pups_base = read.csv("./data/FAS_pups.csv") pups_readr = read_csv("./data/FAS_pups.csv") View(pups_base) View(pups_readr) pups_base pups_readr pups_base$S pups_readr$S
read_csvproduces tibbles which are very similar to the base R data frames produced by
read.csv. However, tibbles have some features that can help prevent mistakes and unwanted behavior.
You can open many data files using RStudio’s drop-down menus. To import an excel spreadsheet, for example, you can use File > Import Dataset > From Excel. This allows several import options, previews the data, and shows the code necessary for importing. Importing in this way will load the data into your current session, but you’ll have to copy the import code to your RMarkdown file to ensure reproducibility. This approach to importing data can be helpful when you’re getting started, but gaining proficiency with writing code directly will be helpful in the long term and is more consistent with the goals of the course.
As a final point, you will sometimes need to export data after you
have imported and cleaned it. The
write_* functions in
readr address this problem.
The content in this page draws heavily from several sources; each of the things below goes into more detail in one way or another.
The code that I produced working examples in lecture is here.