Importing is the first step in wrangling.

This is the first module in the Data Wrangling I topic.

Overview

Learning Objectives

Read data into R from a variety of sources and parse variable types.

Slide Deck



Video Lecture


Example

We’re going to figure out how to import the datasets in this zip file. First, create a GitHub repo + local R project for today’s work (I’m calling mine data_wrangling_i). Move the datasets to your repo / directory (perhaps in a sub-directory called data), and start an R Markdown; I’m reusing this template. We’re also going to be making extensive use of the tidyverse package, so go ahead and load that in a “setup” code chunk.

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'tidyr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Paths

We’ve mentioned paths and your working directory in passing (when talking about best practices). A firmer understanding is helpful when you start loading data into R, because to load data R will have to know where to find it.

There are two kinds of paths:

  • Absolute: a file or folder’s “full address” on your computer
  • Relative: directions to a file or folder from your current working directory

Absolute paths are often conceptually easier, because you don’t really have to think about them – you’re just giving the complete address, starting from the root directory. These work from any current working directory on the machine. However, absolute paths make your code extremely fragile: if you move your directory, or share your project with someone else, the absolute path to files will change and your code won’t work.

An absolute path example is below:

"/Users/jeffgoldsmith/Dropbox/Work/Teaching/P8105/fall_2024/p8105.github.io/data/FAS_litters.csv"

Relative paths, meanwhile, navigate from your current working directory. Relative paths are portable, in that if you move your directory or share it with someone else, the relative path to files in the directory will stay the same and your code will work. For both of these reasons, relative paths are preferred in almost every setting.

(Hint: if you ever use setwd(), you’re using absolute paths and taking big risks…)

The code below finds my current working directory:

getwd()
## [1] "/Users/jeffgoldsmith/Dropbox/Work/Teaching/P8105/fall_2024/p8105.github.io"

A relative path to the same file as the absolute path above is

"./data/FAS_litters.csv"

The table below, copied from R Programming for Research, gives useful shorthand notation for relative pathnames.

Shorthand Meaning
~ Home directory
. Current working directory
.. One directory up from current working directory
../.. Two directories up from current working directory

Tabbed autocomplete in RStudio works with both absolute and relative paths.

One note: by default, R Markdown files treat whatever directory they’re in as the starting point for relative paths. For now, putting your R Markdown file in the same directory as you .RProj file will save a lot of headaches – the same path will work when you knit the document and when you write / edit code interactive in the console. For larger projects, you might want to have a separate directory with several R Markdown files. In that case using here::here() to identify and navigate from the project top-level directory is great, but you need a reasonable understanding of paths before jumping into that. You can also modify the behavior of R Markdown files (in Global Options) to evaluate code chunks in the project directory, which alleviates most of the issues. But honestly, at that point, just learn how here::here() works …

One caveat: if you’re going to both share code and use confidential data, you have to take extra steps to ensure data security (especially if that directory lives on GitHub). One option is to use the .gitignore file to ensure data aren’t uploaded inadvertently. Alternatively, you can store data outside of your shared project directory. In both cases there are ways to make you paths “stable”.

Importing data tables

Now that we have paths handled, we can start loading data. We’re going to start with rectangular data tables (data in rows and columns, with data separated by a delimiter) saved in a plain text format. Of these, CSV (comma separated value) files are most common, and others are handled in basically the same way. To import a CSV, we’ll use a function from readr (included in the tidyverse):

litters_df = read_csv(file = "./data/FAS_litters.csv")
## Rows: 49 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Group, Litter Number, GD0 weight, GD18 weight
## dbl (4): GD of Birth, Pups born alive, Pups dead @ birth, Pups survive
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Great – we’ve imported data! The first argument to read_csv is the path to the data, and the line above assigns the result of read_csv to litters_df. This function call also prints information about the column parsing. We’ll talk more about both of these shortly.

I (almost always) use janitor::clean_names() to clean up variable names after importing data. Doing so will take whatever the column names are and convert them to lower snake case.

names(litters_df)
## [1] "Group"             "Litter Number"     "GD0 weight"       
## [4] "GD18 weight"       "GD of Birth"       "Pups born alive"  
## [7] "Pups dead @ birth" "Pups survive"
litters_df = janitor::clean_names(litters_df)
names(litters_df)
## [1] "group"           "litter_number"   "gd0_weight"      "gd18_weight"    
## [5] "gd_of_birth"     "pups_born_alive" "pups_dead_birth" "pups_survive"

The package::function syntax lets you use a function from a package without loading the whole library. That’s really helpful, because some packages have functions with the same name (e.g. dplyr::filter and stats::filter), and R has to choose which one you mean. In general, only load the packages you need to prevent this kind of confusion. You can even use the conflicted package if you want to force yourself into using good habits!

Learning Assessment: Make sure you are able to load the FAS_pups.csv dataset. Use both absolute and relative paths. Quit R Studio and move the directory containing your project, data, and R Markdown document. Repeat the previous data import process; do both absolute and relative paths still work?

Solution

For me, loading the data with absolute and relative paths might look like the following:

pups_df = read_csv(file = "./data/FAS_pups.csv")
pups_df = read_csv(file = "~/Desktop/data_wranging_i/data/FAS_litters.csv")

If I move the directory holding this example, the relative path works and the absolute path is broken.

Looking at data

The first thing to do after importing the data (unless read_csv gives warnings) is to look at it. If there are unexpected results during data import, you’ll catch a lot of them here. In addition to printing the data (which is my first step), I often use View / view, str, head, and tail:

litters_df
## # A tibble: 49 × 8
##    group litter_number   gd0_weight gd18_weight gd_of_birth pups_born_alive
##    <chr> <chr>           <chr>      <chr>             <dbl>           <dbl>
##  1 Con7  #85             19.7       34.7                 20               3
##  2 Con7  #1/2/95/2       27         42                   19               8
##  3 Con7  #5/5/3/83/3-3   26         41.4                 19               6
##  4 Con7  #5/4/2/95/2     28.5       44.1                 19               5
##  5 Con7  #4/2/95/3-3     <NA>       <NA>                 20               6
##  6 Con7  #2/2/95/3-2     <NA>       <NA>                 20               6
##  7 Con7  #1/5/3/83/3-3/2 <NA>       <NA>                 20               9
##  8 Con8  #3/83/3-3       <NA>       <NA>                 20               9
##  9 Con8  #2/95/3         <NA>       <NA>                 20               8
## 10 Con8  #3/5/2/2/95     28.5       <NA>                 20               8
## # ℹ 39 more rows
## # ℹ 2 more variables: pups_dead_birth <dbl>, pups_survive <dbl>
tail(litters_df, 5)
## # A tibble: 5 × 8
##   group litter_number gd0_weight gd18_weight gd_of_birth pups_born_alive
##   <chr> <chr>         <chr>      <chr>             <dbl>           <dbl>
## 1 Low8  #100          20         39.2                 20               8
## 2 Low8  #4/84         21.8       35.2                 20               4
## 3 Low8  #108          25.6       47.5                 20               8
## 4 Low8  #99           23.5       39                   20               6
## 5 Low8  #110          25.5       42.7                 20               7
## # ℹ 2 more variables: pups_dead_birth <dbl>, pups_survive <dbl>

Another tool that I use sometimes is skimr::skim, although for me it’s more “neat” than useful.

skimr::skim(litters_df)
Data summary
Name litters_df
Number of rows 49
Number of columns 8
_______________________
Column type frequency:
character 4
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
group 0 1.00 4 4 0 6 0
litter_number 0 1.00 3 15 0 49 0
gd0_weight 13 0.73 1 4 0 26 0
gd18_weight 15 0.69 1 4 0 31 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
gd_of_birth 0 1 19.65 0.48 19 19 20 20 20 ▅▁▁▁▇
pups_born_alive 0 1 7.35 1.76 3 6 8 8 11 ▁▃▂▇▁
pups_dead_birth 0 1 0.33 0.75 0 0 0 0 4 ▇▂▁▁▁
pups_survive 0 1 6.41 2.05 1 5 7 8 9 ▁▃▂▇▇

I’m not going to show the output of the line below, because View / view don’t work well inside of .Rmd files. Instead, call these functions from the console, and be sure to remove them from any code chunks in your .Rmd.

view(litters_df)

Arguments to read_*

In the best case, the data are stored in the csv without any weirdness – there are no blank lines or columns, the first row is the variable name, missing values are stored in sensible ways. When this isn’t the case, arguments to read_csv are helpful. The ones I use most frequently are:

  • col_names: usually TRUE. If FALSE, column names are X1, X1, … . You can also supply column names.
  • na: string vector containing character expressions for missing values.
  • skip: number of rows to skip before reading data.

For example, the call below will skip the first 10 lines of data and not assume the first row are variable names:

litters_df = 
    read_csv(file = "./data/FAS_litters.csv",
    skip = 10, col_names = FALSE)
## Rows: 40 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): X1, X2, X3, X4
## dbl (4): X5, X6, X7, X8
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(litters_df)
## # A tibble: 6 × 8
##   X1    X2              X3    X4       X5    X6    X7    X8
##   <chr> <chr>           <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Con8  #3/5/2/2/95     28.5  <NA>     20     8     0     8
## 2 Con8  #5/4/3/83/3     28    <NA>     19     9     0     8
## 3 Con8  #1/6/2/2/95-2   <NA>  <NA>     20     7     0     6
## 4 Con8  #3/5/3/83/3-3-2 <NA>  <NA>     20     8     0     8
## 5 Con8  #2/2/95/2       <NA>  <NA>     19     5     0     4
## 6 Con8  #3/6/2/2/95-3   <NA>  <NA>     20     7     0     7

These arguments generally work for other members of the read_* family of functions.

A careful look at litters_df suggests we might also need to use the na argument in read_csv() – in GD0 weight and GD18 weight there are “values” of .. In general (and in this case) the . value is used to indicate missingness. And if we open FAS_litters.csv, we’ll see that NA and blank entries are also used to indicate a missing value.

The line below will treat ., NA, and blank cells as missing.

litters_df = 
    read_csv(
        file = "./data/FAS_litters.csv",
        na = c(".", "NA", ""))
## Rows: 49 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Group, Litter Number
## dbl (6): GD0 weight, GD18 weight, GD of Birth, Pups born alive, Pups dead @ ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(litters_df)
## # A tibble: 6 × 8
##   Group `Litter Number` `GD0 weight` `GD18 weight` `GD of Birth`
##   <chr> <chr>                  <dbl>         <dbl>         <dbl>
## 1 Con7  #85                     19.7          34.7            20
## 2 Con7  #1/2/95/2               27            42              19
## 3 Con7  #5/5/3/83/3-3           26            41.4            19
## 4 Con7  #5/4/2/95/2             28.5          44.1            19
## 5 Con7  #4/2/95/3-3             NA            NA              20
## 6 Con7  #2/2/95/3-2             NA            NA              20
## # ℹ 3 more variables: `Pups born alive` <dbl>, `Pups dead @ birth` <dbl>,
## #   `Pups survive` <dbl>

Parsing columns

I skipped the col_types argument because it’s worth talking about in some detail. The read_* functions will attempt to guess the data type stored in each column; by default, these guesses are based on the first 1000 rows. The guesses are also usually pretty good, and “bad” guesses often indicate that there’s something in the data you should pay more attention to – before treating . as missing GD0 weight was a character, but after indicating this is missing it’s a numeric variable.

In some cases, you’ll want to give explicit column specifications. This is done using the cols function, where you can give each column a type:

litters_df = 
    read_csv(file = "./data/FAS_litters.csv",
        na = c(".", "NA", ""),
    col_types = cols(
      Group = col_character(),
      `Litter Number` = col_character(),
      `GD0 weight` = col_double(),
      `GD18 weight` = col_double(),
      `GD of Birth` = col_integer(),
      `Pups born alive` = col_integer(),
      `Pups dead @ birth` = col_integer(),
      `Pups survive` = col_integer()
    )
  )

tail(litters_df)
## # A tibble: 6 × 8
##   Group `Litter Number` `GD0 weight` `GD18 weight` `GD of Birth`
##   <chr> <chr>                  <dbl>         <dbl>         <int>
## 1 Low8  #79                     25.4          43.8            19
## 2 Low8  #100                    20            39.2            20
## 3 Low8  #4/84                   21.8          35.2            20
## 4 Low8  #108                    25.6          47.5            20
## 5 Low8  #99                     23.5          39              20
## 6 Low8  #110                    25.5          42.7            20
## # ℹ 3 more variables: `Pups born alive` <int>, `Pups dead @ birth` <int>,
## #   `Pups survive` <int>

Note that you don’t have to specify the variable type for every column, and can only focus on ones that are difficult:

litters_df = 
    read_csv(file = "./data/FAS_litters.csv",
        na = c(".", "NA", ""),
    col_types = cols(
      Group = col_factor()
    )
)

head(litters_df)
## # A tibble: 6 × 8
##   Group `Litter Number` `GD0 weight` `GD18 weight` `GD of Birth`
##   <fct> <chr>                  <dbl>         <dbl>         <dbl>
## 1 Con7  #85                     19.7          34.7            20
## 2 Con7  #1/2/95/2               27            42              19
## 3 Con7  #5/5/3/83/3-3           26            41.4            19
## 4 Con7  #5/4/2/95/2             28.5          44.1            19
## 5 Con7  #4/2/95/3-3             NA            NA              20
## 6 Con7  #2/2/95/3-2             NA            NA              20
## # ℹ 3 more variables: `Pups born alive` <dbl>, `Pups dead @ birth` <dbl>,
## #   `Pups survive` <dbl>

Learning Assessment: Repeat the data import process above for the file FAS_pups.csv. Make sure the column names are reasonable, and take some quick looks at the dataset. What happens if your specifications for column parsing aren’t reasonable (e.g. character instead of double, or vice versa)?

Solution

After some tinkering, the code I’d use to import and view the pups dataset is below.

pups_df = 
    read_csv("./data/FAS_pups.csv",
        na = c(".", "NA"), col_types = "fddddd")

skimr::skim(pups_df)

Some kinds of poor column specification still “work” – you can code a number as a character, for example, even if you shouldn’t. Others won’t work – coding a character or factor as a number is nonsense, and you’ll end up with a column full of NA values.

Other file formats

Non-CSV plain text files (e.g. tab delimited files) can be handled using read_delim(). This is very similar to read_csv, but you have to specify a delimiter.

CSV format is great, but you’ll encounter a lot of Excel files too. Although you can export these to a CSV, don’t – use the readxl package instead! This is tidyverse-adjacent but you’ll have to download and install it separately. The read_excel function in this package has many of the same arguments as read_csv, including col_names, na, skip, and col_types, and can be used in basically the same way. There is also a sheet option (useful when there are multiple sheets in the Excel file) and the range option (when you want to read in a specific data rectangle). Lastly, in RStudio you can use File > Import Dataset > From Excel for a GUI interface. The code below illustrates read_excel.

library(readxl)

mlb11_df = read_excel("data/mlb11.xlsx", n_max = 20)

head(mlb11_df, 5)
## # A tibble: 5 × 12
##   team         runs at_bats  hits homeruns bat_avg strikeouts stolen_bases  wins
##   <chr>       <dbl>   <dbl> <dbl>    <dbl>   <dbl>      <dbl>        <dbl> <dbl>
## 1 Texas Rang…   855    5659  1599      210   0.283        930          143    96
## 2 Boston Red…   875    5710  1600      203   0.28        1108          102    90
## 3 Detroit Ti…   787    5563  1540      169   0.277       1143           49    95
## 4 Kansas Cit…   730    5672  1560      129   0.275       1006          153    71
## 5 St. Louis …   762    5532  1513      162   0.273        978           57    90
## # ℹ 3 more variables: new_onbase <dbl>, new_slug <dbl>, new_obs <dbl>

The last tidyverse package for data import we’ll note is haven, which is used to import into R data files from SAS, Stata, and SPSS. An example for reading data from SAS follows.

library(haven)

pulse_df = read_sas("./data/public_pulse_data.sas7bdat")

head(pulse_df, 5)
## # A tibble: 5 × 7
##      ID   age Sex   BDIScore_BL BDIScore_01m BDIScore_06m BDIScore_12m
##   <dbl> <dbl> <chr>       <dbl>        <dbl>        <dbl>        <dbl>
## 1 10003  48.0 male            7            1            2            0
## 2 10015  72.5 male            6           NA           NA           NA
## 3 10022  58.5 male           14            3            8           NA
## 4 10026  72.7 male           20            6           18           16
## 5 10035  60.4 male            4            0            1            2

You can read in data that isn’t coming as a flat file (think data rectangle), but it’s beyond our current scope.

Comparison with Base R

The functions in readr are relatively new, and can be used in place of base R’s read.csv, read.table, and so on. The base R versions tend to be slower (very noticeably for large datasets), and the default options can make less sense for modern datasets. Meanwhile, the readr functions export tibbles, and keep characters as characters (instead of converting to factors …).

Learning Assessment: Import the FAS_pups.csv dataset using read.csv. Compare the class of this dataset to the one produced by read_csv. Try printing both in the console – what happens? After cleaning up the names, try accessing the Sex variable using S (e.g., pups_df$S). What happens?

Solution

The code below shows how I might load the dataset using different import functions, and how I might compare the results.

pups_base = read.csv("./data/FAS_pups.csv")
pups_readr = read_csv("./data/FAS_pups.csv")

View(pups_base)
View(pups_readr)

pups_base
pups_readr

pups_base$S
pups_readr$S
In short, read_csv produces tibbles which are very similar to the base R data frames produced by read.csv. However, tibbles have some features that can help prevent mistakes and unwanted behavior.

Importing using File > Import

You can open many data files using RStudio’s drop-down menus. To import an excel spreadsheet, for example, you can use File > Import Dataset > From Excel. This allows several import options, previews the data, and shows the code necessary for importing. Importing in this way will load the data into your current session, but you’ll have to copy the import code to your RMarkdown file to ensure reproducibility. This approach to importing data can be helpful when you’re getting started, but gaining proficiency with writing code directly will be helpful in the long term and is more consistent with the goals of the course.

Exporting data

As a final point, you will sometimes need to export data after you have imported and cleaned it. The write_* functions in readr address this problem.

Other materials

The content in this page draws heavily from several sources; each of the things below goes into more detail in one way or another.

The code that I produced working examples in lecture is here.