We’ve noted that functions are helpful when you repeat code more than twice; we’ve also noted that a lot of statistical methods involve doing the same thing a large number of times. Putting those together motivates a careful approach to iteratation.

Meanwhile, R’s data structures, especially data frames, are surprisingly flexible. This is useful when the “observations” you want to store become more complex than single values; for example, each row many contain a few scalar observations as well a complete data set. In these cases, list columns are an appropriate column type, and map functions provide a way to interact with those columns.

This is the second module in the Iteration topic; the relevant slack channel is here.

Example

I’ll write code for today’s content in a new R Markdown document called iteration_and_listcols.Rmd in the iteration repo / directory I started last time. The code chunk below loads the tidyverse and sets a seed for reproducibility.

library(tidyverse)

set.seed(1)

Things are gonna get a little weird…

Lists

We need a brief digression about lists before we do anything.

In R, vectors are limited to a single data class – all elements are characters, or all numeric, or all logical. Trying to join the following vectors will result in coersion, as would creating vectors of mixed types.

vec_numeric = 5:8
vec_char = c("My", "name", "is", "Jeff")
vec_logical = c(TRUE, TRUE, TRUE, FALSE)

Lists provide a way to store anything you want. This flexibility is great, but is offset by a certain … clunkiness. Lists contain indexed elements, and the indexed elements themselves be scalars, vectors, or other things entirely.

l = list(vec_numeric = 5:8,
         mat         = matrix(1:8, 2, 4),
         vec_logical = c(TRUE, FALSE),
         summary     = summary(rnorm(1000)))
l
## $vec_numeric
## [1] 5 6 7 8
## 
## $mat
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    7
## [2,]    2    4    6    8
## 
## $vec_logical
## [1]  TRUE FALSE
## 
## $summary
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.00805 -0.69737 -0.03532 -0.01165  0.68843  3.81028

Lists can be accessed using names or indices, and the things in lists can be accessed in the way you would usually access an object of that type.

l$vec_numeric
## [1] 5 6 7 8
l[[1]]
## [1] 5 6 7 8
l[[1]][1:3]
## [1] 5 6 7

Lists seem bizarre but are really useful. Right now, we’ll use them to hold general inputs and outputs of iterative processes. Even more importantly, we’ll see that data frames are actually a very specific kind of list – one comprised of vectors of the same length – which is why they can store variables of different types.

for loops

For this example, I’m going to start with the pretty simple data frame defined below, and confirm that “under the hood” this is a list.

df = tibble(
  a = rnorm(20, 3, 1),
  b = rnorm(20, 0, 5),
  c = rnorm(20, 10, .2),
  d = rnorm(20, -3, 1)
)

is.list(df)
## [1] TRUE

I’d like to apply my simple mean_and_sd function from writing functions to each column of this dataframe. For completeness, that function is below.

mean_and_sd = function(x) {
  
  if (!is.numeric(x)) {
    stop("Argument x should be numeric")
  } else if (length(x) == 1) {
    stop("Cannot be computed for length 1 vectors")
  }
  
  mean_x = mean(x)
  sd_x = sd(x)

  tibble(
    mean = mean_x, 
    sd = sd_x
  )
}

We can apply the mean_and_sd function to each column of df using the lines below. Throughout this content, I’ll take advantage of the fact that data frames are a kind of list – keeping this in mind when you’re iterating is really useful .

mean_and_sd(df[[1]])
## # A tibble: 1 x 2
##    mean    sd
##   <dbl> <dbl>
## 1  2.70  1.12
mean_and_sd(df[[2]])
## # A tibble: 1 x 2
##    mean    sd
##   <dbl> <dbl>
## 1 0.416  4.08
mean_and_sd(df[[3]])
## # A tibble: 1 x 2
##    mean    sd
##   <dbl> <dbl>
## 1  10.1 0.191
mean_and_sd(df[[4]])
## # A tibble: 1 x 2
##    mean    sd
##   <dbl> <dbl>
## 1 -3.43  1.18

But now we’ve broken our “don’t repeat code more than twice” rule! Specifically, we’ve applied the same function / operation to the elements of a list sequentially. This is exactly the kind of code repetition for loops address

Below, I define an output list with the same number of entries as my target dataframe; a sequence to iterate over; and a for loop body that applies the mean_and_sd function for each sequence element and saves the result.

output = vector("list", length = 4)

for (i in 1:4) {
  output[[i]] = mean_and_sd(df[[i]])
}

This is already much cleaner than using four almost-identical lines of code, and will make life easier the larger our sequence gets.

In this example, I bypassed a common first step in writing loops because I already had the function I wanted to repeat. Frequently, however, I’ll start with repeated code segements, then abstract the underlying process into a function, and then wrap things up in a for loop.

map

A criticism of for loops is that there’s a lot of overhead – you have to define your output vector / list, there’s the for loop bookkeeping to do, etc – that distracts from the purpose of the code. In this case, we want to apply mean_and_sd to each column of df, but we have to scan inside the for loop to figure that out.

The map functions in purrr try to make the purpose of your code clear. Compare the loop above to the line below.

output = map(df, mean_and_sd)

The first argument to map is the vector / list ( / data frame) we want to iterate over, and the second argument is the function we want to apply to each element. The line above will produce the same output as the previous loop, but is clearer and easier to understand (once you’re used to map …).

It’s sometimes necessary to be more specific in giving arguments to map. In particular, using .x = df for the input list and ~ mean_and_sd(.x) to specify the function applied to the input list (using .x as a placeholder) will produce the same result.

output = map_df(.x = df, ~ mean_and_sd(.x))

This code (using map) is why we pointed out in writing functions that functions can be passed as arguments to other functions. The second argument in map(df, mean_and_sd) is a function we just wrote. To see how powerful this can be, suppose we wanted to apply a different function, say median, to each column of df. The chunk below includes both the loop and the map approach.

output = vector("list", length = 4)

for (i in 1:4) {
  output[[i]] = median(df[[i]])
}

output = map(df, median)
# output = map(.x = df, ~ median(.x))

Again, both options produce the same output, but the map places the focus squarely on the function you want to apply by removing much of the bookkeeping.

map variants

There are some useful variants to the basic map function if you know what kind of output you’re going to produce. Below we use map_dbl because median outputs a single numeric value each time; the result is a vector instead of a list. Using the .id argument keeps the names of the elements in the input list.

output = map_dbl(df, median, .id = "input")

If we tried to use map_int or map_lgl, we’d get an error because the output of median isn’t a integer or a logical. This is a good way to help catch mistakes when they arise.

Similarly, since we know mean_and_sd produces a data frame, we can use the output-specific map_dfr; this will produce a single data frame.

output = map_dfr(df, mean_and_sd, .id = "input")

The map_df variants can be helpful when your map statement is part of a longer chain of piped commands.

Finally, map2 (and map2_dbl, etc) is helpful when your function has two arguments. In these cases, I find it very helpful to be specific about arguments using something like the following:

output = map2(.x = input_1, .y = input_2, ~func(arg_1 = .x, arg_2 = .y))

Although we won’t go into this in detail, one example where you might use this is in loading the LoTR data using functions and inputs we defined previously.

lotr_data = map2_df(
  .x = cell_ranges, .y = movie_names, 
  ~lotr_load_and_tidy(path = "./data/LotR_Words.xlsx", range = .x, movie_name = .y))

Revisiting Napoleon

In reading data from the web and elsewhere, we wrote code that allowed us to scrape information in Amazon reviews; in writing functions we wrapped that code into a function called read_page_reviews which, for a given url, produces a data frame containing review titles, star ratings, and text.

library(rvest)
## Loading required package: xml2
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
## 
##     pluck
## The following object is masked from 'package:readr':
## 
##     guess_encoding

read_page_reviews = function(url) {
  
  h = read_html(url)
  
  title = h %>%
    html_nodes("#cm_cr-review_list .review-title") %>%
    html_text()
  
  stars = h %>%
    html_nodes("#cm_cr-review_list .review-rating") %>%
    html_text() %>%
    str_extract("\\d") %>%
    as.numeric()
  
  text = h %>%
    html_nodes(".review-data:nth-child(5)") %>%
    html_text()
  
  data_frame(title, stars, text)
}

Learning Assessment: Use this function to read the five pages of reviews on the URLs defined in the code chunk below.

url_base = "https://www.amazon.com/product-reviews/B00005JNBQ/ref=cm_cr_arp_d_viewopt_rvwer?ie=UTF8&reviewerType=avp_only_reviews&sortBy=recent&pageNumber="
vec_urls = str_c(url_base, 1:5)

Solution

First I’ll define a vector of URLs to act as an input, and then I’ll iterate over that vector using both a for loop and a map_df statement.

output = vector("list", 5)

for (i in 1:5) {
  output[[i]] = read_page_reviews(vec_urls[[i]])
}

dynamite_reviews = bind_rows(output)

dynamite_reviews = map_df(vec_urls, read_page_reviews)

As with previous examples, using a for loop is pretty okay but the map_df call is clearer.

List columns

Shifting gears a bit, let’s revisit the weather data from visualization and elsewhere; these data consist of one year of observations from three monitoring stations. The code below pulls these data into R (using the rnoaa package, which interacts with the NOAA API).

weather = 
  rnoaa::meteo_pull_monitors(
    c("USW00094728", "USC00519397", "USS0023B17S"),
    var = c("PRCP", "TMIN", "TMAX"), 
    date_min = "2016-01-01",
    date_max = "2016-12-31") %>%
  mutate(
    name = recode(id, USW00094728 = "CentralPark_NY", 
                      USC00519397 = "Waikiki_HA",
                      USS0023B17S = "Waterhole_WA"),
    tmin = tmin / 10,
    tmax = tmax / 10) %>%
  select(name, id, everything())
## Registered S3 method overwritten by 'crul':
##   method                 from
##   as.character.form_file httr
## Registered S3 method overwritten by 'hoardr':
##   method           from
##   print.cache_info httr
## file path:          /Users/jeffgoldsmith/Library/Caches/rnoaa/ghcnd/USW00094728.dly
## file last updated:  2018-06-14 13:08:48
## file min/max dates: 1869-01-01 / 2018-06-30
## file path:          /Users/jeffgoldsmith/Library/Caches/rnoaa/ghcnd/USC00519397.dly
## file last updated:  2018-06-14 13:09:00
## file min/max dates: 1965-01-01 / 2018-06-30
## file path:          /Users/jeffgoldsmith/Library/Caches/rnoaa/ghcnd/USS0023B17S.dly
## file last updated:  2018-06-14 13:09:04
## file min/max dates: 1999-09-01 / 2018-06-30

The station name and id are constant across the year’s temperature and precipitation data. For that reason, we can reorganize these data into a new data frame with a single row for each station. Weather data will be separated into three station-specific data frames, each of which is the data “observation” for the respective station.

weather_nest = 
  nest(weather, data = date:tmin)

weather_nest
## # A tibble: 3 x 3
##   name           id                    data
##   <chr>          <chr>       <list<df[,4]>>
## 1 CentralPark_NY USW00094728      [366 × 4]
## 2 Waikiki_HA     USC00519397      [366 × 4]
## 3 Waterhole_WA   USS0023B17S      [366 × 4]

Here I’ve used nest by specifying a column range to collapse within remaining variable values.

The name column is a character column – if you pull this column from the weather data frame, the result is a character vector. Similarly, the data column is a list column – on it’s own, it’s a list.

weather_nest %>% pull(name)
## [1] "CentralPark_NY" "Waikiki_HA"     "Waterhole_WA"
weather_nest %>% pull(data)
## <list_of<
##   tbl_df<
##     date: date
##     prcp: double
##     tmax: double
##     tmin: double
##   >
## >[3]>
## [[1]]
## # A tibble: 366 x 4
##   date        prcp  tmax  tmin
##   <date>     <dbl> <dbl> <dbl>
## 1 2016-01-01     0   5.6   1.1
## 2 2016-01-02     0   4.4   0  
## 3 2016-01-03     0   7.2   1.7
## 4 2016-01-04     0   2.2  -9.9
## 5 2016-01-05     0  -1.6 -11.6
## 6 2016-01-06     0   5    -3.8
## # … with 360 more rows
## 
## [[2]]
## # A tibble: 366 x 4
##   date        prcp  tmax  tmin
##   <date>     <dbl> <dbl> <dbl>
## 1 2016-01-01     0  29.4  16.7
## 2 2016-01-02     0  28.3  16.7
## 3 2016-01-03     0  28.3  16.7
## 4 2016-01-04     0  28.3  16.1
## 5 2016-01-05     0  27.2  16.7
## 6 2016-01-06     0  27.2  20  
## # … with 360 more rows
## 
## [[3]]
## # A tibble: 366 x 4
##   date        prcp  tmax  tmin
##   <date>     <dbl> <dbl> <dbl>
## 1 2016-01-01     0   1.7  -5.9
## 2 2016-01-02    25  -0.1  -6  
## 3 2016-01-03     0  -5   -10  
## 4 2016-01-04    25   0.3  -9.8
## 5 2016-01-05    25   1.9  -1.8
## 6 2016-01-06    25   1.4  -2.6
## # … with 360 more rows

The list column really is a list, and will behave as such elsewhere in R. So, for example, you can examine the first list entry using usual list index procedures.

weather_nest$data[[1]]
## # A tibble: 366 x 4
##   date        prcp  tmax  tmin
##   <date>     <dbl> <dbl> <dbl>
## 1 2016-01-01     0   5.6   1.1
## 2 2016-01-02     0   4.4   0  
## 3 2016-01-03     0   7.2   1.7
## 4 2016-01-04     0   2.2  -9.9
## 5 2016-01-05     0  -1.6 -11.6
## 6 2016-01-06     0   5    -3.8
## # … with 360 more rows

Of course, if you can nest data you should be able to unnest it as well, and you can (with the caveat that you’re unnesting a list column that contains a data frame).

unnest(weather_nest, cols = data)
## # A tibble: 1,098 x 6
##   name           id          date        prcp  tmax  tmin
##   <chr>          <chr>       <date>     <dbl> <dbl> <dbl>
## 1 CentralPark_NY USW00094728 2016-01-01     0   5.6   1.1
## 2 CentralPark_NY USW00094728 2016-01-02     0   4.4   0  
## 3 CentralPark_NY USW00094728 2016-01-03     0   7.2   1.7
## 4 CentralPark_NY USW00094728 2016-01-04     0   2.2  -9.9
## 5 CentralPark_NY USW00094728 2016-01-05     0  -1.6 -11.6
## 6 CentralPark_NY USW00094728 2016-01-06     0   5    -3.8
## # … with 1,092 more rows

Nesting columns can help with data organization and comprehension by masking complexity you’re less concerned about right now and clarifying the things you are concerned about. In the weather data, it can be helpful to think of stations as the basic unit of observation, and daily weather recordings as a more granular level of observation. Nesting can also simplify the use of analytic approaches across levels of a higher variable.

Operations on list columns

You will need to be able to manipulate list columns, but usual operations for columns that might appear in mutate (like mean or recode) often don’t apply to the entries in a list column.

Instead, recognizing list columns as list columns motivates an approach for working with nested data frames.

Suppose we want to fit the simple linear regression relating tmax to tmin for each station-specific data frame. First I’ll write a quick function that takes a data frame as the sole argument to fit this model.

weather_lm = function(df) {
  lm(tmax ~ tmin, data = df)
}

Let’s make sure this works on a single data frame.

weather_lm(weather_nest$data[[1]])
## 
## Call:
## lm(formula = tmax ~ tmin, data = df)
## 
## Coefficients:
## (Intercept)         tmin  
##       7.779        1.045

Great! Keeping in mind that weather$data is a list, we can apply our weather_lm function to each data frame using map.

map(weather_nest$data, weather_lm)
## [[1]]
## 
## Call:
## lm(formula = tmax ~ tmin, data = df)
## 
## Coefficients:
## (Intercept)         tmin  
##       7.779        1.045  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = tmax ~ tmin, data = df)
## 
## Coefficients:
## (Intercept)         tmin  
##      22.489        0.326  
## 
## 
## [[3]]
## 
## Call:
## lm(formula = tmax ~ tmin, data = df)
## 
## Coefficients:
## (Intercept)         tmin  
##       6.851        1.245

You can avoid the creation of a dedicated function using map’s syntax for “anonymous” (i.e. not named and saved) functions. This is fine for really short operations, but I typically write a dedicated function instead of using this option.

map(weather_nest$data, ~lm(tmax ~ tmin, data = .x))
## [[1]]
## 
## Call:
## lm(formula = tmax ~ tmin, data = .x)
## 
## Coefficients:
## (Intercept)         tmin  
##       7.779        1.045  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = tmax ~ tmin, data = .x)
## 
## Coefficients:
## (Intercept)         tmin  
##      22.489        0.326  
## 
## 
## [[3]]
## 
## Call:
## lm(formula = tmax ~ tmin, data = .x)
## 
## Coefficients:
## (Intercept)         tmin  
##       6.851        1.245

The map function returns a list; we could store the results as a new list column … !!!

We’ve been using mutate to define a new variable in a data frame, especially one that is a function of an existing variable. That’s exactly what we will keep doing.

weather_nest = 
  weather_nest %>% 
  mutate(models = map(data, weather_lm))

weather_nest
## # A tibble: 3 x 4
##   name           id                    data models
##   <chr>          <chr>       <list<df[,4]>> <list>
## 1 CentralPark_NY USW00094728      [366 × 4] <lm>  
## 2 Waikiki_HA     USC00519397      [366 × 4] <lm>  
## 3 Waterhole_WA   USS0023B17S      [366 × 4] <lm>

This is great! We now have a data frame that has rows for each station; columns contain weather datasets and fitted models. This makes it very easy to keep track of models across stations, and to perform additional analyses.

This is, for sure, a fairly complex bit of code, but in just a few lines we’re able to fit separate linear models to each of our stations. And, once you get used to list columns, map, and the rest of it, these lines of code are pretty clear and can be extended to larger datasets with more complex structures.

Revisiting past examples

Earlier, we looked at scraping reviews of Napoleon Dynamite from several pages on Amazon. This required the creation of input and output vectors / lists; now we’ll repeat this using data frames and list columns.

dynamite_reviews = 
  tibble(page = 1:5,
         urls = str_c(url_base, page)) %>% 
  mutate(reviews = map(urls, read_page_reviews)) %>% 
  unnest()
## Warning: `cols` is now required.
## Please use `cols = c(reviews)`

I’ll also revisit a way to load and tidy the LoTR words data that starts with a dataframe containing cell ranges for each movie. Of all the approaches we’ve seen, this is probably my favorite.

lotr_cell_ranges = 
  tibble(
    movie = c("fellowship_ring", "two_towers", "return_king"),
    cells = c("B3:D6", "F3:H6", "J3:L6")
  )

lotr_tidy = 
  lotr_cell_ranges %>% 
  mutate(
    word_data = map(cells, ~readxl::read_excel("./data/LotR_Words.xlsx", range = .x))
  ) %>% 
  unnest(cols = word_data) %>% 
  janitor::clean_names() %>% 
  pivot_longer(
    female:male,
    names_to = "sex",
    values_to = "words") %>%
  mutate(race = str_to_lower(race)) %>% 
  select(movie, everything(), -cells) 

Other materials

Iteration can be tricky – the readings below will help as you work through this!

  • The chapter on iteration in R for Data Science has a useful treatment of iteration using loops and map
  • Jenny Bryan’s purrr tutorial has a lot of useful information and examples
  • R Programming for Data Science has information on loops and loop functions; given Roger Peng’s tendency towards base R he focuses on lapply and others instead of map
  • This question and response on stack overflow explains why one might prefer map to lapply

List columns take some getting used to; there are some materials to help with that.

  • R for Data Science has a chapter on fitting many models
  • Jenny Bryan’s purrr tutorial has useful list-column examples
  • The R file used to create the starwars dataset (in the tidyverse) using the Star Wars API processes list output (from the API) using several map variants

The code that I produced working examples in lecture is here.