In data import, we saw how to load data from a variety of formats; this is a fairly standard way to get data that have been gathered as part of a study. In a lot of cases, though, you’re going to have to go out and get the data you want or need. That’s what we’re covering now.

This is the first module in the Data Wrangling II topic; the relevant slack channel is here.

Example

As always, I’ll create a new GH Repo + local project called data_wrangling_ii, and open a new R Markdown file called reading_data_from_the_web.Rmd. Although we’ll mostly be getting data from the web, we’ll revisit some of these examples, so I’ll create a data subdirectory and put those in it.

There are some new additions to our standard packages (rvest and httr); I’m loading everything we need now. Now’s also the time to “install” the Selector Gadget.

library(tidyverse)
library(rvest)
library(httr)

Extracting tables

This page contains data from the National Survey on Drug Use and Health; it includes tables for drug use in the past year or month, separately for specific kinds of drug use. These data are potentially useful for analysis, and we’d like to be able to read in the first table.

First, let’s make sure we can load the data from the web.

url = "http://samhda.s3-us-gov-west-1.amazonaws.com/s3fs-public/field-uploads/2k15StateFiles/NSDUHsaeShortTermCHG2015.htm"
drug_use_xml = read_html(url)

drug_use_xml
## {xml_document}
## <html lang="en">
## [1] <head>\n<link rel="P3Pv1" href="http://www.samhsa.gov/w3c/p3p.xml">\ ...
## [2] <body>\r\n\r\n<noscript>\r\n<p>Your browser's Javascript is off. Hyp ...

Doesn’t look like much, but we’re there. Rather than trying to grab something using a CSS selector, let’s try our luck extracting the tables from the HTML.

drug_use_xml %>%
  html_nodes(css = "table")
## {xml_nodeset (15)}
##  [1] <table class="rti" border="1" cellspacing="0" cellpadding="1" width ...
##  [2] <table class="rti" border="1" cellspacing="0" cellpadding="1" width ...
##  [3] <table class="rti" border="1" cellspacing="0" cellpadding="1" width ...
##  [4] <table class="rti" border="1" cellspacing="0" cellpadding="1" width ...
##  [5] <table class="rti" border="1" cellspacing="0" cellpadding="1" width ...
##  [6] <table class="rti" border="1" cellspacing="0" cellpadding="1" width ...
##  [7] <table class="rti" border="1" cellspacing="0" cellpadding="1" width ...
##  [8] <table class="rti" border="1" cellspacing="0" cellpadding="1" width ...
##  [9] <table class="rti" border="1" cellspacing="0" cellpadding="1" width ...
## [10] <table class="rti" border="1" cellspacing="0" cellpadding="1" width ...
## [11] <table class="rti" border="1" cellspacing="0" cellpadding="1" width ...
## [12] <table class="rti" border="1" cellspacing="0" cellpadding="1" width ...
## [13] <table class="rti" border="1" cellspacing="0" cellpadding="1" width ...
## [14] <table class="rti" border="1" cellspacing="0" cellpadding="1" width ...
## [15] <table class="rti" border="1" cellspacing="0" cellpadding="1" width ...

This has extracted all of the tables on the original page; that’s why we have a list with 15 elements. (We haven’t really talked about lists yet, but for now you can think of them as a general collection of objects in R. As we proceed, syntax for extracting individual elements from a list will become clear, and we’ll talk lots about lists in list columns.)

We’re only focused on the first table for now, so let’s get the contents from the first list element.

table_marj = 
  (drug_use_xml %>% html_nodes(css = "table")) %>% 
  .[[1]] %>%
  html_table() 

I won’t print the table here, but if you look at it you’ll notice a problem: the “note” at the bottom of the table appears in every column in the first row. We need to remove that; I’ll also convert to a tibble so that things print nicely.

table_marj = 
  (drug_use_xml %>% html_nodes(css = "table")) %>% 
  .[[1]] %>%
  html_table() %>%
  slice(-1) %>% 
  as_tibble()

table_marj
## # A tibble: 56 x 16
##   State `12+(2013-2014)` `12+(2014-2015)` `12+(P Value)` `12-17(2013-201…
##   <chr> <chr>            <chr>            <chr>          <chr>           
## 1 Tota… 12.90a           13.36            0.002          13.28b          
## 2 Nort… 13.88a           14.66            0.005          13.98           
## 3 Midw… 12.40b           12.76            0.082          12.45           
## 4 South 11.24a           11.64            0.029          12.02           
## 5 West  15.27            15.62            0.262          15.53a          
## 6 Alab… 9.98             9.60             0.426          9.90            
## # … with 50 more rows, and 11 more variables: `12-17(2014-2015)` <chr>,
## #   `12-17(P Value)` <chr>, `18-25(2013-2014)` <chr>,
## #   `18-25(2014-2015)` <chr>, `18-25(P Value)` <chr>,
## #   `26+(2013-2014)` <chr>, `26+(2014-2015)` <chr>, `26+(P Value)` <chr>,
## #   `18+(2013-2014)` <chr>, `18+(2014-2015)` <chr>, `18+(P Value)` <chr>

Success!! At least, mostly. These data aren’t tidy, but we’ll worry about that soon.

Learning assessment: Create a data frame that contains the cost of living table for New York from this page.

Solution

The code below shows one approach to this data cleaning process:

nyc_cost = 
  read_html("https://www.bestplaces.net/cost_of_living/city/new_york/new_york") %>%
  html_nodes(css = "table") %>%
  .[[1]] %>%
  html_table(header = TRUE)

In case you hadn’t known, NYC is kind of expensive.

CSS Selectors

Suppose we’d like to scrape the data about the Harry Potter Saga from the IMDB page. The first step is the same as before – we need to get the HTML.

hpsaga_html = 
  read_html("https://www.imdb.com/list/ls000630791/")

The information isn’t stored in a handy table, so we’re going to isolate the CSS selector for elements we care about. A bit of clicking around gets me something like below.

For each element, I’ll use the CSS selector in html_nodes() to extract the relevant HTML code, and convert it to text. Then I can combine these into a data frame.

title_vec = 
  hpsaga_html %>%
  html_nodes(".lister-item-header a") %>%
  html_text()

gross_rev_vec = 
  hpsaga_html %>%
  html_nodes(".text-small:nth-child(7) span:nth-child(5)") %>%
  html_text()

runtime_vec = 
  hpsaga_html %>%
  html_nodes(".runtime") %>%
  html_text()

hpsaga_df = 
  tibble(
    title = title_vec,
    rev = gross_rev_vec,
    runtime = runtime_vec)

Learning Assessment: This page contains the 10 most recent reviews of the movie “Napoleon Dynamite”. Use a process similar to the one above to extract the titles of the reviews. Note: getting the star ratings from Amazon is trickier, but the CSS path "#cm_cr-review_list .review-rating" helps – I discovered this after about an hour of googling around.

Solution

The code below will give me relevant information for the ten most recent reviews on Amazon:

url = "https://www.amazon.com/product-reviews/B00005JNBQ/ref=cm_cr_arp_d_viewopt_rvwer?ie=UTF8&reviewerType=avp_only_reviews&sortBy=recent&pageNumber=1"

dynamite_html = read_html(url)

review_titles = 
  dynamite_html %>%
  html_nodes(".a-text-bold span") %>%
  html_text()

review_stars = 
  dynamite_html %>%
  html_nodes("#cm_cr-review_list .review-rating") %>%
  html_text()

review_text = 
  dynamite_html %>%
  html_nodes(".review-text-content span") %>%
  html_text()

reviews = tibble(
  title = review_titles,
  stars = review_stars,
  text = review_text
)

Using an API

New York City has a great open data resource, and we’ll use that for our API examples. Although most (all?) of these datasets can be accessed by clicking through a website, we’ll access them directly using the API to improve reproducibility and make it easier to update results to reflect new data.

As a simple example, this page is about a dataset for annual water consumption in NYC, along with the population in that year. First, we’ll import this as a CSV and parse it.

nyc_water = 
  GET("https://data.cityofnewyork.us/resource/waf7-5gvc.csv") %>% 
  content("parsed")
## Parsed with column specification:
## cols(
##   year = col_double(),
##   new_york_city_population = col_double(),
##   nyc_consumption_million_gallons_per_day = col_double(),
##   per_capita_gallons_per_person_per_day = col_double()
## )

We can also import this dataset as a JSON file. This takes a bit more work (and this is, really, a pretty easy case), but it’s still doable.

nyc_water = 
  GET("https://data.cityofnewyork.us/resource/waf7-5gvc.json") %>% 
  content("text") %>%
  jsonlite::fromJSON() %>%
  as_tibble()

Data.gov also has a lot of data available using their API; often this is available as CSV or JSON as well. For example, we might be interested in data coming from BRFSS. This is importable via the API as a CSV (JSON, in this example, is much more complicated).

brfss_smart2010 = 
  GET("https://data.cdc.gov/api/views/acme-vg9e/rows.csv?accessType=DOWNLOAD") %>% 
  content("parsed")

Both of the previous examples are extremely easy – we accessed data that is essentially a data table, and we had a very straightforward API.

To get a sense of how this becomes complicated, let’s look at the Pokemon API (which is also pretty nice).

poke = 
  GET("http://pokeapi.co/api/v2/pokemon/1") %>%
  content()

poke$name
## [1] "bulbasaur"

poke$height
## [1] 7

poke$abilities
## [[1]]
## [[1]]$ability
## [[1]]$ability$name
## [1] "chlorophyll"
## 
## [[1]]$ability$url
## [1] "https://pokeapi.co/api/v2/ability/34/"
## 
## 
## [[1]]$is_hidden
## [1] TRUE
## 
## [[1]]$slot
## [1] 3
## 
## 
## [[2]]
## [[2]]$ability
## [[2]]$ability$name
## [1] "overgrow"
## 
## [[2]]$ability$url
## [1] "https://pokeapi.co/api/v2/ability/65/"
## 
## 
## [[2]]$is_hidden
## [1] FALSE
## 
## [[2]]$slot
## [1] 1

To build a Pokemon dataset for analysis, you’d need to distill the data returned from the API into a useful format; iterate across all pokemon; and combine the results.

For both of the API examples we saw today, it wouldn’t be terrible to just download the CSV, document where it came from carefully, and move on. APIs are more helpful when the full dataset is complex and you only need pieces, or when the data are updated regularly.

Be reasonable

When you’re reading data from the web, remember you’re accessing resources on someone else’s server – either by reading HTML or by accessing data via an API. In some cases, those who make data public will take steps to limit bandwidth devoted to a small number of users. Amazon and IMDB, for example, probably won’t notice if you scrape small amounts of data but would notice if you tried to read data from thousands of pages every time you knitted a document.

Similarly, API developers can (and will) limit the number of database entries that can be accessed in a single request. In those cases you’d have to take some steps to iterate over “pages” and combine the results; as an example, our code for the NYC Restaurant Inspections does this. In some cases, API developers protect themselves from unreasonable use by requiring users to be authenticated – it’s still possible to use httr in these cases, but we won’t get into it.

Other materials

  • A recent short course presented similar topics to those above; a GitHub repo for the course is here
  • A lot of NYC data is public; this is a good place to start looking for interesting data
  • There are some cool projects based on scraped data; the RStudio community collected some here
  • Check out the R file used to create the starwars dataset (in the tidyverse) using the Star Wars API (from the maker of the Pokemon API).
  • Some really helpful R packages are wrappers for APIs – the rnoaa package we’ve used is an example, and so is rtweet

The code that I produced working examples in lecture is here.