We have, by now, established some fundamental tools for doing data science. It’s important to revisit our definition, and especially our discussion of connotation, before moving forward.

The slack channel for today’s example is here.

Example

This example is based entirely on live-coding and uses the NYC Airbnb data. The data can be imported using the p8105.datasets package:

library(p8105.datasets)

data(nyc_airbnb)

As always, I’ll do today’s coding in a R Markdown file, sitting in an GitHub Repo / R Project.

Understanding variables

First, let’s take a few minutes to understand the dataset and the variables it contains.

# View(nyc_airbnb)
str(nyc_airbnb)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 40753 obs. of  17 variables:
##  $ id                            : num  7949480 16042478 1886820 6627449 5557381 ...
##  $ review_scores_location        : num  10 NA NA 10 10 10 10 9 10 9 ...
##  $ name                          : chr  "City Island Sanctuary relaxing BR & Bath w Parking" "WATERFRONT STUDIO APARTMENT" "Quaint City Island Community." "Large 1 BDRM in Great location" ...
##  $ host_id                       : num  119445 9117975 9815788 13886510 28811542 ...
##  $ host_name                     : chr  "Linda & Didier" "Collins" "Steve" "Arlene" ...
##  $ neighbourhood_group           : chr  "Bronx" "Bronx" "Bronx" "Bronx" ...
##  $ neighbourhood                 : chr  "City Island" "City Island" "City Island" "City Island" ...
##  $ lat                           : num  -73.8 -73.8 -73.8 -73.8 -73.8 ...
##  $ long                          : num  40.9 40.9 40.8 40.8 40.9 ...
##  $ room_type                     : chr  "Private room" "Private room" "Entire home/apt" "Entire home/apt" ...
##  $ price                         : num  99 200 300 125 69 125 85 39 95 125 ...
##  $ minimum_nights                : num  1 7 7 3 3 2 1 2 3 2 ...
##  $ number_of_reviews             : num  25 0 0 12 86 41 74 114 5 206 ...
##  $ last_review                   : Date, format: "2017-04-23" NA ...
##  $ reviews_per_month             : num  1.59 NA NA 0.54 3.63 2.48 5.43 2.06 5 2.98 ...
##  $ calculated_host_listings_count: num  1 1 1 1 1 1 1 4 3 4 ...
##  $ availability_365              : num  170 180 365 335 352 129 306 306 144 106 ...

nyc_airbnb %>%
  count(room_type)
## # A tibble: 3 x 2
##   room_type           n
##   <chr>           <int>
## 1 Entire home/apt 19937
## 2 Private room    19626
## 3 Shared room      1190

nyc_airbnb %>%
  count(neighbourhood_group)
## # A tibble: 5 x 2
##   neighbourhood_group     n
##   <chr>               <int>
## 1 Bronx                 649
## 2 Brooklyn            16810
## 3 Manhattan           19212
## 4 Queens               3821
## 5 Staten Island         261

Brainstorming questions

A major element of data science is to ask questions, and this dataset provides some rich opportunities. For example, we might ask:

  • Does rating vary by neighborhood, room type, or both?
  • How is price related to other variables?
  • Where are rentals located?

We’ll take a few minutes as a class to brainstorm some additional questions, and then try to answer some of them.

Other materials

Some additional links on leaflet:

  • A quick introduction is here
  • A more thorough overview is here

Some other reading on thinking like a data scientist:

The code that I produced working examples in lecture is here.