What is Data Science?

We have, by now, established some fundamental tools for doing data science. It’s important to revisit our definition, and especially our discussion of connotation, before moving forward.

Overview

Learning Objectives

Pull together skills learned through this point to produce analytical summaries and reports.

Slide Deck

What is Data Science? Revisited from Jeff Goldsmith.

Example

This example is based entirely on live-coding and uses the NYC Airbnb data. The data can be imported using the p8105.datasets package:

library(p8105.datasets)

data(nyc_airbnb)

As always, I’ll do today’s coding in a R Markdown file, sitting in an GitHub Repo / R Project.

Understanding variables

First, let’s take a few minutes to understand the dataset and the variables it contains.

# View(nyc_airbnb)
str(nyc_airbnb)
## spc_tbl_ [40,753 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id                            : num [1:40753] 7949480 16042478 1886820 6627449 5557381 ...
##  $ review_scores_location        : num [1:40753] 10 NA NA 10 10 10 10 9 10 9 ...
##  $ name                          : chr [1:40753] "City Island Sanctuary relaxing BR & Bath w Parking" "WATERFRONT STUDIO APARTMENT" "Quaint City Island Community." "Large 1 BDRM in Great location" ...
##  $ host_id                       : num [1:40753] 119445 9117975 9815788 13886510 28811542 ...
##  $ host_name                     : chr [1:40753] "Linda & Didier" "Collins" "Steve" "Arlene" ...
##  $ neighbourhood_group           : chr [1:40753] "Bronx" "Bronx" "Bronx" "Bronx" ...
##  $ neighbourhood                 : chr [1:40753] "City Island" "City Island" "City Island" "City Island" ...
##  $ lat                           : num [1:40753] -73.8 -73.8 -73.8 -73.8 -73.8 ...
##  $ long                          : num [1:40753] 40.9 40.9 40.8 40.8 40.9 ...
##  $ room_type                     : chr [1:40753] "Private room" "Private room" "Entire home/apt" "Entire home/apt" ...
##  $ price                         : num [1:40753] 99 200 300 125 69 125 85 39 95 125 ...
##  $ minimum_nights                : num [1:40753] 1 7 7 3 3 2 1 2 3 2 ...
##  $ number_of_reviews             : num [1:40753] 25 0 0 12 86 41 74 114 5 206 ...
##  $ last_review                   : Date[1:40753], format: "2017-04-23" NA ...
##  $ reviews_per_month             : num [1:40753] 1.59 NA NA 0.54 3.63 2.48 5.43 2.06 5 2.98 ...
##  $ calculated_host_listings_count: num [1:40753] 1 1 1 1 1 1 1 4 3 4 ...
##  $ availability_365              : num [1:40753] 170 180 365 335 352 129 306 306 144 106 ...

nyc_airbnb %>%
  count(room_type)
## # A tibble: 3 × 2
##   room_type           n
##   <chr>           <int>
## 1 Entire home/apt 19937
## 2 Private room    19626
## 3 Shared room      1190

nyc_airbnb %>%
  count(neighbourhood_group)
## # A tibble: 5 × 2
##   neighbourhood_group     n
##   <chr>               <int>
## 1 Bronx                 649
## 2 Brooklyn            16810
## 3 Manhattan           19212
## 4 Queens               3821
## 5 Staten Island         261

Brainstorming questions

A major element of data science is to ask questions, and this dataset provides some rich opportunities. For example, we might ask:

Does rating vary by neighborhood, room type, or both?
How is price related to other variables?
Where are rentals located?

We’ll take a few minutes as a class to brainstorm some additional questions, and then try to answer some of them.

Other materials

Some additional links on leaflet:

A quick introduction is here
A more thorough overview is here

Some other reading on thinking like a data scientist:

If you didn’t listen before, now’s the time for Chris Volinsky’s “How Industry Views Data Science Education in Statistics Departments”
Jeff Leek’s problem forward blog post
This (somewhat long) post has great insights into a data science mindset
Although we’re playing with public (scraped) data, you might be interested in How R Helps Airbnb Make the Most of Its Data

The code that I produced working examples in lecture is here.