We have, by now, established some fundamental tools for doing data science. It’s important to revisit our definition, and especially our discussion of connotation, before moving forward.
Pull together skills learned through this point to produce analytical summaries and reports.
This example is based entirely on live-coding and uses the NYC Airbnb data. The data can be imported using the p8105.datasets package:
library(p8105.datasets)
data(nyc_airbnb)
As always, I’ll do today’s coding in a R Markdown file, sitting in an GitHub Repo / R Project.
First, let’s take a few minutes to understand the dataset and the variables it contains.
# View(nyc_airbnb)
str(nyc_airbnb)
## spc_tbl_ [40,753 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:40753] 7949480 16042478 1886820 6627449 5557381 ...
## $ review_scores_location : num [1:40753] 10 NA NA 10 10 10 10 9 10 9 ...
## $ name : chr [1:40753] "City Island Sanctuary relaxing BR & Bath w Parking" "WATERFRONT STUDIO APARTMENT" "Quaint City Island Community." "Large 1 BDRM in Great location" ...
## $ host_id : num [1:40753] 119445 9117975 9815788 13886510 28811542 ...
## $ host_name : chr [1:40753] "Linda & Didier" "Collins" "Steve" "Arlene" ...
## $ neighbourhood_group : chr [1:40753] "Bronx" "Bronx" "Bronx" "Bronx" ...
## $ neighbourhood : chr [1:40753] "City Island" "City Island" "City Island" "City Island" ...
## $ lat : num [1:40753] -73.8 -73.8 -73.8 -73.8 -73.8 ...
## $ long : num [1:40753] 40.9 40.9 40.8 40.8 40.9 ...
## $ room_type : chr [1:40753] "Private room" "Private room" "Entire home/apt" "Entire home/apt" ...
## $ price : num [1:40753] 99 200 300 125 69 125 85 39 95 125 ...
## $ minimum_nights : num [1:40753] 1 7 7 3 3 2 1 2 3 2 ...
## $ number_of_reviews : num [1:40753] 25 0 0 12 86 41 74 114 5 206 ...
## $ last_review : Date[1:40753], format: "2017-04-23" NA ...
## $ reviews_per_month : num [1:40753] 1.59 NA NA 0.54 3.63 2.48 5.43 2.06 5 2.98 ...
## $ calculated_host_listings_count: num [1:40753] 1 1 1 1 1 1 1 4 3 4 ...
## $ availability_365 : num [1:40753] 170 180 365 335 352 129 306 306 144 106 ...
nyc_airbnb %>%
count(room_type)
## # A tibble: 3 × 2
## room_type n
## <chr> <int>
## 1 Entire home/apt 19937
## 2 Private room 19626
## 3 Shared room 1190
nyc_airbnb %>%
count(neighbourhood_group)
## # A tibble: 5 × 2
## neighbourhood_group n
## <chr> <int>
## 1 Bronx 649
## 2 Brooklyn 16810
## 3 Manhattan 19212
## 4 Queens 3821
## 5 Staten Island 261
A major element of data science is to ask questions, and this dataset provides some rich opportunities. For example, we might ask:
We’ll take a few minutes as a class to brainstorm some additional questions, and then try to answer some of them.
Some additional links on leaflet
:
Some other reading on thinking like a data scientist:
The code that I produced working examples in lecture is here.