In this, we’ll provide a basic definition of “data science” and discuss the connotation of the term in several contexts.

Example

For the purpose of this class, we’ll use the following working definition of data science:

Data science is the use of data to formulate and rigorously answer questions in a process that emphasizes clarity, reproducibility, and collaboration, and that recognizes code as a primary means of communication. 

In coming modules, we’ll learn about wrangling data, making visualizations, and conducting analyses. Throughout, we’ll focus on modern tools that facilitate best practices for working with data, including organization, reproducibility, and clear coding. Material will be presented in a way that combines didactic content with hands-on coding elements. Below are two examples we’ll return to later in the course.

Before introducing these, I’ll load the tidyverse.

library(tidyverse)

The next chunk of code loads and tidies an example dataset, which includes daily record of several weather-related variables at each of three weather stations.

weather_df = 
  rnoaa::meteo_pull_monitors(c("USW00094728", "USC00519397", "USS0023B17S"),
                      var = c("PRCP", "TMIN", "TMAX"), 
                      date_min = "2017-01-01",
                      date_max = "2017-12-31") %>%
  mutate(
    name = recode(id, USW00094728 = "CentralPark_NY", 
                      USC00519397 = "Waikiki_HA",
                      USS0023B17S = "Waterhole_WA"),
    tmin = tmin / 10,
    tmax = tmax / 10) %>%
  select(name, id, everything())

As we’ll discuss, a major element of working with data is producing visualizations. The plot below shows the maximum temperature at each of the three stations, as well as smooth trends over time to illustrate seasonal effects. This is produced using ggplot, a package in the tidyverse that we’ll talk more about soon.

weather_df %>% 
  ggplot(aes(x = date, y = tmax, color = name)) + 
  geom_point(alpha = .5) +
  geom_smooth(se = FALSE) + 
  theme(legend.position = "bottom")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The next example uses data on Airbnb rentals in NYC, and is a bit more complex. The code below combines several steps to produce a map showing a sample of 5000 rentals in Brooklyn, Manhattan, and Queens; some important information (average rating, price, number of reviews) can be found by interacting with the map itself.

library(leaflet)
library(p8105.datasets)

data("nyc_airbnb")

nyc_airbnb = 
  nyc_airbnb %>% 
  mutate(stars = review_scores_location / 2) %>% 
  rename(boro = neighbourhood_group)

pal <- colorNumeric(
  palette = "viridis",
  domain = nyc_airbnb$stars)

nyc_airbnb %>% 
  filter(boro %in% c("Manhattan", "Brooklyn", "Queens")) %>% 
  na.omit(stars) %>% 
  sample_n(5000) %>% 
  mutate(click_label = str_c("<b>$", price, "</b><br>", stars, " stars<br>", number_of_reviews, " reviews")) %>% 
  leaflet() %>% 
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addCircleMarkers(~lat, ~long, radius = .1, color = ~pal(stars), popup = ~click_label)

Other materials

Lots of folks have opinions about what data science is. Here’s a collection of things that are worth reading (or watching).

We also touched on useful resources for learning data science. Each class session will have relevant readings; the following are useful in giving an overview about how to learn and find help.

  • stackoverflow has a useful guide on how to ask a good question
  • Julia Evan’s blog also has a useful guide how to ask good questions (note: hers has a cartoon!)
  • (Tip: the fact that there are guides on asking questions means it isn’t always easy or obvious how to do it well. That’s fine! Learning how to ask the right questions is important, and you should practice.)
  • This blog post and the follow-up disavowal are both interesting; one deals with learning to program and asking questions, and the other notes that flippant answers online are discouraging to people who want to learn but regretfully common.