Bootstrapping is a popular resampling-based approach to statistical inference, and is helpful when usual statistical methods are intractable or inappropriate. The idea is to draw repeated samples from your original sample with replacement, thereby approximating the repeated sampling framework. Using list columns to store bootstrap samples is natural and provides a “tidy” approach to resampling-based inference.

This is the third module in the Linear Models topic; the relevant slack channel is here.

## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::lag()    masks stats::lag()

## Example

I’ll write code for today’s content in a new R Markdown document called bootstrapping.Rmd in the linear_models directory / repo. The code chunk below loads usual packages and sets a seed for reproducibility.

library(tidyverse)
library(p8105.datasets)

set.seed(1)

### Bootstrapping

Bootstrapping is based on the idea of repeated sampling which underlies most approaches to statistical inference. Traditionally, the distribution of a sample statistic (sample mean, SLR coefficients, etc.) for repeated, random draws from a population has been established theoretically. These theoretical distributions make some assumptions about the underlying population from which samples are drawn, or depend on large sample sizes for asymptotic results.

In cases where the assumptions aren’t met, or sample sizes aren’t large enough for asymptotics to kick in, it is still necessary to make inferences using the sample statistic. In these cases, drawing repeatedly from the original population would be great – one could simple draw a lot of samples and look at the empirical (rather than theoretical) distribution. But, as we said in iteration and simulation, repeated sampling just doesn’t happen in the real world.

Repeated sampling can happen on a computer though. To bootstrap, one draws repeated samples (with the same sample size) from the original sample with replacement to mimic the process of drawing repeated samples from the population. The bootstrap samples will differ from the original sample, and the sample statistic of interest (sample mean, SLR coefficients, etc.) can be computed for each bootstrap sample. Looking at the distribution of the statistic across samples gives a sense of the uncertainty in the estimate.

### Bootstrapping in SLR

Let’s create some simulated data. First I’ll generate x, then an error sampled from a normal distribution, and then a response y; this all gets stored in sim_df_const. Then I’ll modify this by multiplying the errors by a term that involves x, and create a new response variable y.

n_samp = 250

sim_df_const =
tibble(
x = rnorm(n_samp, 1, 1),
error = rnorm(n_samp, 0, 1),
y = 2 + 3 * x + error
)

sim_df_nonconst = sim_df_const %>%
mutate(
error = error * .75 * x,
y = 2 + 3 * x + error
)

By generating data in this way, I’m creating one case in which the usual linear regression assumptions hold and one case in which they don’t. The plot below illustrates the differences between the dataset.

sim_df =
bind_rows(const = sim_df_const, nonconst = sim_df_nonconst, .id = "data_source")

sim_df %>%
ggplot(aes(x = x, y = y)) +
geom_point(alpha = .5) +
stat_smooth(method = "lm") +
facet_grid(~data_source) 

These datasets have roughly the same overall variance, but the left panel shows data with constant variance and the right panel shows data with non-constant variance. For this reason, ordinary least squares should provide reasonable estimates in both cases, but inference is standard inference approaches may only be justified for the data on the left.

The output below shows results from fitting simple linear regressions to both datasets.

lm(y ~ x, data = sim_df_const) %>%
broom::tidy() %>%
knitr::kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) 1.977 0.098 20.157 0
x 3.045 0.070 43.537 0

lm(y ~ x, data = sim_df_nonconst) %>%
broom::tidy() %>%
knitr::kable(digits = 3)
term estimate std.error statistic p.value
(Intercept) 1.934 0.105 18.456 0
x 3.112 0.075 41.661 0

Despite the very different error structures, standard errors for coefficient estimates are similar in both cases!

We’ll use the bootstrap to make inference for the data on the right. This is intended largely as an illustration for how to use the bootstrap in cases where the theoretical distribution is “unknown”, although for these data in particular weighted least squares could be more appropriate.

### Drawing one bootstrap sample

Let’s write a quick function to generate our bootstrap samples. This function should have the data frame as the argument, and should return a sample from that dataframe drawn with replacement.

boot_sample = function(df) {
sample_frac(df, replace = TRUE)
}

We should also do a quick check to see if this is working.

boot_sample(sim_df_nonconst) %>%
ggplot(aes(x = x, y = y)) +
geom_point(alpha = .5) +
stat_smooth(method = "lm")

That looks about right. In comparison with the original data, the bootstrap sample has the same characteristics but isn’t a perfect duplicate – some original data points appear more than once, others don’t appear at all.

### Drawing many bootstrap samples

We’re going to draw repeated samples with replacement, and then analyze each of those samples separately. It would be really great to have a data structure that makes it possible to keep track of everything. Maybe a list column??!

Let’s give that a try:

boot_straps =
data_frame(
strap_number = 1:1000,
strap_sample = rerun(1000, boot_sample(sim_df_nonconst))
)
## Warning: data_frame() is deprecated, use tibble().
## This warning is displayed once per session.

boot_straps
## # A tibble: 1,000 x 2
##    strap_number strap_sample
##           <int> <list>
##  1            1 <tibble [250 × 3]>
##  2            2 <tibble [250 × 3]>
##  3            3 <tibble [250 × 3]>
##  4            4 <tibble [250 × 3]>
##  5            5 <tibble [250 × 3]>
##  6            6 <tibble [250 × 3]>
##  7            7 <tibble [250 × 3]>
##  8            8 <tibble [250 × 3]>
##  9            9 <tibble [250 × 3]>
## 10           10 <tibble [250 × 3]>
## # … with 990 more rows

We can do a few of quick checks to make sure this has worked as intended. First we’ll look at a couple of bootstrap samples.

boot_straps %>%
filter(strap_number %in% 1:2) %>%
mutate(strap_sample = map(strap_sample, ~arrange(.x, x))) %>%
pull(strap_sample)
## [[1]]
## # A tibble: 250 x 3
##         x   error       y
##     <dbl>   <dbl>   <dbl>
##  1 -1.89   1.62   -2.04
##  2 -1.89   1.62   -2.04
##  3 -1.21  -0.781  -2.43
##  4 -1.21  -0.781  -2.43
##  5 -1.00   0.832  -0.169
##  6 -0.989 -1.97   -2.93
##  7 -0.914 -0.908  -1.65
##  8 -0.606 -0.106   0.0774
##  9 -0.536  0.0227  0.413
## 10 -0.524 -0.536  -0.106
## # … with 240 more rows
##
## [[2]]
## # A tibble: 250 x 3
##         x  error       y
##     <dbl>  <dbl>   <dbl>
##  1 -1.29   1.40  -0.454
##  2 -0.989 -1.97  -2.93
##  3 -0.914 -0.908 -1.65
##  4 -0.914 -0.908 -1.65
##  5 -0.805  0.292 -0.123
##  6 -0.805  0.292 -0.123
##  7 -0.665 -0.544 -0.539
##  8 -0.641 -0.416 -0.338
##  9 -0.606 -0.106  0.0774
## 10 -0.606 -0.106  0.0774
## # … with 240 more rows

Seems okay – some values are repeated, some don’t appear in both datasets. Next I’ll use ggplot to show some of these datasets, and to include a linear fit for each.

boot_straps %>%
filter(strap_number %in% 1:3) %>%
unnest(strap_sample) %>%
ggplot(aes(x = x, y = y)) +
geom_point(alpha = .5) +
stat_smooth(method = "lm", se = FALSE) +
facet_grid(~strap_number) 

This shows some of the differences across bootstrap samples, and shows that the fitted regression lines aren’t the same for every bootstrap sample.

### Analyzing bootstrap samples

My goal, of course, isn’t to analyze bootstrap samples by plotting them – I’d like to get a sense of the variability in estimated intercepts and slopes across all my bootstrap samples.

To do that, I’ll use the analytic pipeline we established when looking at nested datasets in linear models: fit the model; tidy the output; unnest and examine the results. The code chunk below uses this pipeline to look at bootstrap standard errors for the estimated regression coefficients.

bootstrap_results =
boot_straps %>%
mutate(
models = map(strap_sample, ~lm(y ~ x, data = .x) ),
results = map(models, broom::tidy)) %>%
select(-strap_sample, -models) %>%
unnest() %>%
group_by(term) %>%
summarize(boot_se = sd(estimate))
## Warning: cols is now required.
## Please use cols = c(results)

bootstrap_results %>%
knitr::kable(digits = 3)
term boot_se
(Intercept) 0.075
x 0.101

Comparing these to the results of ordinary least squares, the standard error for the intercept is much smaller and the standard error for the intercept is a bit larger. This is reasonable, given the non-constant variance in the data given smaller residuals around zero and larger residuals in the the tails of the x distribution.

For a simple linear regression, we can show the fitted lines for each bootstrap sample to build intuition for these results.

boot_straps %>%
unnest(strap_sample) %>%
ggplot(aes(x = x, y = y)) +
geom_line(aes(group = strap_number), stat = "smooth", method = "lm", se = FALSE, alpha = .1, color = "blue") +
geom_point(data = sim_df_nonconst, alpha = .5)

In comparison to the standard error bands in our previous plot (which are based on OLS), the distribution of regression lines is narrower near $$x = 0$$ and wider at the ends of the $$x$$ distribution.

### bootstrap

Bootstrapping is common enough that it’s been automated, to some degree, in the modelr::boostrap function. This function makes it easy to draw bootstrap samples, and stores them in a mostly-helpful way – as a resample object that can be converted to and treated like a data frame. (This keeps you from having to actually store 1000 dataframes, and saves a lot of memory on your computer.)

boot_straps =
sim_df_nonconst %>%
modelr::bootstrap(n = 1000)

boot_straps$strap[[1]] ## <resample [250 x 3]> 8, 132, 69, 225, 180, 122, 34, 170, 216, 122, ... as_data_frame(boot_straps$strap[[1]])
## Warning: as_data_frame() is deprecated, use as_tibble() (but mind the new semantics).
## This warning is displayed once per session.
## # A tibble: 250 x 3
##         x  error     y
##     <dbl>  <dbl> <dbl>
##  1  1.74   0.747  7.96
##  2  0.411  0.343  3.58
##  3  1.15  -1.12   4.34
##  4 -0.157 -0.159  1.37
##  5  2.21  -1.13   7.50
##  6  2.34   0.488  9.52
##  7  0.946 -0.498  4.34
##  8  1.21   1.55   7.17
##  9  2.52   0.528 10.1
## 10  2.34   0.488  9.52
## # … with 240 more rows

Let’s repeat our analysis pipeline using the bootstrap function instead of our own process for drawing samples with replacement.

sim_df_nonconst %>%
modelr::bootstrap(n = 1000) %>%
mutate(
models = map(strap, ~lm(y ~ x, data = .x) ),
results = map(models, broom::tidy)) %>%
select(-strap, -models) %>%
unnest(results) %>%
group_by(term) %>%
summarize(boot_se = sd(estimate))
## # A tibble: 2 x 2
##   term        boot_se
##   <chr>         <dbl>
## 1 (Intercept)  0.0790
## 2 x            0.104

The results are the same (up to resampling variability), and the code to get here is pretty clean.

Also, check this out – to bootstrap the dataset with constant error variance, we only have to change the input dataframe!

sim_df_const %>%
modelr::bootstrap(n = 1000) %>%
mutate(models = map(strap, ~lm(y ~ x, data = .x) ),
results = map(models, broom::tidy)) %>%
select(-strap, -models) %>%
unnest(results) %>%
group_by(term) %>%
summarize(boot_se = sd(estimate))
## # A tibble: 2 x 2
##   term        boot_se
##   <chr>         <dbl>
## 1 (Intercept)  0.101
## 2 x            0.0737

These results generally agree with the output of the OLS procedure, which is nice.

### Airbnb data

As a final example, we’ll revisit the Airbnb data. The code chunk below loads and tidies the data.

data("nyc_airbnb")

nyc_airbnb =
nyc_airbnb %>%
mutate(stars = review_scores_location / 2) %>%
rename(
boro = neighbourhood_group,
neighborhood = neighbourhood) %>%
filter(boro != "Staten Island") %>%
select(price, stars, boro, neighborhood, room_type)

I’ll make a quick plot showing these data, with particular emphasis on the features I’m interested in analyzing: price as an outcome with stars and room_type as covariates.

nyc_airbnb %>%
ggplot(aes(x = stars, y = price, color = room_type)) +
geom_point()
## Warning: Removed 9962 rows containing missing values (geom_point).

In this plot (and in linear models, we noticed that some large outliers in price might affect estimates and inference for the association between star rating and price. Because estimates are likely to be sensitive to those outliers and “usual” rules for inference may not apply, the code chunk below uses the bootstrap to examine the distribution of regression coefficients under repeated sampling.

nyc_airbnb %>%
filter(boro == "Manhattan") %>%
modelr::bootstrap(n = 1000) %>%
mutate(
models = map(strap, ~ lm(price ~ stars + room_type, data = .x)),
results = map(models, broom::tidy)) %>%
select(results) %>%
unnest(results) %>%
filter(term == "stars") %>%
ggplot(aes(x = estimate)) + geom_density()

This distribution has a heavy tail extending to low values and a bit of a “shoulder”, features that may be related to the frequency with which large outliers are included in the bootstrap sample.

## Other materials

List columns take some getting used to; there are some materials to help with that.

Boostrapping and resampling are also new concepts; the materials below explore these using tidyverse approaches.

The code that I produced working examples in lecture is here.