Linear regression models are fundamental in statistics and data science. When seeking to understand how covariates are associated with outcomes, linear models are among the first, best options. Although other regression approaches are possible, the flexibility and interpretability and of linear models make them essential.
This content assumes some familiarity with linear models, and focuses on the implementation of models in R rather than on the theory or interpretation of the models themselves.
This is the first module in the Linear Models topic.
Review fundamentals of linear and generalized linear models, fit models in R, and tidy results for further analysis.
I’ll write code for today’s content in a new R Markdown document
linear_models.Rmd in a
directory / repo. The code chunk below loads some usual packages and
sets a seed for reproducibility.
library(tidyverse) library(p8105.datasets) set.seed(1)
The code below loads and cleans the Airbnb data, which we’ll use as a primary example for fitting linear models.
data("nyc_airbnb") nyc_airbnb = nyc_airbnb %>% mutate(stars = review_scores_location / 2) %>% rename( borough = neighbourhood_group, neighborhood = neighbourhood) %>% filter(borough != "Staten Island") %>% select(price, stars, borough, neighborhood, room_type)
An good place to start is to consider price as an outcome that may depend on rating and borough. We fit that initial model in the following code.
fit = lm(price ~ stars + borough, data = nyc_airbnb)
lm function begins with the formula specification –
outcome on the left of the
~ and predictors separated by
+ on the right. As we’ll see shortly, interactions between
variables can be specified using
*. You can also specify an
intercept-only model (
outcome ~ 1), a model with no
outcome ~ 0 + ...), and a model using all
available predictors (
outcome ~ .).
R will treat categorical (factor) covariates appropriately and
predictably: indicator variables are created for each non-reference
category and included in your model, and the factor level is treated as
the reference. As with
ggplot, being careful with factors
is therefore critical!
nyc_airbnb = nyc_airbnb %>% mutate( borough = fct_infreq(borough), room_type = fct_infreq(room_type)) fit = lm(price ~ stars + borough, data = nyc_airbnb)
It’s important to note that changing reference categories won’t change “fit” or statistical sigificance, but can affect ease of interpretation.
The output of a
lm is an object of class
– a very specific list that isn’t a dataframe but that can be
manipulated using other functions. Some common functions for interacting
lm fits are below, although we omit the output.
summary(fit) summary(fit)$coef coef(fit) fitted.values(fit)
The reason that we omit the output is that it’s a huge pain to deal
summary produces an object of class
summary.lm, which is also a list – that’s how we extracted
the coefficients using
produces a vector of coefficient values, and
is a vector of fitted values. None of this is tidy.
It’s helpful to know about the products of
lm and to
know there are a range of ways to interact with models in base R. That
said, for the most part it’s easiest to use tidy tools.
broom package has functions for obtaining a quick
summary of the model and for cleaning up the coefficient table.
fit %>% broom::glance()
## # A tibble: 1 × 12 ## r.squared adj.r.…¹ sigma stati…² p.value df logLik AIC BIC devia…³ ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.0342 0.0341 182. 271. 6.73e-229 4 -2.02e5 4.04e5 4.04e5 1.01e9 ## # … with 2 more variables: df.residual <int>, nobs <int>, and abbreviated ## # variable names ¹adj.r.squared, ²statistic, ³deviance
fit %>% broom::tidy()
## # A tibble: 5 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 19.8 12.2 1.63 1.04e- 1 ## 2 stars 32.0 2.53 12.7 1.27e- 36 ## 3 boroughBrooklyn -49.8 2.23 -22.3 6.32e-109 ## 4 boroughQueens -77.0 3.73 -20.7 2.58e- 94 ## 5 boroughBronx -90.3 8.57 -10.5 6.64e- 26
Both of these functions produce data frames, which makes it straightforward to include the results in subsequent steps.
fit %>% broom::tidy() %>% select(term, estimate, p.value) %>% mutate(term = str_replace(term, "^borough", "Borough: ")) %>% knitr::kable(digits = 3)
As an aside,
broom::tidy works with lots of things,
including most of the functions for model fitting you’re likely to run
into (survival, mixed models, additive models, …).
Regression diagnostics can identify issues in model fit, especially related to certain failures in model assumptions. Examining residuals and fitted values are therefore an imporant component of any modeling exercise.
modelr package can be used to add residuals and
fitted values to a dataframe.
## # A tibble: 40,492 × 6 ## price stars borough neighborhood room_type resid ## <dbl> <dbl> <fct> <chr> <fct> <dbl> ## 1 99 5 Bronx City Island Private room 9.47 ## 2 200 NA Bronx City Island Private room NA ## 3 300 NA Bronx City Island Entire home/apt NA ## 4 125 5 Bronx City Island Entire home/apt 35.5 ## 5 69 5 Bronx City Island Private room -20.5 ## 6 125 5 Bronx City Island Entire home/apt 35.5 ## 7 85 5 Bronx City Island Entire home/apt -4.53 ## 8 39 4.5 Bronx Allerton Private room -34.5 ## 9 95 5 Bronx Allerton Entire home/apt 5.47 ## 10 125 4.5 Bronx Allerton Entire home/apt 51.5 ## # … with 40,482 more rows
## # A tibble: 40,492 × 6 ## price stars borough neighborhood room_type pred ## <dbl> <dbl> <fct> <chr> <fct> <dbl> ## 1 99 5 Bronx City Island Private room 89.5 ## 2 200 NA Bronx City Island Private room NA ## 3 300 NA Bronx City Island Entire home/apt NA ## 4 125 5 Bronx City Island Entire home/apt 89.5 ## 5 69 5 Bronx City Island Private room 89.5 ## 6 125 5 Bronx City Island Entire home/apt 89.5 ## 7 85 5 Bronx City Island Entire home/apt 89.5 ## 8 39 4.5 Bronx Allerton Private room 73.5 ## 9 95 5 Bronx Allerton Entire home/apt 89.5 ## 10 125 4.5 Bronx Allerton Entire home/apt 73.5 ## # … with 40,482 more rows
Like many things in the tidyverse, the first argument is a dataframe. That makes it easy to included steps adding residuals or predictions in pipeline of commands to conduct inspections and perform diagnostics.
nyc_airbnb %>% modelr::add_residuals(fit) %>% ggplot(aes(x = borough, y = resid)) + geom_violin()
nyc_airbnb %>% modelr::add_residuals(fit) %>% ggplot(aes(x = stars, y = resid)) + geom_point()
This example has some obvious issues, most notably the presence of extremely large outliers in price and a generally skewed residual distribution. There are a few things we might try to do here – including creating a formal rule for the exclusion of outliers, transforming the price variable (e.g. using a log transformation), or fitting a model that is robust to outliers. Dealing with these issues isn’t really the purpose of this class, though, so we’ll note the issues and move on; shortly we’ll look at using the bootstrap for inference in cases like this, where standard approaches to inference may fail.
(For what it’s worth, I’d probably use a combination of median regression, which is less sensitive to outliers than OLS, and maybe bootstrapping for inference. If that’s not feasible, I’d omit rentals with price over $1000 (< 0.5% of the sample) from the primary analysis and examine these separately. I usually avoid transforming the outcome, because the results model is difficult to interpret.)
We’ll comment briefly on hypothesis testing. Model summaries include results of t-tests for single coefficients, and are the standard way of assessing statistical significance.
Testing multiple coefficients is somewhat more complicated. A useful approach is to use nested models, meaning that the terms in a simple “null” model are a subset of the terms in a more complex “alternative” model. The are formal tests for comparing the null and alternative models, even when several coefficients are added in the alternative model. Tests of this kind are required to assess the significance of a categorical predictor with more than two levels, as in the example below.
fit_null = lm(price ~ stars + borough, data = nyc_airbnb) fit_alt = lm(price ~ stars + borough + room_type, data = nyc_airbnb)
The test of interest is implemented in the
function which, of course, can be summarized using
anova(fit_null, fit_alt) %>% broom::tidy()
## # A tibble: 2 × 6 ## res.df rss df sumsq statistic p.value ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 30525 1005601724. NA NA NA NA ## 2 30523 921447496. 2 84154228. 1394. 0
Note that this works for nested models only. Comparing non-nested models is a common problem that requires other methods; we’ll see one approach in cross validation.
We’ll now turn our attention to fitting models to datasets nested
within variables – meaning, essentially, that we’ll use
nest to create a list column containing datasets and fit
separate models to each. This is very different from fitting nested
models, even though the terminology is similar.
In the airbnb data, we might think that star ratings and room type affects price differently in each borough. One way to allow this kind of effect modification is through interaction terms:
nyc_airbnb %>% lm(price ~ stars * borough + room_type * borough, data = .) %>% broom::tidy() %>% knitr::kable(digits = 3)
This works, but the output takes time to think through – the expected change in price comparing an entire apartment to a private room in Queens, for example, involves the main effect of room type and the Queens / private room interaction.
Alternatively, we can nest within boroughs and fit borough-specific models associating price with rating and room type:
nest_lm_res = nyc_airbnb %>% nest(data = -borough) %>% mutate( models = map(data, ~lm(price ~ stars + room_type, data = .x)), results = map(models, broom::tidy)) %>% select(-data, -models) %>% unnest(results)
The results of this approach are given in the table below.
nest_lm_res %>% select(borough, term, estimate) %>% mutate(term = fct_inorder(term)) %>% pivot_wider( names_from = term, values_from = estimate) %>% knitr::kable(digits = 3)
|borough||(Intercept)||stars||room_typePrivate room||room_typeShared room|
The estimates here are the same as those in the model containing interactions, but are easier to extract from the output.
Fitting models to nested datasets is a way of performing stratified analyses. These have a tradeoff: stratified models make it easy to interpret covariate effects in each stratum, but don’t provide a mechanism for assessing the significance of differences across strata.
An even more extreme example is the assessment of neighborhood effects in Manhattan. The code chunk below fits neighborhood-specific models:
manhattan_airbnb = nyc_airbnb %>% filter(borough == "Manhattan") manhattan_nest_lm_res = manhattan_airbnb %>% nest(data = -neighborhood) %>% mutate( models = map(data, ~lm(price ~ stars + room_type, data = .x)), results = map(models, broom::tidy)) %>% select(-data, -models) %>% unnest(results)
And the chunk below shows neighborhood-specific estimates for the coefficients related to room type.
manhattan_nest_lm_res %>% filter(str_detect(term, "room_type")) %>% ggplot(aes(x = neighborhood, y = estimate)) + geom_point() + facet_wrap(~term) + theme(axis.text.x = element_text(angle = 80, hjust = 1))
There is, generally speaking, a reduction in room price for a private room or a shared room compared to an entire apartment, but this varies quite a bit across neighborhoods.
With this many factor levels, it really isn’t a good idea to fit
models with main effects or interactions for each. Instead, you’d be
best-off using a mixed model, with random intercepts and slopes for each
neighborhood. Although it’s well beyond the scope of this class, code to
fit a mixed model with neighborhood-level random intercepts and random
slopes for room type is below. And, of course, we can tidy the results
using a mixed-model spinoff of the
manhattan_airbnb %>% lme4::lmer(price ~ stars + room_type + (1 + room_type | neighborhood), data = .) %>% broom.mixed::tidy()
## boundary (singular) fit: see help('isSingular')
## # A tibble: 11 × 6 ## effect group term estimate std.e…¹ stati…² ## <chr> <chr> <chr> <dbl> <dbl> <dbl> ## 1 fixed <NA> (Intercept) 250. 26.6 9.41 ## 2 fixed <NA> stars -3.16 5.00 -0.631 ## 3 fixed <NA> room_typePrivate room -124. 7.80 -15.9 ## 4 fixed <NA> room_typeShared room -157. 12.9 -12.2 ## 5 ran_pars neighborhood sd__(Intercept) 59.3 NA NA ## 6 ran_pars neighborhood cor__(Intercept).room_typePri… -0.987 NA NA ## 7 ran_pars neighborhood cor__(Intercept).room_typeSha… -1.00 NA NA ## 8 ran_pars neighborhood sd__room_typePrivate room 36.7 NA NA ## 9 ran_pars neighborhood cor__room_typePrivate room.ro… 0.992 NA NA ## 10 ran_pars neighborhood sd__room_typeShared room 43.6 NA NA ## 11 ran_pars Residual sd__Observation 198. NA NA ## # … with abbreviated variable names ¹std.error, ²statistic
Mixed models are pretty great!
Linear models are appropriate for outcomes that follow a continuous distribution, but binary outcomes are common. In these cases, logistic regression is a useful analytic framework.
The Washington Post has gathered data on homicides in 50 large U.S. cities and made the data available through a GitHub repository; the final CSV is here. You can read their accompanying article here. We’ll use data on unresolved murders in Baltimore, MD to illustrate logistic regression in R. The code below imports, cleans, and generally wrangles the data for analysis.
baltimore_df = read_csv("data/homicide-data.csv") %>% filter(city == "Baltimore") %>% mutate( resolved = as.numeric(disposition == "Closed by arrest"), victim_age = as.numeric(victim_age), victim_race = fct_relevel(victim_race, "White")) %>% select(resolved, victim_age, victim_race, victim_sex)
## Rows: 52179 Columns: 12 ## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (9): uid, victim_last, victim_first, victim_race, victim_age, victim_sex... ## dbl (3): reported_date, lat, lon ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Using these data, we can fit a logistic regression for the binary
“resolved” outcome and victim demographics as predictors. This uses the
glm function with the family specified to account for the
non-Gaussian outcome distribution.
fit_logistic = baltimore_df %>% glm(resolved ~ victim_age + victim_race + victim_sex, data = ., family = binomial())
Many of the same tools we used to work with
lm fits can
be used for
glm fits. The table below summaries the
coefficients from the model fit; because logistic model estimates are
log odds ratios, we include a step to compute odds ratios as well.
fit_logistic %>% broom::tidy() %>% mutate(OR = exp(estimate)) %>% select(term, log_OR = estimate, OR, p.value) %>% knitr::kable(digits = 3)
Homicides in which the victim is Black are substantially less likely to be resolved that those in which the victim is white; for other races the effects are not significant, possible due to small sample sizes. Homicides in which the victim is male are significantly less like to be resolved than those in which the victim is female. The effect of age is statistically significant, but careful data inspections should be conducted before interpreting too deeply.
We can also compute fitted values; similarly to the estimates in the model summary, these are expressed as log odds and can be transformed to produce probabilities for each subject.
baltimore_df %>% modelr::add_predictions(fit_logistic) %>% mutate(fitted_prob = boot::inv.logit(pred))
## # A tibble: 2,827 × 6 ## resolved victim_age victim_race victim_sex pred fitted_prob ## <dbl> <dbl> <fct> <chr> <dbl> <dbl> ## 1 0 17 Black Male -0.654 0.342 ## 2 0 26 Black Male -0.720 0.327 ## 3 0 21 Black Male -0.683 0.335 ## 4 1 61 White Male -0.131 0.467 ## 5 1 46 Black Male -0.864 0.296 ## 6 1 27 Black Male -0.727 0.326 ## 7 1 21 Black Male -0.683 0.335 ## 8 1 16 Black Male -0.647 0.344 ## 9 1 21 Black Male -0.683 0.335 ## 10 1 44 Black Female 0.0297 0.507 ## # … with 2,817 more rows
modelrpackage also has a website
The code that I produced working examples in lecture is here.