This assignment reinforces ideas in Linear Models.
Due: December 2 at 11:59pm.
Please submit (via courseworks) the web address of the GitHub repo containing your work for this assignment; git commits after the due date will cause the assignment to be considered late.
R Markdown documents included as part of your solutions must not install packages, and should only load the packages necessary for your submission to knit.
Problem | Points |
---|---|
Problem 0 | 20 |
Problem 1 | – |
Problem 2 | 40 |
Problem 3 | 40 |
Optional survey | No points |
This “problem” focuses on structure of your submission, especially the use git and GitHub for reproducibility, R Projects to organize your work, R Markdown to write reproducible reports, relative paths to load data from local files, and reasonable naming structures for your files. To that end:
p8105_hw6_YOURUNI
(e.g. p8105_hw6_ajg2202
for Jeff), but that’s not
requiredp8105_hw6_YOURUNI.Rmd
that renders to github_document
Your solutions to Problems 1 and 2 should be implemented in your .Rmd file, and your git commit history should reflect the process you used to solve these Problems.
For this Problem, we will assess adherence to the instructions above regarding repo structure, git commit history, and whether we are able to knit your .Rmd to ensure that your work is reproducible. Adherence to appropriate styling and clarity of code will be assessed in Problems 1+ using the style rubric.
This homework includes figures; the readability of your embedded plots (e.g. font sizes, axis labels, titles) will be assessed in Problems 1+.
For this problem, we’ll use the 2017 Central Park weather data that we’ve seen elsewhere. The code chunk below (adapted from the course website) will download these data.
weather_df =
rnoaa::meteo_pull_monitors(
c("USW00094728"),
var = c("PRCP", "TMIN", "TMAX"),
date_min = "2017-01-01",
date_max = "2017-12-31") %>%
mutate(
name = recode(id, USW00094728 = "CentralPark_NY"),
tmin = tmin / 10,
tmax = tmax / 10) %>%
select(name, id, everything())
The boostrap is helpful when you’d like to perform inference for a
parameter / value / summary that doesn’t have an easy-to-write-down
distribution in the usual repeated sampling framework. We’ll focus on a
simple linear regression with tmax
as the response and
tmin
as the predictor, and are interested in the
distribution of two quantities estimated from these data:
Use 5000 bootstrap samples and, for each bootstrap sample, produce
estimates of these two quantities. Plot the distribution of your
estimates, and describe these in words. Using the 5000 bootstrap
estimates, identify the 2.5% and 97.5% quantiles to provide a 95%
confidence interval for \(\hat{r}^2\)
and \(\log(\hat{\beta}_0 *
\hat{\beta}_1)\). Note: broom::glance()
is helpful
for extracting \(\hat{r}^2\) from a
fitted regression, and broom::tidy()
(with some additional
wrangling) should help in computing \(\log(\hat{\beta}_0 * \hat{\beta}_1)\).
The Washington Post has gathered data on homicides in 50 large U.S. cities and made the data available through a GitHub repository here. You can read their accompanying article here.
Create a city_state
variable (e.g. “Baltimore, MD”), and
a binary variable indicating whether the homicide is solved. Omit cities
Dallas, TX; Phoenix, AZ; and Kansas City, MO – these don’t report victim
race. Also omit Tulsa, AL – this is a data entry mistake. For this
problem, limit your analysis those for whom victim_race
is
white
or black
. Be sure that
victim_age
is numeric.
For the city of Baltimore, MD, use the glm
function to
fit a logistic regression with resolved vs unresolved as the outcome and
victim age, sex and race as predictors. Save the output of
glm
as an R object; apply the broom::tidy
to
this object; and obtain the estimate and confidence interval of the
adjusted odds ratio for solving homicides comparing
male victims to female victims keeping all other variables fixed.
Now run glm
for each of the cities in your dataset, and
extract the adjusted odds ratio (and CI) for solving homicides comparing
male victims to female victims. Do this within a “tidy” pipeline, making
use of purrr::map
, list columns, and unnest
as
necessary to create a dataframe with estimated ORs and CIs for each
city.
Create a plot that shows the estimated ORs and CIs for each city. Organize cities according to estimated OR, and comment on the plot.
In this problem, you will analyze data gathered to understand the effects of several variables on a child’s birthweight. This dataset, available here, consists of roughly 4000 children and includes the following variables:
babysex
: baby’s sex (male = 1, female = 2)bhead
: baby’s head circumference at birth
(centimeters)blength
: baby’s length at birth (centimeteres)bwt
: baby’s birth weight (grams)delwt
: mother’s weight at delivery (pounds)fincome
: family monthly income (in hundreds,
rounded)frace
: father’s race (1 = White, 2 = Black, 3 = Asian,
4 = Puerto Rican, 8 = Other, 9 = Unknown)gaweeks
: gestational age in weeksmalform
: presence of malformations that could affect
weight (0 = absent, 1 = present)menarche
: mother’s age at menarche (years)mheigth
: mother’s height (inches)momage
: mother’s age at delivery (years)mrace
: mother’s race (1 = White, 2 = Black, 3 = Asian,
4 = Puerto Rican, 8 = Other)parity
: number of live births prior to this
pregnancypnumlbw
: previous number of low birth weight
babiespnumgsa
: number of prior small for gestational age
babiesppbmi
: mother’s pre-pregnancy BMIppwt
: mother’s pre-pregnancy weight (pounds)smoken
: average number of cigarettes smoked per day
during pregnancywtgain
: mother’s weight gain during pregnancy
(pounds)Load and clean the data for regression analysis (i.e. convert numeric to factor where appropriate, check for missing data, etc.).
Propose a regression model for birthweight. This model may be based
on a hypothesized structure for the factors that underly birthweight, on
a data-driven model-building process, or a combination of the two.
Describe your modeling process and show a plot of model residuals
against fitted values – use add_predictions
and
add_residuals
in making this plot.
Compare your model to two others:
Make this comparison in terms of the cross-validated prediction
error; use crossv_mc
and functions in purrr
as
appropriate.
Note that although we expect your model to be reasonable, model building itself is not a main idea of the course and we don’t necessarily expect your model to be “optimal”.
If you’d like, a you can complete this short survey after you’ve finished the assignment.