Context

This assignment reinforces ideas in Iteration.

Due date and submission

Due: November 11 at 4:00pm.

Please submit (via courseworks) the web address of the GitHub repo containing your work for this assignment; git commits after the due date will cause the assignment to be considered late.

R Markdown documents included as part of your solutions must not install packages, and should only load the packages necessary for your submission to knit.

Points

Problem Points
Problem 0 20
Problem 1 20
Problem 2 25
Problem 3 35
Optional survey No points

Problem 0

This “problem” focuses on structure of your submission, especially the use git and GitHub for reproducibility, R Projects to organize your work, R Markdown to write reproducible reports, relative paths to load data from local files, and reasonable naming structures for your files. To that end:

  • create a public GitHub repo + local R Project; we suggest naming this repo / directory p8105_hw5_YOURUNI (e.g. p8105_hw5_ajg2202 for Jeff), but that’s not required
  • create a single .Rmd file named p8105_hw5_YOURUNI.Rmd that renders to github_document
  • create a subdirectory to store the local data files used in the assignment, and use relative paths to access these data files
  • submit a link to your repo via Courseworks

Your solutions to Problems 1, 2, and 3 should be implemented in your .Rmd file, and your git commit history should reflect the process you used to solve these Problems.

For this Problem, we will assess adherence to the instructions above regarding repo structure, git commit history, and whether we are able to knit your .Rmd to ensure that your work is reproducible. Adherence to appropriate styling and clarity of code will be assessed in Problems 1+ using the style rubric.

This homework includes figures; the readability of your embedded plots (e.g. font sizes, axis labels, titles) will be assessed in Problems 1+.

Problem 1

The code chunk below loads the iris dataset from the tidyverse package and introduces some missing values in each column. The purpose of this problem is to fill in those missing values.

library(tidyverse)

set.seed(10)

iris_with_missing = iris %>% 
  map_df(~replace(.x, sample(1:150, 20), NA)) %>%
  mutate(Species = as.character(Species))

There are two cases to address:

  • For numeric variables, you should fill in missing values with the mean of non-missing values
  • For character variables, you should fill in missing values with "virginica"

Write a function that takes a vector as an argument; replaces missing values using the rules defined above; and returns the resulting vector. Apply this function to the columns of iris_with_missing using a map statement.

Problem 2

This zip file contains data from a longitudinal study that included a control arm and an experimental arm. Data for each participant is included in a separate file, and file names include the subject ID and arm.

Create a tidy dataframe containing data from all participants, including the subject ID, arm, and observations over time:

  • Start with a dataframe containing all file names; the list.files function will help
  • Iterate over file names and read in data for each subject using purrr::map and saving the result as a new variable in the dataframe
  • Tidy the result; manipulate file names to include control arm and subject ID, make sure weekly observations are “tidy”, and do any other tidying that’s necessary

Make a spaghetti plot showing observations on each subject over time, and comment on differences between groups.

Problem 3

When designing an experiment or analysis, a common question is whether it is likely that a true effect will be detected – put differently, whether a false null hypothesis will be rejected. The probability that a false null hypothesis is rejected is referred to as power, and it depends on several factors, including: the sample size; the effect size; and the error variance. In this problem, you will conduct a simulation to explore power in a simple linear regression.

First set the following design elements:

  • Fix \(n=30\)
  • Fix \(x_{i1}\) as draws from a standard Normal distribution
  • Fix \(\beta_0 = 2\)
  • Fix \(\sigma^2 = 50\)

Set \(\beta_1=0\). Generate 10000 datasets from the model \[ y_i = \beta_0 + \beta_1 x_{i1} + \epsilon_{i} \] with \(\epsilon_{i} ∼ N[0,\sigma^2]\). For each dataset, save \(\hat{\beta}_1\) and the p-value arising from a test of \(H : \beta_1 = 0\) using \(\alpha = 0.05\). Hint: to obtain the estimate and p-value, use broom::tidy to clean the output of lm.

Repeat the above for \(\beta_1 = \{1, 2, 3, 4, 5, 6\}\), and complete the following:

  • Make a plot showing the proportion of times the null was rejected (the power of the test) on the y axis and the true value of \(\beta_1\) on the x axis. Describe the association between effect size and power.
  • Make a plot showing the average estimate of \(\hat{\beta}_1\) on the y axis and the true value of \(\beta_1\) on the x axis. Make a second plot (or overlay on the first) the average estimate of \(\hat{\beta}_1\) only in samples for which the null was rejected on the y axis and the true value of \(\beta_1\) on the x axis. Is the sample average of \(\hat{\beta}_1\) across tests for which the null is rejected approximately equal to the true value of \(\beta_1\)? Why or why not?

Extra topics survey

Please complete this survey regarding extra topics to cover at the end of the semester.

Optional post-assignment survey

If you’d like, a you can complete this short survey after you’ve finished the assignment.