This assignment reinforces ideas in Visualization and EDA.
Due: October 15 at 11:59pm.
Please submit (via courseworks) the web address of the GitHub repo containing your work for this assignment; git commits after the due date will cause the assignment to be considered late.
R Markdown documents included as part of your solutions must not install packages, and should only load the packages necessary for your submission to knit.
Problem | Points |
---|---|
Problem 0 | 20 |
Problem 1 | – |
Problem 2 | 40 |
Problem 3 | 40 |
Optional survey | No points |
This “problem” focuses on structure of your submission, especially the use git and GitHub for reproducibility, R Projects to organize your work, R Markdown to write reproducible reports, relative paths to load data from local files, and reasonable naming structures for your files.
To that end:
p8105_hw2_YOURUNI
(e.g. p8105_hw2_ajg2202
for Jeff), but that’s not
requiredp8105_hw2_YOURUNI.Rmd
that renders to github_document
Your solutions to Problems 1, 2, and 3 should be implemented in your .Rmd file, and your git commit history should reflect the process you used to solve these Problems.
For this Problem, we will assess adherence to the instructions above regarding repo structure, git commit history, and whether we are able to knit your .Rmd to ensure that your work is reproducible. Adherence to appropriate styling and clarity of code will be assessed in Problems 1+ using the style rubric.
This homework includes figures; the readability of your embedded plots (e.g. font sizes, axis labels, titles) will be assessed in Problems 1+.
This problem uses the Instacart
data. DO NOT include this dataset in your local data directory; instead,
load the data from the p8105.datasets
using:
library(p8105.datasets)
data("instacart")
The goal is to do some exploration of this dataset. To that end, write a short description of the dataset, noting the size and structure of the data, describing some key variables, and giving illstrative examples of observations. Then, do or answer the following (commenting on the results of each):
Accelerometers have become an appealing alternative to self-report techniques for studying physical activity in observational studies and clinical trials, largely because of their relative objectivity. During observation periods, the devices measure “activity counts” in a short period; one-minute intervals are common. Because accelerometers can be worn comfortably and unobtrusively, they produce around-the-clock observations.
This problem uses five weeks of accelerometer data collected on a 63
year-old male with BMI 25, who was admitted to the Advanced Cardiac Care
Center of Columbia University Medical Center and diagnosed with
congestive heart failure (CHF). The data can be downloaded here. In this spreadsheet,
variables activity.*
are the activity counts for each
minute of a 24-hour day starting at midnight.
This problem uses the NY NOAA data.
DO NOT include this dataset in your local data directory; instead, load
the data from the p8105.datasets
package using:
library(p8105.datasets)
data("ny_noaa")
The goal is to do some exploration of this dataset. To that end, write a short description of the dataset, noting the size and structure of the data, describing some key variables, and indicating the extent to which missing data is an issue. Then, do or answer the following (commenting on the results of each):
tmax
vs
tmin
for the full dataset (note that a scatterplot may not
be the best option); and (ii) make a plot showing the distribution of
snowfall values greater than 0 and less than 100 separately by
year.If you’d like, a you can complete this short survey after you’ve finished the assignment.