This assignment reinforces ideas in Data Wrangling I.
Due: Oct 1 at 11:59pm.
Please submit (via courseworks) the web address of the GitHub repo containing your work for this assignment; git commits after the due date will cause the assignment to be considered late.
R Markdown documents included as part of your solutions must not install packages, and should only load the packages necessary for your submission to knit.
Problem | Points |
---|---|
Problem 0 | 20 |
Problem 1 | – |
Problem 2 | 40 |
Problem 3 | 40 |
Optional survey | No points |
This “problem” focuses on structure of your submission, especially the use git and GitHub for reproducibility, R Projects to organize your work, R Markdown to write reproducible reports, relative paths to load data from local files, and reasonable naming structures for your files.
To that end:
p8105_hw2_YOURUNI
(e.g. p8105_hw2_ajg2202
for Jeff), but that’s not
requiredp8105_hw2_YOURUNI.Rmd
that renders to github_document
Your solutions to Problems 1, 2, and 3 should be implemented in your .Rmd file, and your git commit history should reflect the process you used to solve these Problems.
For this Problem, we will assess adherence to the instructions above regarding repo structure, git commit history, and whether we are able to knit your .Rmd to ensure that your work is reproducible. Adherence to appropriate styling and clarity of code will be assessed in Problems 1+.
This problem uses the FiveThirtyEight data; these data were gathered to create the interactive graphic on this page. In particular, we’ll use the data in pols-month.csv, unemployment.csv, and snp.csv. Our goal is to merge these into a single data frame using year and month as keys across datasets.
First, clean the data in pols-month.csv. Use separate()
to break up the variable mon
into integer variables
year
, month
, and day
; replace
month number with month name; create a president
variable
taking values gop
and dem
, and remove
prez_dem
and prez_gop
; and remove the day
variable.
Second, clean the data in snp.csv using a similar process to the
above. For consistency across datasets, arrange according to year and
month, and organize so that year
and month
are
the leading columns.
Third, tidy the unemployment data so that it can be merged with the previous datasets. This process will involve switching from “wide” to “long” format; ensuring that key variables have the same name; and ensuring that key variables take the same values.
Join the datasets by merging snp
into pols
,
and merging unemployment
into the result.
Write a short paragraph about these datasets. Explain briefly what each dataset contained, and describe the resulting dataset (e.g. give the dimension, range of years, and names of key variables).
Note: we could have used a date
variable as a key
instead of creating year
and month
keys; doing
so would help with some kinds of plotting, and be a more accurate
representation of the data. Date formats are tricky, though. For more
information check out the lubridate
package in the
tidyverse
.
This problem uses the Mr. Trash Wheel dataset, available as an Excel file on the course website.
Read and clean the Mr. Trash Wheel sheet:
read_excel
as.integer
)Use a similar process to import, clean, and organize the data for Professor Trash Wheel and Gwynnda, and combine this with the Mr. Trash Wheel dataset to produce a single tidy dataset. To keep track of which Trash Wheel is which, you may need to add an additional variable to both datasets before combining.
Write a paragraph about these data; you are encouraged to use inline R. Be sure to note the number of observations in the resulting dataset, and give examples of key variables. For available data, what was the total weight of trash collected by Professor Trash Wheel? What was the total number of cigarette butts collected by Gwynnda in June of 2022?
Home and rental prices have generally increased over the last decade. Zillow, a popular website used to search for homes for sale or rent, is uniquely positioned to provide insights into trends in the real estate market. In response to broad interest, the company releases data for research. In this project, we’ll look at the Zillow Observed Rent Index (ZORI) in New York City between January 2015 and August 2024.
NYC is divided into five boroughs. Each of these boroughs is it’s own county, and in some cases the borough name and county name differ; for example, Manhattan is New York County. Moreover, boroughs are divided into neighborhoods. Rental price data provided by Zillow does not include information neighborhoods within boroughs, but can be accessed separately.
Both datasets are available here.
Create a single, well-organized dataset with all the information contained in these data files. To that end: import, clean, tidy, and otherwise wrangle each of these datasets; check for completeness and correctness across datasets (e.g. by viewing individual datasets and monitoring warning messages); merge to create a single, final dataset; and organize this so that variables and observations are in meaningful orders.
Briefly describe the resulting tidy dataset. How many total observations exist? How many unique ZIP codes are included, and how many unique neighborhoods?
Which ZIP codes appear in the ZIP code dataset but not in the Zillow Rental Price dataset? Using a few illustrative examples discuss why these ZIP codes might be excluded from the Zillow dataset.
Rental prices fluctuated dramatically during the COVID-19 pandemic. For all available ZIP codes, compare rental prices in January 2021 to prices in January 2020. Make a table that shows the 10 ZIP codes (along with the borough and neighborhood) with largest drop in price from January 2020 to 2021. Comment.
If you’d like, you can complete this short survey after you’ve finished the assignment.