In contrast to Homework assignments, you must work completely independently on this project – do not discuss your approach, your code, or your results with any other students, and do not use the discussion board for questions related to this project. If questions do arise, please email the instructor and lead TA.
At this point, we’ve covered Building Blocks, Data Wrangling I, Visualization and EDA, and Data Wrangling II. These topics give a broad introduction into the commonly-used tools of data science, and are the main focus of this project.
Due: October 23 at 11:59pm.
The course’s emphasis on workflow – especially the use git and GitHub for reproducibility, R Projects to organize your work, R Markdown to write reproducible reports, relative paths to load data from local files, and reasonable naming structures for your files – will be reflected in your Midterm Project submission.
To that end:
p8105_mtp_YOURUNI
(e.g. p8105_mtp_ajg2202
for Jeff), but that’s not required
p8105_mtp_YOURUNI.Rmd
that renders to github_document
We will assess adherence to the instructions above and whether we are able to knit your .Rmd in the grading of this project. Adherence to appropriate styling and clarity of code will be assessed. This project includes figures; the readability of your embedded plots (e.g. font sizes, axis labels, titles) will be assessed.
For this project, you should write a report describing your work in a way that targets a reasonably sophisticated collaborator – not an expert data scientist, but an interested observer. You should comment on findings (for example, describe trends in tables and figures). Structure your report to include sections corresponding to the problems below. Write in a reproducible way (e.g. using inline R code where necessary) and include relevant code chunks and their output. Include only relevant information, and adhere to a strict-500 word limit (this excludes figures and tables, code chunks, inline code, YAML, and other non-text elements).
You can check your word count using
wordcountaddin::text_stats("p8105_mtp_YOURUNI.Rmd")
;
installation instructions can be found on the wordcountaddin
package website. We’ll use the “koRpus” count.
NOTE: you do not need to include a word count
in your report, and running
wordcountaddin::text_stats("p8105_mtp_YOURUNI.Rmd")
in your
document can interfere with our automated reproducibility checks.
A common step when moving is to file a Change of Address
form with the United States Post Office to ensure that mail is delivered
to a new address. In response to broad interest, the USPS makes
aggregate Change of Address (COA) data publicly available. In this
project, we’ll look at COA data in New York City between 2018 and 2022.
This dataset
includes the total number of COAs to and from each ZIP code in NYC for
each calendar month. In each sheet, the TOTAL PERM IN
and
TOTAL PERM OUT
indicate the total number of permanent
address changes going into and out of each ZIP code.
NYC is divided into five boroughs. Each of these boroughs is it’s own county, and in some cases the borough name and county name differ; for example, Manhattan is New York County. Moreover, boroughs are divided into neighborhoods. COA data provided by USPS may not accurately reflect NYC’s counties, boroughs, and neighborhoods, and a supplementary dataset including these is available here.
Provide a brief introduction to the raw data and the goals of your report.
Import, tidy, combine, and otherwise clean the data. In the ZIP code
data, create a borough
variable using county names. When
importing COA data, add a year
variable for later use;
also, create a net_change
variable by subtracting outbound
COAs from inbound COAs. Resolve any issues that arise when merging COA
and ZIP code data. Restrict your dataset to only variables necessary for
later parts of this report. Describe the major steps in the data
wrangling process in words, including what steps you took to address
data quality issues.
Briefly describe the resulting tidy dataset. How many total observations exist? How many unique ZIP codes are included, and how many unique neighborhoods?
Compare the city
variable in the COA database to the
borough
variable you created using the ZIP code database.
Make two small tables to show the most common values of
city
in the borough
of Manhattan and of
Queens. Comment on the any data quality issues you observe.
There are 60 months between 2018 and 2022, but many ZIP codes have
fewer than 60 observations; most of these are also missing
neighborhood
values. Discuss why this might be the case,
using a few concrete examples as illustration.
Create a reader-friendly table showing the average of
net_change
in each borough and year. Comment on trends in
this table.
Make a table showing, across all observed data, the five lowest
values of net_change
. This table should include ZIP code,
neighborhood, and the year and month. Make a similar table showing,
across data observed before 2020, the five highest values of
net_change
. This table should include ZIP code,
neighborhood, and the year and month.
Understanding monthly net_change
over the five-year
period can give insights into trends in moving. Make a plot showing
neighborhood-level average net_change
values
(i.e. averaging across ZIP codes within neighborhoods) against month
over all five years. Your plot should facilitate comparisons across
boroughs. Include this visualization in your report and export it to a
results
directory in your repository. Comment on any
significant elements of this plot.
Lastly, note any limitations of this dataset for understanding changes in ZIP code level population sizes.