In contrast to Homework assignments, you must work completely independently on this project – do not discuss your approach, your code, or your results with any other students, and do not use the discussion board for questions related to this project. If questions do arise, please email the instructor and lead TAs.
At this point, we’ve covered Building Blocks, Data Wrangling I, and Visualization and EDA. These three topics give a broad introduction into the commonly-used tools of data science, and are the main focus of this project.
Due: October 24 at 11:59pm.
The course’s emphasis on workflow – especially the use git and GitHub for reproducibility, R Projects to organize your work, R Markdown to write reproducible reports, relative paths to load data from local files, and reasonable naming structures for your files – will be reflected in your Midterm Project submission.
To that end:
p8105_mtp_YOURUNI
(e.g. p8105_mtp_ajg2202
for Jeff), but that’s not required
p8105_mtp_YOURUNI.Rmd
that renders to github_document
We will assess adherence to the instructions above and whether we are able to knit your .Rmd in the grading of this project. Adherence to appropriate styling and clarity of code will be assessed. This project includes figures; the readability of your embedded plots (e.g. font sizes, axis labels, titles) will be assessed.
For this project, you should write a report describing your work in a
way that targets a reasonably sophisticated collaborator – not an expert
data scientist, but an interested observer. Structure your report to
include sections corresponding to the problems below. Write in a
reproducible way (e.g. using inline R code where necessary) and include
relevant code chunks and their output. Include only relevant
information, and adhere to a strict-500 word limit (this excludes
figures and tables, code chunks, inline code, YAML, and other non-text
elements). You can check your word count using
wordcountaddin::text_stats("p8105_mtp_YOURUNI.Rmd")
;
installation instructions can be found on the wordcountaddin
package website. We’ll use the “koRpus” count.
A friend of mine, over the last several years, has been closely monitoring the weight of three pet dogs – Simone, Gagne, and Raisin, shown left to right in the picture below. The dogs are frequently weighed on a dedicated scale, and their weights are recorded. To ensure accuracy over time, a standard object is also weighed from time to time. Occasional notes, often indicating time spent in a kennel, are included.
The data for this project are available here.
Provide a brief introduction to the raw data and the goals of your report.
Import, tidy, and otherwise clean the weight data, omitting notes entered as text. Describe the major steps in the process. Note any issues with data entry you identify, and what steps you took to address them.
Produce a second dataframe, containing only the notes entered as text and the date on which they were written.
Export both dataframes as CSVs; store in the same directory as the raw data.
Briefly describe the resulting tidy dataset containing weights. How many unique dates are included in the dataset? Make a well-formatted table showing the number of observations for each dog, along with their average weight and the standard deviation.
Create a two-panel plot showing:
Include this visualization in your report and export it to a
results
directory in your repository.
Comment on these visualizations. Do you see any trends? What other conclusions can you draw from these data, and the notes that accompany them?