In contrast to Homework assignments, you must work completely independently on this project – do not discuss your approach, your code, or your results with any other students, and do not use the discussion board for questions related to this project. If questions do arise, please email the instructor and lead TAs.


At this point, we’ve covered Building Blocks, Data Wrangling I, and Visualization and EDA. These three topics give a broad introduction into the commonly-used tools of data science, and are the main focus of this project.

Due date

Due: October 24 at 11:59pm.


The course’s emphasis on workflow – especially the use git and GitHub for reproducibility, R Projects to organize your work, R Markdown to write reproducible reports, relative paths to load data from local files, and reasonable naming structures for your files – will be reflected in your Midterm Project submission.

To that end:

  • create a private GitHub repo + local R Project; we suggest naming this repo / directory p8105_mtp_YOURUNI (e.g. p8105_mtp_ajg2202 for Jeff), but that’s not required
    • non-private repos will be treated as inconsistent with the independent work requirement and as violations of the academic integrity policy
  • add the GitHub user “bst-p8105” as a collaborator on the project, which will give us (and only us) access to your repo
  • create a single .Rmd file named p8105_mtp_YOURUNI.Rmd that renders to github_document
  • submit a link to your repo via Courseworks

We will assess adherence to the instructions above and whether we are able to knit your .Rmd in the grading of this project. Adherence to appropriate styling and clarity of code will be assessed. This project includes figures; the readability of your embedded plots (e.g. font sizes, axis labels, titles) will be assessed.


For this project, you should write a report describing your work in a way that targets a reasonably sophisticated collaborator – not an expert data scientist, but an interested observer. Structure your report to include sections corresponding to the problems below. Write in a reproducible way (e.g. using inline R code where necessary) and include relevant code chunks and their output. Include only relevant information, and adhere to a strict-500 word limit (this excludes figures and tables, code chunks, inline code, YAML, and other non-text elements). You can check your word count using wordcountaddin::text_stats("p8105_mtp_YOURUNI.Rmd"); installation instructions can be found on the wordcountaddin package website. We’ll use the “koRpus” count.


A friend of mine, over the last several years, has been closely monitoring the weight of three pet dogs – Simone, Gagne, and Raisin, shown left to right in the picture below. The dogs are frequently weighed on a dedicated scale, and their weights are recorded. To ensure accuracy over time, a standard object is also weighed from time to time. Occasional notes, often indicating time spent in a kennel, are included.

The data for this project are available here.


Problem 1 – Data.

Provide a brief introduction to the raw data and the goals of your report.

Import, tidy, and otherwise clean the weight data, omitting notes entered as text. Describe the major steps in the process. Note any issues with data entry you identify, and what steps you took to address them.

Produce a second dataframe, containing only the notes entered as text and the date on which they were written.

Export both dataframes as CSVs; store in the same directory as the raw data.

Problem 2 – EDA

Briefly describe the resulting tidy dataset containing weights. How many unique dates are included in the dataset? Make a well-formatted table showing the number of observations for each dog, along with their average weight and the standard deviation.

Problem 3 – Visualization.

Create a two-panel plot showing:

  • In the left panel, the distribution of weights for each dog.
  • In the right panel, each dog’s weight over time.

Include this visualization in your report and export it to a results directory in your repository.

Comment on these visualizations. Do you see any trends? What other conclusions can you draw from these data, and the notes that accompany them?