Writing with data

You will typically (if not always) need to summarize your work in writing. This page describes how to do so using R Markdown.

This is the second module in the Building blocks topic.

Overview

Learning Objectives

We’ll be getting up-to-speed on the basic tools in R. As part of that, we’ll define and establish some good habits.

Slide Deck

Writing with data from Jeff Goldsmith.

Video Lecture

Example

Before jumping in, one short note about the default RStudio treatment of R Markdown documents: it behaves like a “notebook” and shows output mixed in with the code, rather than in the console or viewer. I don’t like this, and I’m definitely not the only one. I might just be old and set in my ways, but there are good reasons to avoid notebooks. These can be turned off using Global Options > R Markdown > Show output inline.

Basic RMD

Below is a first RMD file. To follow along, create a .Rmd file using File > New File > R Markdown, and replace the default text with what’s below; you can also download the template here. Don’t forget to keep this in a directory you can find again later!

---
title: "Simple document"
output: html_document
---

I'm an R Markdown document! 

# Section 1

Here's a **code chunk** that samples from 
a _normal distribution_:

```{r}
samp = rnorm(100)
length(samp)
```

# Section 2

I can take the mean of the sample, too!
The mean is `r mean(samp)`.

There are three major components to this file:

YAML header: The segment at the beginning of the document bracketed by ---s.
Text + inline R: Written text with simple formatting like # heading, **bold**, and _italic_
Code chunks: Blocks of code surrounded by ```

The combination of these elements allows you a great deal of flexibility and power as an author.

R Markdown documents are rendered to produce complete reports with text, formatting, and code results by “knitting”. You can knit your document using the RStudio GUI or CMD / Ctrl + Shift + K; these options execute knitr::knit (which you can run directly from the command line if you prefer). Behind the scenes, knitr is creating a Markdown document and pandoc is translating that to the output format you specify (e.g. HTML, .pdf, .docx).

Learning assessment: Take two minutes and create an R Markdown document as above. Make sure you can knit it and find the result in your local directory.

Code chunks and snippets

We’ll start with code chunks, since these are the distinguishing feature of R Markdown documents. The code chunks take the place of scripts in that they hold the code you use to produce your results. However, they tend to be briefer and more self-contained: you’re nestling these bits of code among the text the supports them and the results they produce. You can still execute code in chunks using Cmd/Ctrl + Enter, and you will still develop code by writing and refining until you have something you’re happy with.

Although the benefits will mostly become apparently later, I recommend you get in the habit of naming your code chunks now using {r chunk_name}. I also recommend inserting code chunks using hotkeys (Opt + Cmd + I for Mac, Ctrl + Alt + I for Windows).

Beyond the name, you can customize the behavior of your code chunk via options defined in the chunk header. Some common options are:

eval = FALSE: code will be displayed but not executed; results are not included.
echo = FALSE: code will be executed but not displayed; results are included.
include = FALSE: code won’t be executed or displayed.
message = FALSE and warning = FALSE: prevents messages and warnings from being displayed.
results = hide and fig.show = hide: prevents results and figures from being shown, respectively.
collapse = TRUE: output will be collapsed into a single block at shown at the end of the chunk.
error: errors in code will stop rendering when FALSE; errors in code will be printed in the doc when TRUE. The default is FALSE and you should almost never change it.

Use these options to be judicious about what you include in your report. Remember to keep your audience in mind: how much do they want or need to see?

You can also cache the results of a code chunk, but we will largely avoid this. Caching can save time by saving the results of a code chunk instead of re-executing when the document is knit. However, you have to be careful when using this option since downstream code can depend on upstream changes. Controlling this behavior through the dependson option can help, but if you cache code you’ll want to periodically clear you cache to ensure you’re getting reproducible results.

Inserting brief code snippets inline is sometimes helpful; I use these to give the sample size or summary statistics in text. You can insert code inline using `r `, often in conjunction with the format() function to clean up your output.

Learning assessment: Write a named code chunk that creates a dataframe comprised of: a numeric variable containing a random sample of size 500 from a normal variable with mean 1; a logical vector indicating whether each sampled value is greater than zero; and a numeric vector containing the absolute value of each element. Then, produce a histogram of the absolute value variable just created. Add an inline summary giving the median value rounded to two decimal places. What happens if you set eval = FALSE to the code chunk? What about echo = FALSE?

Solution

The snippet below shows the relevant section of an R Markdown document.

The chunk below creates a dataframe containing a sample of size 500 from a 
random normal variable, constructs the specified logical vector, takes the 
absolute value of each element of that sample,and produces a histogram of 
the absolute value. The code chunk also finds the median of the sample and
stores it for easy in-line printing.

```{r learning_assessment_1}
library(tidyverse)

la_df = tibble(
  norm_samp = rnorm(500, mean = 1),
  norm_samp_pos = norm_samp > 0,
  abs_norm_samp = abs(norm_samp)
)

ggplot(la_df, aes(x = abs_norm_samp)) + 
    geom_histogram()

median_samp = median(pull(la_df, norm_samp))
```

The median of the variable containing absolute values is 
`r round(median_samp, digits = 2)`.

Formatting text

There are a huge number of ways to format your documents. The overview below is essentially copied from R for Data Science; a link to a handy cheatsheet is below.

Text formatting 
------------------------------------------------------------

*italic*  or  _italic_
**bold**  or  __bold__
`code`
superscript^2^ and subscript~2~


Headings
------------------------------------------------------------

# 1st Level Header

## 2nd Level Header

### 3rd Level Header


Lists
------------------------------------------------------------

*   Bulleted list item 1

*   Item 2

    * Item 2a

    * Item 2b

1.  Numbered list item 1

1.  Item 2. The numbers are incremented automatically in the output.


Tables 
------------------------------------------------------------

First Header  | Second Header
------------- | -------------
Content Cell  | Content Cell
Content Cell  | Content Cell

You’ll need to refer to this list (or to similar resources) pretty often at first, but most of it will become second-nature after you’ve written a few documents.

Learning assessment: After the previous code chunk, write a bullet list given the mean, median, and standard deviation of the original random sample.

Solution

The snippet below shows the relevant section of an R Markdown document.

* The mean of the sample is `r mean(pull(la_df, norm_samp))`
* The median of the sample is `r median(pull(la_df, norm_samp))`
* The standard deviation of the sample is `r sd(pull(la_df, norm_samp))`

YAML and output formats

The YAML header controls global features of the document. I generally will include both the author and date in each document I produce.

author: "Jeff Goldsmith"
date: 2024-09-12

We’re mostly concerned with the output format, which is controlled through the output: field.

The snippet below will produce an HTML document. Notice that this has subfields to add a table of contents, and float that table alongside the content. These lines are used throughout the course website.

output:
  html_document:
    toc: true
    toc_float: true

HTML documents are great because they allow interactivity in a way that static formats (PDF, Word) do not. For example, adding the subfield code_folding: hide under html_document will hide all the code in the document until the reader clicks to show it (I almost always use this for collaborative reports).

That said, some collaborators will need or prefer static documents. You can create these using the YAML snippet below, which will produce a PDF or a Word document when knitted. These require extra software (LaTeX and Word, respectively).

output:
  pdf_document: default

output:
  word_document: default

The formatting for both PDF and Word documents can be controlled through options as well, although these can be tricky to control (especially for Word documents). If you’re really interested in generating reports using Word, you may want to read up on the redoc package!

We use the github_document format extensively in this course, and talk more about why in git and github. Later in the course we’ll talk about some other output formats – especially dashboards and websites – introducing other YAML options as needed.

New workflow

All of this suggests a slight modification to the previous workflow:

Create a directory with a reasonable name and path (e.g. ~/Documents/School/P8105/Homework_2/)
Put an R Project in the directory
Keep everything related to the analysis – data inputs, scripts / R Markdown files, reports, output – in there, and use R Markdown as much as possible
Periodically check for reproducibility of the analysis

The bold stuff is new; This should become your default behavior when starting any new project.

A last note about reproducibility: each time you knit an R Markdown file, knitr uses a new R session to run the included code. As a result, knitting is a great way to make sure your analysis is self-contained and you should knit frequently!

Learning assessment: Convert the scripts for creating a data frame and producing basic plots (from best practices) to self-contained R Markdown files that produce HTML documents.

Solution

Here is a R Markdown file for vector classes:

---
title: "Exploring Vector Classes"
author: "Jeff Goldsmith"
date: 2024-09-10
output: html_document
---

```{r setup, include = FALSE}
library(tidyverse)
```

The purpose of this file is to examine a few data types (or data classes) in R.

First we create a dataframe containing variables of four different types.

```{r}
example_df = tibble(
  vec_numeric = 5:8,
  vec_char = c("My", "name", "is", "Jeff"),
  vec_logical = c(TRUE, TRUE, TRUE, FALSE),
  vec_factor = factor(c("male", "male", "female", "female"))
)
```

The variable `vec_numeric` has class `r class(pull(example_df, vec_numeric))`, and the variable `vec_factor` has class `r class(pull(example_df, vec_factor))`.

And here’s one for basic plots:

---
---
---
title: "Basic Plots"
author: "Jeff Goldsmith"
date: 2024-09-10
output: html_document
---

```{r setup, include = FALSE}
library(tidyverse)
```

The purpose of this file is to present a couple of basic plots using `ggplot`.

First we create a dataframe containing variables for our plots.

```{r df_create}
set.seed(1234)

plot_df = tibble(
  x = rnorm(1000, sd = .5),
  y = 1 + 2 * x + rnorm(1000)
)
```

First we show a histogram of the `x` variable.

```{r x_hist}
ggplot(plot_df, aes(x = x)) + 
    geom_histogram()
```

Next we show a scatterplot of `y` vs `x`. 

```{r yx_scatter}
ggplot(plot_df, aes(x = x, y = y)) + 
    geom_point()
```

Other materials

Constructing useful documents that combine text and code is the subject of several online guides. See below for a sampling:

The R Markdown cheatsheet is a useful resource once you have the basics down
R For Data Science devotes chapters to R Markdown, additional output formats, and a useful workflow.
The Intro to R Markdown from RStudio overlaps a lot with the previous bullet, but is also handy to review
This chapter from R Basics is a good intro to R Markdown
If Webinars are more to your liking, this intro is good

You’ll also find a lot of stuff online that has been written using R Markdown and, with a little digging, can often find the .RMD file as well. This is a great way to spot new tools and figure out how to incorporate them in your own documents!!

The code that I produced working examples in lecture is here.