You will typically (if not always) need to summarize your work in writing. This page describes how to do so using R Markdown.
This is the second module in the Building blocks topic.
We’ll be getting up-to-speed on the basic tools in R. As part of that, we’ll define and establish some good habits.
Before jumping in, one short note about the default RStudio treatment of R Markdown documents: it behaves like a “notebook” and shows output mixed in with the code, rather than in the console or viewer. I don’t like this, and I’m definitely not the only one. I might just be old and set in my ways, but there are good reasons to avoid notebooks. These can be turned off using Global Options > R Markdown > Show output inline.
Below is a first RMD file. To follow along, create a .Rmd file using File > New File > R Markdown, and replace the default text with what’s below; you can also download the template here. Don’t forget to keep this in a directory you can find again later!
---
title: "Simple document"
output: html_document
---
I'm an R Markdown document!
# Section 1
Here's a **code chunk** that samples from
a _normal distribution_:
```{r}
samp = rnorm(100)
length(samp)
```
# Section 2
I can take the mean of the sample, too!
The mean is `r mean(samp)`.
There are three major components to this file:
---
s.# heading
, **bold**
, and
_italic_
```
The combination of these elements allows you a great deal of flexibility and power as an author.
R Markdown documents are rendered to produce complete reports with
text, formatting, and code results by “knitting”. You can knit your
document using the RStudio GUI or CMD / Ctrl + Shift + K; these options
execute knitr::knit
(which you can run directly from the
command line if you prefer). Behind the scenes, knitr
is
creating a Markdown document and pandoc is translating that to the
output format you specify (e.g. HTML, .pdf, .docx).
Learning assessment: Take two minutes and create an R Markdown document as above. Make sure you can knit it and find the result in your local directory.
We’ll start with code chunks, since these are the distinguishing feature of R Markdown documents. The code chunks take the place of scripts in that they hold the code you use to produce your results. However, they tend to be briefer and more self-contained: you’re nestling these bits of code among the text the supports them and the results they produce. You can still execute code in chunks using Cmd/Ctrl + Enter, and you will still develop code by writing and refining until you have something you’re happy with.
Although the benefits will mostly become apparently later, I
recommend you get in the habit of naming your code chunks now using
{r chunk_name}
. I also recommend inserting code chunks
using hotkeys (Opt + Cmd + I for Mac, Ctrl + Alt + I for Windows).
Beyond the name, you can customize the behavior of your code chunk via options defined in the chunk header. Some common options are:
eval = FALSE
: code will be displayed but not executed;
results are not included.echo = FALSE
: code will be executed but not displayed;
results are included.include = FALSE
: code won’t be executed or
displayed.message = FALSE
and warning = FALSE
:
prevents messages and warnings from being displayed.results = hide
and fig.show = hide
:
prevents results and figures from being shown, respectively.collapse = TRUE
: output will be collapsed into a single
block at shown at the end of the chunk.error
: errors in code will stop rendering when
FALSE
; errors in code will be printed in the doc when
TRUE
. The default is FALSE
and you should
almost never change it.Use these options to be judicious about what you include in your report. Remember to keep your audience in mind: how much do they want or need to see?
You can also cache the results of a code chunk, but we will largely
avoid this. Caching can save time by saving the results of a code chunk
instead of re-executing when the document is knit. However, you have to
be careful when using this option since downstream code can depend on
upstream changes. Controlling this behavior through the
dependson
option can help, but if you cache code you’ll
want to periodically clear you cache to ensure you’re getting
reproducible results.
Inserting brief code snippets inline is sometimes helpful; I use
these to give the sample size or summary statistics in text. You can
insert code inline using `r `
, often in conjunction with
the format()
function to clean up your output.
Learning assessment: Write a named code
chunk that creates a dataframe comprised of: a numeric variable
containing a random sample of size 500 from a normal variable with mean
1; a logical vector indicating whether each sampled value is greater
than zero; and a numeric vector containing the absolute value of each
element. Then, produce a histogram of the absolute value variable just
created. Add an inline summary giving the median value rounded to two
decimal places. What happens if you set eval = FALSE
to the
code chunk? What about echo = FALSE
?
The snippet below shows the relevant section of an R Markdown document.
The chunk below creates a dataframe containing a sample of size 500 from a
random normal variable, constructs the specified logical vector, takes the
absolute value of each element of that sample,and produces a histogram of
the absolute value. The code chunk also finds the median of the sample and
stores it for easy in-line printing.
```{r learning_assessment_1}
library(tidyverse)
la_df = tibble(
norm_samp = rnorm(500, mean = 1),
norm_samp_pos = norm_samp > 0,
abs_norm_samp = abs(norm_samp)
)
ggplot(la_df, aes(x = abs_norm_samp)) + geom_histogram()
median_samp = median(pull(la_df, norm_samp))
```
The median of the variable containing absolute values is
`r round(median_samp, digits = 2)`.
There are a huge number of ways to format your documents. The overview below is essentially copied from R for Data Science; a link to a handy cheatsheet is below.
Text formatting
------------------------------------------------------------
*italic* or _italic_
**bold** or __bold__
`code`
superscript^2^ and subscript~2~
Headings
------------------------------------------------------------
# 1st Level Header
## 2nd Level Header
### 3rd Level Header
Lists
------------------------------------------------------------
* Bulleted list item 1
* Item 2
* Item 2a
* Item 2b
1. Numbered list item 1
1. Item 2. The numbers are incremented automatically in the output.
Tables
------------------------------------------------------------
First Header | Second Header
------------- | -------------
Content Cell | Content Cell
Content Cell | Content Cell
You’ll need to refer to this list (or to similar resources) pretty often at first, but most of it will become second-nature after you’ve written a few documents.
Learning assessment: After the previous code chunk, write a bullet list given the mean, median, and standard deviation of the original random sample.
The snippet below shows the relevant section of an R Markdown document.
* The mean of the sample is `r mean(pull(ls_df, norm_samp))`
* The median of the sample is `r median(pull(ls_df, norm_samp))`
* The median of the sample is `r median(pull(ls_df, norm_samp))`
The YAML header controls global features of the document. I generally
will include both the author
and date
in each
document I produce.
author: "Jeff Goldsmith"
date: 2024-09-12
We’re mostly concerned with the output format, which is controlled
through the output:
field.
The snippet below will produce an HTML document. Notice that this has subfields to add a table of contents, and float that table alongside the content. These lines are used throughout the course website.
output:
html_document:
toc: true
toc_float: true
HTML documents are great because they allow interactivity in a way
that static formats (PDF, Word) do not. For example, adding the subfield
code_folding: hide
under html_document
will
hide all the code in the document until the reader clicks to show it (I
almost always use this for collaborative reports).
That said, some collaborators will need or prefer static documents. You can create these using the YAML snippet below, which will produce a PDF or a Word document when knitted. These require extra software (LaTeX and Word, respectively).
output:
pdf_document: default
output:
word_document: default
The formatting for both PDF and Word documents can be controlled
through options as well, although these can be tricky to control
(especially for Word documents). If you’re really interested in
generating reports using Word, you may want to read up on the redoc
package!
We use the github_document
format extensively in this
course, and talk more about why in git and
github. Later in the course we’ll talk about some other output
formats – especially dashboards and websites – introducing other YAML
options as needed.
All of this suggests a slight modification to the previous workflow:
~/Documents/School/P8105/Homework_2/
)The bold stuff is new; This should become your default behavior when starting any new project.
A last note about reproducibility: each time you knit an R Markdown
file, knitr
uses a new R session to run the included code.
As a result, knitting is a great way to make sure your analysis is
self-contained and you should knit frequently!
Learning assessment: Convert the scripts for exploring vector classes and producing basic plots (from best practices) to self-contained R Markdown files that produce HTML documents.
Here is a R Markdown file for vector classes:
---
title: "Exploring Vector Classes"
author: "Jeff Goldsmith"
date: 2019-09-10
output: html_document
---
```{r setup, include = FALSE}
library(tidyverse)
```
The purpose of this file is to examine a few data types (or data classes) in R.
First we create a dataframe containing variables of four different types.
```{r}
example_df = tibble(
vec_numeric = 5:8,
vec_char = c("My", "name", "is", "Jeff"),
vec_logical = c(TRUE, TRUE, TRUE, FALSE),
vec_factor = factor(c("male", "male", "female", "female"))
)
```
The variable `vec_numeric` has class `r class(pull(example_df, vec_numeric))`, and the variable `vec_factor` has class `r class(pull(example_df, vec_factor))`.
And here’s one for basic plots:
---
---
---
title: "Basic Plots"
author: "Jeff Goldsmith"
date: 2019-09-10
output: html_document
---
```{r setup, include = FALSE}
library(tidyverse)
```
The purpose of this file is to present a couple of basic plots using `ggplot`.
First we create a dataframe containing variables for our plots.
```{r df_create}
set.seed(1234)
plot_df = tibble(
x = rnorm(1000, sd = .5),
y = 1 + 2 * x + rnorm(1000)
)
```
First we show a histogram of the `x` variable.
```{r x_hist}
ggplot(plot_df, aes(x = x)) + geom_histogram()
```
Next we show a scatterplot of `y` vs `x`.
```{r yx_scatter}
ggplot(plot_df, aes(x = x, y = y)) + geom_point()
```
Constructing useful documents that combine text and code is the subject of several online guides. See below for a sampling:
You’ll also find a lot of stuff online that has been written using R Markdown and, with a little digging, can often find the .RMD file as well. This is a great way to spot new tools and figure out how to incorporate them in your own documents!!
The code that I produced working examples in lecture is here.