Writing Functions

If you use the same code twice, you need a function – this will improve code readability, facilitate troubleshooting, and reduce chances for mistakes. This content looks at the best approaches for writing R functions.

This is the first module in the Iteration topic.

Overview

Learning Objectives

Create simple R functions to abstract common processes.

Slide Deck

Writing Functions from Jeff Goldsmith.

Video Lecture

Example

For this topic, I’ll create a GitHub repo + directory / R Project called iteration. I’ll write code for today’s content in a new R Markdown document called writing_functions.Rmd, and I’m going to load the usual packages. I’m also setting the seed so that the output on this page is fixed.

library(tidyverse)
library(rvest)

set.seed(1)

My first function

The best way to build up a function is to start with code you’ve written outside a function. To see how this might work, I’ll start with a simple example: the chunk below takes a sample from a normal distribution and then computes the vector of Z scores for the sample.

x_vec = rnorm(25, mean = 5, sd = 3)

(x_vec - mean(x_vec)) / sd(x_vec)

##  [1] -0.83687228  0.01576465 -1.05703126  1.50152998  0.16928872 -1.04107494
##  [7]  0.33550276  0.59957343  0.42849461 -0.49894708  1.41364561  0.23279252
## [13] -0.83138529 -2.50852027  1.00648110 -0.22481531 -0.19456260  0.81587675
## [19]  0.68682298  0.44756609  0.78971253  0.64568566 -0.09904161 -2.27133861
## [25]  0.47485186

If I want to repeat this (admittedly simple) process for lots of samples, I might want to have a function that takes the sample as an argument, computes the vector of Z scores in the body, and returns the result. I define such a function below.

z_scores = function(x) {
  
  z = (x - mean(x)) / sd(x)
  z
  
}

z_scores(x_vec)

##  [1] -0.83687228  0.01576465 -1.05703126  1.50152998  0.16928872 -1.04107494
##  [7]  0.33550276  0.59957343  0.42849461 -0.49894708  1.41364561  0.23279252
## [13] -0.83138529 -2.50852027  1.00648110 -0.22481531 -0.19456260  0.81587675
## [19]  0.68682298  0.44756609  0.78971253  0.64568566 -0.09904161 -2.27133861
## [25]  0.47485186

I can try this with a few samples and confirm that it works. I should also try to think of ways this code might break; the attempts below try a variety of inputs to see what happens.

z_scores(3)

## [1] NA

z_scores("my name is jeff")

## Error in x - mean(x): non-numeric argument to binary operator

z_scores(iris)

## Error in is.data.frame(x): 'list' object cannot be coerced to type 'double'

z_scores(sample(c(TRUE, FALSE), 25, replace = TRUE))

##  [1] -0.7348469  1.3063945 -0.7348469 -0.7348469  1.3063945  1.3063945
##  [7] -0.7348469 -0.7348469 -0.7348469  1.3063945  1.3063945 -0.7348469
## [13] -0.7348469 -0.7348469 -0.7348469 -0.7348469 -0.7348469  1.3063945
## [19] -0.7348469 -0.7348469 -0.7348469 -0.7348469  1.3063945  1.3063945
## [25]  1.3063945

These all did something I didn’t want, but only two returned errors. To avoid behavior you don’t want (i.e. to “fail noisily and as soon as possible”) we’ll add some checks on the argument values using conditional statements.

z_scores = function(x) {
  
  if (!is.numeric(x)) {
    stop("Argument x should be numeric")
  } else if (length(x) == 1) {
    stop("Z scores cannot be computed for length 1 vectors")
  }
  
  z = mean(x) / sd(x)
  
  z
}

Fantastic – we have a pretty solid function for computing Z scores!

Multiple outputs

In some cases it might be better to return the mean and standard deviation instead of the Z scores. A first option is to store each of the values in a named list, and to return that list. (We’ll talk more about lists in iteration and listcols.)

mean_and_sd = function(x) {
  
  if (!is.numeric(x)) {
    stop("Argument x should be numeric")
  } else if (length(x) == 1) {
    stop("Cannot be computed for length 1 vectors")
  }
  
  mean_x = mean(x)
  sd_x = sd(x)

  list(mean = mean_x, 
       sd = sd_x)
}

Alternatively, we might store values in a data frame.

mean_and_sd = function(x) {
  
  if (!is.numeric(x)) {
    stop("Argument x should be numeric")
  } else if (length(x) == 1) {
    stop("Cannot be computed for length 1 vectors")
  }
  
  mean_x = mean(x)
  sd_x = sd(x)

  tibble(
    mean = mean_x, 
    sd = sd_x
  )
}

In general, either of these will be fine; which one you choose will depend on what kind of values you want to return, and what you plan to do with the function itself. If you want to return the original sample along with the computed values, a list might make sense; if you plan to run your function a lot and study the results, having a data frame will make it easier to use other tools. We’ll see more of that in iteration and simulation.

Multiple inputs

As exciting as it is to compute Z scores, let’s start setting our sights higher. I’d like to have a function that takes a given sample size along with a true mean and standard deviation, simulates data from a normal distribution, and returns the estimated mean and standard deviation. I’ll start from the code below.

sim_data = tibble(
  x = rnorm(30, mean = 2, sd = 3)
)

sim_data |> 
  summarize(
    mu_hat = mean(x),
    sigma_hat = sd(x)
  )

## # A tibble: 1 × 2
##   mu_hat sigma_hat
##    <dbl>     <dbl>
## 1   1.63      3.06

You should take a few minutes to examine this code – make a plot of the simulated data to make sure it “makes sense”, take a look at the result of computing the mean and standard deviation, etc.

Once you’re satisfied, it’s time to wrap things up in a function. I’d like to be able to change the sample size and parameters, so those will be my arguments; the code that simulates data and computes the sample mean and SD go in the body; and the return statement should include the estimates. A function that does all this, using default values for the mean and standard deviation, is below.

sim_mean_sd = function(n, mu = 2, sigma = 3) {
  
  sim_data = tibble(
    x = rnorm(n, mean = mu, sd = sigma),
  )
  
  sim_data |> 
    summarize(
      mu_hat = mean(x),
      sigma_hat = sd(x)
    )
}

Repeated calls to sim_mean_sd() will give a sense of sampling variability in estimating the mean and standard deviation from a sample; take a few minutes to run sim_mean_sd(30) a few times, and then to run sim_mean_sd(300), and think about the results. We’ll examine this more formally in iteration and in simulation.

This is also a good time to point out how R handles argument matching. We can use positional matching, meaning the first value supplied is taken to be the first argument, the second value is the second argument, and so on. We do this with tidyverse functions a lot; the first argument is always a dataframe, and we just supply that dataframe in the first position. We also use positional matching when we call mean(x) or sim_mean_sd(30, 5, 1).

Alternatively, we can use named matching, which uses the argument name in the function call. Named matching can be a bit more stable when you’re writing your own functions (in case you decide to change the order of the inputs, for example) but isn’t strictly necessary. Named arguments can be supplied in any order: sim_mean_sd(n = 30, mu = 5, sd = 1) is equivalent to sim_mean_sd(sd = 1, n = 30, mu = 5).

Revisiting past examples

There have been a couple of times in this class that I’ve had to write code I didn’t like, because it would have made sense to write a function. We’ll revisit those quickly to see how we could improve our code.

Loading LoTR data

In tidy data, we broke the “only copy code twice” rule when we used the code below to process the LoTR words data:

fellowship_ring = readxl::read_excel("./data/LotR_Words.xlsx", range = "B3:D6") |>
  mutate(movie = "fellowship_ring")

two_towers = readxl::read_excel("./data/LotR_Words.xlsx", range = "F3:H6") |>
  mutate(movie = "two_towers")

return_king = readxl::read_excel("./data/LotR_Words.xlsx", range = "J3:L6") |>
  mutate(movie = "return_king")

lotr_tidy = bind_rows(fellowship_ring, two_towers, return_king) |>
  janitor::clean_names() |>
  pivot_longer(
    female:male,
    names_to = "sex",
    values_to = "words") |> 
  mutate(race = str_to_lower(race)) |> 
  select(movie, everything())

Learning Assessment: Try to write a function that can be used to abstract the data loading and cleaning process. Use this function to recreate the tidied LoTR dataset.

Solution

The function below will read in and clean LoTR data – it differs from the previous code by including some data tidying steps in the function rather than after data have been combined, but produces the same result.

lotr_load_and_tidy = function(path, range, movie_name) {
  
  df = 
    readxl::read_excel(path, range = range) |>
    janitor::clean_names() |>
    pivot_longer(
      female:male,
      names_to = "sex",
      values_to = "words") |>
    mutate(
      race = str_to_lower(race),
      movie = movie_name) |> 
    select(movie, everything())
  
  df
  
}

lotr_tidy = 
  bind_rows(
    lotr_load_and_tidy("data/LotR_Words.xlsx", "B3:D6", "fellowship_ring"),
    lotr_load_and_tidy("data/LotR_Words.xlsx", "F3:H6", "two_towers"),
    lotr_load_and_tidy("data/LotR_Words.xlsx", "J3:L6", "return_king"))

Having a function that handles the loading and cleaning is great – if I decide I want to change the tidying step, I only have to do it once, and I don’t have to worry about mistakes creeping in through copy-and-paste errors!

NSDUH

In reading data from the web, we wrote code that allowed us to scrape information in from a page containing NSDUH results. We refined the resulting table in strings and factors. A version of our prior code is below.

nsduh_url = "http://samhda.s3-us-gov-west-1.amazonaws.com/s3fs-public/field-uploads/2k15StateFiles/NSDUHsaeShortTermCHG2015.htm"

nsduh_html = read_html(nsduh_url)

data_marj = 
  nsduh_html |> 
  html_table() |> 
  nth(1) |>
  slice(-1) |> 
  select(-contains("P Value")) |>
  pivot_longer(
    -State,
    names_to = "age_year", 
    values_to = "percent") |>
  separate(age_year, into = c("age", "year"), sep = "\\(") |>
  mutate(
    year = str_replace(year, "\\)", ""),
    percent = str_replace(percent, "[a-c]$", ""),
    percent = as.numeric(percent)) |>
  filter(!(State %in% c("Total U.S.", "Northeast", "Midwest", "South", "West")))

Let’s write a quick function to scrape review information for other tables on this page. We’ll pass in the HTML data as an argument so we don’t scrape it each time, along with a number and name for the table we want to process.

nsduh_table <- function(html, table_num, table_name) {
  
  table = 
    html |> 
    html_table() |> 
    nth(table_num) |>
    slice(-1) |> 
    select(-contains("P Value")) |>
    pivot_longer(
      -State,
      names_to = "age_year", 
      values_to = "percent") |>
    separate(age_year, into = c("age", "year"), sep = "\\(") |>
    mutate(
      year = str_replace(year, "\\)", ""),
      percent = str_replace(percent, "[a-c]$", ""),
      percent = as.numeric(percent),
      name = table_name) |>
    filter(!(State %in% c("Total U.S.", "Northeast", "Midwest", "South", "West")))
  
  table
  
}

Next we’ll use this to get a few different tables and combine the results.

nsduh_results = 
  bind_rows(
    nsduh_table(nsduh_html, 1, "marj_one_year"),
    nsduh_table(nsduh_html, 4, "cocaine_one_year"),
    nsduh_table(nsduh_html, 5, "heroin_one_year")
  )

In addition to being less likely to result in errors than copy-and-pasting, using a function is clearer: once the purpose of the function is known, using that function to extract separate tables uses only three easy-to-understand lines of code.

Functions as arguments

One powerful tool is the ability to pass functions as arguments into functions. This might seem like a weird thing to do, but it has a lot of handy applications – we’ll see just how far it goes in the next modules in this topic.

As a quick example, suppose we wanted to get a sense of how similar or different values in a vector are to each other. There are lots of ways to measure this – variance, standard deviation, range, inter-quartile range – and some are more appropriate in some cases than in others. The function below allows you to input a vector and a function, and returns the result of applying the specified function to the vector input.

x_vec = rnorm(25, 0, 1)

my_summary = function(x, summ_func) {
  summ_func(x)
}

my_summary(x_vec, sd)

## [1] 0.753654

my_summary(x_vec, IQR)

## [1] 1.083116

my_summary(x_vec, var)

## [1] 0.5679943

This example is pretty trivial – you could just apply those functions directly to x and skip the hassle – but in many cases the idea of passing functions as arguments is really powerful. As a practical example, remember that you can reorder factors according to different summaries in fct_reorder!

Scoping and names

Take a look at the code below. Will the call f(x = y) work? If so, what will it produce? What is the current value of x, y, and z?

f = function(x) {
  z = x + y
  z
}

x = 1
y = 2

f(x = y)

Examples like this are tricky, but emphasize an issue that comes up a lot in writing functions: you define a variable in your global environment and use it in your function, but it isn’t passed as an argument. This is easy to miss, especially when you go from code written in chunks to a function, and can be hard to track down if you empty your working directory or change a variable name. The best advice I have is to give your arguments useful names and think carefully about where everything is defined, and to periodically restart R and try everything again!

Other materials

There are chapters on functions in R for Data Science and Advanced R
Jenny Bryan has a three part series on writing functions (each part is short)
R Programming for Data Science has nice chapters on functions and scoping
The Basics of UNIX Philosophy, linked to above, apply broadly to designing code

The code that I produced working examples in lecture is here.