If you use the same code twice, you need a function – this will improve code readability, facilitate troubleshooting, and reduce chances for mistakes. This content looks at the best approaches for writing R functions.

This is the first module in the Iteration topic.

Create simple R functions to abstract common processes.

For this topic, I’ll create a GitHub repo + directory / R Project
called `iteration`

. I’ll write code for today’s content in a
new R Markdown document called `writing_functions.Rmd`

, and
I’m going to load the usual packages. I’m also setting the seed so that
the output on this page is fixed.

```
library(tidyverse)
library(rvest)
set.seed(1)
```

The best way to build up a function is to start with code you’ve written outside a function. To see how this might work, I’ll start with a simple example: the chunk below takes a sample from a normal distribution and then computes the vector of Z scores for the sample.

```
x_vec = rnorm(25, mean = 5, sd = 3)
(x_vec - mean(x_vec)) / sd(x_vec)
```

```
## [1] -0.83687228 0.01576465 -1.05703126 1.50152998 0.16928872 -1.04107494
## [7] 0.33550276 0.59957343 0.42849461 -0.49894708 1.41364561 0.23279252
## [13] -0.83138529 -2.50852027 1.00648110 -0.22481531 -0.19456260 0.81587675
## [19] 0.68682298 0.44756609 0.78971253 0.64568566 -0.09904161 -2.27133861
## [25] 0.47485186
```

If I want to repeat this (admittedly simple) process for lots of
samples, I might want to have a function that takes the sample as an
*argument*, computes the vector of Z scores in the *body*,
and *returns* the result. I define such a function below.

```
z_scores = function(x) {
z = (x - mean(x)) / sd(x)
z
}
z_scores(x_vec)
```

```
## [1] -0.83687228 0.01576465 -1.05703126 1.50152998 0.16928872 -1.04107494
## [7] 0.33550276 0.59957343 0.42849461 -0.49894708 1.41364561 0.23279252
## [13] -0.83138529 -2.50852027 1.00648110 -0.22481531 -0.19456260 0.81587675
## [19] 0.68682298 0.44756609 0.78971253 0.64568566 -0.09904161 -2.27133861
## [25] 0.47485186
```

I can try this with a few samples and confirm that it works. I should also try to think of ways this code might break; the attempts below try a variety of inputs to see what happens.

`z_scores(3)`

`## [1] NA`

`z_scores("my name is jeff")`

`## Error in x - mean(x): non-numeric argument to binary operator`

`z_scores(iris)`

`## Error in is.data.frame(x): 'list' object cannot be coerced to type 'double'`

`z_scores(sample(c(TRUE, FALSE), 25, replace = TRUE))`

```
## [1] -0.7348469 1.3063945 -0.7348469 -0.7348469 1.3063945 1.3063945
## [7] -0.7348469 -0.7348469 -0.7348469 1.3063945 1.3063945 -0.7348469
## [13] -0.7348469 -0.7348469 -0.7348469 -0.7348469 -0.7348469 1.3063945
## [19] -0.7348469 -0.7348469 -0.7348469 -0.7348469 1.3063945 1.3063945
## [25] 1.3063945
```

These all did something I didn’t want, but only two returned errors. To avoid behavior you don’t want (i.e. to “fail noisily and as soon as possible”) we’ll add some checks on the argument values using conditional statements.

```
z_scores = function(x) {
if (!is.numeric(x)) {
stop("Argument x should be numeric")
} else if (length(x) == 1) {
stop("Z scores cannot be computed for length 1 vectors")
}
z = mean(x) / sd(x)
z
}
```

Fantastic – we have a pretty solid function for computing Z scores!

In some cases it might be better to return the mean and standard deviation instead of the Z scores. A first option is to store each of the values in a named list, and to return that list. (We’ll talk more about lists in iteration and listcols.)

```
mean_and_sd = function(x) {
if (!is.numeric(x)) {
stop("Argument x should be numeric")
} else if (length(x) == 1) {
stop("Cannot be computed for length 1 vectors")
}
mean_x = mean(x)
sd_x = sd(x)
list(mean = mean_x,
sd = sd_x)
}
```

Alternatively, we might store values in a data frame.

```
mean_and_sd = function(x) {
if (!is.numeric(x)) {
stop("Argument x should be numeric")
} else if (length(x) == 1) {
stop("Cannot be computed for length 1 vectors")
}
mean_x = mean(x)
sd_x = sd(x)
tibble(
mean = mean_x,
sd = sd_x
)
}
```

In general, either of these will be fine; which one you choose will depend on what kind of values you want to return, and what you plan to do with the function itself. If you want to return the original sample along with the computed values, a list might make sense; if you plan to run your function a lot and study the results, having a data frame will make it easier to use other tools. We’ll see more of that in iteration and simulation.

As exciting as it is to compute Z scores, let’s start setting our sights higher. I’d like to have a function that takes a given sample size along with a true mean and standard deviation, simulates data from a normal distribution, and returns the estimated mean and standard deviation. I’ll start from the code below.

```
sim_data = tibble(
x = rnorm(30, mean = 2, sd = 3)
)
sim_data %>%
summarize(
mu_hat = mean(x),
sigma_hat = sd(x)
)
```

```
## # A tibble: 1 × 2
## mu_hat sigma_hat
## <dbl> <dbl>
## 1 1.63 3.06
```

You should take a few minutes to examine this code – make a plot of the simulated data to make sure it “makes sense”, take a look at the result of computing the mean and standard deviation, etc.

Once you’re satisfied, it’s time to wrap things up in a function. I’d
like to be able to change the sample size and parameters, so those will
be my *arguments*; the code that simulates data and computes the
sample mean and SD go in the *body*; and the *return*
statement should include the estimates. A function that does all this,
using default values for the mean and standard deviation, is below.

```
sim_mean_sd = function(n, mu = 2, sigma = 3) {
sim_data = tibble(
x = rnorm(n, mean = mu, sd = sigma),
)
sim_data %>%
summarize(
mu_hat = mean(x),
sigma_hat = sd(x)
)
}
```

Repeated calls to `sim_mean_sd()`

will give a sense of
sampling variability in estimating the mean and standard deviation from
a sample; take a few minutes to run `sim_mean_sd(30)`

a few
times, and then to run `sim_mean_sd(300)`

, and think about
the results. We’ll examine this more formally in iteration and in simulation.

This is also a good time to point out how R handles argument
matching. We can use *positional* matching, meaning the first
value supplied is taken to be the first argument, the second value is
the second argument, and so on. We do this with `tidyverse`

functions a lot; the first argument is always a dataframe, and we just
supply that dataframe in the first position. We also use positional
matching when we call `mean(x)`

or
`sim_mean_sd(30, 5, 1)`

.

Alternatively, we can use *named* matching, which uses the
argument name in the function call. Named matching can be a bit more
stable when you’re writing your own functions (in case you decide to
change the order of the inputs, for example) but isn’t strictly
necessary. Named arguments can be supplied in any order:
`sim_mean_sd(n = 30, mu = 5, sd = 1)`

is equivalent to
`sim_mean_sd(sd = 1, n = 30, mu = 5)`

.

There have been a couple of times in this class that I’ve had to write code I didn’t like, because it would have made sense to write a function. We’ll revisit those quickly to see how we could improve our code.

In reading data from the web, we wrote code that allowed us to scrape information in Amazon reviews. That code is below.

```
url = "https://www.amazon.com/product-reviews/B00005JNBQ/ref=cm_cr_arp_d_viewopt_rvwer?ie=UTF8&reviewerType=avp_only_reviews&sortBy=recent&pageNumber=1"
dynamite_html = read_html(url)
review_titles =
dynamite_html %>%
html_nodes(".a-text-bold span") %>%
html_text()
review_stars =
dynamite_html %>%
html_nodes("#cm_cr-review_list .review-rating") %>%
html_text() %>%
str_extract("^\\d") %>%
as.numeric()
review_text =
dynamite_html %>%
html_nodes(".review-text-content span") %>%
html_text() %>%
str_replace_all("\n", "") %>%
str_trim()
reviews = tibble(
title = review_titles,
stars = review_stars,
text = review_text
)
```

Let’s write a quick function to scrape review information for any URL
to an Amazon review page. Note that this contains some extra code for
manipulating the `review_text`

– our previous code worked for
the first page, but broke on a later page. The added lines remove
entries in `review_text`

that come from embedded images after
extracting text from HTML using CSS tags.

```
read_page_reviews <- function(url) {
html = read_html(url)
review_titles =
html %>%
html_nodes(".a-text-bold span") %>%
html_text()
review_stars =
html %>%
html_nodes("#cm_cr-review_list .review-rating") %>%
html_text() %>%
str_extract("^\\d") %>%
as.numeric()
review_text =
html %>%
html_nodes(".review-text-content span") %>%
html_text() %>%
str_replace_all("\n", "") %>%
str_trim() %>%
str_subset("The media could not be loaded.", negate = TRUE) %>%
str_subset("^$", negate = TRUE)
tibble(
title = review_titles,
stars = review_stars,
text = review_text
)
}
```

Next we’ll use this to read in reviews from a few pages and combine the results.

```
url_base = "https://www.amazon.com/product-reviews/B00005JNBQ/ref=cm_cr_arp_d_viewopt_rvwer?ie=UTF8&reviewerType=avp_only_reviews&sortBy=recent&pageNumber="
vec_urls = str_c(url_base, 1:5)
dynamite_reviews = bind_rows(
read_page_reviews(vec_urls[1]),
read_page_reviews(vec_urls[2]),
read_page_reviews(vec_urls[3]),
read_page_reviews(vec_urls[4]),
read_page_reviews(vec_urls[5])
)
dynamite_reviews
```

In tidy data, we broke the “only copy code twice” rule when we used the code below to process the LoTR words data:

```
fellowship_ring = readxl::read_excel("./data/LotR_Words.xlsx", range = "B3:D6") %>%
mutate(movie = "fellowship_ring")
two_towers = readxl::read_excel("./data/LotR_Words.xlsx", range = "F3:H6") %>%
mutate(movie = "two_towers")
return_king = readxl::read_excel("./data/LotR_Words.xlsx", range = "J3:L6") %>%
mutate(movie = "return_king")
lotr_tidy = bind_rows(fellowship_ring, two_towers, return_king) %>%
janitor::clean_names() %>%
gather(key = sex, value = words, female:male) %>%
mutate(race = str_to_lower(race)) %>%
select(movie, everything())
```

** Learning Assessment:** Try to write a
function that can be used to abstract the data loading and cleaning
process. Use this function to recreate the tidied LoTR dataset.

The function below will read in and clean LoTR data – it differs from the previous code by including some data tidying steps in the function rather than after data have been combined, but produces the same result.

```
lotr_load_and_tidy = function(path, range, movie_name) {
df = readxl::read_excel(path, range = range) %>%
janitor::clean_names() %>%
gather(key = sex, value = words, female:male) %>%
mutate(race = str_to_lower(race),
movie = movie_name)
df
}
lotr_tidy =
bind_rows(
lotr_load_and_tidy("./data/LotR_Words.xlsx", "B3:D6", "fellowship_ring"),
lotr_load_and_tidy("./data/LotR_Words.xlsx", "F3:H6", "two_towers"),
lotr_load_and_tidy("./data/LotR_Words.xlsx", "J3:L6", "return_king")) %>%
select(movie, everything())
```

Having a function that handles the loading and cleaning is great – if I decide I want to change the tidying step, I only have to do it once, and I don’t have to worry about mistakes creeping in through copy-and-paste errors!

One powerful tool is the ability to pass functions as arguments into functions. This might seem like a weird thing to do, but it has a lot of handy applications – we’ll see just how far it goes in the next modules in this topic.

As a quick example, suppose we wanted to get a sense of how similar or different values in a vector are to each other. There are lots of ways to measure this – variance, standard deviation, range, inter-quartile range – and some are more appropriate in some cases than in others. The function below allows you to input a vector and a function, and returns the result of applying the specified function to the vector input.

```
x_vec = rnorm(25, 0, 1)
my_summary = function(x, summ_func) {
summ_func(x)
}
my_summary(x_vec, sd)
```

`## [1] 0.753654`

`my_summary(x_vec, IQR)`

`## [1] 1.083116`

`my_summary(x_vec, var)`

`## [1] 0.5679943`

This example is pretty trivial – you could just apply those functions
directly to `x`

and skip the hassle – but in many cases the
idea of passing functions as arguments is really powerful. As a
practical example, remember that you can reorder factors according to
different summaries in `fct_reorder`

!

Take a look at the code below. Will the call `f(x = y)`

work? If so, what will it produce? What is the current value of
`x`

, `y`

, and `z`

?

```
f = function(x) {
z = x + y
z
}
x = 1
y = 2
f(x = y)
```

Examples like this are tricky, but emphasize an issue that comes up a lot in writing functions: you define a variable in your global environment and use it in your function, but it isn’t passed as an argument. This is easy to miss, especially when you go from code written in chunks to a function, and can be hard to track down if you empty your working directory or change a variable name. The best advice I have is to give your arguments useful names and think carefully about where everything is defined, and to periodically restart R and try everything again!

- There are chapters on functions in R for Data Science and Advanced R
- Jenny Bryan has a three part series on writing functions (each part is short)
- R Programming for Data Science has nice chapters on functions and scoping
- The Basics of UNIX Philosophy, linked to above, apply broadly to designing code

The code that I produced working examples in lecture is here.