34 Column-wise operations in dplyr

Throughout the chapters in this book we have learned to do a really vast array of useful data transformations and statistical analyses with the help of the dplyr package.

So far, however, we’ve always done these transformations and statistical analyses on one column of our data frame at a time. There isn’t anything inherently “wrong” with this approach, but, for reasons we’ve already discussed, there are often advantages to telling R what you want to do one time, and then asking R to do that thing repeatedly across all, or a subset of, the columns in your data frame. That is exactly what dplyr’s across() function allows us to do.

There are so many ways we might want to use the across() function in our R programs. I can’t begin to cover, or even imagine, them all. Instead, the goal of this chapter is just to provide you with an overview of the across() function and show you some examples of using it with filter(), mutate(), and summarise() to get you thinking about how you might want to use it in your R programs.

Before we discuss further, let’s take a look at a quick example. The first thing we will need to do is load dplyr.

library(dplyr)

Then, we will simulate some data. In this case, we are creating a data frame that contains three columns of 10 random numbers:

set.seed(123)
df_xyz <- tibble(
  x = rnorm(10),
  y = rnorm(10),
  z = rnorm(10)
) %>% 
  print()
## # A tibble: 10 × 3
##          x      y      z
##      <dbl>  <dbl>  <dbl>
##  1 -0.560   1.22  -1.07 
##  2 -0.230   0.360 -0.218
##  3  1.56    0.401 -1.03 
##  4  0.0705  0.111 -0.729
##  5  0.129  -0.556 -0.625
##  6  1.72    1.79  -1.69 
##  7  0.461   0.498  0.838
##  8 -1.27   -1.97   0.153
##  9 -0.687   0.701 -1.14 
## 10 -0.446  -0.473  1.25

Up to this point, if we wanted to find the mean of each column, we would probably have written code like this:

df_xyz %>% 
  summarise(
    x_mean = mean(x),
    y_mean = mean(y),
    z_mean = mean(y)
  )
## # A tibble: 1 × 3
##   x_mean y_mean z_mean
##    <dbl>  <dbl>  <dbl>
## 1 0.0746  0.209  0.209

With the help of the across() function, we can now get the mean of each column like this:

df_xyz %>%
  summarise(
    across(
      .cols  = everything(),
      .fns   = mean,
      .names = "{col}_mean"
    )
  )
## # A tibble: 1 × 3
##   x_mean y_mean z_mean
##    <dbl>  <dbl>  <dbl>
## 1 0.0746  0.209 -0.425

Now, you might ask why this is a better approach. Fair question.

In this case, using across() doesn’t actually reduce the number of lines of code we wrote. In fact, we wrote two additional lines when we used the across() function. However, imagine if we added 20 additional columns to our data frame. Using the first approach, we would have to write 20 additional lines of code inside the summarise() function. Using the across() approach, we wouldn’t have to change our code at all.

Perhaps more importantly, did you notice that I “accidently” forgot to replace y with z when I copied and pasted z_mean = mean(y) in the code chunk for the first approach? That mistake is fairly easy to catch and fix in this very simple example. But, in real-world projects, mistakes like this are easy to make, and not always so easy to catch. We are much less likely to make similar mistakes when we use across().

34.1 The across() function

As of this writing, the across function is a relatively new addition to dplyr. You will only be able to use across() if you have dplyr 1.0.0 (released 2020-06-01) or higher. If you aren’t sure what version you have, you can copy and paste packageVersion("dplyr") into your R console. If you need to update your dplyr version, you can copy and paste install.packages("dplyr") into your R console.

We will always use across() inside of one of the dplyr verbs we’ve been learning about. Specifically, filter(), mutate(), and summarise(). We will not use across() outside of the dplyr verbs. Additionally, we will always use across() within the context of a data frame.

To view the help documentation for across(), you can copy and paste ?dplyr::across into your R console. If you do, you will see that across() has four arguments. They are:

1️⃣.cols. The value we pass to this argument should be columns of the data frame we want to operate on. We can once again use tidy-select argument modifiers here. In the example above, we used the everything() tidy-select modifier to tell R that we wanted to operate on all of the columns in the data frame.

2️⃣.fns. This is where you tell across() what function, or functions, you want to apply to the columns you selected in .cols. In the example above, we passed the mean function to the .fns argument. Notice that we typed mean without the parentheses.

3️⃣.... In this case, the ... argument is where we pass any additional arguments to the function we passed to the .fns argument. For example, we passed the mean function to the .fns argument above. In the data frame above, none of the columns had any missing values. Let’s go ahead and add some missing values so that we can take a look at how ... works in across().

df_xyz$x[2] <- NA_real_
df_xyz$y[4] <- NA_real_
df_xyz$z[6] <- NA_real_
df_xyz
## # A tibble: 10 × 3
##          x      y      z
##      <dbl>  <dbl>  <dbl>
##  1 -0.560   1.22  -1.07 
##  2 NA       0.360 -0.218
##  3  1.56    0.401 -1.03 
##  4  0.0705 NA     -0.729
##  5  0.129  -0.556 -0.625
##  6  1.72    1.79  NA    
##  7  0.461   0.498  0.838
##  8 -1.27   -1.97   0.153
##  9 -0.687   0.701 -1.14 
## 10 -0.446  -0.473  1.25

As we’ve already seen many times, R won’t drop the missing values and carry out a complete case analysis by default:

df_xyz %>% 
  summarise(
    x_mean = mean(x),
    y_mean = mean(y),
    z_mean = mean(y)
  )
## # A tibble: 1 × 3
##   x_mean y_mean z_mean
##    <dbl>  <dbl>  <dbl>
## 1     NA     NA     NA

Instead, we have to explicitly tell R to carry out a complete case analysis. We can do so by filtering our rows with missing data (more on this soon) or by changing the value of the mean() function’s na.rm argument from FALSE (the default) to TRUE:

df_xyz %>% 
  summarise(
    x_mean = mean(x, na.rm = TRUE),
    y_mean = mean(y, na.rm = TRUE),
    z_mean = mean(z, na.rm = TRUE)
  )
## # A tibble: 1 × 3
##   x_mean y_mean z_mean
##    <dbl>  <dbl>  <dbl>
## 1  0.108  0.220 -0.284

When we use across(), we will need to pass the na.rm = TRUE to the mean() function in across()’s ... argument like this:

df_xyz %>%
  summarise(
    across(
      .cols  = everything(),
      .fns   = mean,
      na.rm  = TRUE,
      .names = "{col}_mean"
    )
  )
## # A tibble: 1 × 3
##   x_mean y_mean z_mean
##    <dbl>  <dbl>  <dbl>
## 1  0.108  0.220 -0.284

Notice that we do not actually type out ... = or anything like that.

4️⃣.names. You can use this argument to adjust the column names that will result from the operation you pass to .fns. In the example above, we used the special {cols} keyword to use each of the column names that were passed to the .cols argument as the first part of each of the new columns’ names. Then, we asked R to add a literal underscore and the word “mean” because these are all mean values. That resulted in the new column names you see above. The default value for .names is just {cols}. So, if we hadn’t modified the value passed to the .names argument, our results would have looked like this:

df_xyz %>%
  summarise(
    across(
      .cols  = everything(),
      .fns   = mean,
      na.rm  = TRUE
    )
  )
## # A tibble: 1 × 3
##       x     y      z
##   <dbl> <dbl>  <dbl>
## 1 0.108 0.220 -0.284

There is also a special {fn} keyword that we can use to pass the name of each of the functions we used in .fns as part of the new column names. However, in order to get {fn} to work the way we want it to, we have to pass a list of name-function pairs to the .fns argument. Let me show you what I mean.

First, we will keep the code exactly as it was, but replace “mean” with “{fn}” in the .names argument:

df_xyz %>%
  summarise(
    across(
      .cols  = everything(),
      .fns   = mean, 
      na.rm  = TRUE,
      .names = "{col}_{fn}"
    )
  )
## # A tibble: 1 × 3
##     x_1   y_1    z_1
##   <dbl> <dbl>  <dbl>
## 1 0.108 0.220 -0.284

This is not the result we wanted. Because, we didn’t name the function that we passed to .fns, across() essentially used “function number 1” as its name. In order to get the result we want, we need to pass a list of name-function pairs to the .fns argument like this:

df_xyz %>% 
  summarise(
    across(
      .cols  = everything(),
      .fns   = list(mean = mean),
      na.rm  = TRUE,
      .names = "{col}_{fn}"
    )
  )
## # A tibble: 1 × 3
##   x_mean y_mean z_mean
##    <dbl>  <dbl>  <dbl>
## 1  0.108  0.220 -0.284

Although it may not be self-evident from just looking at the code above, the first mean in the list(mean = mean) name-function pair is a name that we are choosing to be passed to the new column names. Theoretically, we could have picked any name. For example:

df_xyz %>% 
  summarise(
    across(
      .cols  = everything(),
      .fns   = list(r4epi = mean),
      na.rm  = TRUE,
      .names = "{col}_{fn}"
    )
  )
## # A tibble: 1 × 3
##   x_r4epi y_r4epi z_r4epi
##     <dbl>   <dbl>   <dbl>
## 1   0.108   0.220  -0.284

The second mean in the list(mean = mean) name-function pair is the name of the actual function we want to apply to the columns in .cols. This part of the name-function pair must be the name of the function that we actually want to apply to the columns in .cols. Otherwise, we will get an error:

df_xyz %>% 
  summarise(
    across(
      .cols  = everything(),
      .fns   = list(mean = r4epi),
      na.rm  = TRUE,
      .names = "{col}_{fn}"
    )
  )
## Error: Problem with `summarise()` input `..1`.
## ℹ `..1 = across(...)`.
## x object 'r4epi' not found

An additional advantage of passing a list of name-function pairs to the .fns argument is that we can pass multiple functions at once. For example, let’s say that we want the minimum and maximum value of each column in our data frame. Without across() we might do that analysis like this:

df_xyz %>% 
  summarise(
    x_min = min(x, na.rm = TRUE),
    x_max = max(x, na.rm = TRUE),
    y_min = min(y, na.rm = TRUE),
    y_max = max(y, na.rm = TRUE),
    z_min = min(z, na.rm = TRUE),
    z_max = max(z, na.rm = TRUE)
  )
## # A tibble: 1 × 6
##   x_min x_max y_min y_max z_min z_max
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -1.27  1.72 -1.97  1.79 -1.14  1.25

But, we can simply pass min and max as a list of name-function pairs if we use across():

df_xyz %>% 
  summarise(
    across(
      .cols  = everything(),
      .fns   = list(min = min, max = max),
      na.rm  = TRUE,
      .names = "{col}_{fn}"
    )
  )
## # A tibble: 1 × 6
##   x_min x_max y_min y_max z_min z_max
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -1.27  1.72 -1.97  1.79 -1.14  1.25

How great is that?!?

So, we’ve seen how to pass an individual function to the .fns argument and we’ve seen how to pass a list containing multiple functions to the .fns argument. There is actually a third syntax for passing functions to the .fns argument. The across() documentation calls it “a purrr-style lambda.” This can be a little bit confusing, so I’m going to show you an example, and then walk through it step by step.

df_xyz %>% 
  summarise(
    across(
      .cols  = everything(),
      .fns   = ~ mean(.x, na.rm = TRUE),
      .names = "{col}_mean"
    )
  )
## # A tibble: 1 × 3
##   x_mean y_mean z_mean
##    <dbl>  <dbl>  <dbl>
## 1  0.108  0.220 -0.284

The purrr-style lambda always begins with the tilde symbol (~). Then we type out a function call behind the tilde symbol. We place the special .x symbol inside the function call where we would normally want to type the name of the column we want the function to operate on. The across() function will then substitute each column name we passed to the .cols argument for .x sequentially. In the example above, there isn’t really any good reason to use this syntax. However, this syntax can be useful at times. We will see some examples below.

34.2 Across with filter

We’ve already discussed complete case analyses multiple times in this book. That is, including only the rows from our data frame that don’t have any missing values in our analysis. We’ve already seen how we can use the filter() function to remove the rows of a single column where the data are missing. For example:

df_xyz %>% 
  filter(!is.na(x)) %>% 
  summarise(mean = mean(x))
## # A tibble: 1 × 1
##    mean
##   <dbl>
## 1 0.108

We can use the across() function, inside the filter() function, to remove all rows with missing values from our data frame:

df_xyz %>% 
  filter(
    across(
      .cols = everything(),
      .fns  = !is.na
    )
  )
## Error: Problem with `filter()` input `..1`.
## ℹ Input `..1` is `across(.cols = everything(), .fns = !is.na)`.
## x invalid argument type

Wait. Why didn’t this work? Can you spot the problem?

The reason the code above didn’t work is because we actually passed two functions to the .fns argument. Remember, operators in R are functions. So, !is.na is two functions – ! and is.na().

Hmmm, well, above we passed two functions to the .fns argument using a list of name-function pairs. Maybe we should try that:

df_xyz %>% 
  filter(
    across(
      .cols = everything(),
      .fns  = list(not = `!`, missing = is.na)
    )
  )
## # A tibble: 0 × 3
## # … with 3 variables: x <dbl>, y <dbl>, z <dbl>

That didn’t work either. One of the problems is that the syntax above applies each function to the columns separately, but we want them applied to the columns sequentially. We don’t want !(x) and separately is.na(x). We want !is.na(x). To solve this problem, we need pass one function to .fns that executes !is.na() for each column. It sounds like we may need to write our own function.

Let’s think back to the previous chapter. How might we write this function? Well, this is one way we could write it:

not_missing <- function(col) {
  !is.na(col)
}

Next, let’s test it out on a single column:

df_xyz %>% 
  filter(not_missing(x))
## # A tibble: 9 × 3
##         x      y      z
##     <dbl>  <dbl>  <dbl>
## 1 -0.560   1.22  -1.07 
## 2  1.56    0.401 -1.03 
## 3  0.0705 NA     -0.729
## 4  0.129  -0.556 -0.625
## 5  1.72    1.79  NA    
## 6  0.461   0.498  0.838
## 7 -1.27   -1.97   0.153
## 8 -0.687   0.701 -1.14 
## 9 -0.446  -0.473  1.25

It works! The result above only includes rows from our data frame where the value of the x column was non-missing. We can now use our not_missing() function inside of across():

df_xyz %>% 
  filter(
    across(
      .cols = everything(),
      .fns  = not_missing
    )
  )
## # A tibble: 7 × 3
##        x      y      z
##    <dbl>  <dbl>  <dbl>
## 1 -0.560  1.22  -1.07 
## 2  1.56   0.401 -1.03 
## 3  0.129 -0.556 -0.625
## 4  0.461  0.498  0.838
## 5 -1.27  -1.97   0.153
## 6 -0.687  0.701 -1.14 
## 7 -0.446 -0.473  1.25

That is exactly the result we wanted!

The not_missing() function we wrote above works great. But, what if we needed it for this specific case only, and we knew we weren’t going to need it again at any other place in our program. Well, in that circumstance we may choose to use an anonymous function instead of a named function. What do I mean by that? Well, when we created not_missing() above, we gave it a name – “not_missing.” Later, in a separate code chunk, we called the not_missing() function by passing its name to the .fns argument. In other words, there were two steps: creating the function and calling the function. Because, creating and calling were separate steps, we had to give the function a name. How else we would call it at a later time?

However, it is possible to essentially combine these two steps into one. In other words, we can call, or execute the code inside our function, at the moment the function is created. When we do so, there is no need to name the function. Therefore, it’s anonymous. Aside from not assigning the function a name, however, it is identical to the named version. As a reminder, here’s the code we used to define, or create, the not_missing() function:

not_missing <- function(col) {
  !is.na(col)
}

Now, here’s how we might use the same code as an anonymous function:

df_xyz %>% 
  filter(
    across(
      .cols = everything(),
      .fns  = function(col) {
        !is.na(col)
      }
    )
  )
## # A tibble: 7 × 3
##        x      y      z
##    <dbl>  <dbl>  <dbl>
## 1 -0.560  1.22  -1.07 
## 2  1.56   0.401 -1.03 
## 3  0.129 -0.556 -0.625
## 4  0.461  0.498  0.838
## 5 -1.27  -1.97   0.153
## 6 -0.687  0.701 -1.14 
## 7 -0.446 -0.473  1.25

It is the exact same code and produces the exact same result. The only difference is that we don’t assign a name to the function and it isn’t stored in our global environment to be called again in the future. If we don’t care about calling it again in the future, then this method allows us to type fewer lines of code and reduce the amount of unnecessary “stuff” in our global environment. Pretty cool!

There’s actually a way that we can reduce the amount of code we typed above even more. Remember, that the .fns argument will accept purrr-style lambdas. In this context, “lambda” is really just another way of saying “anonymous function,” and using the purrr-style lambda syntax requires us to type fewer characters.

We first need to replace function(col) with a ~ to convert the traditional anonymous function syntax we used in the code chunk above to the purrr-style lambda syntax:

df_xyz %>% 
  filter(
    across(
      .cols = everything(),
      .fns  = ~ {
        !is.na(col)
      }
    )
  )
## Warning in is.na(col): is.na() applied to non-(list or vector) of type 'closure'

## Warning in is.na(col): is.na() applied to non-(list or vector) of type 'closure'

## Warning in is.na(col): is.na() applied to non-(list or vector) of type 'closure'
## # A tibble: 10 × 3
##          x      y      z
##      <dbl>  <dbl>  <dbl>
##  1 -0.560   1.22  -1.07 
##  2 NA       0.360 -0.218
##  3  1.56    0.401 -1.03 
##  4  0.0705 NA     -0.729
##  5  0.129  -0.556 -0.625
##  6  1.72    1.79  NA    
##  7  0.461   0.498  0.838
##  8 -1.27   -1.97   0.153
##  9 -0.687   0.701 -1.14 
## 10 -0.446  -0.473  1.25

However, this code doesn’t work because R doesn’t know what the col inside !is.na(col) is. We are no longer defining it with function(col). Do you remember which special symbol we can use to pass the function in .fns each of the columns passed to .cols?

We can use the special .x symbol inside the function call where we would normally want to type the name of the column we want the function to operate on:

df_xyz %>% 
  filter(
    across(
      .cols = everything(),
      .fns  = ~ {
        !is.na(.x) # Use the special .x
      }
    )
  )
## # A tibble: 7 × 3
##        x      y      z
##    <dbl>  <dbl>  <dbl>
## 1 -0.560  1.22  -1.07 
## 2  1.56   0.401 -1.03 
## 3  0.129 -0.556 -0.625
## 4  0.461  0.498  0.838
## 5 -1.27  -1.97   0.153
## 6 -0.687  0.701 -1.14 
## 7 -0.446 -0.473  1.25

Because our function body is on one line only, we can go ahead and get rid of the curly braces too:

df_xyz %>% 
  filter(
    across(
      .cols = everything(),
      .fns  = ~ !is.na(.x)
    )
  )
## # A tibble: 7 × 3
##        x      y      z
##    <dbl>  <dbl>  <dbl>
## 1 -0.560  1.22  -1.07 
## 2  1.56   0.401 -1.03 
## 3  0.129 -0.556 -0.625
## 4  0.461  0.498  0.838
## 5 -1.27  -1.97   0.153
## 6 -0.687  0.701 -1.14 
## 7 -0.446 -0.473  1.25

In this section, we learned how to use the across() function inside the filter() function to drop all rows with a missing value from our data frame in an efficient way. This required us to write our own function, and we learned three different methods for passing the function we wrote to across(). We passed a named function, an anonymous function, and a purrr-style lambda. In this case, the method you choose to use is largely a matter of preference.

34.3 Across with mutate

We’ve already seen a number of examples of manipulating columns of our data frames using the mutate() function. In this section, we are going to take a look at two examples where using the across() function inside mutate() will allow us to apply the same manipulation to multiple columns in our data frame at once.

Let’s go ahead and simulate the same demographics data frame we simulated for the recoding missing section of the conditional operations chapter. Let’s also add two new columns: a four-category education column and a six-category income column. For all columns except id and age, a value of 7 represents “Don’t know” and a value of 9 represents “refused.”

set.seed(123)
demographics <- tibble(
  id       = 1:10,
  age      = c(sample(1:30, 9, TRUE), NA),
  race     = c(1, 2, 1, 4, 7, 1, 2, 9, 1, 3),
  hispanic = c(7, 0, 1, 0, 1, 0, 1, 9, 0, 1),
  edu_4cat = c(4, 2, 9, 1, 2, 3, 4, 9, 3, 3),
  inc_6cat = c(1, 4, 1, 1, 5, 3, 2, 2, 7, 9)
) %>% 
  print()
## # A tibble: 10 × 6
##       id   age  race hispanic edu_4cat inc_6cat
##    <int> <int> <dbl>    <dbl>    <dbl>    <dbl>
##  1     1    15     1        7        4        1
##  2     2    19     2        0        2        4
##  3     3    14     1        1        9        1
##  4     4     3     4        0        1        1
##  5     5    10     7        1        2        5
##  6     6    18     1        0        3        3
##  7     7    22     2        1        4        2
##  8     8    11     9        9        9        2
##  9     9     5     1        0        3        7
## 10    10    NA     3        1        3        9

When working with data like this, it’s common to want to recode all the 7’s and 9’s to NA’s. We saw how to do that one column at a time already:

demographics %>% 
  mutate(
    race     = if_else(race == 7 | race == 9, NA_real_, race),
    hispanic = if_else(race == 7 | hispanic == 9, NA_real_, hispanic),
    edu_4cat = if_else(edu_4cat == 7 | edu_4cat == 9, NA_real_, edu_4cat)
  )
## # A tibble: 10 × 6
##       id   age  race hispanic edu_4cat inc_6cat
##    <int> <int> <dbl>    <dbl>    <dbl>    <dbl>
##  1     1    15     1        7        4        1
##  2     2    19     2        0        2        4
##  3     3    14     1        1       NA        1
##  4     4     3     4        0        1        1
##  5     5    10    NA       NA        2        5
##  6     6    18     1        0        3        3
##  7     7    22     2        1        4        2
##  8     8    11    NA       NA       NA        2
##  9     9     5     1        0        3        7
## 10    10    NA     3        1        3        9

🚩In the code chunk above, we have essentially the same code copied more than twice. That’s a red flag that we should be thinking about removing unnecessary repetition from our code.

Also, did you notice that I forgot to replace race with hispanic in hispanic = if_else(race == 7 | hispanic == 9, NA_real_, hispanic)? This time, I didn’t write “forgot” in quotes because I really did forget and only noticed it later. In this case, the error caused a value of 1 to be recoded to NA in the hispanic column. These typos we’ve been talking about really do happen – even to me!

Here’s how we can use across() in this situation:

demographics %>% 
  mutate(
    across(
      .cols = c(-id, -age),
      .fns  = ~ if_else(.x == 7 | .x == 9, NA_real_, .x)
    )
  )
## # A tibble: 10 × 6
##       id   age  race hispanic edu_4cat inc_6cat
##    <int> <int> <dbl>    <dbl>    <dbl>    <dbl>
##  1     1    15     1       NA        4        1
##  2     2    19     2        0        2        4
##  3     3    14     1        1       NA        1
##  4     4     3     4        0        1        1
##  5     5    10    NA        1        2        5
##  6     6    18     1        0        3        3
##  7     7    22     2        1        4        2
##  8     8    11    NA       NA       NA        2
##  9     9     5     1        0        3       NA
## 10    10    NA     3        1        3       NA

👆Here’s what we did above:

  • We used a purrr-style lambda to replace 7’s and 9’s in all columns in our data frame, except id and age, with NA.

  • Remember, the special .x symbol is just shorthand for each column passed to the .cols argument.

As another example, let’s say that we are once again working with data from a drug trial that includes a list of side effects for each person:

set.seed(123)
drug_trial <- tibble(
  id           = 1:10,
  se_headache  = sample(0:1, 10, TRUE),
  se_diarrhea  = sample(0:1, 10, TRUE),
  se_dry_mouth = sample(0:1, 10, TRUE),
  se_nausea    = sample(0:1, 10, TRUE)
) %>% 
 print()
## # A tibble: 10 × 5
##       id se_headache se_diarrhea se_dry_mouth se_nausea
##    <int>       <int>       <int>        <int>     <int>
##  1     1           0           1            0         0
##  2     2           0           1            1         1
##  3     3           0           1            0         0
##  4     4           1           0            0         1
##  5     5           0           1            0         1
##  6     6           1           0            0         0
##  7     7           1           1            1         0
##  8     8           1           0            1         0
##  9     9           0           0            0         0
## 10    10           0           0            1         1

Now, we want to create a factor version of each of the side effect columns. We’ve already learned how to do so one column at a time:

drug_trial %>% 
  mutate(
    se_headache_f  = factor(se_headache, 0:1, c("No", "Yes")),
    se_diarrhea_f  = factor(se_diarrhea, 0:1, c("No", "Yes")),
    se_dry_mouth_f = factor(se_dry_mouth, 0:1, c("No", "Yes"))
  )
## # A tibble: 10 × 8
##       id se_headache se_diarrhea se_dry_mouth se_nausea se_headache_f
##    <int>       <int>       <int>        <int>     <int> <fct>        
##  1     1           0           1            0         0 No           
##  2     2           0           1            1         1 No           
##  3     3           0           1            0         0 No           
##  4     4           1           0            0         1 Yes          
##  5     5           0           1            0         1 No           
##  6     6           1           0            0         0 Yes          
##  7     7           1           1            1         0 Yes          
##  8     8           1           0            1         0 Yes          
##  9     9           0           0            0         0 No           
## 10    10           0           0            1         1 No           
## # … with 2 more variables: se_diarrhea_f <fct>, se_dry_mouth_f <fct>

🚩Once again, we have essentially the same code copied more than twice. That’s a red flag that we should be thinking about removing unnecessary repetition from our code. Here’s how we can use across() to do so:

drug_trial %>% 
  mutate(
    across(
      .cols  = starts_with("se"),
      .fns   = ~ factor(.x, 0:1, c("No", "Yes")),
      .names = "{col}_f"
    )
  )
## # A tibble: 10 × 9
##       id se_headache se_diarrhea se_dry_mouth se_nausea se_headache_f
##    <int>       <int>       <int>        <int>     <int> <fct>        
##  1     1           0           1            0         0 No           
##  2     2           0           1            1         1 No           
##  3     3           0           1            0         0 No           
##  4     4           1           0            0         1 Yes          
##  5     5           0           1            0         1 No           
##  6     6           1           0            0         0 Yes          
##  7     7           1           1            1         0 Yes          
##  8     8           1           0            1         0 Yes          
##  9     9           0           0            0         0 No           
## 10    10           0           0            1         1 No           
## # … with 3 more variables: se_diarrhea_f <fct>, se_dry_mouth_f <fct>,
## #   se_nausea_f <fct>

👆Here’s what we did above:

  • We used a purrr-style lambda to create a factor version of all the side effect columns in our data frame.

  • We used the .names argument to add an "_f" to the end of the new column names.

34.4 Across with summarise

Let’s return to the ehr data frame we used in the chapter on working with character strings for our first example of using across() inside of summarise:

library(readr)
library(stringr)
ehr <- read_rds("/Users/bradcannell/Dropbox/Datasets/epcr/ehr.Rds")

For this example, the only column we will concern ourselves with is the symptoms column:

symptoms <- ehr %>% 
  select(symptoms) %>% 
  print()
## # A tibble: 15 × 1
##    symptoms                            
##    <chr>                               
##  1 "\"Pain\", \"Headache\", \"Nausea\""
##  2 "Pain"                              
##  3 "Pain"                              
##  4 "\"Nausea\", \"Headache\""          
##  5 "\"Pain\", \"Headache\""            
##  6 "\"Nausea\", \"Headache\""          
##  7 "Pain"                              
##  8  <NA>                               
##  9 "Pain"                              
## 10  <NA>                               
## 11 "\"Nausea\", \"Headache\""          
## 12 "\"Headache\", \"Pain\", \"Nausea\""
## 13 "Headache"                          
## 14 "\"Headache\", \"Pain\", \"Nausea\""
## 15  <NA>

You may recall that we created dummy variables for each symptom like this:

symptoms <- symptoms %>% 
  mutate(
    pain     = str_detect(symptoms, "Pain"),
    headache = str_detect(symptoms, "Headache"),
    nausea   = str_detect(symptoms, "Nausea")
  ) %>% 
  print()
## # A tibble: 15 × 4
##    symptoms                             pain  headache nausea
##    <chr>                                <lgl> <lgl>    <lgl> 
##  1 "\"Pain\", \"Headache\", \"Nausea\"" TRUE  TRUE     TRUE  
##  2 "Pain"                               TRUE  FALSE    FALSE 
##  3 "Pain"                               TRUE  FALSE    FALSE 
##  4 "\"Nausea\", \"Headache\""           FALSE TRUE     TRUE  
##  5 "\"Pain\", \"Headache\""             TRUE  TRUE     FALSE 
##  6 "\"Nausea\", \"Headache\""           FALSE TRUE     TRUE  
##  7 "Pain"                               TRUE  FALSE    FALSE 
##  8  <NA>                                NA    NA       NA    
##  9 "Pain"                               TRUE  FALSE    FALSE 
## 10  <NA>                                NA    NA       NA    
## 11 "\"Nausea\", \"Headache\""           FALSE TRUE     TRUE  
## 12 "\"Headache\", \"Pain\", \"Nausea\"" TRUE  TRUE     TRUE  
## 13 "Headache"                           FALSE TRUE     FALSE 
## 14 "\"Headache\", \"Pain\", \"Nausea\"" TRUE  TRUE     TRUE  
## 15  <NA>                                NA    NA       NA

🗒Side Note: Some of you may have noticed that we repeated ourselves more than twice in the code chunk above and thought about using across() to remove it. Unfortunately, across() won’t solve our problem in this situation. We will some of the tools we learn about in later chapters if we want to remove this repetition.

And finally, we used the table() function to get a count of how many people reported having a headache:

table(symptoms$headache)
## 
## FALSE  TRUE 
##     4     8

This is where the example stopped in the chapter on working with character strings. However, what if we wanted to know how many people reported the other symptoms as well? Well, we could repeatedly call the table() function:

table(symptoms$pain)
## 
## FALSE  TRUE 
##     4     8
table(symptoms$nausea)
## 
## FALSE  TRUE 
##     6     6

But, that would cause us to copy and paste repeatedly. Additionally, wouldn’t it be nice to view these counts in a way that makes them easier to compare? One solution would be to use summarise() like this:

symptoms %>% 
  summarise(
    had_headache = sum(headache, na.rm = TRUE),
    had_pain     = sum(pain, na.rm = TRUE),
    had_nausea   = sum(nausea, na.rm = TRUE)
  )
## # A tibble: 1 × 3
##   had_headache had_pain had_nausea
##          <int>    <int>      <int>
## 1            8        8          6

This works, but we can do better with across():

symptoms %>% 
  summarise(
    across(
      .cols  = c(headache, pain, nausea),
      .fns   = ~ sum(.x, na.rm = TRUE)
    )
  )
## # A tibble: 1 × 3
##   headache  pain nausea
##      <int> <int>  <int>
## 1        8     8      6

Great! But, wouldn’t it be nice to know the proportion of people with each symptom as well? You may recall that R treats TRUE and FALSE as 1 and 0 when used in a mathematical operation. Additionally, you may also already be aware that the mean of a set of 1’s and 0’s is equal to the proportion of 1’s in the set. For example, there are three ones and three zeros in the set (1, 1, 1, 0, 0, 0). The proportion of 1’s in the set is 3 out of 6, which is 0.5. Equivalently, the mean value of the set is (1 + 1 + 1 + 0 + 0 + 0) / 6, which equals 3 / 6, which is 0.5. So, when we have dummy variables like headache, pain, and nausea above, passing them to the mean() function returns the proportion of TRUE values. In this case, the proportion of people who had each symptom. We know we can do that calculation like this:

symptoms %>% 
  summarise(
    had_headache = mean(headache, na.rm = TRUE),
    had_pain     = mean(pain, na.rm = TRUE),
    had_nausea   = mean(nausea, na.rm = TRUE)
  )
## # A tibble: 1 × 3
##   had_headache had_pain had_nausea
##          <dbl>    <dbl>      <dbl>
## 1        0.667    0.667        0.5

As before, we can do better with the across() function like this:

symptoms %>% 
  summarise(
    across(
      .cols = c(pain, headache, nausea),
      .fns  = ~ mean(.x, na.rm = TRUE)
    )
  )
## # A tibble: 1 × 3
##    pain headache nausea
##   <dbl>    <dbl>  <dbl>
## 1 0.667    0.667    0.5

Now, at this point, we might think, “wouldn’t it be nice to see the count and the proportion in the same result?” Well, we can do that by supplying our purrr-style lambdas as functions in a list of name-function pairs like this:

symptom_summary <- symptoms %>% 
  summarise(
    across(
      .cols = c(pain, headache, nausea),
      .fns  = list(
        count = ~ sum(.x, na.rm = TRUE),
        prop  = ~ mean(.x, na.rm = TRUE)
      )
    )
  ) %>% 
  print()
## # A tibble: 1 × 6
##   pain_count pain_prop headache_count headache_prop nausea_count nausea_prop
##        <int>     <dbl>          <int>         <dbl>        <int>       <dbl>
## 1          8     0.667              8         0.667            6         0.5

In this case, it’s probably fine to stop here. But, what if we had 20 or 30 symptoms that we were analyzing? It would be really difficult to read and compare them arranged horizontally like this, wouldn’t it?

Do you recall us discussing restructuring our results in the chapter on restructuring data frames? This is a circumstance where we might want to use pivot_longer() to make our results easier to read and interpret:

symptom_summary %>% 
  tidyr::pivot_longer(
    cols      = everything(),
    names_to  = c("symptom", ".value"),
    names_sep = "_"
  )
## # A tibble: 3 × 3
##   symptom  count  prop
##   <chr>    <int> <dbl>
## 1 pain         8 0.667
## 2 headache     8 0.667
## 3 nausea       6 0.5

There! Isn’t that result much easier to read?

For our final example of this chapter, let’s return the first example from the writing functions chapter. We started with some simulated study data:

study <- tibble(
  age       = c(32, 30, 32, 29, 24, 38, 25, 24, 48, 29, 22, 29, 24, 28, 24, 25, 
                25, 22, 25, 24, 25, 24, 23, 24, 31, 24, 29, 24, 22, 23, 26, 23, 
                24, 25, 24, 33, 27, 25, 26, 26, 26, 26, 26, 27, 24, 43, 25, 24, 
                27, 28, 29, 24, 26, 28, 25, 24, 26, 24, 26, 31, 24, 26, 31, 34, 
                26, 25, 27, NA),
  age_group = c(2, 2, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
                1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 
                1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 
                2, 1, 1, 1, NA),
  gender    = c(2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 
                1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 2, 2, 1, 2, 2, 1, 
                1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 2, 1, 
                1, 1, 2, 1, NA),
  ht_in     = c(70, 63, 62, 67, 67, 58, 64, 69, 65, 68, 63, 68, 69, 66, 67, 65, 
                64, 75, 67, 63, 60, 67, 64, 73, 62, 69, 67, 62, 68, 66, 66, 62, 
                64, 68, NA, 68, 70, 68, 68, 66, 71, 61, 62, 64, 64, 63, 67, 66, 
                69, 76, NA, 63, 64, 65, 65, 71, 66, 65, 65, 71, 64, 71, 60, 62, 
                61, 69, 66, NA),
  wt_lbs    = c(216, 106, 145, 195, 143, 125, 138, 140, 158, 167, 145, 297, 146, 
                125, 111, 125, 130, 182, 170, 121, 98, 150, 132, 250, 137, 124, 
                186, 148, 134, 155, 122, 142, 110, 132, 188, 176, 188, 166, 136, 
                147, 178, 125, 102, 140, 139, 60, 147, 147, 141, 232, 186, 212, 
                110, 110, 115, 154, 140, 150, 130, NA, 171, 156, 92, 122, 102, 
                163, 141, NA),
  bmi       = c(30.99, 18.78, 26.52, 30.54, 22.39, 26.12, 23.69, 20.67, 26.29, 
                25.39, 25.68, 45.15, 21.56, 20.17, 17.38, 20.8, 22.31, 22.75, 
                26.62, 21.43, 19.14, 23.49, 22.66, 32.98, 25.05, 18.31, 29.13, 
                27.07, 20.37, 25.01, 19.69, 25.97, 18.88, 20.07, NA, 26.76, 
                26.97, 25.24, 20.68, 23.72, 24.82, 23.62, 18.65, 24.03, 23.86, 
                10.63, 23.02, 23.72, 20.82, 28.24, NA, 37.55, 18.88, 18.3, 
                19.13, 21.48, 22.59, 24.96, 21.63, NA, 29.35, 21.76, 17.97, 
                22.31, 19.27, 24.07, 22.76, NA),
  bmi_3cat  = c(3, 1, 2, 3, 1, 2, 1, 1, 2, 2, 2, 3, 1, 1, 1, 1, 1, 1, 2, 1, 1, 
                1, 1, 3, 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, NA, 2, 2, 2, 1, 1, 1, 1, 
                1, 1, 1, 1, 1, 1, 1, 2, NA, 3, 1, 1, 1, 1, 1, 1, 1, NA, 2, 1, 
                1, 1, 1, 1, 1, NA)
) %>% 
  mutate(
    age_group = factor(age_group, labels = c("Younger than 30", "30 and Older")),
    gender    = factor(gender, labels = c("Female", "Male")),
    bmi_3cat  = factor(bmi_3cat, labels = c("Normal", "Overweight", "Obese"))
  ) %>% 
  print()
## # A tibble: 68 × 7
##      age age_group       gender ht_in wt_lbs   bmi bmi_3cat  
##    <dbl> <fct>           <fct>  <dbl>  <dbl> <dbl> <fct>     
##  1    32 30 and Older    Male      70    216  31.0 Obese     
##  2    30 30 and Older    Female    63    106  18.8 Normal    
##  3    32 30 and Older    Female    62    145  26.5 Overweight
##  4    29 Younger than 30 Male      67    195  30.5 Obese     
##  5    24 Younger than 30 Female    67    143  22.4 Normal    
##  6    38 30 and Older    Female    58    125  26.1 Overweight
##  7    25 Younger than 30 Female    64    138  23.7 Normal    
##  8    24 Younger than 30 Male      69    140  20.7 Normal    
##  9    48 30 and Older    Male      65    158  26.3 Overweight
## 10    29 Younger than 30 Male      68    167  25.4 Overweight
## # … with 58 more rows

And wrote our own function to calculate the number of missing values, mean, median, min, and max for all of the continuous variables:

continuous_stats <- function(var) {
  study %>% 
    summarise(
      n_miss = sum(is.na({{ var }})),
      mean   = mean({{ var }}, na.rm = TRUE),
      median = median({{ var }}, na.rm = TRUE),
      min    = min({{ var }}, na.rm = TRUE),
      max    = max({{ var }}, na.rm = TRUE)
    )
}

We then used that function to calculate our statistics of interest for each continuous variable:

continuous_stats(age)
## # A tibble: 1 × 5
##   n_miss  mean median   min   max
##    <int> <dbl>  <dbl> <dbl> <dbl>
## 1      1  26.9     26    22    48
continuous_stats(ht_in)
## # A tibble: 1 × 5
##   n_miss  mean median   min   max
##    <int> <dbl>  <dbl> <dbl> <dbl>
## 1      3  66.0     66    58    76
continuous_stats(wt_lbs)
## # A tibble: 1 × 5
##   n_miss  mean median   min   max
##    <int> <dbl>  <dbl> <dbl> <dbl>
## 1      2  148.   142.    60   297
continuous_stats(bmi)
## # A tibble: 1 × 5
##   n_miss  mean median   min   max
##    <int> <dbl>  <dbl> <dbl> <dbl>
## 1      4  23.6   22.9  10.6  45.2

This is definitely an improvement over all the copying and pasting we were doing before we wrote our own function. However, there is still some unnecessary repetition above. One way we can remove this repetition is to use across() like this:

summary_stats <- study %>% 
  summarise(
    across(
      .cols = c(age, ht_in, wt_lbs, bmi),
      .fns  = list(
        n_miss = ~ sum(is.na(.x)),
        mean   = ~ mean(.x, na.rm = TRUE),
        median = ~ median(.x, na.rm = TRUE),
        min    = ~ min(.x, na.rm = TRUE),
        max    = ~ max(.x, na.rm = TRUE)
      )
    ) 
  ) %>% 
  print()
## # A tibble: 1 × 20
##   age_n_miss age_mean age_median age_min age_max ht_in_n_miss ht_in_mean
##        <int>    <dbl>      <dbl>   <dbl>   <dbl>        <int>      <dbl>
## 1          1     26.9         26      22      48            3       66.0
## # … with 13 more variables: ht_in_median <dbl>, ht_in_min <dbl>,
## #   ht_in_max <dbl>, wt_lbs_n_miss <int>, wt_lbs_mean <dbl>,
## #   wt_lbs_median <dbl>, wt_lbs_min <dbl>, wt_lbs_max <dbl>, bmi_n_miss <int>,
## #   bmi_mean <dbl>, bmi_median <dbl>, bmi_min <dbl>, bmi_max <dbl>

This method works, but it has the same problem that our symptom summaries had above. Our results are hard to read and interpret because they are arranged horizontally. We can once again pivot this data longer, but it won’t be quite as easy as it was before. Our first attempt might look like this:

summary_stats %>% 
  tidyr::pivot_longer(
    cols      = everything(),
    names_to  = c("characteristic", ".value"),
    names_sep = "_"
  )
## Warning: Expected 2 pieces. Additional pieces discarded in 12 rows [1, 6, 7, 8,
## 9, 10, 11, 12, 13, 14, 15, 16].
## # A tibble: 12 × 8
##    characteristic     n  mean median   min   max  `in`   lbs
##    <chr>          <int> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 age                1  26.9   26    22    48    NA     NA 
##  2 ht                NA  NA     NA    NA    NA     3     NA 
##  3 ht                NA  NA     NA    NA    NA    66.0   NA 
##  4 ht                NA  NA     NA    NA    NA    66     NA 
##  5 ht                NA  NA     NA    NA    NA    58     NA 
##  6 ht                NA  NA     NA    NA    NA    76     NA 
##  7 wt                NA  NA     NA    NA    NA    NA      2 
##  8 wt                NA  NA     NA    NA    NA    NA    148.
##  9 wt                NA  NA     NA    NA    NA    NA    142.
## 10 wt                NA  NA     NA    NA    NA    NA     60 
## 11 wt                NA  NA     NA    NA    NA    NA    297 
## 12 bmi                4  23.6   22.9  10.6  45.2  NA     NA

What do you think the problem is here?

Well, we passed an underscore to the names_sep argument. This tells pivot_longer() that that character string on the left side of the underscore should make up the values of the new characteristic column and each unique character string on the right side of the underscore should be used to create a new column name. In the symptoms data, this worked fine because all of the column names followed this pattern (e.g., pain_count and pain_prop). But, do the column names in summary_stats always follow this pattern? What about age_n_miss and ht_in_n_miss? All the extra underscores in the column names makes this pattern ineffective.

There are probably many ways we could address this problem. I think the most straightforward way is probably to go back to the code we used to create summary_stats and use the .names argument to separate the column name and statistic name with a character other than an underscore. Maybe a hyphen instead:

summary_stats <- study %>% 
  summarise(
    across(
      .cols  = c(age, ht_in, wt_lbs, bmi),
      .fns   = list(
        n_miss = ~ sum(is.na(.x)),
        mean   = ~ mean(.x, na.rm = TRUE),
        median = ~ median(.x, na.rm = TRUE),
        min    = ~ min(.x, na.rm = TRUE),
        max    = ~ max(.x, na.rm = TRUE)
      ),
      .names = "{col}-{fn}" # This is the new part of the code
    ) 
  ) %>% 
  print()
## # A tibble: 1 × 20
##   `age-n_miss` `age-mean` `age-median` `age-min` `age-max` `ht_in-n_miss`
##          <int>      <dbl>        <dbl>     <dbl>     <dbl>          <int>
## 1            1       26.9           26        22        48              3
## # … with 14 more variables: ht_in-mean <dbl>, ht_in-median <dbl>,
## #   ht_in-min <dbl>, ht_in-max <dbl>, wt_lbs-n_miss <int>, wt_lbs-mean <dbl>,
## #   wt_lbs-median <dbl>, wt_lbs-min <dbl>, wt_lbs-max <dbl>, bmi-n_miss <int>,
## #   bmi-mean <dbl>, bmi-median <dbl>, bmi-min <dbl>, bmi-max <dbl>

Now, we can simply pass a hyphen to the names_sep argument to pivot_longer():

summary_stats %>% 
  tidyr::pivot_longer(
    cols      = everything(),
    names_to  = c("characteristic", ".value"),
    names_sep = "-"
  )
## # A tibble: 4 × 6
##   characteristic n_miss  mean median   min   max
##   <chr>           <int> <dbl>  <dbl> <dbl> <dbl>
## 1 age                 1  26.9   26    22    48  
## 2 ht_in               3  66.0   66    58    76  
## 3 wt_lbs              2 148.   142.   60   297  
## 4 bmi                 4  23.6   22.9  10.6  45.2

Look at how much easier those results are to read!

I’m a big fan of using across() in conjunction with the dplyr verbs. It allows me to remove a lot of the unnecessary repetition from my code in a way that integrates pretty seamlessly with the tools I’m already using. Perhaps you will see value in using across() as well. In the next chapter, we will learn about using for loops to remove unnecessary repetition from our code.