11 Using Pipes
11.1 What are pipes?
🤔 What are pipes? This %>%
is the pipe operator. The pipe operator is not part of base R. So, you will need to install and load a package to use it. There is actually more than one package that you can use. I recommend that you install and load the dplyr
package.
You can install the
dplyr
package by copying and pasting the following command in your R consoleinstall.packages("dplyr")
You can load the
dplyr
package by copying and pasting the following command in your R consolelibrary(dplyr)
🤔 What does the pipe operator do? In my opinion, the pipe operator makes your R code much easier to read and understand.
🤔 How does it do that? It makes your R code easier to read and understand by allowing you to view your nested functions in the order you want them to execute, as opposed to viewing them literally nested inside of each other.
You were first introduced to nesting functions in the Let’s get programming chapter. Recall that functions return values, and the R language allows us to directly pass those returned values into other functions for further calculations. We referred to this as nesting functions and said it was a big deal because it allows us to do very complex operations in a scalable way and without storing a bunch of unneeded values.
In that chapter, we also discussed a potential downside of nesting functions. Namely, our R code can become really difficult to read when we start nesting lots of functions inside one another.
Pipes allow us to retain the benefits of nesting functions without making our code really difficult to read. At this point, I think it’s best to show you an example. In the code below we want to generate a sequence of numbers, then we want to calculate the log of each of the numbers, and then find the mean of the logged values.
# Performing an operation using a series of steps.
<- seq(from = 2, to = 100, by = 2)
my_numbers <- log(my_numbers)
my_numbers_logged <- mean(my_numbers_logged)
mean_my_numbers_logged mean_my_numbers_logged
## [1] 3.662703
👆Here’s what we did above:
- We created a vector of numbers called
my_numbers
using theseq()
function.
- Then we used the
log()
function to create a new vector of numbers calledmy_numbers_logged
, which contains the log values of the numbers inmy_numbers
.
- Then we used the
mean()
function to create a new vector calledmean_my_numbers_logged
, which contains the mean of the log values inmy_numbers_logged
.
- Finally, we printed the value of
mean_my_numbers_logged
to the screen to view.
The obvious first question here is, “why would I ever want to do that?” Good question! You probably won’t ever want to do what we just did in the code chunk above, but we haven’t learned many functions for working with real data yet and I don’t want to distract you with a bunch of new functions right now. Instead, I want to demonstrate what pipes do. So, we’re stuck with this silly example.
👍 What’s nice about the code above? I would argue that it is pretty easy to read because each line does one thing and it follows a series of steps in logical order – create the numbers, log the numbers, get the mean.
👎 What could be better about the code above? All we really wanted was the mean value of the logged numbers (i.e., mean_my_numbers_logged
); however, on our way to getting mean_my_numbers_logged
we also created two other objects that we don’t care about – my_numbers
and my_numbers_logged
. It took us time to do the extra typing required to create those objects, and those objects are now cluttering up our global environment. It may not seem like that big of a deal here, but in a real data analysis project these things can really add up.
Next, let’s try nesting these functions instead:
# Performing an operation using nested functions.
<- mean(log(seq(from = 2, to = 100, by = 2)))
mean_my_numbers_logged mean_my_numbers_logged
## [1] 3.662703
👆Here’s what we did above:
We created a vector of numbers called
mean_my_numbers_logged
by nesting theseq()
function inside of thelog()
function and nesting thelog()
function inside of themean()
function.Then, we printed the value of
mean_my_numbers_logged
to the screen to view.
👍 What’s nice about the code above? It is certainly more efficient than the sequential step method we used at first. We went from using 4 lines of code to using 2 lines of code, and we didn’t generate any unneeded objects.
👎 What could be better about the code above? Many people would say that this code is harder to read than than the the sequential step method we used at first. This is primarily due to the fact that each line no longer does one thing, and the code no longer follows a sequence of steps from start to finish. For example, the final operation we want to do is calculate the mean, but the mean()
function is the first function we use in the code.
Finally, let’s try see what this code looks like when we use pipes:
# Performing an operation using pipes.
<- seq(from = 2, to = 100, by = 2) %>%
mean_my_numbers_logged log() %>%
mean()
mean_my_numbers_logged
## [1] 3.662703
👆Here’s what we did above:
We created a vector of numbers called
mean_my_numbers_logged
by passing the result of theseq()
function directly to thelog()
function using the pipe operator, and passing the result of the thelog()
function directly to themean()
function using the pipe operator.Then, we printed the value of
mean_my_numbers_logged
to the screen to view.
👏 As you can see, by using pipes we were able to retain the benefits of performing the operation in a series of steps (i.e., each line of code does one thing and they follow in sequential order) and the benefits of nesting functions (i.e., more efficient code).
The utility of the pipe operator may not be immediately apparent to you based on this very simple example. So, next I’m going to show you a little snippet of code from one of my research projects. In the code chunk that follows, the operation I’m trying to perform on my data is written in two different ways – without pipes and with pipes. It’s very unlikely that you will know what this code does, but that isn’t really the point. Just try to get a sense of which version is easier for you to read.
# Nest functions without pipes
<- select(ungroup(filter(group_by(filter(merged_data, !is.na(incident_number)), incident_number), row_number() == 1)), date_entered, detect_data, validation)
responses
# Nest functions with pipes
<- merged_data %>%
responses filter(!is.na(incident_number)) %>%
group_by(incident_number) %>%
filter(row_number() == 1) %>%
ungroup() %>%
select(date_entered, detect_data, validation)
What do you think? Even without knowing what this code does, do you feel like one version is easier to read than the other?
11.2 How do pipes work?
Perhaps I’ve convinced you that pipes are generally useful. But, it may not be totally obvious to you how to use them. They are actually really simple. Start by thinking about pipes as having a left side and a right side.

Figure 11.1: Pipes have a left side and a right side.
The thing on the right side of the pipe operator should always be a function.

Figure 11.2: A function should always be to the right of the pipe operator.
The thing on the left side of the pipe operator can be a function or an object.

Figure 11.3: A function or an object can be to the left of the pipe operator.
All the pipe operator does is take the thing on the left side and pass it to the first argument of the function on the right side.

Figure 11.4: Pipe the left side to the first argument of the function on the right side.
It’s a really simple concept, but it can also cause people a lot of confusion at first. So, let’s take look at a couple more concrete examples.
Below we pass a vector of numbers to the to the mean()
function, which returns the mean value of those numbers to us.
mean(c(2, 4, 6, 8))
## [1] 5
We can also use a pipe to pass that vector of numbers to the mean()
function.
c(2, 4, 6, 8) %>% mean()
## [1] 5
So, the R interpreter took the thing on the left side of the pipe operator, stuck it into the first argument of the function on the right side of the pipe operator, and then executed the function. In this case, the mean()
function doesn’t require any other arguments, so we don’t have to write anything else inside of the mean()
function’s parentheses. When we see c(2, 4, 6, 8) %>% mean()
, R sees mean(c(2, 4, 6, 8))
Here’s one more example. Pretty soon we will learn how to use the filter()
function from the dplyr
package to keep only a subset of rows from our data frame. Let’s start by simulating some data:
# Simulate some data
<- tibble(
height_and_weight id = c("001", "002", "003", "004", "005"),
sex = c("Male", "Male", "Female", "Female", "Male"),
ht_in = c(71, 69, 64, 65, 73),
wt_lbs = c(190, 176, 130, 154, 173)
)
height_and_weight
## # A tibble: 5 × 4
## id sex ht_in wt_lbs
## <chr> <chr> <dbl> <dbl>
## 1 001 Male 71 190
## 2 002 Male 69 176
## 3 003 Female 64 130
## 4 004 Female 65 154
## 5 005 Male 73 173
In order to work, the filter()
function requires us to pass two values to it. The first value is the name of the data frame object with the rows we want to subset. The second is the condition used to subset the rows. Let’s say that we want to do a subgroup analysis using only the females in our data frame. We could use the filter()
function like so:
# First value = data frame name (height_and_weight)
# Second value = condition for keeping rows (when the value of sex is Female)
filter(height_and_weight, sex == "Female")
## # A tibble: 2 × 4
## id sex ht_in wt_lbs
## <chr> <chr> <dbl> <dbl>
## 1 003 Female 64 130
## 2 004 Female 65 154
👆Here’s what we did above:
- We kept only the rows from the data frame called
height_and_weight
that had a value ofFemale
for the variable calledsex
usingdplyr
’sfilter()
function.
We can also use a pipe to pass the height_and_weight
data frame to the filter()
function.
# First value = data frame name (height_and_weight)
# Second value = condition for keeping rows (when the value of sex is Female)
%>% filter(sex == "Female") height_and_weight
## # A tibble: 2 × 4
## id sex ht_in wt_lbs
## <chr> <chr> <dbl> <dbl>
## 1 003 Female 64 130
## 2 004 Female 65 154
As you can see, we get the exact same result. So, the R interpreter took the thing on the left side of the pipe operator, stuck it into the first argument of the function on the right side of the pipe operator, and then executed the function. In this case, the filter()
function needs a value supplied to two arguments in order work. So, we wrote sex == "Female"
inside of the filter()
function’s parentheses. When we see height_and_weight %>% filter(sex == "Female")
, R sees filter(height_and_weight, sex == "Female")
.
🗒Side Note: This pattern – a data frame piped into a function, which is usually then piped into one or more additional functions is something that you will see over and over in this book.
Don’t worry too much about how the filter()
function works. That isn’t the point here. The two main takeaways so far are:
Pipes make your code easier to read once you get used to them.
The R interpreter knows how to automatically take whatever is on the left side of the pipe operator and make it the value that gets passed to the first argument of the function on the right side of the pipe operator.
11.2.1 Keyboard shortcut
Typing %>%
over and over can be tedious! Thankfully, RStudio provides a keyboard shortcut for inserting the pipe operator into your R code.
On Mac type shift + command + m
.
On Windows type shift + control + m
It may not seem totally intuitive at first, but this shortcut is really handy once you get used to it.
11.2.2 Pipe style
As with all the code we write, style is an important consideration. I generally agree with the recommendations given in the Tidyverse style guide. In particular, I tend to use pipes in such a way that each line of my code does one, and only one, thing. For example:
# Each line does one thing - Use
<- merged_data %>%
responses filter(!is.na(incident_number)) %>%
group_by(incident_number) %>%
filter(row_number() == 1) %>%
ungroup() %>%
select(date_entered, detect_data, validation)
# Some lines do more than one thing - Don't use
<- merged_data %>% filter(!is.na(incident_number)) %>%
responses group_by(incident_number) %>% filter(row_number() == 1) %>%
ungroup() %>% select(date_entered, detect_data, validation)
As previously stated, there is a certain amount of subjectivity in what constitutes “good” style. But, I will once again reiterate that it is important to adopt some style and use it consistently.
11.3 Final thought on pipes
I think it’s important to note that not everyone in the R programming community is a fan of using pipes. I hope I’ve made a compelling case for why I use pipes, but I acknowledge that it is ultimately a preference, and that using pipes is not the best choice in all circumstances. Whether or not you choose to use the pipe operator is up to you; however, I will be using them extensively throughout the remainder of this book.