30 Conditional Operations
There will often be times that we want to modify the values in one column of our data based on the values in one or more other columns in our data. For example, maybe we want to create a column that contains the region of the country someone is from, based on another column that contains the state they are from.
we don’t really have a way to do this with the tools we currently have in our toolbox. We can manually type out all the region values, but that isn’t very scalable. Wouldn’t it be nice if we could just give R some rules, or conditions (e.g., TX is in the South, CA is in the West), and have R fill in the region values for us? Well, that’s exactly what we are going to learn how to do in this chapter.
These kinds of operations are called conditional operations because we type in a set of conditions, R evaluates those conditions, and then executes a different process or procedure based on whether or not the condition is met.
As a silly example, let’s say that we want our daughters to wear a raincoat if it’s raining outside, but we don’t want them to wear a raincoat if it is not raining outside. So, we give them a conditional request: “If it’s raining outside, then make sure to wear your raincoat, please. Otherwise, please don’t wear your raincoat.”
In this hypothetical scenario, they say, “yes, dad,” and then go to the window to see if it’s raining. Then, they choose their next action (i.e., raincoat wearing) depending on whether the condition (raining) is met or not.
Just like we have to ask our daughters to put on a raincoat using conditional logic, we sometimes have to ask R to execute commands using conditional logic. Additionally, we have to do so in a way that R understands. For example, we can use dplyr
’s if_else()
function to ask R to execute commands conditionally. Let’s go ahead and take a look at an example now:
# A tibble: 5 × 2
day weather
<int> <chr>
1 1 rain
2 2 rain
3 3 no rain
4 4 rain
5 5 no rain
👆Here’s what we did above:
- we simulated some data that contains information about whether or not it rained on each of 5 days.
Now, let’s say that we want to create a new column in our data frame called raincoat
. We want the value of raincoat
to be wear
on rainy days and no wear
on days when it isn’t raining. Here’s how we can do that with the if_else() function:
# A tibble: 5 × 3
day weather raincoat
<int> <chr> <chr>
1 1 rain wear
2 2 rain wear
3 3 no rain no wear
4 4 rain wear
5 5 no rain no wear
👆Here’s what we did above:
we used
dplyr
’sif_else()
function to assign the valueswear
andno wear
to the columnraincoat
conditional on the values in each row of theweather
column.You can type
?if_else
into our R console to view the help documentation for this function and follow along with the explanation below.The first argument to the
if_else()
function is thecondition
argument. The condition should typically be composed of a series of operands and operators (we’ll talk more about these soon) that tell R the condition(s) that we want it to test. For example, is the value ofweather
equal torain
?The second argument to the
if_else()
function is thetrue
argument. The value passed to thetrue
argument tells R what value theif_else()
function should return when the condition isTRUE
. In this case, we toldif_else()
to return the character valuewear
.The third argument to the
if_else()
function is thefalse
argument. The value passed to thefalse
argument tells R what value theif_else()
function should return when the condition isFALSE
. In this case, we toldif_else()
to return the character valueno wear
.Finally, we assigned all the values returned by the
if_else()
function to a new column that we namedraincoat
.
🗒Side Note: For the rest of the book, we will pass values to the if_else()
function by position instead of name. In other words, we won’t write condition =
, true =
, or false =
anymore. However, the first value passed to the if_else()
function will always be passed to the condition
argument, the second value will always be passed to the true
argument, and the third value will always be passed to the false
argument.
Before moving on, let’s dive into this a little further. R must always be able to reduce whatever value we pass to the condition
argument of if_else()
to TRUE or FALSE. That’s how R views any expression we pass to the condition
argument. We can literally even pass the value TRUE
or the value FALSE
(not that doing so has much practical application):
[1] "wear"
Because the value passed to the condition
argument is TRUE
(in this case, literally), the if_else()
function returns the value wear
. What happens if we use this code to assign values to the raincoat
column?
# A tibble: 5 × 3
day weather raincoat
<int> <chr> <chr>
1 1 rain wear
2 2 rain wear
3 3 no rain wear
4 4 rain wear
5 5 no rain wear
Again, the if_else()
function returns the value wear
because the value passed to the condition
argument is TRUE
. Then, R uses its recycling rules to copy the value wear
to every row of the raincoat
column. What would do you think will happen if we pass the value FALSE
to the condition
argument instead?
# A tibble: 5 × 3
day weather raincoat
<int> <chr> <chr>
1 1 rain no wear
2 2 rain no wear
3 3 no rain no wear
4 4 rain no wear
5 5 no rain no wear
Hopefully, that was the result you expected. The if_else()
function returns the value no wear
because the value passed to the condition
argument is FALSE
. Then, R uses its recycling rules to copy the value no wear
to every row of the raincoat
column.
we can take this a step further and actually pass a vector of logical (TRUE
/FALSE
) values to the condition
argument. For example:
# A tibble: 5 × 3
day weather raincoat
<int> <chr> <chr>
1 1 rain wear
2 2 rain wear
3 3 no rain no wear
4 4 rain wear
5 5 no rain no wear
In reality, that’s sort of what we did in the very first if_else()
example above. But, instead of typing the values manually, we used an expression that returned a vector of logical values. Specifically, we used the equality operator (==
) to check whether or not each value in the weather
column was equal to the value “rain” or not.
[1] TRUE TRUE FALSE TRUE FALSE
That pretty much covers the basics of how the if_else()
function works. Next, let’s take a look at some of the different combinations of operands and operators that we can combine and pass to the condition
argument of the if_else()
function.
30.1 Operands and operators
Let’s start by taking a look at some commonly used operands:
As we can see in the table above, operands are the values we want to check, or test. Operands can be variables or they can be individual values (also called constants). The example above (weather == "rain"
) contained two operands; the variable weather
and the character constant "rain"
. The operator we used in this case was the equality operator (==
). Next, let’s take a look at some other commonly used operators.
we think that most of the operators above will be familiar, or a least intuitive, for most of you. However, we do want to provide a little bit of commentary for a few of them.
we haven’t seen the
%in%
operator before, but we will wait to discuss it below.Some of you may have been a little surprised by the results we get from using less than (
<
) and greater than (>
) with characters. It’s basically just testing alphabetical order. A comes before B in the alphabet, so A is less than B. Additionally, when two letters are the same, the upper-case letter is considered greater than the lowercase letter. However, alphabetical order takes precedence over case. So, b is still greater than A even though b is lowercase and A is upper case.Many of you may not have seen the modulus operator (
%%
) before. The modulus operator returns the remainder that is left after dividing two numbers. For example, 4 divided by 2 is 2 with a remainder of 0 because 2 goes into 4 exactly two times. Said another way, 2 * 2 = 4 and 4 - 4 = 0. So,4 %% 2 = 0
. However, 3 divided by 2 is 1 with a remainder of 1 because 2 goes into 3 one time with 1 left over. Said another way, 2 * 1 = 2 and 3 - 2 = 1. So,3 %% 2 = 1
. How is this useful? Well, the only times we can remember using the modulus operator have been when we needed to separate even and odd rows of a data frame. For example, let’s say that we have a data frame where each person has two rows. The first row always corresponds to treatment A and the second row always corresponds to treatment B. However, for some reason (maybe blinding?), there was notreatment
column in the data when we received it. We could use the modulus operator to add atreatment
column like this:
# A tibble: 4 × 2
id outcome
<dbl> <dbl>
1 1 0
2 1 1
3 2 1
4 2 1
# A tibble: 4 × 3
id outcome treatment
<dbl> <dbl> <chr>
1 1 0 A
2 1 1 B
3 2 1 A
4 2 1 B
- we also want to remind you that we should always use the
is.na()
function to check for missing values. Not the equality operator (==
). Using the equality operator when there are missing values can give results that may be unexpected. For example:
# A tibble: 3 × 3
name1 name2 name_match
<chr> <chr> <lgl>
1 Jon Jon TRUE
2 John Jon FALSE
3 <NA> Jon NA
Many of us would expect the third value of the name_match
column to be FALSE
instead of NA
. There are a couple of different ways we can get FALSE
in the third row instead of NA
. One way, although not necessarily the best way, is to use the if_else()
function:
# A tibble: 3 × 3
name1 name2 name_match
<chr> <chr> <lgl>
1 Jon Jon TRUE
2 John Jon FALSE
3 <NA> Jon FALSE
👆Here’s what we did above:
we used
dplyr
’sif_else()
function to assign the valueFALSE
to the columnname_match
where the original value ofname_match
wasNA
.The value we passed to the
condition
argument wasis.na(name_match)
. In doing so, we asked R to check each value of thename_match
column and see if it wasNA
.If it was
NA
, then we wanted to return the value that we passed to thetrue
argument. Somewhat confusingly, the value we passed to thetrue
argument wasFALSE
. All that means is that we wantedif_else()
to return the literal valueFALSE
when the value forname_match
wasNA
.If the value in
name_match
was NOTNA
, then we wanted to return the value that we passed to thefalse
argument. In this case, we asked R to return the value that already exists in thename_match
column.In more informal language, we asked R to replace missing values in the
name_match
column withFALSE
and leave the rest of the values unchanged.
30.2 Testing multiple conditions simultaneously
So far, we have only ever passed one condition to the condition
argument of the if_else()
function. However, we can pass as many conditions as we want. Having said that, more than 2, or maybe 3, gets very convoluted. Let’s go ahead and take a look at a couple of examples now. We’ll start by simulating some blood pressure data:
# A tibble: 10 × 3
id sysbp diasbp
<int> <dbl> <dbl>
1 1 152 78
2 2 120 60
3 3 119 88
4 4 123 76
5 5 135 85
6 6 83 54
7 7 191 116
8 8 147 95
9 9 209 100
10 10 166 106
A person may be categorized as having normal blood pressure when their systolic blood pressure is less than 120 mmHG AND their diastolic blood pressure is less than 80 mmHG. We can use this information and the if_else()
function to create a new column in our data frame that contains information about whether each person in our simulated data frame has normal blood pressure or not:
# A tibble: 10 × 4
id sysbp diasbp bp
<int> <dbl> <dbl> <chr>
1 1 152 78 Not Normal
2 2 120 60 Not Normal
3 3 119 88 Not Normal
4 4 123 76 Not Normal
5 5 135 85 Not Normal
6 6 83 54 Normal
7 7 191 116 Not Normal
8 8 147 95 Not Normal
9 9 209 100 Not Normal
10 10 166 106 Not Normal
👆Here’s what we did above:
we used
dplyr
’sif_else()
function to create a new column in our data frame (bp
) that contains information about whether each person has normal blood pressure or not.we actually passed two conditions to the
condition
argument. The first condition was that the value ofsysbp
had to be less than120
. The second condition was that the value ofdiasbp
had to be less than80
.Because we separated these conditions with the AND operator (
&
), both conditions had to be true in order for theif_else()
function to return the value we passed to thetrue
argument –Normal
. Otherwise, theif_else()
function returned the value we passed to thefalse
argument –Not Normal
.Participant 2 had a systolic blood pressure of 120 and a diastolic blood pressure of 60. Although 60 is less than 80 (condition number 2), 120 is not less than 120 (condition number 1). So, the value returned by the
if_else()
function wasNot Normal
.Participant 3 had a systolic blood pressure of 119 and a diastolic blood pressure of 88 Although 119 is less than 120 (condition number 1), 88 is not less than 80 (condition number 2). So, the value returned by the
if_else()
function wasNot Normal
.Participant 6 had a systolic blood pressure of 83 and a diastolic blood pressure of 54. In this case, conditions 1 and 2 were met. So, the value returned by the
if_else()
function wasNormal
.
This is useful! However, in some cases, we need to be able to test conditions sequentially, rather than simultaneously, and return a different value for each condition.
30.3 Testing a sequence of conditions
Let’s say that we wanted to create a new column in our blood_pressure
data frame that contains each person’s blood pressure category according to the following scale:
This is the perfect opportunity to use dplyr
’s case_when()
function. Take a look:
# A tibble: 10 × 4
id sysbp diasbp bp
<int> <dbl> <dbl> <chr>
1 1 152 78 Hypertension Stage 2
2 2 120 60 Elevated
3 3 119 88 Hypertension Stage 1
4 4 123 76 Elevated
5 5 135 85 Hypertension Stage 1
6 6 83 54 Normal
7 7 191 116 Hypertension Stage 2
8 8 147 95 Hypertension Stage 2
9 9 209 100 Hypertension Stage 2
10 10 166 106 Hypertension Stage 2
👆Here’s what we did above:
we used
dplyr
’scase_when()
function to create a new column in our data frame (bp
) that contains information about each person’s blood pressure category.You can type
?case_when
into our R console to view the help documentation for this function and follow along with the explanation below.The
case_when()
function only has a single argument – the...
argument. You should pass one or more two-sided formulas separated by commas to this argument. What in the heck does that mean?When the help documentation refers to a two-sided formula, it means this:
LHS ~ RHS
. Here,LHS
means left-hand side andRHS
means right-hand side.The
LHS
should be the condition or conditions that we want to test. You can think of this as being equivalent to thecondition
argument of theif_else()
function.The
RHS
should be the value we want thecase_when()
function to return when the condition on the left-hand side is met. You can think of this as being equivalent to thetrue
argument of theif_else()
function.The tilde symbol (
~
) is used to separate the conditions on the left-hand side and the return values on the right-hand side.
The
case_when()
function doesn’t have a direct equivalent to theif_else()
function’sfalse
argument. Instead, it evaluates each two-sided formula sequentially until if finds a condition that is met. If it never finds a condition that is met, then it returns anNA
. We will expand on this more below.Finally, we assigned all the values returned by the
case_when()
function to a new column that we namedbp
.
🗒Side Note: Traditionally, the tilde symbol (~
) is used to represent relationships in a statistical model. Here, it doesn’t have that meaning. We assume this symbol was picked somewhat out of necessity. Remember, any of the comparison operators, arithmetic operators, and logical operators may be used to define a condition in the left-hand side, and commas are used to separated multiple two-sided formulas. Therefore, there aren’t very many symbols left to choose from. Therefore, tilde it is. That’s our guess anyway.
The case_when()
function was really useful for creating the bp
column above, but there was also a lot going on there. Next, we’ll take a look at a slightly less complex example and clarify a few things along the way.
30.4 Recoding variables
In epidemiology, recoding variables is really common. For example, we may collect information about people’s ages as a continuous variable, but decide that it makes more sense to collapse age into age categories for our analysis. Let’s say that our analysis plan calls for assigning each of our participants to one of the following age categories:
1 = child
when the participant is less than 12 years old
2 = adolescent
when the participant is between the ages of 12 and less than 18
3 = adult
when the participant is 18 years old or older
🗒Side Note: You may not have ever heard of collapsing variables before. It simply means combing two or more values of our variable. We can collapse continuous variables into categories, as we discussed in the example above, or we can collapse categories into broader categories (as we will see with the race category example below). After we collapse a variable, it always contains fewer (and broader) possible values than it contained before we collapsed it.
we’re going to show you how to do this below using the case_when()
function. However, we’re going to do it piecemeal so that we can highlight a few important concepts. First, let’s simulate some data that includes 10 participant’s ages.
# A tibble: 10 × 2
id age
<int> <int>
1 1 15
2 2 19
3 3 14
4 4 3
5 5 10
6 6 18
7 7 22
8 8 11
9 9 5
10 10 NA
Then, let’s start the process of collapsing the age
column into a new column called age_3cat
that contains the 3 age categories we discussed above:
# A tibble: 10 × 3
id age age_3cat
<int> <int> <dbl>
1 1 15 NA
2 2 19 NA
3 3 14 NA
4 4 3 1
5 5 10 1
6 6 18 NA
7 7 22 NA
8 8 11 1
9 9 5 1
10 10 NA NA
👆Here’s what we did above:
we used
dplyr
’scase_when()
function to create a new column in our data frame (age_3cat
) that will eventually categorize each participant into one of 3 categories depending on their continuous age value.Notice that we only passed one two-sided formula to the
case_when()
function –age < 12 ~ 1
.The
RHS
of the two-sided formula isage < 12
. This tells thecase_when()
function to check whether or not every value in theage
column is less than12
or not.The
LHS
of the two-sided formula is1
. This tells thecase_when()
function what value to return each time it finds a value less than12
in theage
column.The tilde symbol is used to separate the
RHS
and theLHS
of the two-sided formula.
Here is how the
case_when()
function basically works. It will test the condition on the left-hand side for each value of the variable, or variables, passed to the left-hand side (i.e.,age
). If the condition is met (i.e.,< 12
), then it will return the value on the right-hand side of the tilde (i.e.,1
). If the condition is not met, it will test the condition in the next two-sided formula. When there are no more two-sided formulas, then it will return anNA
.Above, the first value in
age
is15
.15
is NOT less than12
. So,case_when()
tries to move on to the next two-sided formula. However, there is no next two-sided formula. So, the first value returned by thecase_when()
function isNA
. The same is true for the next two values of age.The fourth value in
age
is3
.3
is less than12
. So, the fourth value returned by thecase_when()
function is1
. And so on…Finally, after the
case_when()
function has tested all conditions, the returned values are assigned to a new column that we namedage_3cat
.
Notice that we named the new variable
age_3cat
. We’re not sure where we picked up this naming convention, but we use it a lot when we collapse variables. The basic format is the name of variable we’re collapsing, an underscore, and the number of categories in the collapsed variable. We like using this convention for two reasons. First, the resulting column names are meaningful and informative. Second, we don’t have to spend any time trying to think of a different meaningful or informative name for my new variable. It’s totally fine if you don’t adopt this naming convention, but we would recommend that you try to use names that are more informative thanage2
or something like that.Notice that we used a number (
1
) on the right-hand side of the two-sided formula above. We could have used a character value instead (i.e.,child
); however, for reasons we discussed in the section on factor variables, we prefer to recode my variables using numeric categories and then later creating a factor version of the variable using the_f
naming convention.
Now, let’s add a second two-sided formula to our case_when()
function.
# A tibble: 10 × 3
id age age_3cat
<int> <int> <dbl>
1 1 15 2
2 2 19 NA
3 3 14 2
4 4 3 1
5 5 10 1
6 6 18 NA
7 7 22 NA
8 8 11 1
9 9 5 1
10 10 NA NA
👆Here’s what we did above:
we used
dplyr
’scase_when()
function to create a new column in our data frame (age_3cat
) that will eventually categorize each participant into one of 3 categories depending on their continuous age value.Notice that this time we passed two two-sided formulas to the
case_when()
function –age < 12 ~ 1
andage >= 12 & age < 18 ~ 2
.Notice that we separated the two two-sided formulas with a comma (i.e., immediately after the
1
inage < 12 ~ 1
.Notice that the second two-sided formula is actually testing two conditions. First, it tests whether or not the value of age is greater than or equal to 12. Then, it tests whether or not the value of age is less than 18.
Because we separated the two conditions with the and operator (
&
), both must be TRUE forcase_when()
to return the value2
. Otherwise, it will move on to the next two-sided formula.Above, the first value in
age
is15
.15
is NOT less than12
. So,case_when()
moves on to evaluate the next two-sided formula.15
is greater than or equal to12
AND15
is less than18
. Because both conditions of the second two-sided formula were met,case-when()
returns the value on the right-hand side of the second two-sided formula –2
. So, the first value returned by thecase_when()
function is2
.The second value in
age
is19
.19
is NOT less than12
. So,case_when()
moves on to evaluate the next two-sided formula.19
is greater than or equal to12
, but19
is NOT less than18
. So,case_when()
tries to move on to the next two-sided formula. However, there is no next two-sided formula. So, the second value returned by thecase_when()
function isNA
.The fourth value in
age
is3
.3
is less than12
. So, the fourth value returned by thecase_when()
function is1
. At this point, because a condition was met,case_when()
does not continue checking the current value ofage
against the remaining two-sided formulas. It returns a1
and moves on to the next value ofage
.Finally, after the
case_when()
function has tested all conditions, the returned values are assigned to a new column that we namedage_3cat
.
In everyday speech, we may express the second two-sided condition above as “categorize all people between the ages of 12 and 18 as an adolescent.” we want to make two points about that before moving on.
First, while that statement may be totally reasonable in everyday speech, it isn’t quite specific enough for what we are trying to do here. “Between 12 and 18” is a little bit ambiguous. What category is a person put in if they are exactly 12? What category are they put in if they are exactly 18? So, clearly we need to be more precise. We’re not aware of any hard and fast rules for making these kinds of decisions about categorization, but we tend to include the lower end of the range in the current category and exclude the value on the upper end of the range in the current category. So, in the example above, we would say, “categorize all people between the ages of 12 and less than 18 as an adolescent.”
Second, when we are testing for a “between” condition like this one, we often see students write code like this:
age >= 12 & < 18
. R won’t understand that. We have to use the column name in each condition to be tested (i.e.,age >= 12 & age < 18
), even though it doesn’t change. Otherwise, we get an error that looks something like this:
Error in parse(text = input): <text>:5:19: unexpected '<'
4: age < 12 ~ 1,
5: age >= 12 & <
^
Ok, let’s go ahead and wrap up this age category variable:
# A tibble: 10 × 3
id age age_3cat
<int> <int> <dbl>
1 1 15 2
2 2 19 3
3 3 14 2
4 4 3 1
5 5 10 1
6 6 18 3
7 7 22 3
8 8 11 1
9 9 5 1
10 10 NA NA
👆Here’s what we did above:
- we used
dplyr
’scase_when()
function to create a new column in our data frame (age_3cat
) that categorized each participant into one of 3 categories depending on their continuous age value.
30.5 case_when() is lazy
What do we mean when we say that case_when()
is lazy? Well, it may not have registered when we mentioned it above, but case_when()
stops evaluating two-sided functions for a value as soon as it finds one that is TRUE
. For example:
# A tibble: 3 × 1
number
<dbl>
1 1
2 2
3 3
# A tibble: 3 × 2
number size
<dbl> <chr>
1 1 Small
2 2 Medium
3 3 Large
Why wasn’t the value for the size
column Large
in every row of the data frame? After all, 1
, 2
, and 3
are all less than 4
, and number < 4
was the final possible two-sided formula that could have been evaluated for each value of number
. The answer is that case_when()
is lazy. The first value in number
is 1
. 1
is less than 2
. So, the condition in the first two-sided formula evaluates to TRUE
. So, case_when()
immediately returns the value on the right-hand side (Small
) and does not continue checking two-sided formulas. It moves on to the next value of number
.
The fact that case_when()
is lazy isn’t a bad thing. It’s just something to be aware of. In fact, we can often use it to our advantage. For example, we can use case_when()
’s laziness to rewrite the age_3cat
code from above a little more succinctly:
# A tibble: 10 × 3
id age age_3cat
<int> <int> <dbl>
1 1 15 2
2 2 19 3
3 3 14 2
4 4 3 1
5 5 10 1
6 6 18 3
7 7 22 3
8 8 11 1
9 9 5 1
10 10 NA NA
👆Here’s what we did above:
- Because
case_when()
is lazy, we were able to omit theage >= 12
condition from the second two-sided formula. It’s unnecessary because the value1
is immediately returned for every person with anage
value less than12
. By definition, any value being evaluated in the second two-sided function (age < 18
) has an age value greater than