30  Conditional Operations

There will often be times that we want to modify the values in one column of our data based on the values in one or more other columns in our data. For example, maybe we want to create a column that contains the region of the country someone is from, based on another column that contains the state they are from.

we don’t really have a way to do this with the tools we currently have in our toolbox. We can manually type out all the region values, but that isn’t very scalable. Wouldn’t it be nice if we could just give R some rules, or conditions (e.g., TX is in the South, CA is in the West), and have R fill in the region values for us? Well, that’s exactly what we are going to learn how to do in this chapter.

These kinds of operations are called conditional operations because we type in a set of conditions, R evaluates those conditions, and then executes a different process or procedure based on whether or not the condition is met.

As a silly example, let’s say that we want our daughters to wear a raincoat if it’s raining outside, but we don’t want them to wear a raincoat if it is not raining outside. So, we give them a conditional request: “If it’s raining outside, then make sure to wear your raincoat, please. Otherwise, please don’t wear your raincoat.”

In this hypothetical scenario, they say, “yes, dad,” and then go to the window to see if it’s raining. Then, they choose their next action (i.e., raincoat wearing) depending on whether the condition (raining) is met or not.

Just like we have to ask our daughters to put on a raincoat using conditional logic, we sometimes have to ask R to execute commands using conditional logic. Additionally, we have to do so in a way that R understands. For example, we can use dplyr’s if_else() function to ask R to execute commands conditionally. Let’s go ahead and take a look at an example now:

# A tibble: 5 × 2
    day weather
  <int> <chr>  
1     1 rain   
2     2 rain   
3     3 no rain
4     4 rain   
5     5 no rain

👆Here’s what we did above:

Now, let’s say that we want to create a new column in our data frame called raincoat. We want the value of raincoat to be wear on rainy days and no wear on days when it isn’t raining. Here’s how we can do that with the if_else() function:

# A tibble: 5 × 3
    day weather raincoat
  <int> <chr>   <chr>   
1     1 rain    wear    
2     2 rain    wear    
3     3 no rain no wear 
4     4 rain    wear    
5     5 no rain no wear 

👆Here’s what we did above:

Note

🗒Side Note: For the rest of the book, we will pass values to the if_else() function by position instead of name. In other words, we won’t write condition =, true =, or false = anymore. However, the first value passed to the if_else() function will always be passed to the condition argument, the second value will always be passed to the true argument, and the third value will always be passed to the false argument.

Before moving on, let’s dive into this a little further. R must always be able to reduce whatever value we pass to the condition argument of if_else() to TRUE or FALSE. That’s how R views any expression we pass to the condition argument. We can literally even pass the value TRUE or the value FALSE (not that doing so has much practical application):

[1] "wear"

Because the value passed to the condition argument is TRUE (in this case, literally), the if_else() function returns the value wear. What happens if we use this code to assign values to the raincoat column?

# A tibble: 5 × 3
    day weather raincoat
  <int> <chr>   <chr>   
1     1 rain    wear    
2     2 rain    wear    
3     3 no rain wear    
4     4 rain    wear    
5     5 no rain wear    

Again, the if_else() function returns the value wear because the value passed to the condition argument is TRUE. Then, R uses its recycling rules to copy the value wear to every row of the raincoat column. What would do you think will happen if we pass the value FALSE to the condition argument instead?

# A tibble: 5 × 3
    day weather raincoat
  <int> <chr>   <chr>   
1     1 rain    no wear 
2     2 rain    no wear 
3     3 no rain no wear 
4     4 rain    no wear 
5     5 no rain no wear 

Hopefully, that was the result you expected. The if_else() function returns the value no wear because the value passed to the condition argument is FALSE. Then, R uses its recycling rules to copy the value no wear to every row of the raincoat column.

we can take this a step further and actually pass a vector of logical (TRUE/FALSE) values to the condition argument. For example:

# A tibble: 5 × 3
    day weather raincoat
  <int> <chr>   <chr>   
1     1 rain    wear    
2     2 rain    wear    
3     3 no rain no wear 
4     4 rain    wear    
5     5 no rain no wear 

In reality, that’s sort of what we did in the very first if_else() example above. But, instead of typing the values manually, we used an expression that returned a vector of logical values. Specifically, we used the equality operator (==) to check whether or not each value in the weather column was equal to the value “rain” or not.

[1]  TRUE  TRUE FALSE  TRUE FALSE

That pretty much covers the basics of how the if_else() function works. Next, let’s take a look at some of the different combinations of operands and operators that we can combine and pass to the condition argument of the if_else() function.

30.1 Operands and operators

Let’s start by taking a look at some commonly used operands:

As we can see in the table above, operands are the values we want to check, or test. Operands can be variables or they can be individual values (also called constants). The example above (weather == "rain") contained two operands; the variable weather and the character constant "rain". The operator we used in this case was the equality operator (==). Next, let’s take a look at some other commonly used operators.

we think that most of the operators above will be familiar, or a least intuitive, for most of you. However, we do want to provide a little bit of commentary for a few of them.

  • we haven’t seen the %in% operator before, but we will wait to discuss it below.

  • Some of you may have been a little surprised by the results we get from using less than (<) and greater than (>) with characters. It’s basically just testing alphabetical order. A comes before B in the alphabet, so A is less than B. Additionally, when two letters are the same, the upper-case letter is considered greater than the lowercase letter. However, alphabetical order takes precedence over case. So, b is still greater than A even though b is lowercase and A is upper case.

  • Many of you may not have seen the modulus operator (%%) before. The modulus operator returns the remainder that is left after dividing two numbers. For example, 4 divided by 2 is 2 with a remainder of 0 because 2 goes into 4 exactly two times. Said another way, 2 * 2 = 4 and 4 - 4 = 0. So, 4 %% 2 = 0. However, 3 divided by 2 is 1 with a remainder of 1 because 2 goes into 3 one time with 1 left over. Said another way, 2 * 1 = 2 and 3 - 2 = 1. So, 3 %% 2 = 1. How is this useful? Well, the only times we can remember using the modulus operator have been when we needed to separate even and odd rows of a data frame. For example, let’s say that we have a data frame where each person has two rows. The first row always corresponds to treatment A and the second row always corresponds to treatment B. However, for some reason (maybe blinding?), there was no treatment column in the data when we received it. We could use the modulus operator to add a treatment column like this:

# A tibble: 4 × 2
     id outcome
  <dbl>   <dbl>
1     1       0
2     1       1
3     2       1
4     2       1
# A tibble: 4 × 3
     id outcome treatment
  <dbl>   <dbl> <chr>    
1     1       0 A        
2     1       1 B        
3     2       1 A        
4     2       1 B        
  • we also want to remind you that we should always use the is.na() function to check for missing values. Not the equality operator (==). Using the equality operator when there are missing values can give results that may be unexpected. For example:
# A tibble: 3 × 3
  name1 name2 name_match
  <chr> <chr> <lgl>     
1 Jon   Jon   TRUE      
2 John  Jon   FALSE     
3 <NA>  Jon   NA        

Many of us would expect the third value of the name_match column to be FALSE instead of NA. There are a couple of different ways we can get FALSE in the third row instead of NA. One way, although not necessarily the best way, is to use the if_else() function:

# A tibble: 3 × 3
  name1 name2 name_match
  <chr> <chr> <lgl>     
1 Jon   Jon   TRUE      
2 John  Jon   FALSE     
3 <NA>  Jon   FALSE     

👆Here’s what we did above:

  • we used dplyr’s if_else() function to assign the value FALSE to the column name_match where the original value of name_match was NA.

  • The value we passed to the condition argument was is.na(name_match). In doing so, we asked R to check each value of the name_match column and see if it was NA.

  • If it was NA, then we wanted to return the value that we passed to the true argument. Somewhat confusingly, the value we passed to the true argument was FALSE. All that means is that we wanted if_else() to return the literal value FALSE when the value for name_match was NA.

  • If the value in name_match was NOT NA, then we wanted to return the value that we passed to the false argument. In this case, we asked R to return the value that already exists in the name_match column.

  • In more informal language, we asked R to replace missing values in the name_match column with FALSE and leave the rest of the values unchanged.

30.2 Testing multiple conditions simultaneously

So far, we have only ever passed one condition to the condition argument of the if_else() function. However, we can pass as many conditions as we want. Having said that, more than 2, or maybe 3, gets very convoluted. Let’s go ahead and take a look at a couple of examples now. We’ll start by simulating some blood pressure data:

# A tibble: 10 × 3
      id sysbp diasbp
   <int> <dbl>  <dbl>
 1     1   152     78
 2     2   120     60
 3     3   119     88
 4     4   123     76
 5     5   135     85
 6     6    83     54
 7     7   191    116
 8     8   147     95
 9     9   209    100
10    10   166    106

A person may be categorized as having normal blood pressure when their systolic blood pressure is less than 120 mmHG AND their diastolic blood pressure is less than 80 mmHG. We can use this information and the if_else() function to create a new column in our data frame that contains information about whether each person in our simulated data frame has normal blood pressure or not:

# A tibble: 10 × 4
      id sysbp diasbp bp        
   <int> <dbl>  <dbl> <chr>     
 1     1   152     78 Not Normal
 2     2   120     60 Not Normal
 3     3   119     88 Not Normal
 4     4   123     76 Not Normal
 5     5   135     85 Not Normal
 6     6    83     54 Normal    
 7     7   191    116 Not Normal
 8     8   147     95 Not Normal
 9     9   209    100 Not Normal
10    10   166    106 Not Normal

👆Here’s what we did above:

  • we used dplyr’s if_else() function to create a new column in our data frame (bp) that contains information about whether each person has normal blood pressure or not.

  • we actually passed two conditions to the condition argument. The first condition was that the value of sysbp had to be less than 120. The second condition was that the value of diasbp had to be less than 80.

  • Because we separated these conditions with the AND operator (&), both conditions had to be true in order for the if_else() function to return the value we passed to the true argument – Normal. Otherwise, the if_else() function returned the value we passed to the false argument – Not Normal.

  • Participant 2 had a systolic blood pressure of 120 and a diastolic blood pressure of 60. Although 60 is less than 80 (condition number 2), 120 is not less than 120 (condition number 1). So, the value returned by the if_else() function was Not Normal.

  • Participant 3 had a systolic blood pressure of 119 and a diastolic blood pressure of 88 Although 119 is less than 120 (condition number 1), 88 is not less than 80 (condition number 2). So, the value returned by the if_else() function was Not Normal.

  • Participant 6 had a systolic blood pressure of 83 and a diastolic blood pressure of 54. In this case, conditions 1 and 2 were met. So, the value returned by the if_else() function was Normal.

This is useful! However, in some cases, we need to be able to test conditions sequentially, rather than simultaneously, and return a different value for each condition.

30.3 Testing a sequence of conditions

Let’s say that we wanted to create a new column in our blood_pressure data frame that contains each person’s blood pressure category according to the following scale:

This is the perfect opportunity to use dplyr’s case_when() function. Take a look:

# A tibble: 10 × 4
      id sysbp diasbp bp                  
   <int> <dbl>  <dbl> <chr>               
 1     1   152     78 Hypertension Stage 2
 2     2   120     60 Elevated            
 3     3   119     88 Hypertension Stage 1
 4     4   123     76 Elevated            
 5     5   135     85 Hypertension Stage 1
 6     6    83     54 Normal              
 7     7   191    116 Hypertension Stage 2
 8     8   147     95 Hypertension Stage 2
 9     9   209    100 Hypertension Stage 2
10    10   166    106 Hypertension Stage 2

👆Here’s what we did above:

  • we used dplyr’s case_when() function to create a new column in our data frame (bp) that contains information about each person’s blood pressure category.

  • You can type ?case_when into our R console to view the help documentation for this function and follow along with the explanation below.

  • The case_when() function only has a single argument – the ... argument. You should pass one or more two-sided formulas separated by commas to this argument. What in the heck does that mean?

    • When the help documentation refers to a two-sided formula, it means this: LHS ~ RHS. Here, LHS means left-hand side and RHS means right-hand side.

    • The LHS should be the condition or conditions that we want to test. You can think of this as being equivalent to the condition argument of the if_else() function.

    • The RHS should be the value we want the case_when() function to return when the condition on the left-hand side is met. You can think of this as being equivalent to the true argument of the if_else() function.

    • The tilde symbol (~) is used to separate the conditions on the left-hand side and the return values on the right-hand side.

  • The case_when() function doesn’t have a direct equivalent to the if_else() function’s false argument. Instead, it evaluates each two-sided formula sequentially until if finds a condition that is met. If it never finds a condition that is met, then it returns an NA. We will expand on this more below.

  • Finally, we assigned all the values returned by the case_when() function to a new column that we named bp.

Note

🗒Side Note: Traditionally, the tilde symbol (~) is used to represent relationships in a statistical model. Here, it doesn’t have that meaning. We assume this symbol was picked somewhat out of necessity. Remember, any of the comparison operators, arithmetic operators, and logical operators may be used to define a condition in the left-hand side, and commas are used to separated multiple two-sided formulas. Therefore, there aren’t very many symbols left to choose from. Therefore, tilde it is. That’s our guess anyway.

The case_when() function was really useful for creating the bp column above, but there was also a lot going on there. Next, we’ll take a look at a slightly less complex example and clarify a few things along the way.

30.4 Recoding variables

In epidemiology, recoding variables is really common. For example, we may collect information about people’s ages as a continuous variable, but decide that it makes more sense to collapse age into age categories for our analysis. Let’s say that our analysis plan calls for assigning each of our participants to one of the following age categories:

1 = child when the participant is less than 12 years old
2 = adolescent when the participant is between the ages of 12 and less than 18
3 = adult when the participant is 18 years old or older

Note

🗒Side Note: You may not have ever heard of collapsing variables before. It simply means combing two or more values of our variable. We can collapse continuous variables into categories, as we discussed in the example above, or we can collapse categories into broader categories (as we will see with the race category example below). After we collapse a variable, it always contains fewer (and broader) possible values than it contained before we collapsed it.

we’re going to show you how to do this below using the case_when() function. However, we’re going to do it piecemeal so that we can highlight a few important concepts. First, let’s simulate some data that includes 10 participant’s ages.

# A tibble: 10 × 2
      id   age
   <int> <int>
 1     1    15
 2     2    19
 3     3    14
 4     4     3
 5     5    10
 6     6    18
 7     7    22
 8     8    11
 9     9     5
10    10    NA

Then, let’s start the process of collapsing the age column into a new column called age_3cat that contains the 3 age categories we discussed above:

# A tibble: 10 × 3
      id   age age_3cat
   <int> <int>    <dbl>
 1     1    15       NA
 2     2    19       NA
 3     3    14       NA
 4     4     3        1
 5     5    10        1
 6     6    18       NA
 7     7    22       NA
 8     8    11        1
 9     9     5        1
10    10    NA       NA

👆Here’s what we did above:

  • we used dplyr’s case_when() function to create a new column in our data frame (age_3cat) that will eventually categorize each participant into one of 3 categories depending on their continuous age value.

  • Notice that we only passed one two-sided formula to the case_when() function – age < 12 ~ 1.

    • The RHS of the two-sided formula is age < 12. This tells the case_when() function to check whether or not every value in the age column is less than 12 or not.

    • The LHS of the two-sided formula is 1. This tells the case_when() function what value to return each time it finds a value less than 12 in the age column.

    • The tilde symbol is used to separate the RHS and the LHS of the two-sided formula.

  • Here is how the case_when() function basically works. It will test the condition on the left-hand side for each value of the variable, or variables, passed to the left-hand side (i.e., age). If the condition is met (i.e., < 12), then it will return the value on the right-hand side of the tilde (i.e., 1). If the condition is not met, it will test the condition in the next two-sided formula. When there are no more two-sided formulas, then it will return an NA.

    • Above, the first value in age is 15. 15 is NOT less than 12. So, case_when() tries to move on to the next two-sided formula. However, there is no next two-sided formula. So, the first value returned by the case_when() function is NA. The same is true for the next two values of age.

    • The fourth value in age is 3. 3 is less than 12. So, the fourth value returned by the case_when() function is 1. And so on…

    • Finally, after the case_when() function has tested all conditions, the returned values are assigned to a new column that we named age_3cat.

  • Notice that we named the new variable age_3cat. We’re not sure where we picked up this naming convention, but we use it a lot when we collapse variables. The basic format is the name of variable we’re collapsing, an underscore, and the number of categories in the collapsed variable. We like using this convention for two reasons. First, the resulting column names are meaningful and informative. Second, we don’t have to spend any time trying to think of a different meaningful or informative name for my new variable. It’s totally fine if you don’t adopt this naming convention, but we would recommend that you try to use names that are more informative than age2 or something like that.

  • Notice that we used a number (1) on the right-hand side of the two-sided formula above. We could have used a character value instead (i.e., child); however, for reasons we discussed in the section on factor variables, we prefer to recode my variables using numeric categories and then later creating a factor version of the variable using the _f naming convention.

Now, let’s add a second two-sided formula to our case_when() function.

# A tibble: 10 × 3
      id   age age_3cat
   <int> <int>    <dbl>
 1     1    15        2
 2     2    19       NA
 3     3    14        2
 4     4     3        1
 5     5    10        1
 6     6    18       NA
 7     7    22       NA
 8     8    11        1
 9     9     5        1
10    10    NA       NA

👆Here’s what we did above:

  • we used dplyr’s case_when() function to create a new column in our data frame (age_3cat) that will eventually categorize each participant into one of 3 categories depending on their continuous age value.

  • Notice that this time we passed two two-sided formulas to the case_when() function – age < 12 ~ 1 and age >= 12 & age < 18 ~ 2.

    • Notice that we separated the two two-sided formulas with a comma (i.e., immediately after the 1 in age < 12 ~ 1.

    • Notice that the second two-sided formula is actually testing two conditions. First, it tests whether or not the value of age is greater than or equal to 12. Then, it tests whether or not the value of age is less than 18.

    • Because we separated the two conditions with the and operator (&), both must be TRUE for case_when() to return the value 2. Otherwise, it will move on to the next two-sided formula.

    • Above, the first value in age is 15. 15 is NOT less than 12. So, case_when() moves on to evaluate the next two-sided formula. 15 is greater than or equal to 12 AND 15 is less than 18. Because both conditions of the second two-sided formula were met, case-when() returns the value on the right-hand side of the second two-sided formula – 2. So, the first value returned by the case_when() function is 2.

    • The second value in age is 19. 19 is NOT less than 12. So, case_when() moves on to evaluate the next two-sided formula. 19 is greater than or equal to 12, but 19 is NOT less than 18. So, case_when() tries to move on to the next two-sided formula. However, there is no next two-sided formula. So, the second value returned by the case_when() function is NA.

    • The fourth value in age is 3. 3 is less than 12. So, the fourth value returned by the case_when() function is 1. At this point, because a condition was met, case_when() does not continue checking the current value of age against the remaining two-sided formulas. It returns a 1 and moves on to the next value of age.

    • Finally, after the case_when() function has tested all conditions, the returned values are assigned to a new column that we named age_3cat.

  • In everyday speech, we may express the second two-sided condition above as “categorize all people between the ages of 12 and 18 as an adolescent.” we want to make two points about that before moving on.

    • First, while that statement may be totally reasonable in everyday speech, it isn’t quite specific enough for what we are trying to do here. “Between 12 and 18” is a little bit ambiguous. What category is a person put in if they are exactly 12? What category are they put in if they are exactly 18? So, clearly we need to be more precise. We’re not aware of any hard and fast rules for making these kinds of decisions about categorization, but we tend to include the lower end of the range in the current category and exclude the value on the upper end of the range in the current category. So, in the example above, we would say, “categorize all people between the ages of 12 and less than 18 as an adolescent.”

    • Second, when we are testing for a “between” condition like this one, we often see students write code like this: age >= 12 & < 18. R won’t understand that. We have to use the column name in each condition to be tested (i.e., age >= 12 & age < 18), even though it doesn’t change. Otherwise, we get an error that looks something like this:

Error in parse(text = input): <text>:5:19: unexpected '<'
4:       age < 12             ~ 1,
5:       age >= 12 & <
                     ^

Ok, let’s go ahead and wrap up this age category variable:

# A tibble: 10 × 3
      id   age age_3cat
   <int> <int>    <dbl>
 1     1    15        2
 2     2    19        3
 3     3    14        2
 4     4     3        1
 5     5    10        1
 6     6    18        3
 7     7    22        3
 8     8    11        1
 9     9     5        1
10    10    NA       NA

👆Here’s what we did above:

  • we used dplyr’s case_when() function to create a new column in our data frame (age_3cat) that categorized each participant into one of 3 categories depending on their continuous age value.

30.5 case_when() is lazy

What do we mean when we say that case_when() is lazy? Well, it may not have registered when we mentioned it above, but case_when() stops evaluating two-sided functions for a value as soon as it finds one that is TRUE. For example:

# A tibble: 3 × 1
  number
   <dbl>
1      1
2      2
3      3
# A tibble: 3 × 2
  number size  
   <dbl> <chr> 
1      1 Small 
2      2 Medium
3      3 Large 

Why wasn’t the value for the size column Large in every row of the data frame? After all, 1, 2, and 3 are all less than 4, and number < 4 was the final possible two-sided formula that could have been evaluated for each value of number. The answer is that case_when() is lazy. The first value in number is 1. 1 is less than 2. So, the condition in the first two-sided formula evaluates to TRUE. So, case_when() immediately returns the value on the right-hand side (Small) and does not continue checking two-sided formulas. It moves on to the next value of number.

The fact that case_when() is lazy isn’t a bad thing. It’s just something to be aware of. In fact, we can often use it to our advantage. For example, we can use case_when()’s laziness to rewrite the age_3cat code from above a little more succinctly:

# A tibble: 10 × 3
      id   age age_3cat
   <int> <int>    <dbl>
 1     1    15        2
 2     2    19        3
 3     3    14        2
 4     4     3        1
 5     5    10        1
 6     6    18        3
 7     7    22        3
 8     8    11        1
 9     9     5        1
10    10    NA       NA

👆Here’s what we did above:

  • Because case_when() is lazy, we were able to omit the age >= 12 condition from the second two-sided formula. It’s unnecessary because the value 1 is immediately returned for every person with an age value less than 12. By definition, any value being evaluated in the second two-sided function (age < 18) has an age value greater than