25 Creating and modifying columns

Two of the most fundamental data management tasks are to create new columns in your data frame and to modify existing columns in your data frame. In fact, we’ve already talked about creating and modifying columns at a few different places in the book.

In this book, we are actually going to learn 4 different methods for creating and modifying columns of a data frame. They are:

  1. Using name-value pairs to add columns to a data frame during its initial creation. This was one of the first methods we used in this book for creating columns in a data frame. However, this method does not apply to creating or modifying columns in a data frame that already exists. Therefore, we won’t discuss it much in this chapter.

  2. Dollar sign notation. This is probably the most commonly used base R way of creating and modifying columns in a data frame. In this book, we won’t use it as much as we use dplyr::mutate(), but you will see it all over the place in the R community.

  3. Bracket notation. Again, we won’t use bracket notation very often in this book. However, we will use it later on when we learn about for loops. Therefore, I’m going to introduce you to using bracket notation to create and modify data frame columns now.

  4. The mutate() function from the dplyr package. This is the method that I will use the vast majority of the time in this book (and in my real-life projects). I’m going to recommend that you do the same.

25.1 Creating data frames

Very early on, in the Let’s get programming chapter, we learned how to create data frame columns using name-value pairs passed directly into the tibble() function.

class <- tibble(
  names   = c("John", "Sally", "Brad", "Anne"),
  heights = c(68, 63, 71, 72)
)
class
## # A tibble: 4 × 2
##   names heights
##   <chr>   <dbl>
## 1 John       68
## 2 Sally      63
## 3 Brad       71
## 4 Anne       72

This is an absolutely fundamental R programming skill, and one that you will likely use often. However, most people would not consider this to be a “data management” task, which is the focus of this part of the book. Further, we’ve really already covered all we need to cover about creating columns this way. So, I’m not going to write anything further about this method.

25.2 Dollar sign notation

Later in the Let’s get programming chapter, we learned about dollar sign notation. At that time, we used dollar sign notation to access or “get” values from a column.

class$heights
## [1] 68 63 71 72

However, we can also use dollar sign notation to create and/or modify columns in our data frame. For example:

class$heights <- class$heights / 12
class
## # A tibble: 4 × 2
##   names heights
##   <chr>   <dbl>
## 1 John     5.67
## 2 Sally    5.25
## 3 Brad     5.92
## 4 Anne     6

👆Here’s what we did above:

  • We modified the values in the heights column of our class data frame using dollar sign notation. More specifically, we converted the values in the heights column from inches to feet. We did this by telling R to “get” the values for the heights column and divide them by 12 (class$heights / 12) and then assign those new values back to the heights column (class$heights <-). In this case, that has the effect of modifying the values of a column that already exists.

🗒Side Note: I would actually suggest that you don’t typically do what I just did above in a real-world analysis. It’s typically safer to create a new variable with the modified values (e.g. height_feet) and leave the original values in the original variable as-is.

We can also create a new variable in our data frame in a similar way. All we have to do is use a valid column name (that doesn’t already exist in the data frame) on the left side of our assignment arrow. For example:

class$grades <- c(89, 92, 86, 98)
class
## # A tibble: 4 × 3
##   names heights grades
##   <chr>   <dbl>  <dbl>
## 1 John     5.67     89
## 2 Sally    5.25     92
## 3 Brad     5.92     86
## 4 Anne     6        98

👆Here’s what we did above:

  • We created a new column in our class data frame using dollar sign notation. We assigned the values 89, 92, 86, and 98 to that column with the assignment arrow.

25.3 Bracket notation

We also learned how to access or “get” values from a column using bracket notation in the Let’s get programming chapter. There, we actually used a combination of dollar sign and bracket notation to access single individual values from a data frame column. For example:

class$heights[3]
## [1] 5.916667

But, we can also use bracket notation to access or “get” the entire column. For example:

class[["heights"]]
## [1] 5.666667 5.250000 5.916667 6.000000

👆Here’s what we did above:

  • We used bracket notation to get all of the values from the heights column of the class data frame.

I’d like you to notice a couple of things about the example above. First, notice that this is the exact same result we got from (class$heights). Well, technically, the heights are now in feet instead of inches, but you know what I mean. R returned a numeric vector containing the values from the heights column to us. Second, notice that we used double brackets (i.e., two brackets on each side of the column name), and that the column name is wrapped in quotation marks. Both are required to get this result.

Similar to dollar sign notation, we can also create and/or modify columns in our data frame using bracket notation. For example, let’s convert those heights back to inches using bracket notation:

class[["heights"]] <- class[["heights"]] * 12
class
## # A tibble: 4 × 3
##   names heights grades
##   <chr>   <dbl>  <dbl>
## 1 John       68     89
## 2 Sally      63     92
## 3 Brad       71     86
## 4 Anne       72     98

And, let’s go ahead and add one more variable to our data frame using bracket notation.

class[["rank"]] <- c(3, 2, 4, 1)
class
## # A tibble: 4 × 4
##   names heights grades  rank
##   <chr>   <dbl>  <dbl> <dbl>
## 1 John       68     89     3
## 2 Sally      63     92     2
## 3 Brad       71     86     4
## 4 Anne       72     98     1

Somewhat confusingly, we can also access, create, and modify data frame columns using single brackets. For example:

class["heights"]
## # A tibble: 4 × 1
##   heights
##     <dbl>
## 1      68
## 2      63
## 3      71
## 4      72

Notice, however, that this returns a different result than class$heights and class[["heights]]. The results returned from class$heights and class[["heights]] were numeric vectors with 4 elements. The result returned from class["heights"] was a data frame with 1 column and 4 rows.

I don’t want you to get too hung up on the difference between single and double brackets right now. As I said, we are primarily going to use mutate() to create and modify data frame columns in this book. For now, it’s enough for you to simply be aware that single brackets and double brackets are a thing, and they can sometimes return different results. I will make sure to point out whether or not that matters when we use bracket notation later in the book.

25.4 Modify individual values

Before moving on to the mutate() function, I wanted to quickly discuss using dollar sign and bracket notation for modifying individual values in a column. Recall that we already learned how to access individual column values in the Let’s get programming chapter.

class$heights[3]
## [1] 71

As you may have guessed, we can also get the result above using only bracket notation.

class[["heights"]][3]
## [1] 71

Not only can we use these methods to get individual values from a column in a data frame, but we can also use these methods to modify an individual value in a column of a data frame. When might we want to do this? Well, I generally do this in one of two different circumstances.

  • First, I may do this when I’m writing my own R functions (you’ll learn how to do this later) and I want to make sure the function still behaves in the way I intended when there are small changes to the data. So, I may add a missing value to a column or something like that.

  • The second circumstance is when there are little one-off typos in the data. For example, let’s say I imported a data frame that looked like this:

## # A tibble: 4 × 2
##      id site 
##   <dbl> <chr>
## 1     1 TX   
## 2     2 CA   
## 3     3 tx   
## 4     4 CA

Notice that tx in the third row of data isn’t capitalized. Remember, R is a case-sensitive language, so this will likely cause us problems down the road if we don’t fix it. The easiest way to do so is probably:

study_data$site[3] <- "TX"
study_data
## # A tibble: 4 × 2
##      id site 
##   <dbl> <chr>
## 1     1 TX   
## 2     2 CA   
## 3     3 TX   
## 4     4 CA

Keep in mind that I said that I fix little one-off typos. If I needed to change tx to TX in multiple different places in the data, I wouldn’t use this method. Instead, I would use a conditional operation, which we will discuss later in the book.

25.5 The mutate() function

# Load dplyr for the mutate function
library(dplyr)

We first discussed mutate() in the chapter on exporting data, and again in the Introduction to data management chapter. As I said there, the first two arguments to mutate() are .data and ....

The value passed to .data should always be a data frame. In this book, we will often pass data frames to the .data argument using the pipe operator (e.g., df %>% mutate()).

The value passed to the ... argument should be a name-value pair or multiple name value pairs separated by commas. The ... argument is where you will tell mutate() to create or modify columns in your data frame and how.

  • Name-value pairs look like this: column name = value. The only thing that distinguishes whether you are creating or modifying a column is the column name in the name-value pair. If the column name in the name-value pair matches the name of an existing column in the data frame, then mutate() will modify that existing column. If the column name in the name-value pair does NOT match the name of an existing column in the data frame, then mutate() will create a new column in the data frame with a matching column name.

Let’s take a look at a couple of examples. To get us started, let’s simulate some data that is a little more interesting than the class data we used above.

set.seed(123)

drug_trial <- tibble(
  # Study id, there are 20 people enrolled in the trial.
  id = rep(1:20, each = 3),
  # Follow-up year, 0 = baseline, 1 = year one, 2 = year two.
  year = rep(0:2, times = 20),
  # Participant age a baseline. Must be between the ages of 35 and 75 at 
  # baseline to be eligible for the study
  age = sample(35:75, 20, TRUE) %>% rep(each = 3),
  # Drug the participant received, Placebo or active
  drug = sample(c("Placebo", "Active"), 20, TRUE) %>% 
    rep(each = 3),
  # Reported headaches side effect, Y/N
  se_headache = if_else(
    drug == "Placebo", 
    sample(0:1, 60, TRUE, c(.95,.05)), 
    sample(0:1, 60, TRUE, c(.10, .90))
  ),
  # Report diarrhea side effect, Y/N
  se_diarrhea = if_else(
    drug == "Placebo", 
    sample(0:1, 60, TRUE, c(.98,.02)), 
    sample(0:1, 60, TRUE, c(.20, .80))
  ),
  # Report dry mouth side effect, Y/N
  se_dry_mouth = if_else(
    drug == "Placebo", 
    sample(0:1, 60, TRUE, c(.97,.03)), 
    sample(0:1, 60, TRUE, c(.30, .70))
  ),
  # Participant had myocardial infarction in study year, Y/N
  mi = if_else(
    drug == "Placebo", 
    sample(0:1, 60, TRUE, c(.85, .15)), 
    sample(0:1, 60, TRUE, c(.80, .20))
  )
)

👆Here’s what we did above:

  • We are simulating some drug trial data that includes the following variables:

    • id: Study id, there are 20 people enrolled in the trial.

    • year: Follow-up year, 0 = baseline, 1 = year one, 2 = year two.

    • age: Participant age a baseline. Must be between the ages of 35 and 75 at baseline to be eligible for the study.

    • drug: Drug the participant received, Placebo or active.

    • se_headache: Reported headaches side effect, Y/N.

    • se_diarrhea: Report diarrhea side effect, Y/N.

    • se_dry_mouth: Report dry mouth side effect, Y/N.

    • mi: Participant had myocardial infarction in study year, Y/N.

  • We used the tibble() function above to create our data frame instead of the data.frame() function. This allows us to pass the drug column as an value to the if_else() function when we create se_headache, se_diarrhea, se_dry_mouth, and mi. If we had used data.frame() instead, we would have had to create se_headache, se_diarrhea, se_dry_mouth, and mi in a separate step.

  • We used a new function, if_else(), above to help us simulate this data. This function allows us to do something called conditional operations. There will be an entire chapter on conditional operations later in the book.

  • We used a new function, sample(), above to help us simulate this data. We used this function to randomly assign values to age, drug, se_headache, se_diarrhea, se_dry_mouth, and mi instead of manually assigning each value ourselves.

    • You can type ?sample into your R console to view the help documentation for this function and follow along with the explanation below.

    • The first argument to the sample() function is the x argument. You should pass a vector of values you want R to randomly choose from. For example, we told R to select values from a vector of numbers that spanned between 35 and 75 to fill-in the age column. Alternatively, we told R to select values from a character vector that included the values “Placebo” and “Active” to fill-in the drug column.

    • The second argument to the sample() function is the size argument. You should pass a number to the size argument. That number tells R how many times to choose a value from the vector of possible values passed to the x argument.

    • The third argument to the sample() function is the replace argument. The default value passed to the replace argument is FALSE. This tells R that once it has chosen a value from the vector of possible values passed to the x argument, it can’t choose that value again. If you want R to be able to choose the same value more than once, then you have to pass the value TRUE to the replace argument.

    • The fourth argument to the sample() function is the prob argument. The default value passed to the prob argument is NULL. This just means that this argument is optional. Passing a vector of probabilities to this argument allows you to adjust how likely it is that R will choose certain values from the vector of possible values passed to the x argument.

    • Finally, notice that we also used the set.seed() function at the very top of the code chunk. We did this because, the sample() function chooses values at random. That means, every time we run the code above, we get different values. That makes it difficult for me to write about the data because it’s constantly changing. When we use the set.seed() function, the values will still be randomly selected, but they will be the same randomly selected values every time. It doesn’t matter what numbers you pass to the set.seed() function as long as you pass the same numbers every time you want to get the same random values. For example:

# No set.seed - Random values
sample(1:100, 10, TRUE)
##  [1]  5 29 50 70 74 26 73 11  6 96
# No set.seed - Different random values
sample(1:100, 10, TRUE)
##  [1] 76 83 91 56 96 27 94 68 88 28
# Use set.seed - Random values
set.seed(456)
sample(1:100, 10, TRUE)
##  [1] 35 38 85 27 25 78 31 73 79 90
# Use set.seed again - Same random values
set.seed(456)
sample(1:100, 10, TRUE)
##  [1] 35 38 85 27 25 78 31 73 79 90
# Use set.seed with different value - Different random values
set.seed(789)
sample(1:100, 10, TRUE)
##  [1]  45  12  42  26  99  37 100  43  67  70
  • It’s not important that you fully understand the sample() function at this point. I’m just including it for those of you who are interested in simulating some slightly more complex data than we have simulated so far. The rest of you can just copy and paste the code if you want to follow along.

25.5.1 Adding or modifying a single column

This is probably the simplest case of adding a new column. We are going to use mutate() to add a single new column to the drug_trial data frame. Let’s say we want to add a column called complete that is equal to 1 if the participant showed up for all follow-up visits and equal to 0 if they didn’t. In this case, we simulated our data in such a way that we have complete follow-up for every participant. So, the value for complete should be 0 in all 60 rows of the data frame. We can do this in a few different ways.

drug_trial %>% 
  mutate(complete = c(
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
  )
## # A tibble: 60 × 9
##       id  year   age drug    se_headache se_diarrhea se_dry_mouth    mi complete
##    <int> <int> <int> <chr>         <int>       <int>        <int> <int>    <dbl>
##  1     1     0    65 Active            0           1            1     0        0
##  2     1     1    65 Active            1           1            1     0        0
##  3     1     2    65 Active            1           1            0     0        0
##  4     2     0    49 Active            1           1            1     0        0
##  5     2     1    49 Active            0           0            1     0        0
##  6     2     2    49 Active            1           1            1     0        0
##  7     3     0    48 Placebo           0           0            0     0        0
##  8     3     1    48 Placebo           0           0            0     0        0
##  9     3     2    48 Placebo           0           0            0     0        0
## 10     4     0    37 Placebo           0           0            0     0        0
## # … with 50 more rows

So, that works, but typing that out is no fun. Not to mention, this isn’t scalable at all. What if we needed 1,000 zeros? There’s actually a much easier way to get the result above, which may surprise you. Take a look 👀:

drug_trial %>% 
  mutate(complete = 0)
## # A tibble: 60 × 9
##       id  year   age drug    se_headache se_diarrhea se_dry_mouth    mi complete
##    <int> <int> <int> <chr>         <int>       <int>        <int> <int>    <dbl>
##  1     1     0    65 Active            0           1            1     0        0
##  2     1     1    65 Active            1           1            1     0        0
##  3     1     2    65 Active            1           1            0     0        0
##  4     2     0    49 Active            1           1            1     0        0
##  5     2     1    49 Active            0           0            1     0        0
##  6     2     2    49 Active            1           1            1     0        0
##  7     3     0    48 Placebo           0           0            0     0        0
##  8     3     1    48 Placebo           0           0            0     0        0
##  9     3     2    48 Placebo           0           0            0     0        0
## 10     4     0    37 Placebo           0           0            0     0        0
## # … with 50 more rows

How easy is that? Just pass the value to the name-value pair once and R will use it in every row. This works because of something called the recycling rules ♻️. In a nutshell, this means that R will change the length of vectors in certain situations all by itself when it thinks it knows what you “meant.” So, above we passed gave R a length 1 vector 0 (i.e. a numeric vector with one value in it), and R changed it to a length 60 vector behind the scenes so that it could complete the operation it thought you were trying to complete.

25.5.2 Recycling rules

♻️The recycling rules work as long as the length of the longer vector is an integer multiple of the length of the shorter vector. For example, every vector (column) in R data frames must have the same length. In this case, 60. The length of the value we used in the name-value pair above was 1 (i.e., a single 0). Therefore, the longer vector had a length of 60 and the shorter vector had a length of 1. Because 60 * 1 = But, what if we had tried to pass the values 0 and 1 to the column instead of just zero?

drug_trial %>% 
  mutate(complete = c(0, 1))
## Error: Problem with `mutate()` column `complete`.
## ℹ `complete = c(0, 1)`.
## ℹ `complete` must be size 60 or 1, not 2.

This doesn’t work, but it actually isn’t for the reason you may be thinking. Because 30 * 2 = 60, the length of the longer vector (60) is an integer multiple (30) of the length of the shorter vector (2). However, tidyverse functions throw errors when you try to recycle anything other than a single number. They are designed this way to protect you from accidentally getting unexpected results. So, I’m going to switch back over to using base R to round out our discussion of the recycling rules. Let’s try our example above again using base R:

drug_trial$complete <- c(0,1)
## Error: Assigned data `c(0, 1)` must be compatible with existing data.
## x Existing data has 60 rows.
## x Assigned data has 2 rows.
## ℹ Only vectors of size 1 are recycled.
drug_trial
## # A tibble: 60 × 8
##       id  year   age drug    se_headache se_diarrhea se_dry_mouth    mi
##    <int> <int> <int> <chr>         <int>       <int>        <int> <int>
##  1     1     0    65 Active            0           1            1     0
##  2     1     1    65 Active            1           1            1     0
##  3     1     2    65 Active            1           1            0     0
##  4     2     0    49 Active            1           1            1     0
##  5     2     1    49 Active            0           0            1     0
##  6     2     2    49 Active            1           1            1     0
##  7     3     0    48 Placebo           0           0            0     0
##  8     3     1    48 Placebo           0           0            0     0
##  9     3     2    48 Placebo           0           0            0     0
## 10     4     0    37 Placebo           0           0            0     0
## # … with 50 more rows

Wait, why are we still getting an error? Well, take a look at the output below and see if you can figure it out.

class(drug_trial)
## [1] "tbl_df"     "tbl"        "data.frame"

It may not be totally obvious, but this is telling us that drug_trial is a tibble – an enhanced data frame. Remember, we created drug_trial using the tibble() function instead of the tibble() function. Because tibbles are part of the tidyverse they throw the same recycling errors that the mutate() function did above. So, we’ll need to create a non-tibble version of drug_trial to finish our discussion of recycling rules.

drug_trial_df <- as.data.frame(drug_trial)
class(drug_trial_df)
## [1] "data.frame"

There we go! A regular old data frame.

drug_trial_df$complete <- c(0,1)
drug_trial_df
##    id year age    drug se_headache se_diarrhea se_dry_mouth mi complete
## 1   1    0  65  Active           0           1            1  0        0
## 2   1    1  65  Active           1           1            1  0        1
## 3   1    2  65  Active           1           1            0  0        0
## 4   2    0  49  Active           1           1            1  0        1
## 5   2    1  49  Active           0           0            1  0        0
## 6   2    2  49  Active           1           1            1  0        1
## 7   3    0  48 Placebo           0           0            0  0        0
## 8   3    1  48 Placebo           0           0            0  0        1
## 9   3    2  48 Placebo           0           0            0  0        0
## 10  4    0  37 Placebo           0           0            0  0        1
## 11  4    1  37 Placebo           0           0            0  0        0
## 12  4    2  37 Placebo           0           0            0  1        1
## 13  5    0  71 Placebo           0           0            0  0        0
## 14  5    1  71 Placebo           0           0            0  0        1
## 15  5    2  71 Placebo           0           0            0  0        0
## 16  6    0  48 Placebo           0           0            0  0        1
## 17  6    1  48 Placebo           0           0            0  1        0
## 18  6    2  48 Placebo           0           0            0  1        1
## 19  7    0  59  Active           1           1            1  0        0
## 20  7    1  59  Active           1           1            0  0        1
## 21  7    2  59  Active           1           1            1  0        0
## 22  8    0  60 Placebo           0           0            0  0        1
## 23  8    1  60 Placebo           0           0            0  0        0
## 24  8    2  60 Placebo           0           0            0  0        1
## 25  9    0  61  Active           1           1            1  0        0
## 26  9    1  61  Active           0           1            1  0        1
## 27  9    2  61  Active           1           0            0  0        0
## 28 10    0  39  Active           1           0            1  0        1
## 29 10    1  39  Active           1           0            0  0        0
## 30 10    2  39  Active           1           1            1  0        1
## 31 11    0  61 Placebo           0           0            0  0        0
## 32 11    1  61 Placebo           0           0            0  1        1
## 33 11    2  61 Placebo           0           0            0  0        0
## 34 12    0  62 Placebo           1           0            1  0        1
## 35 12    1  62 Placebo           0           0            0  0        0
## 36 12    2  62 Placebo           0           0            0  0        1
## 37 13    0  43 Placebo           0           0            0  0        0
## 38 13    1  43 Placebo           0           0            0  0        1
## 39 13    2  43 Placebo           0           0            0  0        0
## 40 14    0  63 Placebo           0           0            0  0        1
## 41 14    1  63 Placebo           0           0            0  0        0
## 42 14    2  63 Placebo           0           0            0  0        1
## 43 15    0  69  Active           1           1            1  0        0
## 44 15    1  69  Active           1           0            1  0        1
## 45 15    2  69  Active           1           1            1  0        0
## 46 16    0  42 Placebo           0           0            0  0        1
## 47 16    1  42 Placebo           0           0            1  0        0
## 48 16    2  42 Placebo           0           0            0  1        1
## 49 17    0  60 Placebo           0           0            0  0        0
## 50 17    1  60 Placebo           0           0            0  0        1
## 51 17    2  60 Placebo           1           0            0  0        0
## 52 18    0  41  Active           1           1            1  0        1
## 53 18    1  41  Active           1           1            1  0        0
## 54 18    2  41  Active           1           1            0  1        1
## 55 19    0  43 Placebo           0           0            0  0        0
## 56 19    1  43 Placebo           0           0            0  0        1
## 57 19    2  43 Placebo           0           0            0  0        0
## 58 20    0  53 Placebo           0           0            0  0        1
## 59 20    1  53 Placebo           0           0            0  0        0
## 60 20    2  53 Placebo           0           0            0  0        1

As you can see, the values 0 and 1 are now recycled as expected. Because 30 * 2 = 60, the length of the longer vector (60) is an integer multiple (30) of the length of the shorter vector (2). Now, what happens in a situation where the length of the longer vector is not an integer multiple of the length of the shorter vector.

drug_trial_df$complete <- c(0, 1, 2, 3, 4, 5, 6) # 7 values
## Error in `$<-.data.frame`(`*tmp*`, complete, value = c(0, 1, 2, 3, 4, : replacement has 7 rows, data has 60

60 / 7 = 8.571429 – not an integer. Because there is no integer value that we can multiply by 7 to get the number 60, R throws us an error telling us that it isn’t able to use the recycling rules.

Finally, the recycling rules don’t only apply to creating new data frame columns. It applies in all cases where R is using two vectors to perform an operation. For example, R uses the recycling rules in mathematical operations.

nums <- 1:10
nums
##  [1]  1  2  3  4  5  6  7  8  9 10

To demonstrate, we create a simple numeric vector above. This vector just contains the numbers 1 through 10. Now, we can add 1 to each of those numbers like so:

nums + 1
##  [1]  2  3  4  5  6  7  8  9 10 11

Notice how R used the recycling rules to add 1 to every number in the nums vector. We didn’t have to explicitly tell R to add 1 to each number. This is sometimes referred to as vectorization. Functions that perform an action on all elements of a vector, rather than having to be explicitly programmed to perform an action on each element of a vector, is a vectorized function. Remember, that mathematical operators – including +are functions in R. More specifically, + is a vectorized function. In fact, most built-in R functions are vectorized. Why am I telling you this? It isn’t intended to confuse you, but when I was learning R I came across this term all the time in R resources and help pages, and I had no idea what it meant. I hope that this very simple example above makes it easy to understand what vectorization means, and you won’t be intimidated when it pops up while you’re trying to get help with your R programs.

Ok, so what happens when we add a longer vector and a shorter vector?

nums + c(1, 2)
##  [1]  2  4  4  6  6  8  8 10 10 12

As expected, R uses the recycling rules to change the length of the short vector to match the length of the longer vector, and then performs the operation – in this case, addition. So, the net result is 1 + 1 = 2, 2 + 2 = 4, 3 + 1 = 4, 4 + 2 = 6, etc. You probably already guessed what’s going to happen if we try to add a length 3 vector to nums, but let’s go ahead and take a look for the sake of completeness:

nums + c(1, 2, 3)
## Warning in nums + c(1, 2, 3): longer object length is not a multiple of shorter
## object length
##  [1]  2  4  6  5  7  9  8 10 12 11

Yep, we get an error. 10 / 3 = 3.333333 – not an integer. Because there is no integer value that we can multiply by 3 to get the number 10, R throws us an error telling us that it isn’t able to use the recycling rules.

Now that you understand R’s recycling rules, let’s return to our motivating example.

drug_trial %>% 
  mutate(complete = 0)
## # A tibble: 60 × 9
##       id  year   age drug    se_headache se_diarrhea se_dry_mouth    mi complete
##    <int> <int> <int> <chr>         <int>       <int>        <int> <int>    <dbl>
##  1     1     0    65 Active            0           1            1     0        0
##  2     1     1    65 Active            1           1            1     0        0
##  3     1     2    65 Active            1           1            0     0        0
##  4     2     0    49 Active            1           1            1     0        0
##  5     2     1    49 Active            0           0            1     0        0
##  6     2     2    49 Active            1           1            1     0        0
##  7     3     0    48 Placebo           0           0            0     0        0
##  8     3     1    48 Placebo           0           0            0     0        0
##  9     3     2    48 Placebo           0           0            0     0        0
## 10     4     0    37 Placebo           0           0            0     0        0
## # … with 50 more rows

This method works, but not always. And, it can sometimes give us intended results. You may have originally thought to yourself, “we’ve already learned the rep() function. Let’s use that.” In fact, that’s a great idea!

drug_trial %>% 
  mutate(complete = rep(0, 60))
## # A tibble: 60 × 9
##       id  year   age drug    se_headache se_diarrhea se_dry_mouth    mi complete
##    <int> <int> <int> <chr>         <int>       <int>        <int> <int>    <dbl>
##  1     1     0    65 Active            0           1            1     0        0
##  2     1     1    65 Active            1           1            1     0        0
##  3     1     2    65 Active            1           1            0     0        0
##  4     2     0    49 Active            1           1            1     0        0
##  5     2     1    49 Active            0           0            1     0        0
##  6     2     2    49 Active            1           1            1     0        0
##  7     3     0    48 Placebo           0           0            0     0        0
##  8     3     1    48 Placebo           0           0            0     0        0
##  9     3     2    48 Placebo           0           0            0     0        0
## 10     4     0    37 Placebo           0           0            0     0        0
## # … with 50 more rows

That’s a lot less typing than the first method we tried, and it also has the added benefit of providing code that is easier for humans to read. We can both look at the code we used in the first method and tell that there are a bunch of zeros, but it’s hard to guess exactly how many, and it’s hard to feel completely confident that there isn’t a 1 in there somewhere that our eyes are missing. By contrast, it’s easy to look at rep(0, 60) and know that there are exactly 60 zeros, and only 60 zeros.

25.5.3 Using existing variables in name-value pairs

In the example above, we create a new column called complete by directly supplying values for that column in the name-value pair. In my experience, it is probably more common to create new columns in our data frames by combining or transforming the values of columns that already exist in our data frame. You’ve already seen an example of doing so when we created factor versions of variables. As an additional example, we could create a factor version of our mi variable like this:

drug_trial %>% 
  mutate(mi_f = factor(mi, c(0, 1), c("No", "Yes")))
## # A tibble: 60 × 9
##       id  year   age drug    se_headache se_diarrhea se_dry_mouth    mi mi_f 
##    <int> <int> <int> <chr>         <int>       <int>        <int> <int> <fct>
##  1     1     0    65 Active            0           1            1     0 No   
##  2     1     1    65 Active            1           1            1     0 No   
##  3     1     2    65 Active            1           1            0     0 No   
##  4     2     0    49 Active            1           1            1     0 No   
##  5     2     1    49 Active            0           0            1     0 No   
##  6     2     2    49 Active            1           1            1     0 No   
##  7     3     0    48 Placebo           0           0            0     0 No   
##  8     3     1    48 Placebo           0           0            0     0 No   
##  9     3     2    48 Placebo           0           0            0     0 No   
## 10     4     0    37 Placebo           0           0            0     0 No   
## # … with 50 more rows

Notice that in the code above, we didn’t tell R what values to use for mi_f by typing them explicitly in the name-value pair. Instead, we told R to go get the values of the column mi, do some stuff to those values, and then assign those modified values to a column in the data frame and name that column mi_f.

Here’s another example. It’s common to mean-center numeric values for many different kinds of analyses. For example, this is often done in regression analysis to aid in the interpretation of regression coefficients. We can easily mean-center numeric variables inside our mutate() function like so:

drug_trial %>% 
  mutate(age_center = age - mean(age))
## # A tibble: 60 × 9
##       id  year   age drug  se_headache se_diarrhea se_dry_mouth    mi age_center
##    <int> <int> <int> <chr>       <int>       <int>        <int> <int>      <dbl>
##  1     1     0    65 Acti…           0           1            1     0       11.3
##  2     1     1    65 Acti…           1           1            1     0       11.3
##  3     1     2    65 Acti…           1           1            0     0       11.3
##  4     2     0    49 Acti…           1           1            1     0       -4.7
##  5     2     1    49 Acti…           0           0            1     0       -4.7
##  6     2     2    49 Acti…           1           1            1     0       -4.7
##  7     3     0    48 Plac…           0           0            0     0       -5.7
##  8     3     1    48 Plac…           0           0            0     0       -5.7
##  9     3     2    48 Plac…           0           0            0     0       -5.7
## 10     4     0    37 Plac…           0           0            0     0      -16.7
## # … with 50 more rows

Notice how succinctly we were able to express this fairly complicated task. We had to figure out the find the mean of the variable age in the drug_trial data frame, subtract that value from the value for age in each row of the data frame, and then create a new column in the data frame containing the mean-centered values. Because of the fact that mutate()’s name-value pairs can accept complex expressions a value, and because all of the functions used in the code above are vectorized, we can perform this task using only a single, easy-to-read line of code (age_center = age - mean(age)).

25.5.4 Adding or modifying multiple columns

In all of the examples above, we passed a single name-value pair to the ... argument of the mutate() function. If we want to create or modify multiple columns, we don’t need to keep typing the mutate() function over and over. We can simply pass multiple name-value pairs, separated by columns, to the ... argument. And, there is no limit to the number of pairs we can pass. This is part of the beauty of the ... argument in R. For example, we have three variables in drug_trial that capture information about whether or not the participant reported side effects including headache, diarrhea, and dry mouth. Currently, those are all stored as integer vectors that can take the values 0 and 1. Let’s say that we want to also create factor versions of those vectors:

drug_trial %>% 
  mutate(
    se_headache_f  = factor(se_headache, c(0, 1), c("No", "Yes")),
    se_diarrhea_f  = factor(se_diarrhea, c(0, 1), c("N0", "Yes")),
    se_dry_mouth_f = factor(se_dry_mouth, c(0, 1), c("No", "Yes"))
  )
## # A tibble: 60 × 11
##       id  year   age drug    se_headache se_diarrhea se_dry_mouth    mi
##    <int> <int> <int> <chr>         <int>       <int>        <int> <int>
##  1     1     0    65 Active            0           1            1     0
##  2     1     1    65 Active            1           1            1     0
##  3     1     2    65 Active            1           1            0     0
##  4     2     0    49 Active            1           1            1     0
##  5     2     1    49 Active            0           0            1     0
##  6     2     2    49 Active            1           1            1     0
##  7     3     0    48 Placebo           0           0            0     0
##  8     3     1    48 Placebo           0           0            0     0
##  9     3     2    48 Placebo           0           0            0     0
## 10     4     0    37 Placebo           0           0            0     0
## # … with 50 more rows, and 3 more variables: se_headache_f <fct>,
## #   se_diarrhea_f <fct>, se_dry_mouth_f <fct>

👆Here’s what we did above:

  • We created three new factor columns in the drug_trial data called se_headache_f, se_diarrhea_f, and se_dry_mouth_f.

  • We created all columns inside a single mutate() function.

  • Notice that I created one variable per line. I suggest you do the same. It just makes your code much easier to read.

So, adding or modifying multiple columns is really easy with mutate(). But, did any of you notice an error? Take a look at the structure of the data the line of code that creates se_diarrhea_f. Instead of writing the “No” label with an “N” and an “o,” I accidently wrote it with an “N” and a zero. I find that when I have to type something over and over like this, I am more likely to make a mistake. Further, if I ever need to change the levels or labels, I will have to change them in every factor() function in the code above.

For these reasons (and others), programmers of many languages – including R – are taught the DRY principle. DRY is an acronym for don’t repeat yourself. We will discuss the DRY principle again in the chapter on repeated operations, but for now, it just means that you typically don’t want to type code that is the same (or nearly the same) over and over in your programs. Here’s one way we could reduce the repetition in the code above:

# Create a vector of 0/1 levels that can be reused below.
yn_levs <- c(0, 1)
# Create a vector of "No"/"Yes" labels that can be reused below.
yn_labs <- c("No", "Yes")

drug_trial %>% 
  mutate(
    se_headache_f  = factor(se_headache, yn_levs, yn_labs),
    se_diarrhea_f  = factor(se_diarrhea, yn_levs, yn_labs),
    se_dry_mouth_f = factor(se_dry_mouth, yn_levs, yn_labs)
  )
## # A tibble: 60 × 11
##       id  year   age drug    se_headache se_diarrhea se_dry_mouth    mi
##    <int> <int> <int> <chr>         <int>       <int>        <int> <int>
##  1     1     0    65 Active            0           1            1     0
##  2     1     1    65 Active            1           1            1     0
##  3     1     2    65 Active            1           1            0     0
##  4     2     0    49 Active            1           1            1     0
##  5     2     1    49 Active            0           0            1     0
##  6     2     2    49 Active            1           1            1     0
##  7     3     0    48 Placebo           0           0            0     0
##  8     3     1    48 Placebo           0           0            0     0
##  9     3     2    48 Placebo           0           0            0     0
## 10     4     0    37 Placebo           0           0            0     0
## # … with 50 more rows, and 3 more variables: se_headache_f <fct>,
## #   se_diarrhea_f <fct>, se_dry_mouth_f <fct>

Notice that in the code above we type c(0, 1) and c("No", "Yes") once each instead of 3 times each. In the chapter on repeated operations we will learn techniques for removing even more repetition from the code above.

25.5.5 Rowwise mutations

In all the examples above we used the values from a single already existing variable in our name-value pair. However, we can also use the values from multiple variables in our name-value pairs.

For example, we have three variables in our drug_trial data that capture information about whether or not the participant reported side effects including headache, diarrhea, and dry mouth (sounds like every drug commercial that exists 😂). What if we want to know if our participants reported any side effect at each follow-up? That requires us to combine and transform data from across three different columns! This is one of those situations where there are many different ways we could accomplish this task, but I’m going to use dplyr’s rowwise() function to do so in the following code:

drug_trial %>% 
  rowwise() %>% 
  mutate(any_se_year = sum(se_headache, se_diarrhea, se_dry_mouth) > 0)
## # A tibble: 60 × 9
## # Rowwise: 
##       id  year   age drug    se_headache se_diarrhea se_dry_mouth    mi
##    <int> <int> <int> <chr>         <int>       <int>        <int> <int>
##  1     1     0    65 Active            0           1            1     0
##  2     1     1    65 Active            1           1            1     0
##  3     1     2    65 Active            1           1            0     0
##  4     2     0    49 Active            1           1            1     0
##  5     2     1    49 Active            0           0            1     0
##  6     2     2    49 Active            1           1            1     0
##  7     3     0    48 Placebo           0           0            0     0
##  8     3     1    48 Placebo           0           0            0     0
##  9     3     2    48 Placebo           0           0            0     0
## 10     4     0    37 Placebo           0           0            0     0
## # … with 50 more rows, and 1 more variable: any_se_year <lgl>

👆Here’s what we did above:

  • We created a new column in the drug_trial data called any_se_year using the mutate() function.

  • We used the rowwise() function to tell R to group the data frame by rows. Said another way, rowwise() tells R to do any calculations that follow across columns instead within columns. Don’t worry, there are more examples below.

  • The value we passed to the name-value pair inside mutate() was actually the result of two calculations.

    • First, R summed the values of se_headache, se_diarrhea, and se_dry_mouth (i.e., sum(se_headache, se_diarrhea, se_dry_mouth)).

    • Next, R compared that the summed value to 0. If the summed value was greater than 0, then the value assigned to any_se_year was TRUE. Otherwise, the value assigned to any_se_year was FALSE.

Because there is some new stuff in the code above, I’m going break it down a little bit further. We’ll start with rowwise(). And, to reduce distractions a much as possible, I’m going to create a new data frame with only the columns we need for this example (sneak peek at the next chapter):

drug_trial_sub <- drug_trial %>% 
  select(id, year, starts_with("se")) %>% 
  print()
## # A tibble: 60 × 5
##       id  year se_headache se_diarrhea se_dry_mouth
##    <int> <int>       <int>       <int>        <int>
##  1     1     0           0           1            1
##  2     1     1           1           1            1
##  3     1     2           1           1            0
##  4     2     0           1           1            1
##  5     2     1           0           0            1
##  6     2     2           1           1            1
##  7     3     0           0           0            0
##  8     3     1           0           0            0
##  9     3     2           0           0            0
## 10     4     0           0           0            0
## # … with 50 more rows

Let’s start by discussing what rowwise() does. As we discussed above, most built-in R functions are vectorized. They do things to entire vectors, and data frame columns are vectors. So, without using rowwise() the sum() function would have returned the value 54:

drug_trial_sub %>% 
  mutate(any_se_year = sum(se_headache, se_diarrhea, se_dry_mouth))
## # A tibble: 60 × 6
##       id  year se_headache se_diarrhea se_dry_mouth any_se_year
##    <int> <int>       <int>       <int>        <int>       <int>
##  1     1     0           0           1            1          54
##  2     1     1           1           1            1          54
##  3     1     2           1           1            0          54
##  4     2     0           1           1            1          54
##  5     2     1           0           0            1          54
##  6     2     2           1           1            1          54
##  7     3     0           0           0            0          54
##  8     3     1           0           0            0          54
##  9     3     2           0           0            0          54
## 10     4     0           0           0            0          54
## # … with 50 more rows

Any guesses why it returns 54? Here’s a hint:

sum(c(0, 1, 0))
## [1] 1
sum(c(1, 1, 0))
## [1] 2
sum(
  c(0, 1, 0),
  c(1, 1, 0)
)
## [1] 3

When we pass a single numeric vector to the sum() function, it adds together all the numbers in that function. When we pass two or more numeric vectors to the sum() function, it adds together all the numbers in all the vectors combined. Our data frame columns are no different:

sum(drug_trial_sub$se_headache)
## [1] 20
sum(drug_trial_sub$se_diarrhea)
## [1] 16
sum(drug_trial_sub$se_dry_mouth)
## [1] 18
sum(
  drug_trial_sub$se_headache,
  drug_trial_sub$se_diarrhea,
  drug_trial_sub$se_dry_mouth
)
## [1] 54

Hopefully, you see that the sum() function is taking the total of all three vectors added together, which is a single number (54), and then using recycling rules to assign that value to every row of any_se_year.

Using rowwise() tells R to add across the columns instead of within the columns. So, add the first value for se_headache to the first value for se_diarrhea to the first value for se_dry_mouth, assign that value to the first value of any_se_year, and then repeat for each subsequent row. This is what that result looks like:

drug_trial_sub %>% 
  rowwise() %>% 
  mutate(any_se_year = sum(se_headache, se_diarrhea, se_dry_mouth))
## # A tibble: 60 × 6
## # Rowwise: 
##       id  year se_headache se_diarrhea se_dry_mouth any_se_year
##    <int> <int>       <int>       <int>        <int>       <int>
##  1     1     0           0           1            1           2
##  2     1     1           1           1            1           3
##  3     1     2           1           1            0           2
##  4     2     0           1           1            1           3
##  5     2     1           0           0            1           1
##  6     2     2           1           1            1           3
##  7     3     0           0           0            0           0
##  8     3     1           0           0            0           0
##  9     3     2           0           0            0           0
## 10     4     0           0           0            0           0
## # … with 50 more rows

Because the value for each side effect could only be 0 (if not reported) or 1 (if reported) then the rowwise sum of those numbers is a count of the number of side effects reported in each row. For example, person 1 reported not having headaches (0), having diarrhea (1), and having dry mouth (1) at baseline (year == 0). And, 0 + 1 + 1 = 2 – the same value you see for any_se_year in that row. For instructional purposes, let’s run the code above again, but change the name of the variable to n_se_year (i.e., the count of side effects a participant reported in a given year).

This may be a useful result in and of itself. However, we said we wanted a variable that captured whether a participant reported any side effect at each follow-up. Well, because any_se_year is currently a count of side effects reported for that participant in that year, then where the value of any_se_year is 0 no side effects were reported. If the current value of any_se_year is greater than 0, then one or more side effects were reported. Generally, we can test inequalities like this in the following way:

# Is 0 greater than 0?
0 > 0
## [1] FALSE
# Is 2 greater than 0?
2 > 0
## [1] TRUE

In our specific situation, instead of using a number on the left side of the inequality, we can use our calculated n_se_year variable values on the left side of the inequality:

drug_trial_sub %>% 
  rowwise() %>% 
  mutate(
    n_se_year   = sum(se_headache, se_diarrhea, se_dry_mouth),
    any_se_year = n_se_year > 0
  )
## # A tibble: 60 × 7
## # Rowwise: 
##       id  year se_headache se_diarrhea se_dry_mouth n_se_year any_se_year
##    <int> <int>       <int>       <int>        <int>     <int> <lgl>      
##  1     1     0           0           1            1         2 TRUE       
##  2     1     1           1           1            1         3 TRUE       
##  3     1     2           1           1            0         2 TRUE       
##  4     2     0           1           1            1         3 TRUE       
##  5     2     1           0           0            1         1 TRUE       
##  6     2     2           1           1            1         3 TRUE       
##  7     3     0           0           0            0         0 FALSE      
##  8     3     1           0           0            0         0 FALSE      
##  9     3     2           0           0            0         0 FALSE      
## 10     4     0           0           0            0         0 FALSE      
## # … with 50 more rows

In this way, any_se_year is TRUE if the participant reported any side effect in that year and false if they reported no side effects in that year. We could write the code more succinctly like this:

drug_trial_sub %>% 
  rowwise() %>% 
  mutate(any_se_year = sum(se_headache, se_diarrhea, se_dry_mouth) > 0)
## # A tibble: 60 × 6
## # Rowwise: 
##       id  year se_headache se_diarrhea se_dry_mouth any_se_year
##    <int> <int>       <int>       <int>        <int> <lgl>      
##  1     1     0           0           1            1 TRUE       
##  2     1     1           1           1            1 TRUE       
##  3     1     2           1           1            0 TRUE       
##  4     2     0           1           1            1 TRUE       
##  5     2     1           0           0            1 TRUE       
##  6     2     2           1           1            1 TRUE       
##  7     3     0           0           0            0 FALSE      
##  8     3     1           0           0            0 FALSE      
##  9     3     2           0           0            0 FALSE      
## 10     4     0           0           0            0 FALSE      
## # … with 50 more rows

But, is that really what we want to do? The answer is it depends. If we are going to stop here, then the succinct code may be what we want. But, what if we want to also know if the participant reported all side effects in each year. Perhaps, you’ve already worked out what that code would look like. Perhaps you’re thinking something like:

drug_trial_sub %>% 
  rowwise() %>% 
  mutate(
    any_se_year = sum(se_headache, se_diarrhea, se_dry_mouth) > 0,
    all_se_year = sum(se_headache, se_diarrhea, se_dry_mouth) == 3
  )
## # A tibble: 60 × 7
## # Rowwise: 
##       id  year se_headache se_diarrhea se_dry_mouth any_se_year all_se_year
##    <int> <int>       <int>       <int>        <int> <lgl>       <lgl>      
##  1     1     0           0           1            1 TRUE        FALSE      
##  2     1     1           1           1            1 TRUE        TRUE       
##  3     1     2           1           1            0 TRUE        FALSE      
##  4     2     0           1           1            1 TRUE        TRUE       
##  5     2     1           0           0            1 TRUE        FALSE      
##  6     2     2           1           1            1 TRUE        TRUE       
##  7     3     0           0           0            0 FALSE       FALSE      
##  8     3     1           0           0            0 FALSE       FALSE      
##  9     3     2           0           0            0 FALSE       FALSE      
## 10     4     0           0           0            0 FALSE       FALSE      
## # … with 50 more rows

That works, and hopefully, you’re able to reason out why it works. But, there we go repeating code again! So, in this case, we have to choose between more succinct code and the DRY principle. When presented with that choice, I will typically favor the DRY principle. Therefore, my code would look like this:

drug_trial_sub %>% 
  rowwise() %>% 
  mutate(
    n_se_year   = sum(se_headache, se_diarrhea, se_dry_mouth),
    any_se_year = n_se_year > 0,
    all_se_year = n_se_year == 3
  )
## # A tibble: 60 × 8
## # Rowwise: 
##       id  year se_headache se_diarrhea se_dry_mouth n_se_year any_se_year
##    <int> <int>       <int>       <int>        <int>     <int> <lgl>      
##  1     1     0           0           1            1         2 TRUE       
##  2     1     1           1           1            1         3 TRUE       
##  3     1     2           1           1            0         2 TRUE       
##  4     2     0           1           1            1         3 TRUE       
##  5     2     1           0           0            1         1 TRUE       
##  6     2     2           1           1            1         3 TRUE       
##  7     3     0           0           0            0         0 FALSE      
##  8     3     1           0           0            0         0 FALSE      
##  9     3     2           0           0            0         0 FALSE      
## 10     4     0           0           0            0         0 FALSE      
## # … with 50 more rows, and 1 more variable: all_se_year <lgl>

Not only am I less like to make a typing error in this code, but I think the differences between each line of code (i.e., what that line of code is doing) stands out more. In other words, the intent of the code isn’t buried in unneeded words.

Before moving on, I also want to point out that the method above would not have worked on factors. For example:

drug_trial_sub %>% 
  mutate(
    se_headache  = factor(se_headache, yn_levs, yn_labs),
    se_diarrhea  = factor(se_diarrhea, yn_levs, yn_labs),
    se_dry_mouth = factor(se_dry_mouth, yn_levs, yn_labs)
  ) %>% 
  rowwise() %>% 
  mutate(
    n_se_year   = sum(se_headache, se_diarrhea, se_dry_mouth),
    any_se_year = n_se_year > 0,
    all_se_year = n_se_year == 3
  )
## Error: Problem with `mutate()` column `n_se_year`.
## ℹ `n_se_year = sum(se_headache, se_diarrhea, se_dry_mouth)`.
## x 'sum' not meaningful for factors
## ℹ The error occurred in row 1.

The sum() function cannot add factors. Back when I first introduced factors in this book, I suggested that you keep the numeric version of your variables in your data frames and create factors as new variables. I said that I thought this was a good idea because I often find that it can be useful to have both versions of the variable hanging around during the analysis process. The situation above is an example of what I was talking about.

25.5.6 Group_by mutations

So far, we’ve created variables that tell us if our participants reported any side effects in a given and if they reported all 3 side effects in a given year. The next logical question might be to ask if each participant experienced any side effect in any year. For that, we will need dplyr’s group_by() function. Before discussing group_by(), I’m going to show you the code I would use to accomplish this task:

drug_trial_sub %>% 
  rowwise() %>% 
  mutate(
    n_se_year   = sum(se_headache, se_diarrhea, se_dry_mouth),
    any_se_year = n_se_year > 0,
    all_se_year = n_se_year == 3
  ) %>% 
  group_by(id) %>% 
  mutate(any_se = sum(any_se_year) > 0)
## # A tibble: 60 × 9
## # Groups:   id [20]
##       id  year se_headache se_diarrhea se_dry_mouth n_se_year any_se_year
##    <int> <int>       <int>       <int>        <int>     <int> <lgl>      
##  1     1     0           0           1            1         2 TRUE       
##  2     1     1           1           1            1         3 TRUE       
##  3     1     2           1           1            0         2 TRUE       
##  4     2     0           1           1            1         3 TRUE       
##  5     2     1           0           0            1         1 TRUE       
##  6     2     2           1           1            1         3 TRUE       
##  7     3     0           0           0            0         0 FALSE      
##  8     3     1           0           0            0         0 FALSE      
##  9     3     2           0           0            0         0 FALSE      
## 10     4     0           0           0            0         0 FALSE      
## # … with 50 more rows, and 2 more variables: all_se_year <lgl>, any_se <lgl>

👆Here’s what we did above:

  • We created a new column in the drug_trial_sub data called any_se using the mutate() function. The any_se column is TRUE if the participant reported any side effect in any year and FALSE if they never reported a side effect in any year.

  • We first grouped the data by id using the group_by() function. Note that grouping the data by id with group_by() overrides grouping the data by row with rowwise() as soon as R gets to that point in the code. In other words, the data is grouped by row from rowwise() %>% to group_by(id) %>% and grouped by id after.

🗒Side Note: You can use dplyr::ungroup() to ungroup your data frames. This works regardless of whether you grouped them with rowwise() or group_by().

I already introduced group_by() in the chapter on numerical descriptions of categorical variables. I also said that group_by() operationalizes the Split - Apply - Combine strategy for data analysis. That means is that we split our data frame up into smaller data frames, apply our calculation separately to each smaller data frame, and then combine those individual results back together as a single result.

So, in the example above, the drug_trial_sub data frame was split into twenty separate little data frames (i.e., one for each study id). Because there are 3 rows for each study id, each of these 20 little data frames had three rows.

Each of those 20 little data frames was then passed to the mutate() function. The name-value pair inside the mutate() function any_se = sum(any_se_year) > 0 told R to add up all the values for the column any_se_year (i.e., sum(any_se_year)), compare that summed value to 0 (i.e., sum(any_se_year) > 0), and then assign TRUE to any_se if the summed value is greater than zero and FALSE otherwise. Then, all 20 of the little data frames are combined back together and returned to us as a single data frame.

drug_trial_sub %>% 
  rowwise() %>% 
  mutate(
    n_se_year   = sum(se_headache, se_diarrhea, se_dry_mouth),
    any_se_year = n_se_year > 0,
    all_se_year = n_se_year == 3
  ) %>% 
  mutate(any_se = sum(any_se_year) > 0)
## # A tibble: 60 × 9
## # Rowwise: 
##       id  year se_headache se_diarrhea se_dry_mouth n_se_year any_se_year
##    <int> <int>       <int>       <int>        <int>     <int> <lgl>      
##  1     1     0           0           1            1         2 TRUE       
##  2     1     1           1           1            1         3 TRUE       
##  3     1     2           1           1            0         2 TRUE       
##  4     2     0           1           1            1         3 TRUE       
##  5     2     1           0           0            1         1 TRUE       
##  6     2     2           1           1            1         3 TRUE       
##  7     3     0           0           0            0         0 FALSE      
##  8     3     1           0           0            0         0 FALSE      
##  9     3     2           0           0            0         0 FALSE      
## 10     4     0           0           0            0         0 FALSE      
## # … with 50 more rows, and 2 more variables: all_se_year <lgl>, any_se <lgl>

You may be wondering why I used the sum() function when the values for any_se_year are not numbers. The way R treats logical vectors can actually be pretty useful in situations like this. That is, when mathematical operations are applied to logical vectors, R treats FALSE as a 0 and TRUE as a 1. So, for participant 1, R calculated the value for any_se something like this:

any_se_year <- c(TRUE, TRUE, TRUE)
any_se_year
## [1] TRUE TRUE TRUE
sum_any_se_year <- sum(any_se_year)
sum_any_se_year
## [1] 3
any_se <- sum_any_se_year > 0
any_se
## [1] TRUE

R used the recycling rules to copy that result to the other two rows of data from participant 1. R then repeated that process for every other participant, and then returned the combined data frame to us.

I hope you found the example above useful. I think it’s fairly representative of the kinds of data management stuff I tend to do on a day-to-day basis. Of course, missing data always complicates things (more to come on that!). In the next chapter, we will round out our introduction to the basics of data management by learning how to subset rows and columns of a data frame.