26 Subsetting data frames

Subsetting data frames is another one of the most common data management tasks I carryout in my data analysis projects. Subsetting data frames just refers to the process of deciding which columns and rows to keep in your data frame and which to drop.

For example, I may need to subset the rows of a data frame because I’m interested in understanding a subpopulation in my sample. Below, we only want to analyze the rows that correspond to participants from Texas.

Or, perhaps I’m only interested in a subset of the statistics returned to me in a data frame of analysis results. Below, I only want to view and present the variable name, variable category, count, and percent.

Fortunately, the dplyr package includes functions that make it really easy for us to subset our data frames – even in some fairly complicated ways. Let’s start by simulating the same drug trial data we simulated in the last chapter and use it to work through some examples.

# Load dplyr
library(dplyr)
set.seed(123)

drug_trial <- tibble(
  # Follow-up year, 0 = baseline, 1 = year one, 2 = year two.
  year = rep(0:2, times = 20),
  # Participant age a baseline. Must be between the ages of 35 and 75 at 
  # baseline to be eligible for the study
  age = sample(35:75, 20, TRUE) %>% rep(each = 3),
  # Drug the participant received, Placebo or active
  drug = sample(c("Placebo", "Active"), 20, TRUE) %>% 
    rep(each = 3),
  # Reported headaches side effect, Y/N
  se_headache = if_else(
    drug == "Placebo", 
    sample(0:1, 60, TRUE, c(.95,.05)), 
    sample(0:1, 60, TRUE, c(.10, .90))
  ),
  # Report diarrhea side effect, Y/N
  se_diarrhea = if_else(
    drug == "Placebo", 
    sample(0:1, 60, TRUE, c(.98,.02)), 
    sample(0:1, 60, TRUE, c(.20, .80))
  ),
  # Report dry mouth side effect, Y/N
  se_dry_mouth = if_else(
    drug == "Placebo", 
    sample(0:1, 60, TRUE, c(.97,.03)), 
    sample(0:1, 60, TRUE, c(.30, .70))
  ),
  # Participant had myocardial infarction in study year, Y/N
  mi = if_else(
    drug == "Placebo", 
    sample(0:1, 60, TRUE, c(.85, .15)), 
    sample(0:1, 60, TRUE, c(.80, .20))
  )
)

As a reminder, we are simulating some drug trial data that includes the following variables:

  • id: Study id, there are 20 people enrolled in the trial.

  • year: Follow-up year, 0 = baseline, 1 = year one, 2 = year two.

  • age: Participant age a baseline. Must be between the ages of 35 and 75 at baseline to be eligible for the study.

  • drug: Drug the participant received, Placebo or active.

  • se_headache: Reported headaches side effect, Y/N.

  • se_diarrhea: Report diarrhea side effect, Y/N.

  • se_dry_mouth: Report dry mouth side effect, Y/N.

  • mi: Participant had myocardial infarction in study year, Y/N.

Actually, this data is slightly different than the data we used in the last chapter. Did you catch the difference? Take another look:

drug_trial
## # A tibble: 60 × 7
##     year   age drug    se_headache se_diarrhea se_dry_mouth    mi
##    <int> <int> <chr>         <int>       <int>        <int> <int>
##  1     0    65 Active            0           1            1     0
##  2     1    65 Active            1           1            1     0
##  3     2    65 Active            1           1            0     0
##  4     0    49 Active            1           1            1     0
##  5     1    49 Active            0           0            1     0
##  6     2    49 Active            1           1            1     0
##  7     0    48 Placebo           0           0            0     0
##  8     1    48 Placebo           0           0            0     0
##  9     2    48 Placebo           0           0            0     0
## 10     0    37 Placebo           0           0            0     0
## # … with 50 more rows

We forgot to put a study id in our data. Because we simulated this data above, the best way to fix this oversite is to make the necessary change to the simulation code above. But, let’s pretend that someone sent us this data instead, and we have to add a new study id column to it. Well, we now know how to use the mutate() function to columns to our data frame. We can do so like this:

drug_trial <- drug_trial %>% 
  mutate(
    # Study id, there are 20 people enrolled in the trial.
    id = rep(1:20, each = 3)
  ) %>% 
  print()
## # A tibble: 60 × 8
##     year   age drug    se_headache se_diarrhea se_dry_mouth    mi    id
##    <int> <int> <chr>         <int>       <int>        <int> <int> <int>
##  1     0    65 Active            0           1            1     0     1
##  2     1    65 Active            1           1            1     0     1
##  3     2    65 Active            1           1            0     0     1
##  4     0    49 Active            1           1            1     0     2
##  5     1    49 Active            0           0            1     0     2
##  6     2    49 Active            1           1            1     0     2
##  7     0    48 Placebo           0           0            0     0     3
##  8     1    48 Placebo           0           0            0     0     3
##  9     2    48 Placebo           0           0            0     0     3
## 10     0    37 Placebo           0           0            0     0     4
## # … with 50 more rows

And now we have the study id in our data. But, by default R adds new columns as the rightmost column of the data frame. In terms of analysis, it doesn’t really matter where this column is located in our data. R couldn’t care less. However, when humans look at this data, they typically expect the study id (or some other identifier) to be the first column in the data frame. That is a job for select().

26.1 The select() function

drug_trial %>% 
  select(id, year, age, se_headache, se_diarrhea, se_dry_mouth, mi)
## # A tibble: 60 × 7
##       id  year   age se_headache se_diarrhea se_dry_mouth    mi
##    <int> <int> <int>       <int>       <int>        <int> <int>
##  1     1     0    65           0           1            1     0
##  2     1     1    65           1           1            1     0
##  3     1     2    65           1           1            0     0
##  4     2     0    49           1           1            1     0
##  5     2     1    49           0           0            1     0
##  6     2     2    49           1           1            1     0
##  7     3     0    48           0           0            0     0
##  8     3     1    48           0           0            0     0
##  9     3     2    48           0           0            0     0
## 10     4     0    37           0           0            0     0
## # … with 50 more rows

👆Here’s what we did above:

  • We used the select() function to change the order of the columns in the drug_trial data frame so that id would be the first variable in the data frame when reading from left to right.

  • You can type ?select into your R console to view the help documentation for this function and follow along with the explanation below.

  • The first argument to the select() function is .data. The value passed to .data should always be a data frame. In this book, we will often pass data frames to the .data argument using the pipe operator (e.g., df %>% select()).

  • The second argument to the select() function is .... The value passed to the ... argument should column names or expressions that return column positions. We’ll dive deeper into this soon.

More generally, the select() function tells R which variables in your data frame to keep (or drop) and in what order.

The code above gave us the result we wanted. 👏 But, it can be tedious and error prone to manually type every variable name inside the select() function. Did you notice that I forgot the drug column “by accident?”

Thankfully, the select() function is one of several dplyr functions that accept tidy-select argument modifiers (i.e., functions and operators). In this chapter, I will show you some of the tidy-select argument modifiers I regularly use, but you can always type ?dplyr_tidy_select into your console to see a complete list.

In our little example above, we could have used the tidy-select everything() function to make our code easier to write and we wouldn’t have accidently missed the drug column. We can do so like this:

drug_trial <- drug_trial %>% 
  select(id, everything()) %>% 
  print()
## # A tibble: 60 × 8
##       id  year   age drug    se_headache se_diarrhea se_dry_mouth    mi
##    <int> <int> <int> <chr>         <int>       <int>        <int> <int>
##  1     1     0    65 Active            0           1            1     0
##  2     1     1    65 Active            1           1            1     0
##  3     1     2    65 Active            1           1            0     0
##  4     2     0    49 Active            1           1            1     0
##  5     2     1    49 Active            0           0            1     0
##  6     2     2    49 Active            1           1            1     0
##  7     3     0    48 Placebo           0           0            0     0
##  8     3     1    48 Placebo           0           0            0     0
##  9     3     2    48 Placebo           0           0            0     0
## 10     4     0    37 Placebo           0           0            0     0
## # … with 50 more rows

👆Here’s what we did above:

  • We used the select() function to change the order of the columns in the drug_trial data frame so that id would be the first variable in the data frame when reading from left to right.

  • Rather than explicitly typing the othe column names, we used the everything() tidy-select function. As you may have guessed, everything() tells R to do X (in this keep) to all the other variables not explicitly mentioned.

For our next example, let’s go ahead and add our mean-centered age variable to our drug_trial data again. We did this for the first time in the last chapter, in case you missed.

drug_trial <- drug_trial %>% 
  mutate(age_center = age - mean(age)) %>% 
  print()
## # A tibble: 60 × 9
##       id  year   age drug  se_headache se_diarrhea se_dry_mouth    mi age_center
##    <int> <int> <int> <chr>       <int>       <int>        <int> <int>      <dbl>
##  1     1     0    65 Acti…           0           1            1     0       11.3
##  2     1     1    65 Acti…           1           1            1     0       11.3
##  3     1     2    65 Acti…           1           1            0     0       11.3
##  4     2     0    49 Acti…           1           1            1     0       -4.7
##  5     2     1    49 Acti…           0           0            1     0       -4.7
##  6     2     2    49 Acti…           1           1            1     0       -4.7
##  7     3     0    48 Plac…           0           0            0     0       -5.7
##  8     3     1    48 Plac…           0           0            0     0       -5.7
##  9     3     2    48 Plac…           0           0            0     0       -5.7
## 10     4     0    37 Plac…           0           0            0     0      -16.7
## # … with 50 more rows

One way I will often use select() is for performing quick little data checks. For example, let’s say that I wanted to make sure the code I wrote above actually did what I intended it to do. If I print the entire data frame to the screen, age and age_center aren’t directly side-by-side, and there’s a lot of other visual clutter from the other variables. In a case like this, I would use select() to get a clearer picture:

drug_trial %>% 
  select(age, age_center)
## # A tibble: 60 × 2
##      age age_center
##    <int>      <dbl>
##  1    65       11.3
##  2    65       11.3
##  3    65       11.3
##  4    49       -4.7
##  5    49       -4.7
##  6    49       -4.7
##  7    48       -5.7
##  8    48       -5.7
##  9    48       -5.7
## 10    37      -16.7
## # … with 50 more rows

👆Here’s what we did above:

  • We used the select() function to view the age and age_center columns only.

  • We can type individual column names, separated by commas, into select() to return a data frame containing only those columns, and in that order.

⚠️Warning: Notice that I didn’t assign our result above to anything (i.e., there’s no drug_trial <-). If I had done so, the drug_trial data would have contained these two columns only. I didn’t want to drop the other columns. I could have assigned the result of the code to a different R object (e.g., check_age <-, but it wasn’t really necessary. I just wanted to quickly view age and age_center side-by-side for data checking purposes. When I’m satisfied that I coded it correctly, I can move on. There’s no need to save those results to an R object.

You may also recall that we wanted to subset the drug_trial data to include only the columns we needed for the rowwise demonstrations. Here is the code we used to do so:

drug_trial %>% 
  select(id, year, starts_with("se"))
## # A tibble: 60 × 5
##       id  year se_headache se_diarrhea se_dry_mouth
##    <int> <int>       <int>       <int>        <int>
##  1     1     0           0           1            1
##  2     1     1           1           1            1
##  3     1     2           1           1            0
##  4     2     0           1           1            1
##  5     2     1           0           0            1
##  6     2     2           1           1            1
##  7     3     0           0           0            0
##  8     3     1           0           0            0
##  9     3     2           0           0            0
## 10     4     0           0           0            0
## # … with 50 more rows

👆Here’s what we did above:

  • We used the select() function to view the id year, se_headache, se_diarrhea, and se_dry_mouth columns only.

  • We used the tidy-select starts_with() function to select all the side effect variables.

We already know that we can use everything() to select all of the other variables in a data frame, but what if we just want to grab a range or group of other variables in a data frame? tidy-select makes it easy for us. Above, we used the starts_with() function to select all the columns with names that literally start with the letters “se.” Because all of the side effect columns are directly next to each other (i.e., no columns in between them) we could have also used the colon operator : like this:

drug_trial %>% 
  select(id, year, se_headache:se_dry_mouth)
## # A tibble: 60 × 5
##       id  year se_headache se_diarrhea se_dry_mouth
##    <int> <int>       <int>       <int>        <int>
##  1     1     0           0           1            1
##  2     1     1           1           1            1
##  3     1     2           1           1            0
##  4     2     0           1           1            1
##  5     2     1           0           0            1
##  6     2     2           1           1            1
##  7     3     0           0           0            0
##  8     3     1           0           0            0
##  9     3     2           0           0            0
## 10     4     0           0           0            0
## # … with 50 more rows

While either method gets us the same result, I tend to prefer using starts_with() when possible. I think it makes your code easier to read (i.e., “Oh, he’s selecting all the side effect columns here.”).

In addition to starts_with(), there is also an ends_with() tidy-select function that can also be useful. For example, we’ve named factors with the _f naming convention throughout the book. We could use that, along with the ends-with() function to create a subset of our data that includes only the factor versions of our side effects columns.

# Add the side effect factor columns to our data frame again...
yn_levs <- c(0, 1)
yn_labs <- c("No", "Yes")

drug_trial <- drug_trial %>% 
  mutate(
    se_headache_f  = factor(se_headache, yn_levs, yn_labs),
    se_diarrhea_f  = factor(se_diarrhea, yn_levs, yn_labs),
    se_dry_mouth_f = factor(se_dry_mouth, yn_levs, yn_labs)
  )
drug_trial %>% 
  select(id, year, ends_with("_f"))
## # A tibble: 60 × 5
##       id  year se_headache_f se_diarrhea_f se_dry_mouth_f
##    <int> <int> <fct>         <fct>         <fct>         
##  1     1     0 No            Yes           Yes           
##  2     1     1 Yes           Yes           Yes           
##  3     1     2 Yes           Yes           No            
##  4     2     0 Yes           Yes           Yes           
##  5     2     1 No            No            Yes           
##  6     2     2 Yes           Yes           Yes           
##  7     3     0 No            No            No            
##  8     3     1 No            No            No            
##  9     3     2 No            No            No            
## 10     4     0 No            No            No            
## # … with 50 more rows

🗒Side Note: Variable names are important! Throughout this book, I’ve tried to repeatedly emphasize the importance of coding style – including the way we name our R objects. Many people who are new to data management and analysis (and some who aren’t, MDL) don’t fully appreciate the importance of such things. I hope that the preceding two examples are helping you to see why the little details, like variable names, are important. Using consistent variable naming conventions, for example, allows us to write code that requires less typing, is easier for humans to skim and understand, and is less prone to typos and other related errors.

We can also select columns we want to keep by position instead of name. I don’t do this often. I think it’s generally better to use column names or tidy-select argument modifiers when subsetting columns in your data frame. However, I do sometimes select columns by position when I’m writing my own functions. Therefore, I want to quickly show you what this looks like:

drug_trial %>% 
  select(1:2, 4)
## # A tibble: 60 × 3
##       id  year drug   
##    <int> <int> <chr>  
##  1     1     0 Active 
##  2     1     1 Active 
##  3     1     2 Active 
##  4     2     0 Active 
##  5     2     1 Active 
##  6     2     2 Active 
##  7     3     0 Placebo
##  8     3     1 Placebo
##  9     3     2 Placebo
## 10     4     0 Placebo
## # … with 50 more rows

👆Here’s what we did above:

  • We passed column numbers to the select() function to keep the 1st, 2nd, and 4th columns from our drug_trial data frame.

Finally, in addition to using select() to keep columns in our data frame, we can also use select() to explicitly drop columns from our data frame. To do so, we just need to use either the subtraction symbol (-) or the Not operator (!).

Think back to our example from the previous chapter. There we created some new variables that captured information about participants reporting any and all side effects. During that process we created a column that contained a count of the side effects experienced in each year – n_se_year.

drug_trial_sub <- drug_trial %>% 
  rowwise() %>% 
  mutate(
    n_se_year   = sum(se_headache, se_diarrhea, se_dry_mouth),
    any_se_year = n_se_year > 0,
    all_se_year = n_se_year == 3
  ) %>% 
  group_by(id) %>% 
  mutate(any_se = sum(any_se_year) > 0) %>% 
  ungroup() %>% 
  select(id:year, n_se_year:any_se) %>% 
  print()
## # A tibble: 60 × 6
##       id  year n_se_year any_se_year all_se_year any_se
##    <int> <int>     <int> <lgl>       <lgl>       <lgl> 
##  1     1     0         2 TRUE        FALSE       TRUE  
##  2     1     1         3 TRUE        TRUE        TRUE  
##  3     1     2         2 TRUE        FALSE       TRUE  
##  4     2     0         3 TRUE        TRUE        TRUE  
##  5     2     1         1 TRUE        FALSE       TRUE  
##  6     2     2         3 TRUE        TRUE        TRUE  
##  7     3     0         0 FALSE       FALSE       FALSE 
##  8     3     1         0 FALSE       FALSE       FALSE 
##  9     3     2         0 FALSE       FALSE       FALSE 
## 10     4     0         0 FALSE       FALSE       FALSE 
## # … with 50 more rows

Let’s say we decided we don’t need n_se_year column now that we created any_se_year, all_se_year, and any_se. We can easily drop it from the data frame in a couple of ways:

drug_trial_sub %>% 
  select(-n_se_year)
## # A tibble: 60 × 5
##       id  year any_se_year all_se_year any_se
##    <int> <int> <lgl>       <lgl>       <lgl> 
##  1     1     0 TRUE        FALSE       TRUE  
##  2     1     1 TRUE        TRUE        TRUE  
##  3     1     2 TRUE        FALSE       TRUE  
##  4     2     0 TRUE        TRUE        TRUE  
##  5     2     1 TRUE        FALSE       TRUE  
##  6     2     2 TRUE        TRUE        TRUE  
##  7     3     0 FALSE       FALSE       FALSE 
##  8     3     1 FALSE       FALSE       FALSE 
##  9     3     2 FALSE       FALSE       FALSE 
## 10     4     0 FALSE       FALSE       FALSE 
## # … with 50 more rows
drug_trial_sub %>% 
  select(!n_se_year)
## # A tibble: 60 × 5
##       id  year any_se_year all_se_year any_se
##    <int> <int> <lgl>       <lgl>       <lgl> 
##  1     1     0 TRUE        FALSE       TRUE  
##  2     1     1 TRUE        TRUE        TRUE  
##  3     1     2 TRUE        FALSE       TRUE  
##  4     2     0 TRUE        TRUE        TRUE  
##  5     2     1 TRUE        FALSE       TRUE  
##  6     2     2 TRUE        TRUE        TRUE  
##  7     3     0 FALSE       FALSE       FALSE 
##  8     3     1 FALSE       FALSE       FALSE 
##  9     3     2 FALSE       FALSE       FALSE 
## 10     4     0 FALSE       FALSE       FALSE 
## # … with 50 more rows

Note that we could have also dropped it indirectly by selecting everything else:

drug_trial_sub %>% 
  select(id:year, any_se_year:any_se)
## # A tibble: 60 × 5
##       id  year any_se_year all_se_year any_se
##    <int> <int> <lgl>       <lgl>       <lgl> 
##  1     1     0 TRUE        FALSE       TRUE  
##  2     1     1 TRUE        TRUE        TRUE  
##  3     1     2 TRUE        FALSE       TRUE  
##  4     2     0 TRUE        TRUE        TRUE  
##  5     2     1 TRUE        FALSE       TRUE  
##  6     2     2 TRUE        TRUE        TRUE  
##  7     3     0 FALSE       FALSE       FALSE 
##  8     3     1 FALSE       FALSE       FALSE 
##  9     3     2 FALSE       FALSE       FALSE 
## 10     4     0 FALSE       FALSE       FALSE 
## # … with 50 more rows

But, I think this is generally a bad idea. Not only is it more typing, but skimming through your code doesn’t really tell me (or future you) what you were trying to accomplish there.

26.2 The rename() function

Sometimes, we want to change the names of some, or all, of the columns in our data frame. For me, this most commonly comes up with data I’ve imported from someone else. For example, let’s say I’m importing data that uses column names that aren’t super informative. We saw column names like that when we imported NHANES data. It looked something like this:

nhanes <- tibble(
  SEQN = c(1:4),
  ALQ101 = c(1, 2, 1, 2),
  ALQ110 = c(2, 2, 2, 1)
) %>% 
  print()
## # A tibble: 4 × 3
##    SEQN ALQ101 ALQ110
##   <int>  <dbl>  <dbl>
## 1     1      1      2
## 2     2      2      2
## 3     3      1      2
## 4     4      2      1

We previously learned how to change these column names on import (i.e., col_names), but let’s say we didn’t do that for whatever reason. We can rename columns in our data frame using the rename() function like so:

nhanes %>% 
  rename(
    id = SEQN,
    drinks_12_year = ALQ101,
    drinks_12_life = ALQ110
  )
## # A tibble: 4 × 3
##      id drinks_12_year drinks_12_life
##   <int>          <dbl>          <dbl>
## 1     1              1              2
## 2     2              2              2
## 3     3              1              2
## 4     4              2              1

👆Here’s what we did above:

  • We used the rename() function to change the name of each column in the drug_trial data frame to be more informative.

  • You can type ?rename into your R console to view the help documentation for this function and follow along with the explanation below.

  • The first argument to the rename() function is .data. The value passed to .data should always be a data frame. In this book, we will often pass data frames to the .data argument using the pipe operator (e.g., df %>% rename()).

  • The second argument to the rename() function is .... The value passed to the ... argument should be a name value pair, or series of name-value pairs separated by columns. The name-value pairs should be in the format new name = original name.

I think these names are much better, but for the sake of argument let’s say that we wanted to keep the original names – just coerce them to lowercase. We can do that using the rename_with() variation of the rename() function in combination with the tolower() function:

nhanes %>% 
  rename_with(tolower)
## # A tibble: 4 × 3
##    seqn alq101 alq110
##   <int>  <dbl>  <dbl>
## 1     1      1      2
## 2     2      2      2
## 3     3      1      2
## 4     4      2      1

👆Here’s what we did above:

  • We used the rename_with() function to coerce all column names in the drug_trial data frame to lowercase.

  • You can type ?rename into your R console to view the help documentation for this function and follow along with the explanation below.

  • The first argument to the rename_with() function is .data. The value passed to .data should always be a data frame. In this book, we will often pass data frames to the .data argument using the pipe operator (e.g., df %>% rename_with()).

  • The second argument to the rename_with() function is .fn. The value passed to the .fn argument should be a function that you want to apply to all the columns selected in the .cols argument (see below).

  • The third argument to the rename_with() function is .cols. The value passed to the .cols argument should be the columns you want to apply the function passed to the .fn argument to. You can select the columns using tidy-select argument modifiers.

26.3 The filter() function

We just saw how to keep and drop columns in our data frame using the select() function. We can keep and drop rows in our data frame using the filter() function or the slice() function.

Similar to selecting columns by position instead of name:

drug_trial %>% 
  select(1:2, 4)
## # A tibble: 60 × 3
##       id  year drug   
##    <int> <int> <chr>  
##  1     1     0 Active 
##  2     1     1 Active 
##  3     1     2 Active 
##  4     2     0 Active 
##  5     2     1 Active 
##  6     2     2 Active 
##  7     3     0 Placebo
##  8     3     1 Placebo
##  9     3     2 Placebo
## 10     4     0 Placebo
## # … with 50 more rows

We can also select rows we want to keep by position. Again, I don’t do this often, but it is sometimes useful when I’m writing my own functions. Therefore, I want to quickly show you what this looks like:

drug_trial %>% 
  slice(1:5)
## # A tibble: 5 × 12
##      id  year   age drug   se_headache se_diarrhea se_dry_mouth    mi age_center
##   <int> <int> <int> <chr>        <int>       <int>        <int> <int>      <dbl>
## 1     1     0    65 Active           0           1            1     0       11.3
## 2     1     1    65 Active           1           1            1     0       11.3
## 3     1     2    65 Active           1           1            0     0       11.3
## 4     2     0    49 Active           1           1            1     0       -4.7
## 5     2     1    49 Active           0           0            1     0       -4.7
## # … with 3 more variables: se_headache_f <fct>, se_diarrhea_f <fct>,
## #   se_dry_mouth_f <fct>

👆Here’s what we did above:

  • We used the slice() function to keep only the first 5 rows in the drug_trial data frame.

  • You can type ?slice into your R console to view the help documentation for this function and follow along with the explanation below.

  • The first argument to the slice() function is .data. The value passed to .data should always be a data frame. In this book, we will often pass data frames to the .data argument using the pipe operator (e.g., df %>% slice()).

  • The second argument to the slice() function is .... The value passed to the ... argument should be a row numbers you want returned to you.

Generally speaking, I’m far more likely to use the filter() function to select only a subset of rows from my data frame. Two of the most common scenarios, of many possible scenarios, where want to subset rows include:

  • Performing a subgroup analysis. This is a situation where I want my analysis to include only some of the people (or places, or things) in my data frame.

  • Performing a complete case analysis. This is a situation where I want to remove rows that contain missing values from my data frame before performing an analysis.

26.3.1 Subgroup analysis

Let’s say that I want to count the number of people in the drug trial who reported having headaches in the baseline year by drug status (active vs. placebo). We would first use filter() to keep only the rows that contain data from the baseline year:

drug_trial %>% 
  filter(year == 0)
## # A tibble: 20 × 12
##       id  year   age drug  se_headache se_diarrhea se_dry_mouth    mi age_center
##    <int> <int> <int> <chr>       <int>       <int>        <int> <int>      <dbl>
##  1     1     0    65 Acti…           0           1            1     0     11.3  
##  2     2     0    49 Acti…           1           1            1     0     -4.7  
##  3     3     0    48 Plac…           0           0            0     0     -5.7  
##  4     4     0    37 Plac…           0           0            0     0    -16.7  
##  5     5     0    71 Plac…           0           0            0     0     17.3  
##  6     6     0    48 Plac…           0           0            0     0     -5.7  
##  7     7     0    59 Acti…           1           1            1     0      5.3  
##  8     8     0    60 Plac…           0           0            0     0      6.3  
##  9     9     0    61 Acti…           1           1            1     0      7.3  
## 10    10     0    39 Acti…           1           0            1     0    -14.7  
## 11    11     0    61 Plac…           0           0            0     0      7.3  
## 12    12     0    62 Plac…           1           0            1     0      8.3  
## 13    13     0    43 Plac…           0           0            0     0    -10.7  
## 14    14     0    63 Plac…           0           0            0     0      9.3  
## 15    15     0    69 Acti…           1           1            1     0     15.3  
## 16    16     0    42 Plac…           0           0            0     0    -11.7  
## 17    17     0    60 Plac…           0           0            0     0      6.3  
## 18    18     0    41 Acti…           1           1            1     0    -12.7  
## 19    19     0    43 Plac…           0           0            0     0    -10.7  
## 20    20     0    53 Plac…           0           0            0     0     -0.700
## # … with 3 more variables: se_headache_f <fct>, se_diarrhea_f <fct>,
## #   se_dry_mouth_f <fct>

👆Here’s what we did above:

  • We used the filter() function to keep only the rows in the drug_trial data frame that contain data from the baseline year.

  • You can type ?filter into your R console to view the help documentation for this function and follow along with the explanation below.

  • The first argument to the filter() function is .data. The value passed to .data should always be a data frame. In this book, we will often pass data frames to the .data argument using the pipe operator (e.g., df %>% filter()).

  • The second argument to the filter() function is .... The value passed to the ... argument should be a name-value pair or multiple name value pairs separated by commas. The ... argument is where you will tell filter() how to decide which rows to keep.

⚠️Warning: Remember, that in the R language = (i.e., one equal sign) and == (i.e., two equal signs) are different things. The = operator tells R to make the thing on the left equal to the thing on the right. In other words, it assigns values. The == asks R if the thing on the left is equal to the thing on the right. In other words, it test the equality of values.

Now, we can use the descriptive analysis techniques we’ve already learned to answer our research question:

drug_trial %>% 
  filter(year == 0) %>% 
  group_by(drug, se_headache_f) %>% 
  summarise(n = n())
## # A tibble: 4 × 3
## # Groups:   drug [2]
##   drug    se_headache_f     n
##   <chr>   <fct>         <int>
## 1 Active  No                1
## 2 Active  Yes               6
## 3 Placebo No               12
## 4 Placebo Yes               1

So, 6 out of 7 (~ 86%) of the people in our active drug group reported headaches in the baseline year. Now, let’s say that we have reason to suspect that the drug affects people differently based on their age. Let’s go ahead and repeat this analysis, but only in a subgroup of people who are below age 65. Again, we can use the filter() function to do this:

drug_trial %>% 
  filter(year == 0) %>% 
  filter(age < 65) %>% 
  group_by(drug, se_headache_f) %>% 
  summarise(n = n())
## # A tibble: 3 × 3
## # Groups:   drug [2]
##   drug    se_headache_f     n
##   <chr>   <fct>         <int>
## 1 Active  Yes               5
## 2 Placebo No               11
## 3 Placebo Yes               1

Wow! It looks like everyone under age 65 who received active drug also reported headaches!

We can show this more explicitly by using passing the value FALSE to the .drop argument of group_by(). This tells R to keep all factor levels in the output, even if they were observed in the data zero times.

drug_trial %>% 
  filter(year == 0) %>% 
  filter(age < 65) %>% 
  group_by(drug, se_headache_f, .drop = FALSE) %>% 
  summarise(n = n())
## # A tibble: 4 × 3
## # Groups:   drug [2]
##   drug    se_headache_f     n
##   <chr>   <fct>         <int>
## 1 Active  No                0
## 2 Active  Yes               5
## 3 Placebo No               11
## 4 Placebo Yes               1

Finally, we could make our code above more succinct by combining our two filter functions into one:

drug_trial %>% 
  filter(year == 0 & age < 65) %>% 
  group_by(drug, se_headache_f, .drop = FALSE) %>% 
  summarise(n = n())
## # A tibble: 4 × 3
## # Groups:   drug [2]
##   drug    se_headache_f     n
##   <chr>   <fct>         <int>
## 1 Active  No                0
## 2 Active  Yes               5
## 3 Placebo No               11
## 4 Placebo Yes               1

👆Here’s what we did above:

  • We used the filter() function to keep only the rows in the drug_trial data frame that contain data from the baseline year AND (&) contain data from rows with a value that is less than 65 in the age column. The AND (&) here is important. A row must satisfy both of these conditions in order for R to keep it in the returned data frame. If we had used OR instead (filter(year == 0 | age < 65)), then only one condition OR the other would need to be met for R to keep the row in the returned data frame.

🗒Side Note: In the R language, we use the pipe operator to create OR conditions. The pipe operator looks like | and is probably the key immediately to the right of your enter/return key on your keyboard.

26.3.2 Complete case analysis

Now let’s say that we want to compare age at baseline by drug status (active vs. placebo). Additionally, let’s say that we have some missing values in our data.

Let’s first simulate some new data with missing values:

drug_trial_short <- drug_trial %>%
  filter(year == 0) %>% 
  slice(1:10) %>% 
  mutate(
    age  = replace(age, 1, NA),
    drug = replace(drug, 4, NA)
  ) %>% 
  print()
## # A tibble: 10 × 12
##       id  year   age drug  se_headache se_diarrhea se_dry_mouth    mi age_center
##    <int> <int> <int> <chr>       <int>       <int>        <int> <int>      <dbl>
##  1     1     0    NA Acti…           0           1            1     0       11.3
##  2     2     0    49 Acti…           1           1            1     0       -4.7
##  3     3     0    48 Plac…           0           0            0     0       -5.7
##  4     4     0    37 <NA>            0           0            0     0      -16.7
##  5     5     0    71 Plac…           0           0            0     0       17.3
##  6     6     0    48 Plac…           0           0            0     0       -5.7
##  7     7     0    59 Acti…           1           1            1     0        5.3
##  8     8     0    60 Plac…           0           0            0     0        6.3
##  9     9     0    61 Acti…           1           1            1     0        7.3
## 10    10     0    39 Acti…           1           0            1     0      -14.7
## # … with 3 more variables: se_headache_f <fct>, se_diarrhea_f <fct>,
## #   se_dry_mouth_f <fct>

👆Here’s what we did above:

  • We used the filter() and slice() functions to create a new data frame that contains only a subset of our original drug_trial data frame. The subset includes only the first 10 rows of the data frame remaining after selecting only the baseline year rows from the original data frame.

  • We used the replace() function to replace the first value of age with NA and the fourth value of drug with NA.

  • You can type ?replace into your R console to view the help documentation for this function.

If we try to answer our research question above without dealing with the missing data, we get the following undesirable results:

drug_trial_short %>% 
  group_by(drug) %>% 
  summarise(mean_age = mean(age))
## # A tibble: 3 × 2
##   drug    mean_age
##   <chr>      <dbl>
## 1 Active      NA  
## 2 Placebo     56.8
## 3 <NA>        37

One way we can improve our result is by adding the na.rm argument to the mean() function.

drug_trial_short %>% 
  group_by(drug) %>% 
  summarise(mean_age = mean(age, na.rm = TRUE))
## # A tibble: 3 × 2
##   drug    mean_age
##   <chr>      <dbl>
## 1 Active      52  
## 2 Placebo     56.8
## 3 <NA>        37

But, we previously saw how it can sometimes be more efficient to drop the row with missing data from the data frame explicitly. This is called a complete case analysis or list-wise deletion.

drug_trial_short %>% 
  filter(!is.na(age)) %>% 
  group_by(drug) %>% 
  summarise(mean_age = mean(age))
## # A tibble: 3 × 2
##   drug    mean_age
##   <chr>      <dbl>
## 1 Active      52  
## 2 Placebo     56.8
## 3 <NA>        37

However, we still have that missing value for drug. We can easily drop the row with the missing value by adding and additional value to the ... argument of our filter() function:

drug_trial_short %>% 
  filter(!is.na(age) & !is.na(drug)) %>% 
  group_by(drug) %>% 
  summarise(mean_age = mean(age))
## # A tibble: 2 × 2
##   drug    mean_age
##   <chr>      <dbl>
## 1 Active      52  
## 2 Placebo     56.8

26.4 Deduplication

Another common data management task that I want to discuss in this chapter is deduplicating data. Let’s go ahead and simulate some data to illustrate what I mean:

df <- tribble(
  ~id, ~day, ~x,
    1, 1, 1,
    1, 2, 11,
    2, 1, 12,
    2, 2, 13,
    2, 2, 14,
    3, 1, 12,
    3, 1, 12,
    3, 2, 13,
    4, 1, 13,
    5, 1, 10,
    5, 2, 11,
    5, 1, 10
) %>% 
  print()
## # A tibble: 12 × 3
##       id   day     x
##    <dbl> <dbl> <dbl>
##  1     1     1     1
##  2     1     2    11
##  3     2     1    12
##  4     2     2    13
##  5     2     2    14
##  6     3     1    12
##  7     3     1    12
##  8     3     2    13
##  9     4     1    13
## 10     5     1    10
## 11     5     2    11
## 12     5     1    10
  • All id’s but 4 have multiple observations.

  • Id 2 has row with duplicate values for id and day, but a non-duplicate value for x. These rows are partial duplicates.

  • ID 3 has a row with duplicate values for all three columns (i.e., 3, 1, 12). These rows are complete duplicates.

  • ID 5 has a row with duplicate values for all three columns (i.e., 5, 1, 10). These rows are complete duplicates. However, they are not in sequential order in the dataset.

26.4.1 The distinct() function

We can use dplyr’s distinct() function to remove all complete duplicates from the data frame:

df %>% 
  distinct()
## # A tibble: 10 × 3
##       id   day     x
##    <dbl> <dbl> <dbl>
##  1     1     1     1
##  2     1     2    11
##  3     2     1    12
##  4     2     2    13
##  5     2     2    14
##  6     3     1    12
##  7     3     2    13
##  8     4     1    13
##  9     5     1    10
## 10     5     2    11

👆Here’s what we did above:

  • We used the distinct() function to keep only one row from a group of complete duplicate rows in the df data frame.

  • You can type ?distinct into your R console to view the help documentation for this function and follow along with the explanation below.

  • The first argument to the distinct() function is .data. The value passed to .data should always be a data frame. In this book, we will often pass data frames to the .data argument using the pipe operator (e.g., df %>% distinct()).

  • The second argument to the distinct() function is .... The value passed to the ... argument should be the variables to use when determining uniqueness. Passing no variables to the ... argument is equivalent to pass all variables to the ... argument.

26.4.2 Complete duplicate row add tag

If want to identify the complete duplicate rows, without immediately dropping them, we can use the duplicated() function inside the mutate() function. This creates a new column in our data frame that has the value TRUE when the row is a complete duplicate and the value FALSE otherwise.

df %>% 
  mutate(dup = duplicated(df))
## # A tibble: 12 × 4
##       id   day     x dup  
##    <dbl> <dbl> <dbl> <lgl>
##  1     1     1     1 FALSE
##  2     1     2    11 FALSE
##  3     2     1    12 FALSE
##  4     2     2    13 FALSE
##  5     2     2    14 FALSE
##  6     3     1    12 FALSE
##  7     3     1    12 TRUE 
##  8     3     2    13 FALSE
##  9     4     1    13 FALSE
## 10     5     1    10 FALSE
## 11     5     2    11 FALSE
## 12     5     1    10 TRUE

Alternatively, we could get the same result using:

df %>% 
  group_by_all() %>% 
  mutate(
    n_row = row_number(),
    dup   = n_row > 1
  )
## # A tibble: 12 × 5
## # Groups:   id, day, x [10]
##       id   day     x n_row dup  
##    <dbl> <dbl> <dbl> <int> <lgl>
##  1     1     1     1     1 FALSE
##  2     1     2    11     1 FALSE
##  3     2     1    12     1 FALSE
##  4     2     2    13     1 FALSE
##  5     2     2    14     1 FALSE
##  6     3     1    12     1 FALSE
##  7     3     1    12     2 TRUE 
##  8     3     2    13     1 FALSE
##  9     4     1    13     1 FALSE
## 10     5     1    10     1 FALSE
## 11     5     2    11     1 FALSE
## 12     5     1    10     2 TRUE

👆Here’s what we did above:

  • We used the group_by_all() function to split our data frame into multiple data frames grouped by all the columns in df.

  • We used the row_number() to sequentially count every row in each of the little data frames created by group_by_all(). We assigned the sequential count to a new column named n_row.

  • We created a new column named dup that has a value of TRUE when the value of n_row is greater than 1 and FALSE otherwise.

Notice that R only tags the second in a set of duplicate rows as a duplicate. Below we tag both rows with complete duplicate values.

df %>% 
  mutate(dup = duplicated(.) | duplicated(., fromLast = TRUE))
## # A tibble: 12 × 4
##       id   day     x dup  
##    <dbl> <dbl> <dbl> <lgl>
##  1     1     1     1 FALSE
##  2     1     2    11 FALSE
##  3     2     1    12 FALSE
##  4     2     2    13 FALSE
##  5     2     2    14 FALSE
##  6     3     1    12 TRUE 
##  7     3     1    12 TRUE 
##  8     3     2    13 FALSE
##  9     4     1    13 FALSE
## 10     5     1    10 TRUE 
## 11     5     2    11 FALSE
## 12     5     1    10 TRUE

26.4.3 Partial duplicate rows

df %>% 
  distinct(id, day, .keep_all = TRUE)
## # A tibble: 9 × 3
##      id   day     x
##   <dbl> <dbl> <dbl>
## 1     1     1     1
## 2     1     2    11
## 3     2     1    12
## 4     2     2    13
## 5     3     1    12
## 6     3     2    13
## 7     4     1    13
## 8     5     1    10
## 9     5     2    11

👆Here’s what we did above:

  • We used the distinct() function to keep only one row from a group of duplicate rows in the df data frame.

  • You can type ?distinct into your R console to view the help documentation for this function and follow along with the explanation below.

  • This time we passed the column names id and day to the ... argument. This tells R to consider any rows that have the same value of id AND day to be duplicates – even if they have different values in their other columns.

  • The .keep_all argument tells R to return all of the columns in df to us – not just the columns that we are testing for uniqueness (i.e., id and day).

26.4.4 Partial duplicate rows - add tag

We can tag partial duplicate rows in a similar fashion to the way we tagged complete duplicate rows above:

df %>% 
  group_by(id, day) %>% 
  mutate(
    count = row_number(), # Counts rows by group
    dup   = count > 1     # TRUE if there is more than one row per group
  )
## # A tibble: 12 × 5
## # Groups:   id, day [9]
##       id   day     x count dup  
##    <dbl> <dbl> <dbl> <int> <lgl>
##  1     1     1     1     1 FALSE
##  2     1     2    11     1 FALSE
##  3     2     1    12     1 FALSE
##  4     2     2    13     1 FALSE
##  5     2     2    14     2 TRUE 
##  6     3     1    12     1 FALSE
##  7     3     1    12     2 TRUE 
##  8     3     2    13     1 FALSE
##  9     4     1    13     1 FALSE
## 10     5     1    10     1 FALSE
## 11     5     2    11     1 FALSE
## 12     5     1    10     2 TRUE

26.4.5 Count the number of duplicates

Finally, sometimes it can be useful to get a count of the number of duplicate rows. The code below returns a data frame that summarizes the number of rows that contain duplicate values for id and day, and what those duplicate values are.

df %>% 
  group_by(id, day) %>% 
  filter(n() > 1) %>% 
  count()
## # A tibble: 3 × 3
## # Groups:   id, day [3]
##      id   day     n
##   <dbl> <dbl> <int>
## 1     2     2     2
## 2     3     1     2
## 3     5     1     2

26.4.6 What to do about duplicates

Finding duplicates is only half the battle. After finding them, you have to decide what to do about them. In some ways it’s hard to give clear-cut advice on this because different situations require different decisions. However, here are some things you may want to consider:

  • If two or more rows are complete duplicates, then the additional rows provide no additional information. I have a hard time thinking of a scenario where dropping them would be a problem. Additionally, because they are completely identical, it doesn’t matter which row you drop.

  • If have two more rows that are partial duplicates, then you will want to look for obvious errors in the other variables. When you have two rows that are partial duplicates, and one row has very obvious errors in it, then keeping the row without the obvious errors is usually the correct decision. Having said that, you should meticulously document which rows you dropped and why, and make that information known to anyone consuming the results of your analysis.

  • When there are no obvious errors, deciding which rows to keep and which to drop can be really tricky. In this situation the best advice I can give is to be systematic in your approach. What I mean by that is to choose a strategy that seems least likely to introduce bias into your data and then apply that strategy consistently throughout your data. So, something like always keeping the first row among a group of duplicate rows. However, keep in mind that if rows are ordered by data, this strategy could easily introduce bias. In that case, some other strategy may be more appropriate. And again, you should meticulously document which rows you dropped and why, and make that information known to anyone consuming the results of your analysis.

  • Finally, I can definitively tell you a strategy that you should never use. That is, you should never pick and choose, or even give the appearance of picking and choosing, rows with values that are aligned with the results you want to see. I hope the unethical nature of this strategy is blatantly obvious to you.

Congratulations! 🎉 At this point, you are well-versed in all of the dplyr verbs. More importantly, you now have a foundation of tools you can call upon to complete the many of basic data management tasks that you will encounter. In the rest of the data management part of the book we will build on these tools, and learn some new tools, we can use to solve more complex data management problems.