27 Subsetting Data Frames
Subsetting data frames is another one of the most common data management tasks I carryout in my data analysis projects. Subsetting data frames just refers to the process of deciding which columns and rows to keep in your data frame and which to drop.
For example, I may need to subset the rows of a data frame because I’m interested in understanding a subpopulation in my sample. Below, we only want to analyze the rows that correspond to participants from Texas.
Or, perhaps I’m only interested in a subset of the statistics returned to me in a data frame of analysis results. Below, I only want to view and present the variable name, variable category, count, and percent.
Fortunately, the dplyr
package includes functions that make it really easy for us to subset our data frames – even in some fairly complicated ways. Let’s start by simulating the same drug trial data we simulated in the last chapter and use it to work through some examples.
set.seed(123)
drug_trial <- tibble(
# Follow-up year, 0 = baseline, 1 = year one, 2 = year two.
year = rep(0:2, times = 20),
# Participant age a baseline. Must be between the ages of 35 and 75 at
# baseline to be eligible for the study
age = sample(35:75, 20, TRUE) %>% rep(each = 3),
# Drug the participant received, Placebo or active
drug = sample(c("Placebo", "Active"), 20, TRUE) %>%
rep(each = 3),
# Reported headaches side effect, Y/N
se_headache = if_else(
drug == "Placebo",
sample(0:1, 60, TRUE, c(.95,.05)),
sample(0:1, 60, TRUE, c(.10, .90))
),
# Report diarrhea side effect, Y/N
se_diarrhea = if_else(
drug == "Placebo",
sample(0:1, 60, TRUE, c(.98,.02)),
sample(0:1, 60, TRUE, c(.20, .80))
),
# Report dry mouth side effect, Y/N
se_dry_mouth = if_else(
drug == "Placebo",
sample(0:1, 60, TRUE, c(.97,.03)),
sample(0:1, 60, TRUE, c(.30, .70))
),
# Participant had myocardial infarction in study year, Y/N
mi = if_else(
drug == "Placebo",
sample(0:1, 60, TRUE, c(.85, .15)),
sample(0:1, 60, TRUE, c(.80, .20))
)
)
As a reminder, we are simulating some drug trial data that includes the following variables:
id: Study id, there are 20 people enrolled in the trial.
year: Follow-up year, 0 = baseline, 1 = year one, 2 = year two.
age: Participant age a baseline. Must be between the ages of 35 and 75 at baseline to be eligible for the study.
drug: Drug the participant received, Placebo or active.
se_headache: Reported headaches side effect, Y/N.
se_diarrhea: Report diarrhea side effect, Y/N.
se_dry_mouth: Report dry mouth side effect, Y/N.
mi: Participant had myocardial infarction in study year, Y/N.
Actually, this data is slightly different than the data we used in the last chapter. Did you catch the difference? Take another look:
## # A tibble: 60 × 7
## year age drug se_headache se_diarrhea se_dry_mouth mi
## <int> <int> <chr> <int> <int> <int> <int>
## 1 0 65 Active 0 1 1 0
## 2 1 65 Active 1 1 1 0
## 3 2 65 Active 1 1 0 0
## 4 0 49 Active 1 1 1 0
## 5 1 49 Active 0 0 1 0
## 6 2 49 Active 1 1 1 0
## 7 0 48 Placebo 0 0 0 0
## 8 1 48 Placebo 0 0 0 0
## 9 2 48 Placebo 0 0 0 0
## 10 0 37 Placebo 0 0 0 0
## # ℹ 50 more rows
We forgot to put a study id in our data. Because we simulated this data above, the best way to fix this oversite is to make the necessary change to the simulation code above. But, let’s pretend that someone sent us this data instead, and we have to add a new study id column to it. Well, we now know how to use the mutate()
function to columns to our data frame. We can do so like this:
drug_trial <- drug_trial %>%
mutate(
# Study id, there are 20 people enrolled in the trial.
id = rep(1:20, each = 3)
) %>%
print()
## # A tibble: 60 × 8
## year age drug se_headache se_diarrhea se_dry_mouth mi id
## <int> <int> <chr> <int> <int> <int> <int> <int>
## 1 0 65 Active 0 1 1 0 1
## 2 1 65 Active 1 1 1 0 1
## 3 2 65 Active 1 1 0 0 1
## 4 0 49 Active 1 1 1 0 2
## 5 1 49 Active 0 0 1 0 2
## 6 2 49 Active 1 1 1 0 2
## 7 0 48 Placebo 0 0 0 0 3
## 8 1 48 Placebo 0 0 0 0 3
## 9 2 48 Placebo 0 0 0 0 3
## 10 0 37 Placebo 0 0 0 0 4
## # ℹ 50 more rows
And now we have the study id in our data. But, by default R adds new columns as the rightmost column of the data frame. In terms of analysis, it doesn’t really matter where this column is located in our data. R couldn’t care less. However, when humans look at this data, they typically expect the study id (or some other identifier) to be the first column in the data frame. That is a job for select()
.
27.1 The select() function
## # A tibble: 60 × 7
## id year age se_headache se_diarrhea se_dry_mouth mi
## <int> <int> <int> <int> <int> <int> <int>
## 1 1 0 65 0 1 1 0
## 2 1 1 65 1 1 1 0
## 3 1 2 65 1 1 0 0
## 4 2 0 49 1 1 1 0
## 5 2 1 49 0 0 1 0
## 6 2 2 49 1 1 1 0
## 7 3 0 48 0 0 0 0
## 8 3 1 48 0 0 0 0
## 9 3 2 48 0 0 0 0
## 10 4 0 37 0 0 0 0
## # ℹ 50 more rows
👆Here’s what we did above:
We used the
select()
function to change the order of the columns in thedrug_trial
data frame so thatid
would be the first variable in the data frame when reading from left to right.You can type
?select
into your R console to view the help documentation for this function and follow along with the explanation below.The first argument to the
select()
function is.data
. The value passed to.data
should always be a data frame. In this book, we will often pass data frames to the.data
argument using the pipe operator (e.g.,df %>% select()
).The second argument to the
select()
function is...
. The value passed to the...
argument should column names or expressions that return column positions. We’ll dive deeper into this soon.
More generally, the select()
function tells R which variables in your data frame to keep (or drop) and in what order.
The code above gave us the result we wanted. 👏 But, it can be tedious and error prone to manually type every variable name inside the select()
function. Did you notice that I forgot the drug
column “by accident”?
Thankfully, the select()
function is one of several dplyr
functions that accept tidy-select argument modifiers (i.e., functions and operators). In this chapter, I will show you some of the tidy-select argument modifiers I regularly use, but you can always type ?dplyr_tidy_select
into your console to see a complete list.
In our little example above, we could have used the tidy-select everything()
function to make our code easier to write and we wouldn’t have accidently missed the drug
column. We can do so like this:
## # A tibble: 60 × 8
## id year age drug se_headache se_diarrhea se_dry_mouth mi
## <int> <int> <int> <chr> <int> <int> <int> <int>
## 1 1 0 65 Active 0 1 1 0
## 2 1 1 65 Active 1 1 1 0
## 3 1 2 65 Active 1 1 0 0
## 4 2 0 49 Active 1 1 1 0
## 5 2 1 49 Active 0 0 1 0
## 6 2 2 49 Active 1 1 1 0
## 7 3 0 48 Placebo 0 0 0 0
## 8 3 1 48 Placebo 0 0 0 0
## 9 3 2 48 Placebo 0 0 0 0
## 10 4 0 37 Placebo 0 0 0 0
## # ℹ 50 more rows
👆Here’s what we did above:
We used the
select()
function to change the order of the columns in thedrug_trial
data frame so thatid
would be the first variable in the data frame when reading from left to right.Rather than explicitly typing the other column names, we used the
everything()
tidy-select function. As you may have guessed,everything()
tells R to do X (in this keep) to all the other variables not explicitly mentioned.
For our next example, let’s go ahead and add our mean-centered age variable to our drug_trial
data again. We did this for the first time in the last chapter, in case you missed.
## # A tibble: 60 × 9
## id year age drug se_headache se_diarrhea se_dry_mouth mi age_center
## <int> <int> <int> <chr> <int> <int> <int> <int> <dbl>
## 1 1 0 65 Active 0 1 1 0 11.3
## 2 1 1 65 Active 1 1 1 0 11.3
## 3 1 2 65 Active 1 1 0 0 11.3
## 4 2 0 49 Active 1 1 1 0 -4.7
## 5 2 1 49 Active 0 0 1 0 -4.7
## 6 2 2 49 Active 1 1 1 0 -4.7
## 7 3 0 48 Placebo 0 0 0 0 -5.7
## 8 3 1 48 Placebo 0 0 0 0 -5.7
## 9 3 2 48 Placebo 0 0 0 0 -5.7
## 10 4 0 37 Placebo 0 0 0 0 -16.7
## # ℹ 50 more rows
One way I will often use select()
is for performing quick little data checks. For example, let’s say that I wanted to make sure the code I wrote above actually did what I intended it to do. If I print the entire data frame to the screen, age
and age_center
aren’t directly side-by-side, and there’s a lot of other visual clutter from the other variables. In a case like this, I would use select()
to get a clearer picture:
## # A tibble: 60 × 2
## age age_center
## <int> <dbl>
## 1 65 11.3
## 2 65 11.3
## 3 65 11.3
## 4 49 -4.7
## 5 49 -4.7
## 6 49 -4.7
## 7 48 -5.7
## 8 48 -5.7
## 9 48 -5.7
## 10 37 -16.7
## # ℹ 50 more rows
👆Here’s what we did above:
We used the
select()
function to view theage
andage_center
columns only.We can type individual column names, separated by commas, into
select()
to return a data frame containing only those columns, and in that order.
⚠️Warning: Notice that I didn’t assign our result above to anything (i.e., there’s no drug_trial <-
). If I had done so, the drug_trial
data would have contained these two columns only. I didn’t want to drop the other columns. I could have assigned the result of the code to a different R object (e.g., check_age <-
, but it wasn’t really necessary. I just wanted to quickly view age
and age_center
side-by-side for data checking purposes. When I’m satisfied that I coded it correctly, I can move on. There’s no need to save those results to an R object.
You may also recall that we wanted to subset the drug_trial
data to include only the columns we needed for the rowwise demonstrations. Here is the code we used to do so:
## # A tibble: 60 × 5
## id year se_headache se_diarrhea se_dry_mouth
## <int> <int> <int> <int> <int>
## 1 1 0 0 1 1
## 2 1 1 1 1 1
## 3 1 2 1 1 0
## 4 2 0 1 1 1
## 5 2 1 0 0 1
## 6 2 2 1 1 1
## 7 3 0 0 0 0
## 8 3 1 0 0 0
## 9 3 2 0 0 0
## 10 4 0 0 0 0
## # ℹ 50 more rows
👆Here’s what we did above:
We used the
select()
function to view theid
year
,se_headache
,se_diarrhea
, andse_dry_mouth
columns only.We used the tidy-select
starts_with()
function to select all the side effect variables.
We already know that we can use everything()
to select all of the other variables in a data frame, but what if we just want to grab a range or group of other variables in a data frame? tidy-select makes it easy for us. Above, we used the starts_with()
function to select all the columns with names that literally start with the letters “se”. Because all of the side effect columns are directly next to each other (i.e., no columns in between them) we could have also used the colon operator :
like this:
## # A tibble: 60 × 5
## id year se_headache se_diarrhea se_dry_mouth
## <int> <int> <int> <int> <int>
## 1 1 0 0 1 1
## 2 1 1 1 1 1
## 3 1 2 1 1 0
## 4 2 0 1 1 1
## 5 2 1 0 0 1
## 6 2 2 1 1 1
## 7 3 0 0 0 0
## 8 3 1 0 0 0
## 9 3 2 0 0 0
## 10 4 0 0 0 0
## # ℹ 50 more rows
While either method gets us the same result, I tend to prefer using starts_with()
when possible. I think it makes your code easier to read (i.e., “Oh, he’s selecting all the side effect columns here.”).
In addition to starts_with()
, there is also an ends_with()
tidy-select function that can also be useful. For example, we’ve named factors with the _f
naming convention throughout the book. We could use that, along with the ends-with()
function to create a subset of our data that includes only the factor versions of our side effects columns.
# Add the side effect factor columns to our data frame again...
yn_levs <- c(0, 1)
yn_labs <- c("No", "Yes")
drug_trial <- drug_trial %>%
mutate(
se_headache_f = factor(se_headache, yn_levs, yn_labs),
se_diarrhea_f = factor(se_diarrhea, yn_levs, yn_labs),
se_dry_mouth_f = factor(se_dry_mouth, yn_levs, yn_labs)
)
## # A tibble: 60 × 5
## id year se_headache_f se_diarrhea_f se_dry_mouth_f
## <int> <int> <fct> <fct> <fct>
## 1 1 0 No Yes Yes
## 2 1 1 Yes Yes Yes
## 3 1 2 Yes Yes No
## 4 2 0 Yes Yes Yes
## 5 2 1 No No Yes
## 6 2 2 Yes Yes Yes
## 7 3 0 No No No
## 8 3 1 No No No
## 9 3 2 No No No
## 10 4 0 No No No
## # ℹ 50 more rows
🗒Side Note: Variable names are important! Throughout this book, I’ve tried to repeatedly emphasize the importance of coding style – including the way we name our R objects. Many people who are new to data management and analysis (and some who aren’t, MDL) don’t fully appreciate the importance of such things. I hope that the preceding two examples are helping you to see why the little details, like variable names, are important. Using consistent variable naming conventions, for example, allows us to write code that requires less typing, is easier for humans to skim and understand, and is less prone to typos and other related errors.
We can also select columns we want to keep by position instead of name. I don’t do this often. I think it’s generally better to use column names or tidy-select argument modifiers when subsetting columns in your data frame. However, I do sometimes select columns by position when I’m writing my own functions. Therefore, I want to quickly show you what this looks like:
## # A tibble: 60 × 3
## id year drug
## <int> <int> <chr>
## 1 1 0 Active
## 2 1 1 Active
## 3 1 2 Active
## 4 2 0 Active
## 5 2 1 Active
## 6 2 2 Active
## 7 3 0 Placebo
## 8 3 1 Placebo
## 9 3 2 Placebo
## 10 4 0 Placebo
## # ℹ 50 more rows
👆Here’s what we did above:
- We passed column numbers to the
select()
function to keep the 1st, 2nd, and 4th columns from ourdrug_trial
data frame.
Finally, in addition to using select()
to keep columns in our data frame, we can also use select()
to explicitly drop columns from our data frame. To do so, we just need to use either the subtraction symbol (-
) or the Not operator (!
).
Think back to our example from the previous chapter. There we created some new variables that captured information about participants reporting any and all side effects. During that process we created a column that contained a count of the side effects experienced in each year – n_se_year
.
drug_trial_sub <- drug_trial %>%
rowwise() %>%
mutate(
n_se_year = sum(se_headache, se_diarrhea, se_dry_mouth),
any_se_year = n_se_year > 0,
all_se_year = n_se_year == 3
) %>%
group_by(id) %>%
mutate(any_se = sum(any_se_year) > 0) %>%
ungroup() %>%
select(id:year, n_se_year:any_se) %>%
print()
## # A tibble: 60 × 6
## id year n_se_year any_se_year all_se_year any_se
## <int> <int> <int> <lgl> <lgl> <lgl>
## 1 1 0 2 TRUE FALSE TRUE
## 2 1 1 3 TRUE TRUE TRUE
## 3 1 2 2 TRUE FALSE TRUE
## 4 2 0 3 TRUE TRUE TRUE
## 5 2 1 1 TRUE FALSE TRUE
## 6 2 2 3 TRUE TRUE TRUE
## 7 3 0 0 FALSE FALSE FALSE
## 8 3 1 0 FALSE FALSE FALSE
## 9 3 2 0 FALSE FALSE FALSE
## 10 4 0 0 FALSE FALSE FALSE
## # ℹ 50 more rows
Let’s say we decided we don’t need n_se_year
column now that we created any_se_year
, all_se_year
, and any_se
. We can easily drop it from the data frame in a couple of ways:
## # A tibble: 60 × 5
## id year any_se_year all_se_year any_se
## <int> <int> <lgl> <lgl> <lgl>
## 1 1 0 TRUE FALSE TRUE
## 2 1 1 TRUE TRUE TRUE
## 3 1 2 TRUE FALSE TRUE
## 4 2 0 TRUE TRUE TRUE
## 5 2 1 TRUE FALSE TRUE
## 6 2 2 TRUE TRUE TRUE
## 7 3 0 FALSE FALSE FALSE
## 8 3 1 FALSE FALSE FALSE
## 9 3 2 FALSE FALSE FALSE
## 10 4 0 FALSE FALSE FALSE
## # ℹ 50 more rows
## # A tibble: 60 × 5
## id year any_se_year all_se_year any_se
## <int> <int> <lgl> <lgl> <lgl>
## 1 1 0 TRUE FALSE TRUE
## 2 1 1 TRUE TRUE TRUE
## 3 1 2 TRUE FALSE TRUE
## 4 2 0 TRUE TRUE TRUE
## 5 2 1 TRUE FALSE TRUE
## 6 2 2 TRUE TRUE TRUE
## 7 3 0 FALSE FALSE FALSE
## 8 3 1 FALSE FALSE FALSE
## 9 3 2 FALSE FALSE FALSE
## 10 4 0 FALSE FALSE FALSE
## # ℹ 50 more rows
Note that we could have also dropped it indirectly by selecting everything else:
## # A tibble: 60 × 5
## id year any_se_year all_se_year any_se
## <int> <int> <lgl> <lgl> <lgl>
## 1 1 0 TRUE FALSE TRUE
## 2 1 1 TRUE TRUE TRUE
## 3 1 2 TRUE FALSE TRUE
## 4 2 0 TRUE TRUE TRUE
## 5 2 1 TRUE FALSE TRUE
## 6 2 2 TRUE TRUE TRUE
## 7 3 0 FALSE FALSE FALSE
## 8 3 1 FALSE FALSE FALSE
## 9 3 2 FALSE FALSE FALSE
## 10 4 0 FALSE FALSE FALSE
## # ℹ 50 more rows
But, I think this is generally a bad idea. Not only is it more typing, but skimming through your code doesn’t really tell me (or future you) what you were trying to accomplish there.
27.2 The rename() function
Sometimes, we want to change the names of some, or all, of the columns in our data frame. For me, this most commonly comes up with data I’ve imported from someone else. For example, let’s say I’m importing data that uses column names that aren’t super informative. We saw column names like that when we imported NHANES data. It looked something like this:
## # A tibble: 4 × 3
## SEQN ALQ101 ALQ110
## <int> <dbl> <dbl>
## 1 1 1 2
## 2 2 2 2
## 3 3 1 2
## 4 4 2 1
We previously learned how to change these column names on import (i.e., col_names
), but let’s say we didn’t do that for whatever reason. We can rename columns in our data frame using the rename()
function like so:
## # A tibble: 4 × 3
## id drinks_12_year drinks_12_life
## <int> <dbl> <dbl>
## 1 1 1 2
## 2 2 2 2
## 3 3 1 2
## 4 4 2 1
👆Here’s what we did above:
We used the
rename()
function to change the name of each column in thedrug_trial
data frame to be more informative.You can type
?rename
into your R console to view the help documentation for this function and follow along with the explanation below.The first argument to the
rename()
function is.data
. The value passed to.data
should always be a data frame. In this book, we will often pass data frames to the.data
argument using the pipe operator (e.g.,df %>% rename()
).The second argument to the
rename()
function is...
. The value passed to the...
argument should be a name value pair, or series of name-value pairs separated by columns. The name-value pairs should be in the formatnew name = original name
.
I think these names are much better, but for the sake of argument let’s say that we wanted to keep the original names – just coerce them to lowercase. We can do that using the rename_with()
variation of the rename()
function in combination with the tolower()
function:
## # A tibble: 4 × 3
## seqn alq101 alq110
## <int> <dbl> <dbl>
## 1 1 1 2
## 2 2 2 2
## 3 3 1 2
## 4 4 2 1
👆Here’s what we did above:
We used the
rename_with()
function to coerce all column names in thedrug_trial
data frame to lowercase.You can type
?rename
into your R console to view the help documentation for this function and follow along with the explanation below.The first argument to the
rename_with()
function is.data
. The value passed to.data
should always be a data frame. In this book, we will often pass data frames to the.data
argument using the pipe operator (e.g.,df %>% rename_with()
).The second argument to the
rename_with()
function is.fn
. The value passed to the.fn
argument should be a function that you want to apply to all the columns selected in the.cols
argument (see below).The third argument to the
rename_with()
function is.cols
. The value passed to the.cols
argument should be the columns you want to apply the function passed to the.fn
argument to. You can select the columns using tidy-select argument modifiers.
27.3 The filter() function
We just saw how to keep and drop columns in our data frame using the select()
function. We can keep and drop rows in our data frame using the filter() function or the slice() function.
Similar to selecting columns by position instead of name:
## # A tibble: 60 × 3
## id year drug
## <int> <int> <chr>
## 1 1 0 Active
## 2 1 1 Active
## 3 1 2 Active
## 4 2 0 Active
## 5 2 1 Active
## 6 2 2 Active
## 7 3 0 Placebo
## 8 3 1 Placebo
## 9 3 2 Placebo
## 10 4 0 Placebo
## # ℹ 50 more rows
We can also select rows we want to keep by position. Again, I don’t do this often, but it is sometimes useful when I’m writing my own functions. Therefore, I want to quickly show you what this looks like:
## # A tibble: 5 × 12
## id year age drug se_headache se_diarrhea se_dry_mouth mi age_center se_headache_f se_diarrhea_f se_dry_mouth_f
## <int> <int> <int> <chr> <int> <int> <int> <int> <dbl> <fct> <fct> <fct>
## 1 1 0 65 Active 0 1 1 0 11.3 No Yes Yes
## 2 1 1 65 Active 1 1 1 0 11.3 Yes Yes Yes
## 3 1 2 65 Active 1 1 0 0 11.3 Yes Yes No
## 4 2 0 49 Active 1 1 1 0 -4.7 Yes Yes Yes
## 5 2 1 49 Active 0 0 1 0 -4.7 No No Yes
👆Here’s what we did above:
We used the
slice()
function to keep only the first 5 rows in thedrug_trial
data frame.You can type
?slice
into your R console to view the help documentation for this function and follow along with the explanation below.The first argument to the
slice()
function is.data
. The value passed to.data
should always be a data frame. In this book, we will often pass data frames to the.data
argument using the pipe operator (e.g.,df %>% slice()
).The second argument to the
slice()
function is...
. The value passed to the...
argument should be a row numbers you want returned to you.
Generally speaking, I’m far more likely to use the filter()
function to select only a subset of rows from my data frame. Two of the most common scenarios, of many possible scenarios, where want to subset rows include:
Performing a subgroup analysis. This is a situation where I want my analysis to include only some of the people (or places, or things) in my data frame.
Performing a complete case analysis. This is a situation where I want to remove rows that contain missing values from my data frame before performing an analysis.
27.3.1 Subgroup analysis
Let’s say that I want to count the number of people in the drug trial who reported having headaches in the baseline year by drug status (active vs. placebo). We would first use filter()
to keep only the rows that contain data from the baseline year:
## # A tibble: 20 × 12
## id year age drug se_headache se_diarrhea se_dry_mouth mi age_center se_headache_f se_diarrhea_f se_dry_mouth_f
## <int> <int> <int> <chr> <int> <int> <int> <int> <dbl> <fct> <fct> <fct>
## 1 1 0 65 Active 0 1 1 0 11.3 No Yes Yes
## 2 2 0 49 Active 1 1 1 0 -4.7 Yes Yes Yes
## 3 3 0 48 Placebo 0 0 0 0 -5.7 No No No
## 4 4 0 37 Placebo 0 0 0 0 -16.7 No No No
## 5 5 0 71 Placebo 0 0 0 0 17.3 No No No
## 6 6 0 48 Placebo 0 0 0 0 -5.7 No No No
## 7 7 0 59 Active 1 1 1 0 5.3 Yes Yes Yes
## 8 8 0 60 Placebo 0 0 0 0 6.3 No No No
## 9 9 0 61 Active 1 1 1 0 7.3 Yes Yes Yes
## 10 10 0 39 Active 1 0 1 0 -14.7 Yes No Yes
## 11 11 0 61 Placebo 0 0 0 0 7.3 No No No
## 12 12 0 62 Placebo 1 0 1 0 8.3 Yes No Yes
## 13 13 0 43 Placebo 0 0 0 0 -10.7 No No No
## 14 14 0 63 Placebo 0 0 0 0 9.3 No No No
## 15 15 0 69 Active 1 1 1 0 15.3 Yes Yes Yes
## 16 16 0 42 Placebo 0 0 0 0 -11.7 No No No
## 17 17 0 60 Placebo 0 0 0 0 6.3 No No No
## 18 18 0 41 Active 1 1 1 0 -12.7 Yes Yes Yes
## 19 19 0 43 Placebo 0 0 0 0 -10.7 No No No
## 20 20 0 53 Placebo 0 0 0 0 -0.700 No No No
👆Here’s what we did above:
We used the
filter()
function to keep only the rows in thedrug_trial
data frame that contain data from the baseline year.You can type
?filter
into your R console to view the help documentation for this function and follow along with the explanation below.The first argument to the
filter()
function is.data
. The value passed to.data
should always be a data frame. In this book, we will often pass data frames to the.data
argument using the pipe operator (e.g.,df %>% filter()
).The second argument to the
filter()
function is...
. The value passed to the...
argument should be a name-value pair or multiple name value pairs separated by commas. The...
argument is where you will tellfilter()
how to decide which rows to keep.
⚠️Warning: Remember, that in the R language =
(i.e., one equal sign) and ==
(i.e., two equal signs) are different things. The =
operator tells R to make the thing on the left equal to the thing on the right. In other words, it assigns values. The ==
asks R if the thing on the left is equal to the thing on the right. In other words, it test the equality of values.
Now, we can use the descriptive analysis techniques we’ve already learned to answer our research question:
## # A tibble: 4 × 3
## # Groups: drug [2]
## drug se_headache_f n
## <chr> <fct> <int>
## 1 Active No 1
## 2 Active Yes 6
## 3 Placebo No 12
## 4 Placebo Yes 1
So, 6 out of 7 (~ 86%) of the people in our active drug group reported headaches in the baseline year. Now, let’s say that we have reason to suspect that the drug affects people differently based on their age. Let’s go ahead and repeat this analysis, but only in a subgroup of people who are below age 65. Again, we can use the filter()
function to do this:
drug_trial %>%
filter(year == 0) %>%
filter(age < 65) %>%
group_by(drug, se_headache_f) %>%
summarise(n = n())
## # A tibble: 3 × 3
## # Groups: drug [2]
## drug se_headache_f n
## <chr> <fct> <int>
## 1 Active Yes 5
## 2 Placebo No 11
## 3 Placebo Yes 1
Wow! It looks like everyone under age 65 who received active drug also reported headaches!
We can show this more explicitly by using passing the value FALSE
to the .drop
argument of group_by()
. This tells R to keep all factor levels in the output, even if they were observed in the data zero times.
drug_trial %>%
filter(year == 0) %>%
filter(age < 65) %>%
group_by(drug, se_headache_f, .drop = FALSE) %>%
summarise(n = n())
## # A tibble: 4 × 3
## # Groups: drug [2]
## drug se_headache_f n
## <chr> <fct> <int>
## 1 Active No 0
## 2 Active Yes 5
## 3 Placebo No 11
## 4 Placebo Yes 1
Finally, we could make our code above more succinct by combining our two filter functions into one:
drug_trial %>%
filter(year == 0 & age < 65) %>%
group_by(drug, se_headache_f, .drop = FALSE) %>%
summarise(n = n())
## # A tibble: 4 × 3
## # Groups: drug [2]
## drug se_headache_f n
## <chr> <fct> <int>
## 1 Active No 0
## 2 Active Yes 5
## 3 Placebo No 11
## 4 Placebo Yes 1
👆Here’s what we did above:
- We used the
filter()
function to keep only the rows in thedrug_trial
data frame that contain data from the baseline year AND (&
) contain data from rows with a value that is less than 65 in theage
column. The AND (&
) here is important. A row must satisfy both of these conditions in order for R to keep it in the returned data frame. If we had used OR instead (filter(year == 0 | age < 65)
), then only one condition OR the other would need to be met for R to keep the row in the returned data frame.
🗒Side Note: In the R language, we use the pipe operator to create OR conditions. The pipe operator looks like |
and is probably the key immediately to the right of your enter/return key on your keyboard.
27.3.2 Complete case analysis
Now let’s say that we want to compare age at baseline by drug status (active vs. placebo). Additionally, let’s say that we have some missing values in our data.
Let’s first simulate some new data with missing values:
drug_trial_short <- drug_trial %>%
filter(year == 0) %>%
slice(1:10) %>%
mutate(
age = replace(age, 1, NA),
drug = replace(drug, 4, NA)
) %>%
print()
## # A tibble: 10 × 12
## id year age drug se_headache se_diarrhea se_dry_mouth mi age_center se_headache_f se_diarrhea_f se_dry_mouth_f
## <int> <int> <int> <chr> <int> <int> <int> <int> <dbl> <fct> <fct> <fct>
## 1 1 0 NA Active 0 1 1 0 11.3 No Yes Yes
## 2 2 0 49 Active 1 1 1 0 -4.7 Yes Yes Yes
## 3 3 0 48 Placebo 0 0 0 0 -5.7 No No No
## 4 4 0 37 <NA> 0 0 0 0 -16.7 No No No
## 5 5 0 71 Placebo 0 0 0 0 17.3 No No No
## 6 6 0 48 Placebo 0 0 0 0 -5.7 No No No
## 7 7 0 59 Active 1 1 1 0 5.3 Yes Yes Yes
## 8 8 0 60 Placebo 0 0 0 0 6.3 No No No
## 9 9 0 61 Active 1 1 1 0 7.3 Yes Yes Yes
## 10 10 0 39 Active 1 0 1 0 -14.7 Yes No Yes
👆Here’s what we did above:
We used the
filter()
andslice()
functions to create a new data frame that contains only a subset of our originaldrug_trial
data frame. The subset includes only the first 10 rows of the data frame remaining after selecting only the baseline year rows from the original data frame.We used the
replace()
function to replace the first value of age withNA
and the fourth value ofdrug
withNA
.You can type
?replace
into your R console to view the help documentation for this function.
If we try to answer our research question above without dealing with the missing data, we get the following undesirable results:
## # A tibble: 3 × 2
## drug mean_age
## <chr> <dbl>
## 1 Active NA
## 2 Placebo 56.8
## 3 <NA> 37
One way we can improve our result is by adding the na.rm
argument to the mean()
function.
## # A tibble: 3 × 2
## drug mean_age
## <chr> <dbl>
## 1 Active 52
## 2 Placebo 56.8
## 3 <NA> 37
But, we previously saw how it can sometimes be more efficient to drop the row with missing data from the data frame explicitly. This is called a complete case analysis or list-wise deletion.
## # A tibble: 3 × 2
## drug mean_age
## <chr> <dbl>
## 1 Active 52
## 2 Placebo 56.8
## 3 <NA> 37
However, we still have that missing value for drug
. We can easily drop the row with the missing value by adding an additional value to the ...
argument of our filter()
function:
drug_trial_short %>%
filter(!is.na(age) & !is.na(drug)) %>%
group_by(drug) %>%
summarise(mean_age = mean(age))
## # A tibble: 2 × 2
## drug mean_age
## <chr> <dbl>
## 1 Active 52
## 2 Placebo 56.8
27.4 Deduplication
Another common data management task that I want to discuss in this chapter is deduplicating data. Let’s go ahead and simulate some data to illustrate what I mean:
df <- tribble(
~id, ~day, ~x,
1, 1, 1,
1, 2, 11,
2, 1, 12,
2, 2, 13,
2, 2, 14,
3, 1, 12,
3, 1, 12,
3, 2, 13,
4, 1, 13,
5, 1, 10,
5, 2, 11,
5, 1, 10
) %>%
print()
## # A tibble: 12 × 3
## id day x
## <dbl> <dbl> <dbl>
## 1 1 1 1
## 2 1 2 11
## 3 2 1 12
## 4 2 2 13
## 5 2 2 14
## 6 3 1 12
## 7 3 1 12
## 8 3 2 13
## 9 4 1 13
## 10 5 1 10
## 11 5 2 11
## 12 5 1 10
All id’s but 4 have multiple observations.
ID 2 has row with duplicate values for
id
andday
, but a non-duplicate value forx
. These rows are partial duplicates.ID 3 has a row with duplicate values for all three columns (i.e.,
3, 1, 12
). These rows are complete duplicates.ID 5 has a row with duplicate values for all three columns (i.e.,
5, 1, 10
). These rows are complete duplicates. However, they are not in sequential order in the dataset.
27.4.1 The distinct() function
We can use dplyr
’s distinct()
function to remove all complete duplicates from the data frame:
## # A tibble: 10 × 3
## id day x
## <dbl> <dbl> <dbl>
## 1 1 1 1
## 2 1 2 11
## 3 2 1 12
## 4 2 2 13
## 5 2 2 14
## 6 3 1 12
## 7 3 2 13
## 8 4 1 13
## 9 5 1 10
## 10 5 2 11
👆Here’s what we did above:
We used the
distinct()
function to keep only one row from a group of complete duplicate rows in thedf
data frame.You can type
?distinct
into your R console to view the help documentation for this function and follow along with the explanation below.The first argument to the
distinct()
function is.data
. The value passed to.data
should always be a data frame. In this book, we will often pass data frames to the.data
argument using the pipe operator (e.g.,df %>% distinct()
).The second argument to the
distinct()
function is...
. The value passed to the...
argument should be the variables to use when determining uniqueness. Passing no variables to the...
argument is equivalent to pass all variables to the...
argument.
27.4.2 Complete duplicate row add tag
If want to identify the complete duplicate rows, without immediately dropping them, we can use the duplicated()
function inside the mutate()
function. This creates a new column in our data frame that has the value TRUE
when the row is a complete duplicate and the value FALSE
otherwise.
## # A tibble: 12 × 4
## id day x dup
## <dbl> <dbl> <dbl> <lgl>
## 1 1 1 1 FALSE
## 2 1 2 11 FALSE
## 3 2 1 12 FALSE
## 4 2 2 13 FALSE
## 5 2 2 14 FALSE
## 6 3 1 12 FALSE
## 7 3 1 12 TRUE
## 8 3 2 13 FALSE
## 9 4 1 13 FALSE
## 10 5 1 10 FALSE
## 11 5 2 11 FALSE
## 12 5 1 10 TRUE
Alternatively, we could get the same result using:
## # A tibble: 12 × 5
## # Groups: id, day, x [10]
## id day x n_row dup
## <dbl> <dbl> <dbl> <int> <lgl>
## 1 1 1 1 1 FALSE
## 2 1 2 11 1 FALSE
## 3 2 1 12 1 FALSE
## 4 2 2 13 1 FALSE
## 5 2 2 14 1 FALSE
## 6 3 1 12 1 FALSE
## 7 3 1 12 2 TRUE
## 8 3 2 13 1 FALSE
## 9 4 1 13 1 FALSE
## 10 5 1 10 1 FALSE
## 11 5 2 11 1 FALSE
## 12 5 1 10 2 TRUE
👆Here’s what we did above:
We used the
group_by_all()
function to split our data frame into multiple data frames grouped by all the columns indf
.We used the
row_number()
to sequentially count every row in each of the little data frames created bygroup_by_all()
. We assigned the sequential count to a new column namedn_row
.We created a new column named
dup
that has a value ofTRUE
when the value ofn_row
is greater than 1 andFALSE
otherwise.
Notice that R only tags the second in a set of duplicate rows as a duplicate. Below we tag both rows with complete duplicate values.
## # A tibble: 12 × 4
## id day x dup
## <dbl> <dbl> <dbl> <lgl>
## 1 1 1 1 FALSE
## 2 1 2 11 FALSE
## 3 2 1 12 FALSE
## 4 2 2 13 FALSE
## 5 2 2 14 FALSE
## 6 3 1 12 TRUE
## 7 3 1 12 TRUE
## 8 3 2 13 FALSE
## 9 4 1 13 FALSE
## 10 5 1 10 TRUE
## 11 5 2 11 FALSE
## 12 5 1 10 TRUE
27.4.3 Partial duplicate rows
## # A tibble: 9 × 3
## id day x
## <dbl> <dbl> <dbl>
## 1 1 1 1
## 2 1 2 11
## 3 2 1 12
## 4 2 2 13
## 5 3 1 12
## 6 3 2 13
## 7 4 1 13
## 8 5 1 10
## 9 5 2 11
👆Here’s what we did above:
We used the
distinct()
function to keep only one row from a group of duplicate rows in thedf
data frame.You can type
?distinct
into your R console to view the help documentation for this function and follow along with the explanation below.This time we passed the column names
id
andday
to the...
argument. This tells R to consider any rows that have the same value ofid
ANDday
to be duplicates – even if they have different values in their other columns.The
.keep_all
argument tells R to return all of the columns indf
to us – not just the columns that we are testing for uniqueness (i.e.,id
andday
).
27.4.4 Partial duplicate rows - add tag
We can tag partial duplicate rows in a similar fashion to the way we tagged complete duplicate rows above:
df %>%
group_by(id, day) %>%
mutate(
count = row_number(), # Counts rows by group
dup = count > 1 # TRUE if there is more than one row per group
)
## # A tibble: 12 × 5
## # Groups: id, day [9]
## id day x count dup
## <dbl> <dbl> <dbl> <int> <lgl>
## 1 1 1 1 1 FALSE
## 2 1 2 11 1 FALSE
## 3 2 1 12 1 FALSE
## 4 2 2 13 1 FALSE
## 5 2 2 14 2 TRUE
## 6 3 1 12 1 FALSE
## 7 3 1 12 2 TRUE
## 8 3 2 13 1 FALSE
## 9 4 1 13 1 FALSE
## 10 5 1 10 1 FALSE
## 11 5 2 11 1 FALSE
## 12 5 1 10 2 TRUE
27.4.5 Count the number of duplicates
Finally, sometimes it can be useful to get a count of the number of duplicate rows. The code below returns a data frame that summarizes the number of rows that contain duplicate values for id
and day
, and what those duplicate values are.
## # A tibble: 3 × 3
## # Groups: id, day [3]
## id day n
## <dbl> <dbl> <int>
## 1 2 2 2
## 2 3 1 2
## 3 5 1 2
27.4.6 What to do about duplicates
Finding duplicates is only half the battle. After finding them, you have to decide what to do about them. In some ways it’s hard to give clear-cut advice on this because different situations require different decisions. However, here are some things you may want to consider:
If two or more rows are complete duplicates, then the additional rows provide no additional information. I have a hard time thinking of a scenario where dropping them would be a problem. Additionally, because they are completely identical, it doesn’t matter which row you drop.
If have two more rows that are partial duplicates, then you will want to look for obvious errors in the other variables. When you have two rows that are partial duplicates, and one row has very obvious errors in it, then keeping the row without the obvious errors is usually the correct decision. Having said that, you should meticulously document which rows you dropped and why, and make that information known to anyone consuming the results of your analysis.
When there are no obvious errors, deciding which rows to keep and which to drop can be really tricky. In this situation the best advice I can give is to be systematic in your approach. What I mean by that is to choose a strategy that seems least likely to introduce bias into your data and then apply that strategy consistently throughout your data. So, something like always keeping the first row among a group of duplicate rows. However, keep in mind that if rows are ordered by data, this strategy could easily introduce bias. In that case, some other strategy may be more appropriate. And again, you should meticulously document which rows you dropped and why, and make that information known to anyone consuming the results of your analysis.
Finally, I can definitively tell you a strategy that you should never use. That is, you should never pick and choose, or even give the appearance of picking and choosing, rows with values that are aligned with the results you want to see. I hope the unethical nature of this strategy is blatantly obvious to you.
Congratulations! 🎉 At this point, you are well-versed in all of the dplyr
verbs. More importantly, you now have a foundation of tools you can call upon to complete the many of basic data management tasks that you will encounter. In the rest of the data management part of the book we will build on these tools, and learn some new tools, we can use to solve more complex data management problems.