25 Introduction to Data Management

Way back in the Getting Started chapter, I told you that managing data includes all the things you may have to do to your data to get it ready for analysis. I also talked about the 80/20 “rule.” The basic idea of the 80/20 rule is that data management is where you will spend the majority of your time and effort when you are involved in just about any project that makes use of data. Unfortunately, I can’t cover strategies for overcoming every single data management challenge that you will encounter in epidemiology. However, in this part of the book, I will try to give you a foundation in some of the most common data management tasks that you will encounter. I will also try to point you towards some of the best tools and resources for data management that the R community has to offer.

25.1 Multiple paradigms for data management in R

Before moving on to providing you with examples of how to accomplish specific data management tasks, I think this is the right point in the book to touch on a couple of high-level concepts that we have more or less ignored thus far.

R is pretty unique among the major statistical programming applications used in epidemiology in many ways. Among them is that R has multiple paradigms for data management. That’s what I’m calling them anyway. What I mean by that is that there are 3 primary packages that the vast majority of R users use for data management. They are base R, data.table, and dplyr. There is a tremendous amount of overlap in the data management tasks you can perform with base R, data.table, and dplyr, but the syntax for each is very different. As are the relative strengths and weaknesses.

In this book, we will primarily use the dplyr paradigm for data management. We will do so because I believe in using the best tool to get the job done. Currently, I believe that the best tool for managing data in R is usually dplyr, and especially when you are new to R. However, there will be cases where I will show you how to use base R to accomplish a task. Where I do this, it’s because I think that base R is the best tool for the job or because I think you are very likely to see base R way used when you go looking for help with a related data management challenge and I don’t want you to be totally clueless about what you’re looking at.

As of this writing, I’ve decided not to specifically discuss using the data.table package for data management. I think the data.table package is a great package, and I use it when I think it’s the best tool for the job. However, I think the confusion caused by introducing data.table in this text aimed primarily at inexperienced R users would cause more problems than it would solve. The last thing I’ll say about data.table for now is that you may want to consider learning more about data.table if you routinely work with very large data sets (e.g., millions of rows). For reasons that are beyond the scope of this book, data.table is currently much faster than dplyr. However, for most of the work I do, and all of what we will do in this book, the time difference will be imperceptible to you. Literally milliseconds.

25.2 The dplyr package

At this point in the book, you’ve already been exposed to several of the most important functions in the dplyr package. You saw the filter() function in the Speaking R’s language chapter, the mutate() function in the chapter on exporting data, and the summarise() function all over the descriptive analysis part of the book. However, I mostly glossed over the details at those points. In this section, I want to dive just a tiny bit deeper into how the dplyr functions work – but not too deep.

25.2.1 The dplyr verbs

The dplyr package includes five main functions for managing data: mutate(), select(), filter(), arrange(), and summarise(). These five functions are often referred to as the dplyr verbs. And, the first two arguments to all five of these functions are .data and .... Let’s go ahead and discuss those two arguments a little bit more.

🗒Side Note: I don’t want to give you the impression that dplyr only contains 5 functions. In fact, dplyr contains many functions, and they are all designed to work together in a very intentional way.

25.2.2 The .data argument

I first introduced you to data frames in the Let’s get programming chapter and we’ve been using them as our primary structure for storing and analyzing data in ever since. The R language allows for other data structures (e.g., vectors, lists, and matrices), but data frames are the most commonly used data structure for most of the kinds of things we do in epidemiology. Thankfully, the dplyr package is designed specifically to help people like you and I manage data stored in data frames. Therefore, dplyr verbs always receive a data frame as an input and return a data frame as an output. Specifically, the value passed to the .data argument must always be a data frame, and you will get an error if you attempt to pass any other data structure to the .data argument. For example:

# No problem
df <- tibble(
  id = c(1, 2, 3),
  x  = c(0, 1, 0)
)

df %>% 
  filter(x == 0)

## # A tibble: 2 × 2
##      id     x
##   <dbl> <dbl>
## 1     1     0
## 2     3     0

# Problem
l <- list(
  id = c(1, 2, 3),
  x  = c(0, 1, 0)
) 

l %>% 
  filter(x == 0)

## Error in UseMethod("filter"): no applicable method for 'filter' applied to an object of class "list"

25.2.3 The … argument

The second value passed to all of the dplyr verbs is the ... argument. If you are new to R, this probably seems like a really weird argument. And, it kind of is! But, it’s also really useful. The ... argument (pronounced “dot dot dot”) has special meaning in the R language. This is true for all functions that use the ... argument – not just dplyr verbs. The ... argument can be used to pass multiple arguments to a function without knowing exactly what those arguments will look like ahead of time – including entire expressions. For example:

df %>% 
  filter(x == 0)

## # A tibble: 2 × 2
##      id     x
##   <dbl> <dbl>
## 1     1     0
## 2     3     0

Above we passed a data frame to the .data argument of the filter() function. The second value we passed to the filter() function was x == 1. Think about it, x is an object (i.e. a column in the data frame), == is a function (remember that operators are functions in R), and 0 is a value. Together, they form an expression (x == 0) that tells R to perform a relatively complex operation – compare every value of x to the value 0 and tell me if they are the same. If you are new to programming, this may not seem like any big deal, but it’s really handy to be able to pass that much information to a single argument of a single function.

If this is all really confusing to you, don’t get too hung up on it right now. The ... argument is an important component of the R language, but it isn’t important that you fully understand it in order to use R. If nothing else, just know that that the ... is the second argument to all the dplyr verbs, and it is generally where you will tell R what you want to do to the columns of your data frame (i.e., keep them, drop them, create them, sort them, etc.).

25.2.4 Non-standard evaluation

A final little peculiarity about the tidyverse packages – dplyr being one of them – that I want to discuss in this chapter is something called non-standard evaluation. How non-standard evaluation works really isn’t that important for us. If I’m being honest, I don’t even fully understand how it works “under the hood.” But, it is one of the big advantages of using dplyr, and therefore worth mentioning. Do you remember the section in the Let’s get programming chapter on common errors? In that section I wrote about how a vector that lives in the global environment is a different thing to R than a vector that lives as a column in a data frame in the global environment. So, weight and class$weight are different things, and if you want to access the weight values in class$weight then you have to make sure and write the whole thing out. But, have you noticed that we don’t have to do that in dplyr verbs? For example:

df %>% 
  filter(df$x == 0)

## # A tibble: 2 × 2
##      id     x
##   <dbl> <dbl>
## 1     1     0
## 2     3     0

In the example above we wrote out the column name using dollar sign notation. But, we don’t have to:

df %>% 
  filter(x == 0)

## # A tibble: 2 × 2
##      id     x
##   <dbl> <dbl>
## 1     1     0
## 2     3     0

When we don’t tell a dplyr verb exactly which data frame a column lives in, then the dplyr verb will assume it lives in the data frame that is passed to the .data argument. This is really handy for at least two reasons:

It reduces the amount of typing we have to do when we write our code. 👏
It makes it easier to glance at our code and see what it’s doing. Without all the data frame names and dollar signs strewn about our code, it’s much easier to see what the code is actually doing.

Overall, non-standard evaluation is a great thing – at least in my opinion. However, it will present some challenges that we will have to overcome if we plan to use dplyr verbs inside of functions and loops. Don’t worry, we’ll come back to this topic later in the book.

Now that you (hopefully) have a better general understanding of the dplyr verbs, let’s go take a look at how to use them for data management.