41 Introduction to Epidemiology

This chapter is under heavy development and may still undergo significant changes.

This book has primarily been about R programming and data management up to this point. We have tried to create examples and scenarios that would resonate with epidemiologists and other people who are interested in epidemiology, but there was very little information in the previous chapters that epidemiology can claim exclusive ownership of. The same may be true for the rest of the book; however, from this point on, we will shift focus slightly from learning about R generally to learning about how to use R as a tool for grasping the concepts, and conducting the analyses, that are central to the practice of modern epidemiology.

This book isn’t intended to be a broad introduction to epidemiology. If you’re reading this, we expect that you’ve already had some exposure to the basics of epidemiology. If you haven’t had any exposure to the basics of epidemiology, Epidemiology by Leon Gordis is a popular introductory textbook and we recommend that you start there. Alternatively, you could read Modern Epidemiology by Lash, VanderWeele, Haneuse, and Rothman, which is our personal favorite, but is also regarded as a challenging text by some. Having said all of that, we would like to briefly touch on some core concepts that are important for epidemiology as they apply to this book. Namely, we want to introduce measurement and uncertainty.

41.1 Measurement

Epidemiology is typically defined as some version of, “the study of the occurrence and distribution of health-related states or events in specified populations, including the study of the determinants influencing such states, and the application of this knowledge to control the health problems.”15 We usually say, “who gets sick or stays healthy, and why.” Because this isn’t an introductory course on epidemiology, we’re not going to attempt to pick apart all the nuances of either definition. However, it may be worth taking a step back and thinking about how we study the occurrence and distribution of these health states. Do we consult powerful oracles or deities? Not typically. Do we go into a deep meditative state until the answers just occur to us? Not typically. At least doing that alone would not typically be very convincing evidence to most people. Instead, we almost always study health-related states and populations by measuring characteristics about them that are thought to be relevant. Then, we look for useful patterns in those measurements. Recording those measurements typically results in data and looking for useful patterns typically occurs by applying statistical procedures to the data. Hence, data and statistics are probably the two most commonly used tools in many epidemiologists’ toolbox.

We want to quickly note that the relevance of a characteristic is often based on previous observations and/or the relevance of the same characteristic(s) to other similar health-related events or populations. To be even more specific, when we say that we are “measuring characteristics,” we mean that we are recording numerical or qualitative values somewhere as we observe varying quantities or qualities of those characteristics. Sometimes, those values may be more or less dictated by nature (e.g., you have a certain genetic variation or you don’t), while others are socially constructed (e.g., race), and sometimes they are assigned somewhat arbitrarily by us (e.g., mild, moderate, and severe pain). Note that what we measure, how we measure it, and how we interpret different measurements we collect are driven by what we believe to be important, and make up part of the assumptions we bring to the research process. This subjectivity should give you some pause. While this is not an ethics or philosophy textbook, it’s important to point out that data is not value-neutral. If you want to do this work well, you should take the time to think through the assumptions you bring to the table.

If that all sounds a little too “deep” to be meaningful, I think the relevant takeaway for our purposes is that a typical day in the life of most epidemiologists includes attempting to describe, predict, and/or explain health-related phenomena. A potentially helpful way to frame this is that epidemiologists are storytellers. We tell stories about how different things impact people’s health, and we tell stories about whose health is impacted. What makes us different from some other storytellers is our heavy reliance on quantitative data to help us sort through which stories are useful for impacting population health.

41.2 Uncertainty

In epidemiology generally, and parts of this book specifically, it is important that we become comfortable with uncertainty. What do we mean when we say “uncertainty”? In epidemiological research (and quantitative research more generally), we actually need to get comfortable with uncertainty in a few different ways: statistical uncertainty, uncertainty in the research process itself, and epistemological uncertainty.

First, we should get comfortable with uncertainty in the statistical sense, which is something we will try to measure and quantify. Indeed, it may be an oversimplification, but not entirely inaccurate, to define statistics as the science of quantifying uncertainty. In the chapters that follow, we will discuss statistical uncertainty in more detail. For now, it is important to understand that even when everything goes well our estimates are not perfectly precise. Statistical estimation does not give us the “right” answer, but helps understand what is most plausible based on our data. This is often a challenge for new students of epidemiology who are used to taking math classes where there is just one correct answer to a problem. As an epidemiologist, it is important to get comfortable with statistical uncertainty because of how heavily we rely on statistical methods to make sense of quantitative data.

Next, in a very practical sense, all students of epidemiology (including us) must get comfortable with a lack of certainty in the research process itself. Epidemiology is not a series of check-lists and procedures. Students often wish that there was a simple checklist or algorithm that we can follow that will always lead us to the correct answer. But epidemiological inquiry does not work that way. In reality, we rely on (sometimes untestable) assumptions to inform how we conduct research (we will return to this again when we discuss DAGs). A common phrase in our classrooms is “don’t let the data do the thinking for you.” Ultimately, the quantitative analyses discussed in this book are just one part of the process of making a well-reasoned (ideally) scientific argument about the research question under study. Clear-cut procedures that provide valid and reliable, black and white answers are the exception rather than the rule. On the bright side, this should also provide us with some measure of job security for the foreseeable future. If epidemiology could simply be reduced to a checklist or algorithm, then our jobs would almost certainly be outsourced to computers in no time.

And finally, we have to get comfortable with uncertainty on the level of our conclusions. In epidemiology, the questions we are called to answer are often causal questions (e.g., Did this cause that? If we stop this, can we stop that?). On one hand, these questions are usually incredibly exciting and have the potential to lead to real, tangible changes in population health. On the other hand, our two primary tools, data and statistics, are not sufficient to answer such questions in and of themselves. The conclusions that we are able to make are almost always loaded with caveats and assumptions. This is true even for many non-causal questions. However, the questions being asked are often so important that we don’t have the luxury of completely deferring our conclusions until some imaginary later date when the stars align, and the inner workings of the world are magically revealed to us with complete clarity. No, we must often do our best with the information and resources that are available to us now. Therefore, when we make conclusions, we will have to be comfortable with the fact that they may be wholly or partially incorrect, there may be important exceptions, and we may have to revisit them again in the future when circumstances, or the information available to us, changes. Even when they are entirely correct, we will often not be able to prove as much with complete certainty. Said another way, our conclusions will rarely, if ever, be exactly correct or provable; however, sometimes they will still be useful for a given purpose. That is an unsettling thought for many people. But it is true nevertheless and we must become comfortable with it if we want to practice epidemiology. If it makes you feel any better, the history of public health, including epidemiology, is littered with notable examples of useful conclusions that were not entirely correct or not entirely proven, yet still extremely useful. For example, citrus fruit to prevent and cure scurvy, drinking tainted water as a cause of cholera, and tobacco smoke as a cause of lung cancer (and many other diseases).

41.3 Summary

As we’ve already seen, R is a powerful tool for accessing, managing, analyzing, and presenting data. In the chapters that follow, we will learn how to use R to describe, predict, and explain health-related phenomena in populations of people. We will also use R to help ourselves develop a more concrete understanding of the key concepts related to correctly carrying out epidemiologic research.


Porta M, ed. A Dictionary of Epidemiology. Oxford University Press; 2008.