19 Measures of central tendency

In previous sections you’ve seen methods for describing individual categorical variables. Now we’ll switch over to numerically describing numerical variables.

In epidemiology, we often want to describe the “typical” person in a population with respect to some characteristic that is recorded as a numerical variable – like height or weight. The most basic, and probably most commonly used, way to do so is with a measure of central tendency.

In this chapter we’ll discuss three measures of central tendency:

  • The mean
  • The median
  • The mode

Now, this is not a statistics course. But, I’m going to briefly discuss these measures, and some of their characteristics, below to make sure that we’re all on the same page when we discuss the interpretation of our results.

The mean

When we talk about the typical, or “average,” value of some variable measured on a continuous scale, we are usually talking about the mean value of that variable. To be even more specific, we are usually talking about the arithmetic mean value. This value has some favorable characteristics that make it a good description of central tendency.

👍 For starters it’s simple. Most people are familiar with the mean, and at the very least, have some intuitive sense of what it means (no pun intended).

👍 In addition, there can be only one mean value for any set of values.

However, there are a couple of potentially problematic characteristics of the mean as well:

👎 It’s susceptible to extreme values in your data. In other words, a couple of people with very atypical values for the characteristic you are interested in can drastically alter the value of the mean, and your estimate for the typical person in your population of interest along with it.

👎 Additionally, it’s very possible to calculate a mean value that is not actually observed anywhere in your data.

🗒Side Note: The sample mean is often referred to as \(\bar{x}\), which pronounced “x bar.”

The median

The median is probably the second most commonly used measure of central tendency. Like the mean, it’s computationally simple and relatively straightforward to understand. 👍 There can be one, and only one, median. 👍 And, its value may also be unobserved in the data.👎

However, unlike the mean, it’s relatively resistant to extreme values. 👍 In fact, when the median is used as the measure of central tendency, it’s often because the person conducting the analysis suspects that extreme values in the data are likely to distort the mean.

The mode

And finally, we have the mode, or the value that is most often observed in the data. It doesn’t get much simpler than that. 👍 But, unlike the mean and the median, there can be more than one mode for a given set of values. In fact, there can even be no mode if all the values are observed the exact same number of times.👎

However, if there is a mode, by definition it’s observed in the data.👍

Now that we are all on the same page with respect to the fundamentals of central tendency, let’s take a look at how to calculate these measures using R.

19.1 Calculate the mean

Calculating the mean is really straightforward. We can just use base R’s built-in mean() function.

# Load the dplyr package. We will need several of dpylr's functions in the 
# code below.
library(dplyr)
# Simulate some data
height_and_weight_20 <- tribble(
  ~id,   ~sex,     ~ht_in, ~wt_lbs,
  "001", "Male",   71,     190,
  "002", "Male",   69,     177,
  "003", "Female", 64,     130,
  "004", "Female", 65,     153,
  "005", NA,       73,     173,
  "006", "Male",   69,     182,
  "007", "Female", 68,     186,
  "008", NA,       73,     185,
  "009", "Female", 71,     157,
  "010", "Male",   66,     155,
  "011", "Male",   71,     213,
  "012", "Female", 69,     151,
  "013", "Female", 66,     147,
  "014", "Female", 68,     196,
  "015", "Male",   75,     212,
  "016", "Female", 69,     19000,
  "017", "Female", 66,     194,
  "018", "Female", 65,     176,
  "019", "Female", 65,     176,
  "020", "Female", 65,     102
)

👆 Here’s what we did above:

  • We loaded the tibble package so that we could use its tribble() function.

  • We used the tribble() function to simulate some data – heights and weights for 20 hypothetical students.

    • The tribble() function creates something called a tibble. A tibble is the tidyverse version of a data frame. In fact, it is a data frame, but with some additional functionality. You can use the link to read more about it if you’d like.

    • We used the tribble() function instead of the data.frame() function to create our data frame above because we can use the tribble() function to create our data frames in rows (like you see above) instead of columns with the c() function.

    • Using the tribble() function to create a data frame isn’t any better or worse than using the data.frame() function. I just wanted you to be aware that it exists and is sometimes useful.

mean(height_and_weight_20$ht_in)
## [1] 68.4

👆 Here’s what we did above:

  • We used base R’s mean() function to calculate the mean of the column “ht_in” from the data frame “height_and_weight_20.”

    • Note: if you just type mean(ht_in) you will get an error. That’s because R will look for an object called “ht_in” in the global environment.

    • However, we didn’t create an object called “ht_in.” We created an object (in this case a data frame) called “height_and_weight_20.” That object has a column in it called “ht_in.”

    • So, we must specifically tell R to look for the “ht_in” column in the data frame “height_and_weight_20.” Using base R, we can do that in one of two ways: height_and_weight_20$ht_in or height_and_weight_20[["ht_in"]].

19.2 Calculate the median

Similar to above, we can use base R’s median() function to calculate the median.

median(height_and_weight_20$ht_in)
## [1] 68.5

👆 Here’s what we did above:

  • We used base R’s median() function to calculate the modian of the column “ht_in” from the data frame “height_and_weight_20.”

19.3 Calculate the mode

Base R does not have a built-in mode() function. Well, it actually does have a mode() function, but for some reason that function does not return the mode value(s) of a set of numbers. Instead, the mode() function gets or sets the type or storage mode of an object. For example:

mode(height_and_weight_20$ht_in)
## [1] "numeric"

This is clearly not what we are looking for. So, how do we find the mode value(s)? Well, we are going to have to build our own mode function. Later in the book, I’ll return to this function and walk you through how I built it one step at a time. For now, just copy and paste the code into R on your computer. Keep in mind, as is almost always the case with R, my way of writing this function is only one of multiple possible ways.

mode_val <- function(x) {
  
  # Count the number of occurrences for each value of x
  value_counts <- table(x)
  
  # Get the maximum number of times any value is observed
  max_count <- max(value_counts)
  
  # Create and index vector that identifies the positions that correspond to
  # count values that are the same as the maximum count value: TRUE if so
  # and false otherwise
  index <- value_counts == max_count
  
  # Use the index vector to get all values that are observed the same number 
  # of times as the maximum number of times that any value is observed
  unique_values <- names(value_counts)
  result <- unique_values[index]
  
  # If result is the same length as value counts that means that every value
  # occured the same number of times. If every value occurred the same number
  # of times, then there is no mode
  no_mode <- length(value_counts) == length(result)
  
  # If there is no mode then change the value of result to NA
  if (no_mode) {
    result <- NA
  }
  
  # Return result
  result
}
mode_val(height_and_weight_20$ht_in)
## [1] "65" "69"

👆 Here’s what we did above:

  • We created our own function, mode_val(), that takes a vector (or data frame column) as a value to its “x” argument and returns the mode value(s) of that vector.

  • We can also see that the function works as expected when there is more than one mode value. In this case, “65” and “69” each occur 4 times in the column “ht_in.”

19.4 Compare mean, median, and mode

Now that you know how to calculate the mean, median, and mode, let’s compare these three measures of central tendency. This is a good opportunity to demonstrate some of the different characteristics of each that we spoke about earlier.

height_and_weight_20 %>% 
  summarise(
    min_weight    = min(wt_lbs),
    mean_weight   = mean(wt_lbs),
    median_weight = median(wt_lbs),
    mode_weight   = mode_val(wt_lbs) %>% as.double(),
    max_weight    = max(wt_lbs)
  )
## # A tibble: 1 × 5
##   min_weight mean_weight median_weight mode_weight max_weight
##        <dbl>       <dbl>         <dbl>       <dbl>      <dbl>
## 1        102       1113.          176.         176      19000

👆 Here’s what we did above:

  • We used the mean() function, median() function, and our mode_val() function inside of dpylr’s summarise() function to find the mean, median, and mode values of the column “wt_lbs” in the “height_and_weight_20” data frame.

  • We also used the as.double() function to convert the value returned by mode_val() – “176” – from a character string to a numeric double. This isn’t strictly necessary, but I think it looks better.

  • Finally, we used base R’s min() and max() functions to view the lowest and highest weights in our sample.

19.5 Data checking

Do you see any red flags 🚩as you scan the results? Do you really think a mean weight of 1,113 pounds sounds reasonable? This should definitely be a red flag for you. Now move your gaze three columns to the right and notice that the maximum value of weight is 19,000 lbs – an impossible value for a study in human populations. In this case the real weight was supposed to be 190 pounds, but the person entering the data accidently got a little trigger-happy with the zero key.

This is an example of what I meant when I said “We can use descriptive analysis to uncover errors in our data” in the Introduction to descriptive analysis chapter. Often times, for various reasons, some observations for a given variable take on values that don’t make sense. Starting by calculating some basic descriptive statistics for each variable is one approach you can use to try to figure out if you have values in your data that don’t make sense.

In this case we can just go back and fix our data, but what if we didn’t know this value was an error? What if it were a value that was technically possible, but very unlikely? Well, we can’t just go changing values in our data. It’s unethical, and in some cases illegal. Below, we discuss the how the properties of the median and mode can come in handy in situations such as this.

19.6 Properties of mean, median, and mode

Despite the fact that this impossibly extreme value is in our data, the median and mode estimates are reasonable estimates of the typical person’s weight in this sample. This is what I meant when I said that the median and mode were more “resistant to extreme values” than the mean.

You may also notice that no person in our sample had an actual weight of 1,112.75 (the mean) or even 176.5 (the median). This is what I meant above when I said that the mean and median values are “not necessarily observed in the data.”

In this case, the mode value (176) is also a more reasonable estimate of the average person’s weight than the mean. And unlike the mean and the median, participants 18 and 19 actually weigh 176 pounds. I’m not saying that the mode is always the best measure of central tendency to use. However, I am saying that you can often learn useful information from your data by calculating and comparing these relatively simple descriptive statistics on each of your numeric variables.

19.7 Missing data

In numerical descriptions of categorical variables we saw that we could use the dplyr::filter() function to remove all the rows from our data frame that contained a missing value for any of our variables of interest. We learned that this is called a complete case analysis. This method should pretty much always work, but in this section I’m going to show you an alternative method for dropping missing values from your analysis that you are likely to come across often when reading R documentation – the na.rm argument.

Many R functions that perform calculations on numerical variables include an na.rm – short for “Remove NA” – argument. By default, this argument is typically set to FALSE. By passing the value TRUE to this argument, we can perform a complete case analysis. Let’s quickly take a look at how it works.

We already saw that we can calculate the mean value of a numeric vector using the mean() function:

mean(c(1, 2, 3))
## [1] 2

But, what happens when our vector has a missing value?

mean(c(1, NA, 3))
## [1] NA

As you can see, the mean() function returns NA by default when we pass it a numeric vector that contains a missing value. It took me a little while to wrap my head around why this is the case when I was a student. Perhaps some of you are confused as well. The logic goes something like this. In R, an NA doesn’t represent the absence of a value – a value that doesn’t exist at all; rather, it represents a value that does exist, but is unknown to us. So, if I ask you to tell me the mean of a set of numbers that contains 1, some unknown number, and 3 what would your answer be? Well, you can’t just give me the mean of 1 and 2. That would imply that the unknown number doesn’t exist. Further, you can’t really give me any numeric answer because that answer will depend on the value of the missing number. So, the only logical answer to give me is something like “I don’t know” or “it depends.” 🤷 That is essentially what R is telling us when it returns an NA.

While this answer is technically correct, it usually isn’t very satisfying to us. Instead, we often want R to calculate the mean of the numbers that remain after all missing values are removed from the original set. The implicit assumption is that the mean of that reduced set of numbers will be “close enough” to the mean of the original set of numbers for our purposes. We can ask R to do this by changing the value of the na.rm argument from FALSE – the default – to TRUE.

mean(c(1, NA, 3), na.rm = TRUE)
## [1] 2

In this case, the mean of the original set of numbers (2) and the mean of our complete case analysis (2) are identical. That won’t always be the case.

Finally, let’s compare using filter() and na.rm = TRUE in a dplyr pipeline. We will first use the replace() function to add some missing values to our height_and_weight_20 data.

height_and_weight_20 <- height_and_weight_20 %>% 
  mutate(ht_in = replace(ht_in, c(1, 2), NA)) %>% 
  print()
## # A tibble: 20 × 4
##    id    sex    ht_in wt_lbs
##    <chr> <chr>  <dbl>  <dbl>
##  1 001   Male      NA    190
##  2 002   Male      NA    177
##  3 003   Female    64    130
##  4 004   Female    65    153
##  5 005   <NA>      73    173
##  6 006   Male      69    182
##  7 007   Female    68    186
##  8 008   <NA>      73    185
##  9 009   Female    71    157
## 10 010   Male      66    155
## 11 011   Male      71    213
## 12 012   Female    69    151
## 13 013   Female    66    147
## 14 014   Female    68    196
## 15 015   Male      75    212
## 16 016   Female    69  19000
## 17 017   Female    66    194
## 18 018   Female    65    176
## 19 019   Female    65    176
## 20 020   Female    65    102

👆Here’s what we did above:

  • Replaced the 1st and 2nd value of ht_in with NA (missing) using the replace() function.

Here’s what our results look like when we don’t perform a complete case analysis.

height_and_weight_20 %>% 
  summarise(
    min_height    = min(ht_in),
    mean_height   = mean(ht_in),
    median_height = median(ht_in),
    mode_height   = mode_val(ht_in),
    max_height    = max(ht_in)
  )
## # A tibble: 1 × 5
##   min_height mean_height median_height mode_height max_height
##        <dbl>       <dbl>         <dbl> <chr>            <dbl>
## 1         NA          NA            NA 65                  NA

Here’s what our results look like when we use the filter() function.

height_and_weight_20 %>% 
  filter(!is.na(ht_in)) %>% 
  summarise(
    min_height    = min(ht_in),
    mean_height   = mean(ht_in),
    median_height = median(ht_in),
    mode_height   = mode_val(ht_in),
    max_height    = max(ht_in)
  )
## # A tibble: 1 × 5
##   min_height mean_height median_height mode_height max_height
##        <dbl>       <dbl>         <dbl> <chr>            <dbl>
## 1         64        68.2            68 65                  75

And, here’s what our results look like when we change the na.rm argument to TRUE.

height_and_weight_20 %>% 
  summarise(
    min_height    = min(ht_in, na.rm = TRUE),
    mean_height   = mean(ht_in, na.rm = TRUE),
    median_height = median(ht_in, na.rm = TRUE),
    mode_height   = mode_val(ht_in),
    max_height    = max(ht_in, na.rm = TRUE)
  )
## # A tibble: 1 × 5
##   min_height mean_height median_height mode_height max_height
##        <dbl>       <dbl>         <dbl> <chr>            <dbl>
## 1         64        68.2            68 65                  75

As you can see, both methods give us the same result. The method you choose to use will typically just come down to personal preference.

19.8 Using meantables

In the sections above, we learned how to use dplyr functions to calculate various measures of central tendency for continuous variables. However, there can be a fair amount of code writing involved when using those methods. The more we have to repeatedly type code, the more tedious and error-prone it becomes. This is an idea we will return to many times in this book. Luckily, the R programming language allows us to write our own functions, which solves both of those problems.

Later in this book, I will show you how to write your own functions. For the time being, I’m going to suggest that you install and use a package I created called meantables. The meantables package is basically an enhanced version of the code we wrote in the sections above. I designed it to help us quickly make tables of descriptive statistics for continuous variables, and it’s specifically designed to work in a dplyr pipeline.

Like all packages, you need to first install it…

# You may be asked if you want to update other packages on your computer that
# meantables uses. Go ahead and do so.
install.packages("meantables")

And then load it…

# After installing meantables on your computer, you can load it just like you
# would any other package.
library(meantables)

Now, let’s use the mean_table() function from meantables package to rerun our analysis from above.

height_and_weight_20 %>%
  filter(!is.na(ht_in)) %>%
  mean_table(ht_in)
## # A tibble: 1 × 8
##   response_var     n  mean   sem   lcl   ucl   min   max
##   <chr>        <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ht_in           18  68.2 0.774  66.6  69.8    64    75

👆Here’s what we did above:

  • We used filter() to keep only the rows that have a non-missing value for ht_in and passed the data frame on to the mean_table() function using a pipe.

  • We told the mean_table() function to create a table of summary statistics for the variable ht_in. This is just an R data frame of useful statistics about a single continuous variable.

  • The summary statistics in the table above include:

    • response_var: The name of the variable (column) we are analyzing.

    • n: The number of non-missing values of response_var being analyzed in the current analysis.

    • mean: The mean of all n values of response_var.

    • sem: The standard error of the mean of all n values of response_var.

    • lcl: The lower (95%, by default) confidence limit for the percentage mean.

    • ucl: The upper (95%, by default) confidence limit for the percentage mean.

    • min: The minimum value of response_var.

    • max: The maximum value of response_var.

We will continue using the meantables package at various points throughout the book. I will also show you some other cool things we can do with meantables. For now, all you need to know how to do is use the mean_table() function to calculate basic descriptive statistics for single continuous variables.