19 Numerical Descriptions of Categorical Variables

We’ll begin our discussion of descriptive statistics in the categorical half of our flow chart. Specifically, we’ll start by numerically describing categorical variables. As a reminder, categorical variables are variables whose values fit into categories.

Some examples of categorical variables commonly seen in public health data are: sex, race or ethnicity, and level of educational attainment.

Notice that there is no inherent numeric value to any of these categories. Having said that, we can, and often will, assign a numeric value to each category using R.

The two most common numerical descriptions of categorical variables are probably the frequency count (you will often hear this referred to as simply the frequency, the count, or the n) and the proportion or percentage (the percentage is just the proportion multiplied by 100).

The count is simply the number of observations, in this case people, which fall into each possible category.

The proportion is just the count divided by the total number of observations. In this example, 2 people out of 5 people (.40 or 40%) are in the Asian race category.

The remainder of this chapter is devoted to learning how to calculate frequency counts and percentages using R.

19.1 Factors

We first learned about factors in the Let’s Get Programming chapter. Before moving on to calculating frequency counts and percentages, we will discuss factors in slightly greater depth here. As a reminder, factors can be useful for representing categorical data in R. To demonstrate, let’s simulate a simple little data frame.

# Load dplyr for tibble()
library(dplyr)
demo <- tibble(
  id  = c("001", "002", "003", "004"),
  age = c(30, 67, 52, 56),
  edu = c(3, 1, 4, 2)
)

👆 Here’s what we did above:

  • We created a data frame that is meant to simulate some demographic information about 4 hypothetical study participants.

  • The first variable (id) is the participant’s study id.

  • The second variable (age) is the participant’s age at enrollment in the study.

  • The third variable (edu) is the highest level of formal education the participant completed. Where:

    • 1 = Less than high school

    • 2 = High school graduate

    • 3 = Some college

    • 4 = College graduate

Each participant in our data frame has a value for edu – 1, 2, 3, or 4. The value they have for that variable corresponds to the highest level of formal education they have completed, which is split up into categories that we defined. We can see which category each person is in by viewing the data.

demo
## # A tibble: 4 × 3
##   id      age   edu
##   <chr> <dbl> <dbl>
## 1 001      30     3
## 2 002      67     1
## 3 003      52     4
## 4 004      56     2

We can see that person 001 is in category 3, person 002 is in category 1, and so on. This compact representation of the categories is convenient for data entry and data manipulation, but it also has an obvious limitation – what do these numbers mean? I defined what these values mean for you above, but if you didn’t have that information, or some kind of prior knowledge about the process that was used to gather this data, then you would likely have no idea what these numbers mean.

Now, we could have solved that problem by making education a character vector from the beginning. For example:

demo <- tibble(
  id       = c("001", "002", "003", "004"),
  age      = c(30, 67, 52, 56),
  edu      = c(3, 1, 4, 2),
  edu_char = c(
    "Some college", "Less than high school", "College graduate", 
    "High school graduate"
  )
)

demo
## # A tibble: 4 × 4
##   id      age   edu edu_char             
##   <chr> <dbl> <dbl> <chr>                
## 1 001      30     3 Some college         
## 2 002      67     1 Less than high school
## 3 003      52     4 College graduate     
## 4 004      56     2 High school graduate

But, this strategy also has a few limitations.

👎 First, entering data this way requires more typing. Not such a big deal in this case because we only have 4 participants. But, imagine typing out the categories as character strings 10, 20, or 100 times. 😫

👎 Second, R summarizes character vectors alphabetically by default, which may not be the ideal way to order some categorical variables.

👎 Third, creating categorical variables in our data frame as character vectors limits us to inputting only observed values for that variable. However, there are cases when other categories are possible and just didn’t apply to anyone in our data. That information may be useful to know.

At this point, I’m going to show you how to coerce a variable to a factor in your data frame. Then, I will return to showing you how using factors can overcome some of the limitations outlined above.

19.1.1 Coerce a numeric variable

The code below shows one method for coercing a numeric vector into a factor.

# Load dplyr for pipes and mutate()
library(dplyr)
demo <- demo %>% 
  mutate(
    edu_f = factor(
      x      = edu,
      levels = 1:4,
      labels = c(
        "Less than high school", "High school graduate", "Some college", 
        "College graduate"
      )
    )
  )

demo
## # A tibble: 4 × 5
##   id      age   edu edu_char              edu_f                
##   <chr> <dbl> <dbl> <chr>                 <fct>                
## 1 001      30     3 Some college          Some college         
## 2 002      67     1 Less than high school Less than high school
## 3 003      52     4 College graduate      College graduate     
## 4 004      56     2 High school graduate  High school graduate

👆Here’s what we did above:

  • We used dplyr’s mutate() function to create a new variable (edu_f) in the data frame called demo. The purpose of the mutate() function is to add new variables to data frames. We will discuss mutate() in greater detail in the later in the book.

    • You can type ?mutate into your R console to view the help documentation for this function and follow along with the explanation below.

    • We assigned this new data frame the name demo using the assignment operator (<-).

    • Because we assigned it the name demo, our previous data frame named demo (i.e., the one that didn’t include edu_f) no longer exists in our global environment. If we had wanted to keep that data frame in our global environment, we would have needed to assign our new data frame a different name (e.g., demo_w_factor).

  • The first argument to the mutate() function is the .data argument. The value passed to the .data argument should be a data frame that is currently in our global environment. We passed the data frame demo to the .data argument using the pipe operator (%>%), which is why demo isn’t written inside mutate’s parentheses.

  • The second argument to the mutate() function is the ... argument. The value passed to the ... argument should be a name value pair. That means, a variable name, followed by an equal sign, followed by the values to be assigned to that variable name (name = value).

    • The name we passed to the ... argument was edu_f. This value tells R what to name the new variable we are creating.

      • If we had used the name edu instead, then the previous values in the edu variable would have been replaced with the new values. That is sometimes what you want to happen. However, when it comes to creating factors, I typically keep the numeric version of the variable in my data frame (e.g., edu) and add a new factor variable. I just often find that it can be useful to have both versions of the variable hanging around during the analysis process.

      • I also use the _f naming convention in my code. That means that when I create a new factor variable I name it the same thing the original variable was named with the addition of _f (for factor) at the end.

    • In this case, the value that will be assigned to the name edu_f will be the values returned by the factor() function. This is an example of nesting functions.

  • We used the factor() function to create a factor vector.

    • You can type ?factor into your R console to view the help documentation for this function and follow along with the explanation below.

    • The first argument to the factor() function is the x argument. The value passed to the x argument should be a vector of data. We passed the edu vector to the x argument.

    • The second argument to the factor() function is the levels argument. This argument tells R the unique values that the new factor variable can take. We used the shorthand 1:4 to tell R that edu_f can take the unique values 1, 2, 3, or 4.

    • The third argument to the factor() function is the labels argument. The value passed to the labels argument should be a character vector of labels (i.e., descriptive text) for each value in the levels argument. The order of the labels in the character vector we pass to the labels argument should match the order of the values passed to the levels argument. For example, the ordering of levels and labels above tells R that 1 should be labeled with “Less than high school”, 2 should be labeled with “High school graduate”, etc.

When we printed the data frame above, the values in edu_f looked the same as the character strings displayed in edu_char. Notice, however, that the variable type displayed below edu_char in the data frame above is <chr> for character. Alternatively, the variable type displayed below edu_f is <fctr>. Although, labels are used to make factors look like character vectors, they are still integer vectors under the hood. For example:

as.numeric(demo$edu_char)
## Warning: NAs introduced by coercion
## [1] NA NA NA NA
as.numeric(demo$edu_f)
## [1] 3 1 4 2

There are two main reasons that you may want to use factors instead of character vectors at times:

👍 First, R summarizes character vectors alphabetically by default, which may not be the ideal way to order some categorical variables. However, we can explicitly set the order of factor levels. This will be useful to us later when we analyze categorical variables. Here is a glimpse of things to come:

table(demo$edu_char)
## 
##      College graduate  High school graduate Less than high school          Some college 
##                     1                     1                     1                     1
table(demo$edu_f)
## 
## Less than high school  High school graduate          Some college      College graduate 
##                     1                     1                     1                     1

👆Here’s what we did above:

  • You can type ?base::table into your R console to view the help documentation for this function and follow along with the explanation below.

  • We used the table() function to get a count of the number of times each unique value of edu_char appears in our data frame. In this case, each value appears one time. Notice that the results are returned to us in alphabetical order.

  • Next, we used the table() function to get a count of the number of times each unique value of edu_f appears in our data frame. Again, each value appears one time. Notice, however, that this time the results are returned to us in the order that we passed to the levels argument of the factor() function above.

👍 Second, creating categorical variables in our data frame as character vectors limits us to inputting only observed values for that variable. However, there are cases when other categories are possible and just didn’t apply to anyone in our data. That information may be useful to know. Factors allow us to tell R that other values are possible, even when they are unobserved in our data. For example, let’s add a fifth possible category to our education variable – graduate school.

demo <- demo %>% 
  mutate(
    edu_5cat_f = factor(
      x      = edu,
      levels = 1:5,
      labels = c(
        "Less than high school", "High school graduate", "Some college", 
        "College graduate", "Graduate school"
      )
    )
  )

demo
## # A tibble: 4 × 6
##   id      age   edu edu_char              edu_f                 edu_5cat_f           
##   <chr> <dbl> <dbl> <chr>                 <fct>                 <fct>                
## 1 001      30     3 Some college          Some college          Some college         
## 2 002      67     1 Less than high school Less than high school Less than high school
## 3 003      52     4 College graduate      College graduate      College graduate     
## 4 004      56     2 High school graduate  High school graduate  High school graduate

Now, let’s use the table() function once again to count the number of times each unique level of edu_char appears in the data frame and the number of times each unique level of edu_5cat_f appears in the data frame:

table(demo$edu_char)
## 
##      College graduate  High school graduate Less than high school          Some college 
##                     1                     1                     1                     1
table(demo$edu_5cat_f)
## 
## Less than high school  High school graduate          Some college      College graduate       Graduate school 
##                     1                     1                     1                     1                     0

Notice that R now tells us that the value Graduate school was possible but was observed zero times in the data.

19.1.2 Coerce a character variable

It is also possible to coerce character vectors to factors. For example, we can coerce edu_char to a factor like so:

demo <- demo %>% 
  mutate(
    edu_f_from_char = factor(
      x      = edu_char,
      levels = c(
        "Less than high school", "High school graduate", "Some college", 
        "College graduate", "Graduate school"
      )
    )
  )

demo
## # A tibble: 4 × 7
##   id      age   edu edu_char              edu_f                 edu_5cat_f            edu_f_from_char      
##   <chr> <dbl> <dbl> <chr>                 <fct>                 <fct>                 <fct>                
## 1 001      30     3 Some college          Some college          Some college          Some college         
## 2 002      67     1 Less than high school Less than high school Less than high school Less than high school
## 3 003      52     4 College graduate      College graduate      College graduate      College graduate     
## 4 004      56     2 High school graduate  High school graduate  High school graduate  High school graduate
table(demo$edu_f_from_char)
## 
## Less than high school  High school graduate          Some college      College graduate       Graduate school 
##                     1                     1                     1                     1                     0

👆Here’s what we did above:

  • We coerced a character vector (edu_char) to a factor using the factor() function.

  • Because the levels are character strings, there was no need to pass any values to the labels argument this time. Keep in mind, though, that the order of the values passed to the levels argument matters. It will be the order that the factor levels will be displayed in your analyses.

Now that we know how to use factors, let’s return to our discussion of describing categorical variables.

19.2 Height and Weight Data

Below, we’re going to learn to do descriptive analysis in R by experimenting with some simulated data that contains several people’s sex, height, and weight. You can follow along with this lesson by copying and pasting the code chunks below in your R session.

# Load the dplyr package. We will need several of dplyr's functions in the 
# code below.
library(dplyr)
# Simulate some data
height_and_weight_20 <- tibble(
  id = c(
    "001", "002", "003", "004", "005", "006", "007", "008", "009", "010", "011", 
    "012", "013", "014", "015", "016", "017", "018", "019", "020"
  ),
  sex = c(1, 1, 2, 2, 1, 1, 2, 1, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2),
  sex_f = factor(sex, 1:2, c("Male", "Female")),
  ht_in = c(
    71, 69, 64, 65, 73, 69, 68, 73, 71, 66, 71, 69, 66, 68, 75, 69, 66, 65, 65, 
    65
  ),
  wt_lbs = c(
    190, 176, 130, 154, 173, 182, 140, 185, 157, 155, 213, 151, 147, 196, 212, 
    190, 194, 176, 176, 102
  )
)

19.2.1 View the data

Let’s start our analysis by taking a quick look at our data…

height_and_weight_20
## # A tibble: 20 × 5
##    id      sex sex_f  ht_in wt_lbs
##    <chr> <dbl> <fct>  <dbl>  <dbl>
##  1 001       1 Male      71    190
##  2 002       1 Male      69    176
##  3 003       2 Female    64    130
##  4 004       2 Female    65    154
##  5 005       1 Male      73    173
##  6 006       1 Male      69    182
##  7 007       2 Female    68    140
##  8 008       1 Male      73    185
##  9 009       2 Female    71    157
## 10 010       1 Male      66    155
## 11 011       1 Male      71    213
## 12 012       2 Female    69    151
## 13 013       2 Female    66    147
## 14 014       2 Female    68    196
## 15 015       1 Male      75    212
## 16 016       2 Female    69    190
## 17 017       2 Female    66    194
## 18 018       2 Female    65    176
## 19 019       2 Female    65    176
## 20 020       2 Female    65    102

👆Here’s what we did above:

  • Simulated some data that we can use to practice categorical data analysis.

  • We viewed the data and found that it has 5 variables (columns) and 20 observations (rows).

  • Also notice that you can use the “Next” button at the bottom right corner of the printed data frame to view rows 11 through 20 if you are viewing this data in RStudio.

19.3 Calculating frequencies

Now that we’re able to easily view our data, let’s return to the original purpose of this demonstration – calculating frequencies and proportions. At this point, I suspect that few of you would have any trouble telling me that the frequency of females in this data is 12 and the frequency of males in this data is 8. It’s pretty easy to just count the number of females and males in this small data set with only 20 rows. Further, if I asked you what proportion of this sample is female, most of you would still be able to easily tell me 12/20 = 0.6, or 60%. But, what if we had 100 observations or 1,000,000 observations? You’d get sick of counting pretty quickly. Fortunately, you don’t have to! Let R do it for you! As is almost always the case with R, there are multiple ways we can calculate the statistics that we’re interested in.

19.3.1 The base R table function

As we already saw above, we can use the base R table() function like this:

table(height_and_weight_20$sex)
## 
##  1  2 
##  8 12

Additionally, we can use the CrossTable() function from the gmodels package, which gives us a little more information by default.

19.3.2 The gmodels CrossTable function

# Like all packages, you will have to install gmodels (install.packages("gmodels")) before you can use the CrossTable() function. 
gmodels::CrossTable(height_and_weight_20$sex)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  20 
## 
##  
##           |         1 |         2 | 
##           |-----------|-----------|
##           |         8 |        12 | 
##           |     0.400 |     0.600 | 
##           |-----------|-----------|
## 
## 
## 
## 

19.3.3 The tidyverse way

The final way I’m going to discuss here is the tidyverse way, which is my preference. We will have to write a little additional code, but the end result will be more flexible, more readable, and will return our statistics to us in a data frame that we can save and use for further analysis. Let’s walk through this step by step…

🗒Side Note: You should already be familiar with the pipe operator (%>%), but if it doesn’t look familiar to you, you can learn more about it in Using pipes. Don’t forget, if you are using RStudio, you can use the keyboard shortcut shift + command + m (Mac) or shift + control + m (Windows) to insert the pipe operator.

First, we don’t want to view the individual values in our data frame. Instead, we want to condense those values into summary statistics. This is a job for the summarise() function.

height_and_weight_20 %>% 
  summarise()
## # A tibble: 1 × 0

As you can see, summarise() doesn’t do anything interesting on its own. We need to tell it what kind of summary information we want. We can use the n() function to count rows. By default, it will count all the rows in the data frame. For example:

height_and_weight_20 %>% 
  summarise(n())
## # A tibble: 1 × 1
##   `n()`
##   <int>
## 1    20

👆Here’s what we did above:

  • We passed our entire data frame to the summarise() function and asked it to count the number of rows in the data frame.

  • The result we get is a new data frame with 1 column (named n()) and one row with the value 20 (the number of rows in the original data frame).

This is a great start. However, we really want to count the number of rows that have the value “Female” for sex_f, and then separately count the number of rows that have the value “Male” for sex_f. Said another way, we want to break our data frame up into smaller data frames – one for each value of sex_f – and then count the rows. This is exactly what dplyr’s group_by() function does.

height_and_weight_20 %>%
  group_by(sex_f) %>% 
  summarise(n())
## # A tibble: 2 × 2
##   sex_f  `n()`
##   <fct>  <int>
## 1 Male       8
## 2 Female    12

And, that’s what we want.

🗒Side Note: dplyr’s group_by() function operationalizes the Split - Apply - Combine strategy for data analysis. That sounds sort of fancy, but all it really means is that we split our data frame up into smaller data frames, apply our calculation separately to each smaller data frame, and then combine those individual results back together as a single result. So, in the example above, the height_and_weight_20 data frame was split into two separate little data frames (i.e., one for females and one for males), then the summarise() and n() functions counted the number of rows in each of the two smaller data frames (i.e., 12 and 8 respectively), and finally combined those individual results into a single data frame, which was printed to the screen for us to view.

However, it will be awkward to work with a variable named n() (i.e., with parentheses) in the future. Let’s go ahead and assign it a different name. We can assign it any valid name we want. Some names that might make sense are n, frequency, or count. I’m going to go ahead and just name it n without the parentheses.

height_and_weight_20 %>%
  group_by(sex_f) %>% 
  summarise(n = n())
## # A tibble: 2 × 2
##   sex_f      n
##   <fct>  <int>
## 1 Male       8
## 2 Female    12

👆Here’s what we did above:

  • We added n = to our summarise function (summarise(n = n())) so that our count column in the resulting data frame would be named n instead of n().

Finally, estimating categorical frequencies like this is such a common operation that dplyr has a shortcut for it – count(). We can use the count() function to get the same result that we got above.

height_and_weight_20 %>% 
  count(sex_f)
## # A tibble: 2 × 2
##   sex_f      n
##   <fct>  <int>
## 1 Male       8
## 2 Female    12

19.4 Calculating percentages

In addition to frequencies, we will often be interested in calculating percentages for categorical variables. As always, there are many ways to accomplish this task in R. From here on out, I’m going to primarily use tidyverse functions.

In this case, the proportion of people in our data who are female can be calculated as the number who are female (12) divided by the total number of people in the data (20). Because we already know that there are 20 people in the data, we could calculate proportions like this:

height_and_weight_20 %>% 
  count(sex_f) %>% 
  mutate(prop = n / 20)
## # A tibble: 2 × 3
##   sex_f      n  prop
##   <fct>  <int> <dbl>
## 1 Male       8   0.4
## 2 Female    12   0.6

👆Here’s what we did above:

  • Because the count() function returns a data frame just like any other data frame, we can manipulate it in the same ways we can manipulate any other data frame.

  • So, we used dplyr’s mutate() function to create a new variable in the data frame named prop. Again, we could have given it any valid name.

  • Then we set the value of prop to be equal to the value of n divided by 20.

This works, but it would be better to have R calculate the total number of observations for the denominator (20) than for us to manually type it in. In this case, we can do that with the sum() function.

height_and_weight_20 %>% 
  count(sex_f) %>% 
  mutate(prop = n / sum(n))
## # A tibble: 2 × 3
##   sex_f      n  prop
##   <fct>  <int> <dbl>
## 1 Male       8   0.4
## 2 Female    12   0.6

👆Here’s what we did above:

  • Instead of manually typing in the total count for our denominator (20), we had R calculate it for us using the sum() function. The sum() function added together all the values of the variable n (i.e., 12 + 8 = 20).

Finally, we just need to multiply our proportion by 100 to convert it to a percentage.

height_and_weight_20 %>% 
  count(sex_f) %>% 
  mutate(percent = n / sum(n) * 100)
## # A tibble: 2 × 3
##   sex_f      n percent
##   <fct>  <int>   <dbl>
## 1 Male       8      40
## 2 Female    12      60

👆Here’s what we did above:

  • Changed the name of the variable we are creating from prop to percent. But, we could have given it any valid name.

  • Multiplied the proportion by 100 to convert it to a percentage.

19.5 Missing data

In the real world, you will frequently encounter data that has missing values. Let’s quickly take a look at an example by adding some missing values to our data frame.

height_and_weight_20 <- height_and_weight_20 %>% 
  mutate(sex_f = replace(sex, c(2, 9), NA)) %>% 
  print()
## # A tibble: 20 × 5
##    id      sex sex_f ht_in wt_lbs
##    <chr> <dbl> <dbl> <dbl>  <dbl>
##  1 001       1     1    71    190
##  2 002       1    NA    69    176
##  3 003       2     2    64    130
##  4 004       2     2    65    154
##  5 005       1     1    73    173
##  6 006       1     1    69    182
##  7 007       2     2    68    140
##  8 008       1     1    73    185
##  9 009       2    NA    71    157
## 10 010       1     1    66    155
## 11 011       1     1    71    213
## 12 012       2     2    69    151
## 13 013       2     2    66    147
## 14 014       2     2    68    196
## 15 015       1     1    75    212
## 16 016       2     2    69    190
## 17 017       2     2    66    194
## 18 018       2     2    65    176
## 19 019       2     2    65    176
## 20 020       2     2    65    102

👆Here’s what we did above:

  • Replaced the 2nd and 9th value of sex_f with NA (missing) using the replace() function.

Now let’s see how our code from above handles this

height_and_weight_20 %>% 
  count(sex_f) %>% 
  mutate(percent = n / sum(n) * 100)
## # A tibble: 3 × 3
##   sex_f     n percent
##   <dbl> <int>   <dbl>
## 1     1     7      35
## 2     2    11      55
## 3    NA     2      10

As you can see, we are now treating missing as if it were a category of sex_f. Sometimes this will be the result you want. However, often you will want the n and percent of non-missing values for your categorical variable. This is sometimes referred to as a complete case analysis. There’s a couple of different ways we can handle this. I will simply filter out rows with a missing value for sex_f with dplyr’s filter() function.

height_and_weight_20 %>% 
  filter(!is.na(sex_f)) %>% 
  count(sex_f) %>% 
  mutate(percent = n / sum(n) * 100)
## # A tibble: 2 × 3
##   sex_f     n percent
##   <dbl> <int>   <dbl>
## 1     1     7    38.9
## 2     2    11    61.1

👆Here’s what we did above:

  • We used filter() to keep only the rows that have a non-missing value for sex_f. 

    • In the R language, we use the is.na() function to tell the R interpreter to identify NA (missing) values in a vector. We cannot use something like sex_f == NA to identify NA values, which is sometimes confusing for people who are coming to R from other statistical languages.

    • In the R language, ! is the NOT operator. It sort of means “do the opposite.”

    • So, filter() tells R which rows of a data frame to keep, and is.na(sex_f) tells R to find rows with an NA value for the variable sex_f. Together, filter(is.na(sex_f)) would tell R to keep rows with an NA value for the variable sex_f. Adding the NOT operator ! tells R to do the opposite – keep rows that do NOT have an NA value for the variable sex_f.

  • We used our code from above to calculate the n and percent of non-missing values of sex_f. 

19.6 Formatting results

Notice that now our percentages are being displayed with 5 digits to the right of the decimal. If we wanted to present our findings somewhere (e.g., a journal article or a report for our employer) we would almost never want to display this many digits. Let’s get R to round these numbers for us.

height_and_weight_20 %>% 
  filter(!is.na(sex_f)) %>% 
  count(sex_f) %>% 
  mutate(percent = (n / sum(n) * 100) %>% round(2))
## # A tibble: 2 × 3
##   sex_f     n percent
##   <dbl> <int>   <dbl>
## 1     1     7    38.9
## 2     2    11    61.1

👆Here’s what we did above:

  • We passed the calculated percentage values (n / sum(n) * 100) to the round() function to round our percentages to 2 decimal places.

    • Notice that we had to wrap n / sum(n) * 100 in parentheses in order to pass it to the round() function with a pipe.

    • We could have alternatively written our R code this way: mutate(percent = round(n / sum(n) * 100, 2)).

19.7 Using freqtables

In the sections above, we learned how to use dplyr functions to calculate the frequency and percentage of observations that take on each value of a categorical variable. However, there can be a fair amount of code writing involved when using those methods. The more we have to repeatedly type code, the more tedious and error-prone it becomes. This is an idea we will return to many times in this book. Luckily, the R programming language allows us to write our own functions, which solves both of those problems.

Later in this book, I will show you how to write your own functions. For the time being, I’m going to suggest that you install and use a package I created called freqtables. The freqtables package is basically an enhanced version of the code we wrote in the sections above. I designed it to help us quickly make tables of descriptive statistics (i.e., counts, percentages, confidence intervals) for categorical variables, and it’s specifically designed to work in a dplyr pipeline.

Like all packages, you need to first install it…

# You may be asked if you want to update other packages on your computer that
# freqtables uses. Go ahead and do so.
install.packages("freqtables")

And then load it…

# After installing freqtables on your computer, you can load it just like you
# would any other package.
library(freqtables)

Now, let’s use the freq_table() function from freqtables package to rerun our analysis from above.

height_and_weight_20 %>%
  filter(!is.na(sex_f)) %>%
  freq_table(sex_f)
## # A tibble: 2 × 9
##   var   cat       n n_total percent    se t_crit   lcl   ucl
##   <chr> <chr> <int>   <int>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 sex_f 1         7      18    38.9  11.8   2.11  18.2  64.5
## 2 sex_f 2        11      18    61.1  11.8   2.11  35.5  81.8

👆Here’s what we did above:

  • We used filter() to keep only the rows that have a non-missing value for sex and passed the data frame on to the freq_table() function using a pipe.

  • We told the freq_table() function to create a univariate frequency table for the variable sex_f. A “univariate frequency table” just means a table (data frame) of useful statistics about a single categorical variable.

  • The univariate frequency table above includes:

    • var: The name of the categorical variable (column) we are analyzing.

    • cat: Each of the different categories the variable var contains – in this case “Male” and “Female”.

    • n: The number of rows where var equals the value in cat. In this case, there are 7 rows where the value of sex_f is Male, and 11 rows where the value of sex_f is Female.

    • n_total: The sum of all the n values. This is also to total number of rows in the data frame currently being analyzed.

    • percent: The percent of rows where var equals the value in cat.

    • se: The standard error of the percent. This value is not terribly useful on its own; however, it’s necessary for calculating the 95% confidence intervals.

    • t_crit: The critical value from the t distribution. This value is not terribly useful on its own; however, it’s necessary for calculating the 95% confidence intervals.

    • lcl: The lower (95%, by default) confidence limit for the percentage percent.

    • ucl: The upper (95%, by default) confidence limit for the percentage percent.

We will continue using the freqtables package at various points throughout the book. I will also show you some other cool things we can do with freqtables. For now, all you need to know how to do is use the freq_table() function to calculate frequencies and percentages for single categorical variables.

🏆 Congratulations! You now know how to use R to do some basic descriptive analysis of individual categorical variables.