• R for Epidemiology
  • Welcome
    • Acknowledgements
  • Introduction
    • Goals
    • Text conventions used in this book
    • Other reading
    • Contributing to R4Epi
      • Typos
      • Issues
  • About the Authors
    • Brad Cannell
    • Melvin Livingston
  • I Getting Started
  • 1 Installing R and RStudio
    • 1.1 Download and install on a Mac
    • 1.2 Download and install on a PC
  • 2 What is R?
    • 2.1 What is data?
    • 2.2 What is R?
      • 2.2.1 Transferring data
      • 2.2.2 Managing data
      • 2.2.3 Analyzing data
      • 2.2.4 Presenting data
  • 3 Navigating the RStudio interface
    • 3.1 The console
    • 3.2 The environment pane
    • 3.3 The files pane
    • 3.4 The source pane
    • 3.5 RStudio preferences
  • 4 Speaking R’s language
    • 4.1 R is a language
    • 4.2 The R interpreter
    • 4.3 Errors
    • 4.4 Functions
    • 4.5 Objects
    • 4.6 Comments
    • 4.7 Packages
    • 4.8 Programming style
  • 5 Let’s get programming
    • 5.1 Simulating data
    • 5.2 Vectors
      • 5.2.1 Vector types
      • 5.2.2 Double vectors
      • 5.2.3 Integer vectors
      • 5.2.4 Logical vectors
    • 5.3 Data frames
    • 5.4 Tibbles
      • 5.4.1 The as_tibble function
      • 5.4.2 The tibble function
      • 5.4.3 The tribble function
      • 5.4.4 Why use tibbles
    • 5.5 Missing data
    • 5.6 Our first analysis
      • 5.6.1 Manual calculation of the mean
      • 5.6.2 Dollar sign notation
      • 5.6.3 Bracket notation
      • 5.6.4 The sum function
      • 5.6.5 Nesting functions
      • 5.6.6 The length function
      • 5.6.7 The mean function
    • 5.7 Some common errors
    • 5.8 Summary
  • 6 Asking questions
    • 6.1 When should we seek help?
    • 6.2 Where should we seek help?
    • 6.3 How should we seek help?
      • 6.3.1 Creating a post on Stack Overflow
      • 6.3.2 Creating better posts and asking better questions
    • 6.4 Helping others
    • 6.5 Summary
  • II Coding Tools and Best Practices
  • 7 R scripts
    • 7.1 Creating R scripts
  • 8 R markdown
    • 8.1 What is R markdown?
    • 8.2 Why use R markdown?
    • 8.3 Create an R Notebook
    • 8.4 YAML headers
    • 8.5 R code chunks
    • 8.6 Markdown
  • 9 R projects
  • 10 Coding best practices
    • 10.1 General principles
    • 10.2 Code comments
      • 10.2.1 Defining key variables
      • 10.2.2 What this code is trying to accomplish
      • 10.2.3 Why I chose this particular strategy
    • 10.3 Style guidelines
      • 10.3.1 Comments
      • 10.3.2 Object (variable) names
      • 10.3.3 Use names that are informative
      • 10.3.4 File Names
  • 11 Using pipes
    • 11.1 What are pipes?
    • 11.2 How do pipes work?
      • 11.2.1 Keyboard shortcut
      • 11.2.2 Pipe style
    • 11.3 Final thought on pipes
  • III Data Transfer
  • 12 Introduction to data transfer
  • 13 File paths
    • 13.1 Finding file paths
    • 13.2 Relative file paths
  • 14 Importing plain text files
    • 14.1 Packages for importing data
    • 14.2 Importing space delimited files
      • 14.2.1 Specifying missing data values
    • 14.3 Importing tab delimited files
    • 14.4 Importing fixed width format files
      • 14.4.1 Vector of column widths
      • 14.4.2 Paired vector of start and end positions
      • 14.4.3 Using named arguments
    • 14.5 Importing comma separated values files
    • 14.6 Additional arguments
  • 15 Importing binary files
    • 15.1 Packages for importing data
    • 15.2 Importing Microsoft Excel spreadsheets
    • 15.3 Importing data from other statistical analysis software
    • 15.4 Importing SAS data sets
    • 15.5 Importing Stata data sets
  • 16 RStudio’s data import tool
  • 17 Exporting data
    • 17.1 Plain text files
    • 17.2 R binary files
  • IV Descriptive Analysis
  • 18 Introduction to descriptive analysis
    • 18.1 What is descriptive analysis and why would we do it?
    • 18.2 What kind of descriptive analysis should we perform?
  • 19 Numerical descriptions of categorical variables
    • 19.1 Factor vectors
      • 19.1.1 Coerce a numeric variable
      • 19.1.2 Coerce a character variable
    • 19.2 Height and Weight Data
      • 19.2.1 View the data
    • 19.3 Calculating frequencies
      • 19.3.1 The base R table function
      • 19.3.2 The gmodels CrossTable function
      • 19.3.3 The tidyverse way
    • 19.4 Calculating percentages
    • 19.5 Missing data
    • 19.6 Formatting results
    • 19.7 Using freqtables
  • 20 Measures of central tendency
    • 20.1 Calculate the mean
    • 20.2 Calculate the median
    • 20.3 Calculate the mode
    • 20.4 Compare mean, median, and mode
    • 20.5 Data checking
    • 20.6 Properties of mean, median, and mode
    • 20.7 Missing data
    • 20.8 Using meantables
  • 21 Measures of dispersion
    • 21.1 Comparing distributions
  • 22 Describing the relationship between a continuous outcome and a continuous predictor
    • 22.1 Pearson Correlation Coefficient
      • 22.1.1 Calculating r
      • 22.1.2 Correlation intuition
  • 23 Describing the relationship between a continuous outcome and a categorical predictor
    • 23.1 Single predictor and single outcome
    • 23.2 Multiple predictors
  • 24 Describing the relationship between a categorical outcome and a categorical predictor
    • 24.1 Comparing two variables
  • V Data Management
  • 25 Introduction to data management
    • 25.1 Multiple paradigms for data management in R
    • 25.2 The dplyr package
      • 25.2.1 The dplyr verbs
      • 25.2.2 The .data argument
      • 25.2.3 The … argument
      • 25.2.4 Non-standard evaluation
  • 26 Creating and modifying columns
    • 26.1 Creating data frames
    • 26.2 Dollar sign notation
    • 26.3 Bracket notation
    • 26.4 Modify individual values
    • 26.5 The mutate() function
      • 26.5.1 Adding or modifying a single column
      • 26.5.2 Recycling rules
      • 26.5.3 Using existing variables in name-value pairs
      • 26.5.4 Adding or modifying multiple columns
      • 26.5.5 Rowwise mutations
      • 26.5.6 Group_by mutations
  • 27 Subsetting data frames
    • 27.1 The select() function
    • 27.2 The rename() function
    • 27.3 The filter() function
      • 27.3.1 Subgroup analysis
      • 27.3.2 Complete case analysis
    • 27.4 Deduplication
      • 27.4.1 The distinct() function
      • 27.4.2 Complete duplicate row add tag
      • 27.4.3 Partial duplicate rows
      • 27.4.4 Partial duplicate rows - add tag
      • 27.4.5 Count the number of duplicates
      • 27.4.6 What to do about duplicates
  • 28 Working with dates
    • 28.1 Date vector types
    • 28.2 Dates under the hood
    • 28.3 Coercing date-times to dates
    • 28.4 Coercing character strings to dates
    • 28.5 Change the appearance of dates with format()
    • 28.6 Some useful built-in dates
      • 28.6.1 Today’s date
      • 28.6.2 Today’s date-time
      • 28.6.3 Character vector of full month names
      • 28.6.4 Character vector of abbreviated month names
      • 28.6.5 Creating a vector containing a sequence of dates
    • 28.7 Calculating date intervals
      • 28.7.1 Calculate age as the difference in time between dob and today
      • 28.7.2 Rounding time intervals
    • 28.8 Extracting out date parts
    • 28.9 Sorting dates
  • 29 Working with character strings
    • 29.1 Coerce to lowercase
      • 29.1.1 Lowercase
      • 29.1.2 Upper case
      • 29.1.3 Title case
      • 29.1.4 Sentence case
    • 29.2 Trim white space
    • 29.3 Regular expressions
      • 29.3.1 Remove the comma
      • 29.3.2 Remove middle initial
      • 29.3.3 Remove double spaces
    • 29.4 Separate values into component parts
    • 29.5 Dummy variables
  • 30 Conditional operations
    • 30.1 Operands and operators
    • 30.2 Testing multiple conditions simultaneously
    • 30.3 Testing a sequence of conditions
    • 30.4 Recoding variables
    • 30.5 case_when() is lazy
    • 30.6 Recode missing
  • 31 Working with multiple data frames
    • 31.1 Combining data frames vertically: Adding rows
      • 31.1.1 Combining more than 2 data frames
      • 31.1.2 Adding rows with differing columns
      • 31.1.3 Differing column positions
      • 31.1.4 Differing column names
    • 31.2 Combining data frames horizontally: Adding columns
      • 31.2.1 Combining data frames horizontally by position
      • 31.2.2 Combining data frames horizontally by key values
  • 32 Restructuring data frames
    • 32.1 The tidyr package
    • 32.2 Pivoting longer
      • 32.2.1 The names_to argument
      • 32.2.2 The names_prefix argument
      • 32.2.3 The values_to argument
      • 32.2.4 The names_transform argument
      • 32.2.5 Pivoting multiple sets of columns
      • 32.2.6 The names_sep argument
      • 32.2.7 The .value special value
      • 32.2.8 Why person-period?
    • 32.3 Pivoting wider
      • 32.3.1 Why person-level?
    • 32.4 Pivoting summary statistics
      • 32.4.1 Pivoting summary statistics wide to long
      • 32.4.2 Pivoting summary statistics long to wide
    • 32.5 Tidy data
      • 32.5.1 Each variable must have its own column
      • 32.5.2 Each observation must have its own row
      • 32.5.3 Each value must have its own cell
    • 32.6 The complete() function
  • VI Repeated Operations
  • 33 Introduction to repeated operations
    • 33.1 Multiple methods for repeated operations in R
    • 33.2 Tidy evaluation
  • 34 Writing functions
    • 34.1 When to write functions
    • 34.2 How to write functions
      • 34.2.1 The function() function
      • 34.2.2 The function writing process
    • 34.3 Giving your function arguments default values
    • 34.4 The values your functions return
    • 34.5 Lexical scoping and functions
    • 34.6 Tidy evaluation
  • 35 Column-wise operations in dplyr
    • 35.1 The across() function
    • 35.2 Across with mutate
    • 35.3 Across with summarise
    • 35.4 Across with filter
    • 35.5 Summary
  • 36 Writing for loops
    • 36.1 How to write for loops
      • 36.1.1 The for loop body
      • 36.1.2 The for() function
    • 36.2 Using for loops for data transfer
    • 36.3 Using for loops for data management
    • 36.4 Using for loops for analysis
  • 37 Using the purrr package
    • 37.1 Comparing for loops and the map functions
    • 37.2 Using purrr for data transfer
      • 37.2.1 Example 1: Importing multiple sheets from an Excel workbook
      • 37.2.2 Why walk instead of map?
      • 37.2.3 why we didn’t assign the return value of walk() to an object?
    • 37.3 Using purrr for data management
      • 37.3.1 Example 1: Adding NA at multiple positions
      • 37.3.2 Example 2. Detecting matching values by position
    • 37.4 Using purrr for analysis
      • 37.4.1 Example 1: Continuous statistics
      • 37.4.2 Example 2: Categorical statistics
  • VII Collaboration
  • 38 Introduction to git and GitHub
    • 38.1 Versioning
    • 38.2 Preservation
    • 38.3 Reproducibility
    • 38.4 Collaboration
    • 38.5 Summary
  • 39 Using git and GitHub
    • 39.1 Install git
    • 39.2 Sign up for a GitHub account
    • 39.3 Install GitKraken
    • 39.4 Example 1: Contribute to R4Epi
    • 39.5 Example 2: Create a repository for a research project
      • Step 1: Create a repository on GitHub
      • Step 2: Clone the repository to your computer
      • Step 3: Add an R project file to the repository
      • Step 4: Update and commit gitignore
      • Step 5: Keep adding and committing files
    • 39.6 Committing and pushing
    • 39.7 Example 3: Contribute to a research project
      • 39.7.1 Forking a repository
      • 39.7.2 Creating a pull request
    • 39.8 Summary
  • VIII Presenting Results
  • 40 Creating tables with R and Microsoft Word
    • 40.1 Table 1
    • 40.2 Opioid drug use
    • 40.3 Table columns
    • 40.4 Table rows
    • 40.5 Make the table skeleton
    • 40.6 Fill in column headers
      • 40.6.1 Group sample sizes
      • 40.6.2 Formatting column headers
    • 40.7 Fill in row headers
      • 40.7.1 Label statistics
      • 40.7.2 Formatting row headers
    • 40.8 Fill in data values
      • 40.8.1 Manually type values
      • 40.8.2 Copy and paste values
      • 40.8.3 Knit a Word document
      • 40.8.4 flextable and officer
      • 40.8.5 Significant digits
      • 40.8.6 Formatting data values
    • 40.9 Fill in title
    • 40.10 Fill in footnotes
      • 40.10.1 Formatting footnotes
    • 40.11 Final formatting
      • 40.11.1 Adjust column widths
      • 40.11.2 Merge cells
      • 40.11.3 Remove cell borders
    • 40.12 Summary
  • IX Introduction to Epidemiology
  • 41 Introduction to Epidemiology
    • 41.1 Measurement
    • 41.2 Uncertainty
    • 41.3 Summary
  • X Appendix
  • Appendix A: Glossary
  • Appendix: Alternative table formats
    • 41.4 Smaller data frame
      • 41.4.1 Default method for printing the data frame to the screen
      • 41.4.2 Using the kable function
      • 41.4.3 Using the datatable function
    • 41.5 Larger data frame
      • 41.5.1 Default method for printing the data frame to the screen
      • 41.5.2 Using the kable function
      • 41.5.3 Using the datatable function
  • References
  • Published with bookdown

R for Epidemiology

3 Navigating the RStudio interface

You now have R and RStudio on your computer and you have some idea of what R and RStudio are. At this point, it is really common for people to open RStudio and get totally overwhelmed. “What am I looking at?” ”What do I click first?” “Where do I even start?” Don’t worry if these, or similar, thoughts have crossed your mind. You are in good company and we will start to clear some of them up in this chapter.

When you first load RStudio you should see a screen that looks very similar to what you see in the picture below. 3.1 In the current view, you see three panes and each pane has multiple tabs. Don’t beat yourself up if this isn’t immediately obvious. I’ll make it clearer soon.

The default RStudio user interface.

Figure 3.1: The default RStudio user interface.

3.1 The console

The first pane we are going to talk about is the Console/Terminal/Jobs pane. 3.2

The R console.

Figure 3.2: The R console.

It’s called the Console/Terminal/Jobs pane because it has three tabs you can click on: Console, Terminal, and Jobs. However, we will mostly refer to it as the Console pane and we will mostly ignore the Terminal and Jobs tabs. We aren’t ignoring them because they aren’t useful; rather, we are ignoring them because using them isn’t essential for anything we discuss anytime soon, and I want to keep things as simple as possible.

The console is the most basic way to interact with R. You can type a command to R into the console prompt (the prompt looks like “>”) and R will respond to what you type. For example, below I’ve typed “1 plus 1,” hit enter, and the R console returned the sum of the numbers 1 and 1. 3.3

Doing some addition in the R console.

Figure 3.3: Doing some addition in the R console.

The number 1 you see in brackets before the 2 (i.e., [1]) is telling you that this line of results starts with the first result. That fact is obvious here because there is only one result. To make this idea clearer, let’s show you a result with multiple lines.

Demonstrating a function that returns multiple results.

Figure 3.4: Demonstrating a function that returns multiple results.

In the screenshot above we see a couple new things demonstrated. 3.4

First, as promised, we have more than one line of results (or output). The first line of results starts with a 1 in brackets (i.e., [1]), which indicates that this line of results starts with the first result. In this case the first result is the number 2. The second line of results starts with a 29 in brackets (i.e., [29]), which indicates that this line of results starts with the twenty-ninth result. In this case the twenty-ninth result is the number 58. If you count the numbers in the first line, there should be 28 – results 1 through 28. I also want to make it clear that “1” and “29” are NOT results themselves. They are just helping us count the number of results per line.

The second new thing here that you may have noticed is our use of a function. Functions are a BIG DEAL in R. So much so that R is called a functional language. You don’t really need to know all the details of what that means; however, you should know that, in general, everything you do in R you will do with a function. By contrast, everything you create in R will be an object. If you wanted to make an analogy between the R language and the English language, functions are verbs – they do things – and objects are nouns – they are things. This may be confusing right now. Don’t worry. It will become clearer soon.

Most functions in R begin with the function name followed by parentheses. For example, seq(), sum(), and mean().

Question: What is the name of the function we used in the example above?

It’s the seq() function – short for sequence. Inside the function, you may notice that there are three pairs of words, equal symbols, and numbers that are separated by commas. They are, from = 2, to = 100, and by = 2. In this case, from, to, and by are all arguments to the seq() function. I don’t know why they are called arguments, but as far as we are concerned, they just are. We will learn more about functions and arguments later, but for now just know that arguments give functions the information they need to give us the result we want.

In this case, the seq() function gives us a sequence of numbers, but we have to give it information about where that sequence should start, where it should end, and how many steps should be in the middle. Here the sequence begins with the value we gave to the from argument (i.e., 2), ends with the value we gave to the to argument (i.e., 100), and increases at each step by the number we gave to the by argument (i.e., 2). So, 2, 4, 6, 8 … 100.

While it’s convenient, let’s also learn some programming terminology:

  • Arguments: Arguments always go inside the parentheses of a function and give the function the information it needs to give us the result we want.

  • Pass: In programming lingo, you pass a value to a function argument. For example, in the function call seq(from = 2, to = 100, by = 2) we could say that we passed a value of 2 to the from argument, we passed a value of 100 to the to argument, and we passed a value of 2 to the by argument.

  • Returns: Instead of saying, “the seq() function gives us a sequence of numbers…” we could say, “the seq() function returns a sequence of numbers…” In programming lingo, functions return one or more results.

🗒Side Note: The seq() function isn’t particularly important or noteworthy. I essentially chose it at random to illustrate some key points. However, arguments, passing values, and return values are extremely important concepts and we will return to them many times.

3.2 The environment pane

The second pane we are going to talk about is the Environment/History/Connections pane. 3.5 However, we will mostly refer to it as the Environment pane and we will mostly ignore the History and Connections tab. We aren’t ignoring them because they aren’t useful; rather, we are ignoring them because using them isn’t essential for anything we will discuss anytime soon, and I want to keep things as simple as possible.

The environment pane.

Figure 3.5: The environment pane.

The Environment pane shows you all the objects that R can currently use for data management or analysis. In this picture, 3.5 our environment is empty. Let’s create an object and add it to our Environment.

The vector x in the global environment.

Figure 3.6: The vector x in the global environment.

Here we see that we created a new object called x, which now appears in our Global Environment. 3.6 This gives us another great opportunity to discuss some new concepts.

First, we created the x object in the Console by assigning the value 2 to the letter x. We did this by typing “x” followed by a less than symbol (<), a dash symbol (-), and the number 2. R is kind of unique in this way. I have never seen another programming language (although I’m sure they are out there) that uses <- to assign values to variables. By the way, <- is called the assignment operator (or assignment arrow), and ”assign” here means “make x contain 2” or “put 2 inside x.”

In many other languages you would write that as x = 2. But, for whatever reason, in R it is <-. Unfortunately, <- is more awkward to type than =. Fortunately, RStudio gives us a keyboard shortcut to make it easier. To type the assignment operator in RStudio, just hold down Option + - (dash key) on a Mac or Alt + - (dash key) on a PC and RStudio will insert <- complete with spaces on either side of the arrow. This may still seem awkward at first, but you will get used to it.

🗒Side Note: A note about using the letter “x”: By convention, the letter “x” is a widely used variable name. You will see it used a lot in example documents and online. However, there is nothing special about the letter x. We could have just as easily used any other letter (a <- 2), word (variable <- 2), or descriptive name (my_favorite_number <- 2) that is allowed by R.

Second, you can see that our Global Environment now includes the object x, which has a value of 2. In this case, we would say that x is a numeric vector of length 1 (i.e., it has one value stored in it). We will talk more about vectors and vector types soon. For now, just notice that objects that you can manipulate or analyze in R will appear in your Global Environment.

⚠️Warning: R is a case sensitive language. That means that uppercase x (X) and lowercase x (x) are different things to R. So, if you assign 2 to lower case x (x <- 2). And then later ask R to tell what number you stored in uppercase X, you will get an error (Error: object 'X' not found).

3.3 The files pane

Next, let’s talk about the Files/Plots/Packages/Help/Viewer pane (that’s a mouthful). 3.7

The Files/Plots/Packages/Help/Viewer pane.

Figure 3.7: The Files/Plots/Packages/Help/Viewer pane.

Again, some of these tabs are more applicable for us than others. For us, the files tab and the help tab will probably be the most useful. You can think of the files tab as a mini Finder window (for Mac) or a mini File Explorer window (for PC). The help tab is also extremely useful once you get acclimated to it.

The help tab.

Figure 3.8: The help tab.

For example, in the screenshot above 3.8 we typed the seq into the search bar. The help pane then shows us a page of documentation for the seq() function. The documentation includes a brief description of what the function does, outlines all the arguments the seq() function recognizes, and, if you scroll down, gives examples of using the seq() function. Admittedly, this help documentation can seem a little like reading Greek (assuming you don’t speak Greek) at first. But, you will get more comfortable using it with practice. I hated the help documentation when I was learning R. Now, I use it all the time.

3.4 The source pane

There is actually a fourth pane available in RStudio. If you click on the icon shown below you will get the following dropdown box with a list of files you can create. 3.9

Click the new source file icon.

Figure 3.9: Click the new source file icon.

If you click any of these options, a new pane will appear. I will arbitrarily pick the first option – R Script.

New source file options.

Figure 3.10: New source file options.

When I do, a new pane appears. It’s called the source pane. In this case, the source pane contains an untitled R Script. We won’t get into the details now because I don’t want to overwhelm you, but soon you will do the majority of your R programming in the source pane.

A blank R script in the source pane.

Figure 3.11: A blank R script in the source pane.

3.5 RStudio preferences

Finally, I’m going to recommend that you change a few settings in RStudio before we move on.

Start by going to RStudio -> Preferences (on Mac) 3.12

Select the preferences menu on Mac.

Figure 3.12: Select the preferences menu on Mac.

Or start by going to Tools -> Global Options (on Windows) 3.13

Select the global options menu on Windows.

Figure 3.13: Select the global options menu on Windows.

In the “General” tab, I recommend unchecking the “Restore .Rdata into workspace at startup” checkbox. I also recommend setting the “Save workspace .Rdata on exit” dropdown to “Never.” Finally, I recommend unchecking the “Always save history (even when not saving .Rdata)” checkbox. 3.14

General options tab.

Figure 3.14: General options tab.

On the “Appearance” tab, I’m going to change my Editor Theme to Twilight. It’s not so much that I’m recommending you change yours – this is entirely personal preference – I’m just letting you know why my screenshots will look different from here on out. 3.15

Appearance tab.

Figure 3.15: Appearance tab.

I’m sure you still have lots of questions at this point. That’s totally natural. However, I hope you now feel like you have some idea of what you are looking at when you open RStudio. Most of you will naturally get more comfortable with RStudio as we move through the book. For those of you who want more resources now, here are some suggestions.

  1. RStudio IDE cheatsheet

  2. ModernDive: What are R and RStudio?