R for Epidemiology
Welcome
Acknowledgements
Introduction
Goals
Text conventions used in this book
Other reading
Contributing to R4Epi
Typos
Issues
About the Authors
Brad Cannell
Melvin Livingston
I Getting Started
1
Installing R and RStudio
1.1
Download and install on a Mac
1.2
Download and install on a PC
2
What is R?
2.1
What is data?
2.2
What is R?
2.2.1
Transferring data
2.2.2
Managing data
2.2.3
Analyzing data
2.2.4
Presenting data
3
Navigating the RStudio Interface
3.1
The console
3.2
The environment pane
3.3
The files pane
3.4
The source pane
3.5
RStudio preferences
4
Speaking R’s Language
4.1
R is a
language
4.2
The R interpreter
4.3
Errors
4.4
Functions
4.4.1
Passing values to function arguments
4.5
Objects
4.6
Comments
4.7
Packages
4.8
Programming style
5
Let’s Get Programming
5.1
Simulating data
5.2
Vectors
5.2.1
Vector types
5.2.2
Double vectors
5.2.3
Integer vectors
5.2.4
Logical vectors
5.2.5
Factor vectors
5.3
Data frames
5.4
Tibbles
5.4.1
The as_tibble function
5.4.2
The tibble function
5.4.3
The tribble function
5.4.4
Why use tibbles
5.5
Missing data
5.6
Our first analysis
5.6.1
Manual calculation of the mean
5.6.2
Dollar sign notation
5.6.3
Bracket notation
5.6.4
The sum function
5.6.5
Nesting functions
5.6.6
The length function
5.6.7
The mean function
5.7
Some common errors
5.8
Summary
6
Asking Questions
6.1
When should we seek help?
6.2
Where should we seek help?
6.3
How should we seek help?
6.3.1
Creating a post on Stack Overflow
6.3.2
Creating better posts and asking better questions
6.4
Helping others
6.5
Summary
II Coding Tools and Best Practices
7
R Scripts
7.1
Creating R scripts
8
Quarto Files
8.1
What is Quarto?
8.2
Why use Quarto?
8.3
Create a Quarto file
8.4
YAML headers
8.5
R code chunks
8.6
Markdown
8.6.1
Markdown headings
8.7
Summary
9
R Projects
10
Coding Best Practices
10.1
General principles
10.2
Code comments
10.2.1
Defining key variables
10.2.2
What this code is trying to accomplish
10.2.3
Why I chose this particular strategy
10.3
Style guidelines
10.3.1
Comments
10.3.2
Object (variable) names
10.3.3
Use names that are informative
10.3.4
File Names
11
Using Pipes
11.1
What are pipes?
11.2
How do pipes work?
11.2.1
Keyboard shortcut
11.2.2
Pipe style
11.3
Final thought on pipes
III Data Transfer
12
Introduction to Data Transfer
13
File Paths
13.1
Finding file paths
13.2
Relative file paths
14
Importing Plain Text Files
14.1
Packages for importing data
14.2
Importing space delimited files
14.2.1
Specifying missing data values
14.3
Importing tab delimited files
14.4
Importing fixed width format files
14.4.1
Vector of column widths
14.4.2
Paired vector of start and end positions
14.4.3
Using named arguments
14.5
Importing comma separated values files
14.6
Additional arguments
15
Importing Binary Files
15.1
Packages for importing data
15.2
Importing Microsoft Excel spreadsheets
15.3
Importing data from other statistical analysis software
15.4
Importing SAS data sets
15.5
Importing Stata data sets
16
RStudio’s Data Import Tool
17
Exporting Data
17.1
Plain text files
17.2
R binary files
IV Descriptive Analysis
18
Introduction to Descriptive Analysis
18.1
What is descriptive analysis and why would we do it?
18.2
What kind of descriptive analysis should we perform?
19
Numerical Descriptions of Categorical Variables
19.1
Factors
19.1.1
Coerce a numeric variable
19.1.2
Coerce a character variable
19.2
Height and Weight Data
19.2.1
View the data
19.3
Calculating frequencies
19.3.1
The base R table function
19.3.2
The gmodels CrossTable function
19.3.3
The tidyverse way
19.4
Calculating percentages
19.5
Missing data
19.6
Formatting results
19.7
Using freqtables
20
Measures of Central Tendency
20.1
Calculate the mean
20.2
Calculate the median
20.3
Calculate the mode
20.4
Compare mean, median, and mode
20.5
Data checking
20.6
Properties of mean, median, and mode
20.7
Missing data
20.8
Using meantables
21
Measures of Dispersion
21.1
Comparing distributions
22
Describing the Relationship Between a Continuous Outcome and a Continuous Predictor
22.1
Pearson Correlation Coefficient
22.1.1
Calculating r
22.1.2
Correlation intuition
23
Describing the Relationship Between a Continuous Outcome and a Categorical Predictor
23.1
Single predictor and single outcome
23.2
Multiple predictors
24
Describing the Relationship Between a Categorical Outcome and a Categorical Predictor
24.1
Comparing two variables
V Data Management
25
Introduction to Data Management
25.1
Multiple paradigms for data management in R
25.2
The dplyr package
25.2.1
The dplyr verbs
25.2.2
The .data argument
25.2.3
The … argument
25.2.4
Non-standard evaluation
26
Creating and Modifying Columns
26.1
Creating data frames
26.2
Dollar sign notation
26.3
Bracket notation
26.4
Modify individual values
26.5
The mutate() function
26.5.1
Adding or modifying a single column
26.5.2
Recycling rules
26.5.3
Using existing variables in name-value pairs
26.5.4
Adding or modifying multiple columns
26.5.5
Rowwise mutations
26.5.6
Group_by mutations
27
Subsetting Data Frames
27.1
The select() function
27.2
The rename() function
27.3
The filter() function
27.3.1
Subgroup analysis
27.3.2
Complete case analysis
27.4
Deduplication
27.4.1
The distinct() function
27.4.2
Complete duplicate row add tag
27.4.3
Partial duplicate rows
27.4.4
Partial duplicate rows - add tag
27.4.5
Count the number of duplicates
27.4.6
What to do about duplicates
28
Working with Dates
28.1
Date vector types
28.2
Dates under the hood
28.3
Coercing date-times to dates
28.4
Coercing character strings to dates
28.5
Change the appearance of dates with format()
28.6
Some useful built-in dates
28.6.1
Today’s date
28.6.2
Today’s date-time
28.6.3
Character vector of full month names
28.6.4
Character vector of abbreviated month names
28.6.5
Creating a vector containing a sequence of dates
28.7
Calculating date intervals
28.7.1
Calculate age as the difference in time between dob and today
28.7.2
Rounding time intervals
28.8
Extracting out date parts
28.9
Sorting dates
29
Working with Character Strings
29.1
Coerce to lowercase
29.1.1
Lowercase
29.1.2
Upper case
29.1.3
Title case
29.1.4
Sentence case
29.2
Trim white space
29.3
Regular expressions
29.3.1
Remove the comma
29.3.2
Remove middle initial
29.3.3
Remove double spaces
29.4
Separate values into component parts
29.5
Dummy variables
30
Conditional Operations
30.1
Operands and operators
30.2
Testing multiple conditions simultaneously
30.3
Testing a sequence of conditions
30.4
Recoding variables
30.5
case_when() is lazy
30.6
Recode missing
31
Working with Multiple Data Frames
31.1
Combining data frames vertically: Adding rows
31.1.1
Combining more than 2 data frames
31.1.2
Adding rows with differing columns
31.1.3
Differing column positions
31.1.4
Differing column names
31.2
Combining data frames horizontally: Adding columns
31.2.1
Combining data frames horizontally by position
31.2.2
Combining data frames horizontally by key values
32
Restructuring Data frames
32.1
The tidyr package
32.2
Pivoting longer
32.2.1
The names_to argument
32.2.2
The names_prefix argument
32.2.3
The values_to argument
32.2.4
The names_transform argument
32.2.5
Pivoting multiple sets of columns
32.2.6
The names_sep argument
32.2.7
The .value special value
32.2.8
Why person-period?
32.3
Pivoting wider
32.3.1
Why person-level?
32.4
Pivoting summary statistics
32.4.1
Pivoting summary statistics wide to long
32.4.2
Pivoting summary statistics long to wide
32.5
Tidy data
32.5.1
Each variable must have its own column
32.5.2
Each observation must have its own row
32.5.3
Each value must have its own cell
32.6
The complete() function
VI Repeated Operations
33
Introduction to Repeated Operations
33.1
Multiple methods for repeated operations in R
33.2
Tidy evaluation
34
Writing Functions
34.1
When to write functions
34.2
How to write functions
34.2.1
The function() function
34.2.2
The function writing process
34.3
Giving your function arguments default values
34.4
The values your functions return
34.5
Lexical scoping and functions
34.6
Tidy evaluation
35
Column-wise Operations in dplyr
35.1
The across() function
35.2
Across with mutate
35.3
Across with summarise
35.4
Across with filter
35.5
Summary
36
Writing For Loops
36.1
How to write for loops
36.1.1
The for loop body
36.1.2
The for() function
36.2
Using for loops for data transfer
36.3
Using for loops for data management
36.4
Using for loops for analysis
37
Using the purrr Package
37.1
Comparing for loops and the map functions
37.2
Using purrr for data transfer
37.2.1
Example 1: Importing multiple sheets from an Excel workbook
37.2.2
Why walk instead of map?
37.2.3
why we didn’t assign the return value of
walk()
to an object?
37.3
Using purrr for data management
37.3.1
Example 1: Adding NA at multiple positions
37.3.2
Example 2. Detecting matching values by position
37.4
Using purrr for analysis
37.4.1
Example 1: Continuous statistics
37.4.2
Example 2: Categorical statistics
VII Collaboration
38
Introduction to git and GitHub
38.1
Versioning
38.2
Preservation
38.3
Reproducibility
38.4
Collaboration
38.5
Summary
39
Using git and GitHub
39.1
Install git
39.2
Sign up for a GitHub account
39.3
Install GitKraken
39.4
Example 1: Contribute to R4Epi
39.5
Example 2: Create a repository for a research project
Step 1: Create a repository on GitHub
Step 2: Clone the repository to your computer
Step 3: Add an R project file to the repository
Step 4: Update and commit gitignore
Step 5: Keep adding and committing files
39.6
Committing and pushing
39.7
Example 3: Contribute to a research project
39.7.1
Forking a repository
39.7.2
Creating a pull request
39.8
Summary
VIII Presenting Results
40
Creating Tables with R and Microsoft Word
40.1
Table 1
40.2
Opioid drug use
40.3
Table columns
40.4
Table rows
40.5
Make the table skeleton
40.6
Fill in column headers
40.6.1
Group sample sizes
40.6.2
Formatting column headers
40.7
Fill in row headers
40.7.1
Label statistics
40.7.2
Formatting row headers
40.8
Fill in data values
40.8.1
Manually type values
40.8.2
Copy and paste values
40.8.3
Knit a Word document
40.8.4
flextable and officer
40.8.5
Significant digits
40.8.6
Formatting data values
40.9
Fill in title
40.10
Fill in footnotes
40.10.1
Formatting footnotes
40.11
Final formatting
40.11.1
Adjust column widths
40.11.2
Merge cells
40.11.3
Remove cell borders
40.12
Summary
IX Foundational Epidemiologic Concepts
41
Using R for Epidemiology
41.1
Measurement
41.1.1
Descriptive measures
41.2
Uncertainty
41.2.1
Statistical uncertainty
41.2.2
Uncertainty in the research process
41.2.3
Epistemological uncertainty
41.3
Study design
41.4
Summary
42
Populations and Samples
42.1
Open and closed populations
42.2
Other ways to define populations
42.3
Samples
42.4
Cohorts
42.5
Summary
43
Measures of Occurrence
43.1
Terminology
43.1.1
Prevalence and incidence
43.1.2
Point prevalence and period prevalence
43.2
Quantifying prevalence
43.2.1
Prevalence counts
43.2.2
Prevalence proportion
43.2.3
Prevalence Odds
43.3
Quantifying incidence
43.3.1
Incidence Count
43.3.2
Incidence proportion
43.3.3
Incidence Odds
43.3.4
Incidence Rate
44
Random Error in Measures
45
Creating Contingency Tables in R
46
Measures of Association
46.1
Exposures and outcomes
46.2
Contingency tables
46.3
Building contingency tables in R
46.3.1
Matrix dimensions
46.3.2
Matrix to contingency tables
46.3.3
Add row and column names
46.3.4
Add margins
46.4
Probabilities
46.4.1
Frequency probabilities
46.4.2
Conditional probabilities
46.5
Associations
46.5.1
Statistical independence and null values
46.6
Calculating measures of association in R
46.6.1
Incidence proportion ratios
46.6.2
Incidence proportion difference
46.6.3
Incidence odds ratio
46.6.4
Incidence rate ratio
46.6.5
Incidence rate difference
46.7
Summary
47
Time-to-event Analysis
48
Stratification
49
Standardization
50
Selection Bias
50.1
Direction of bias
50.2
Summary
51
Systematic Error in Measures
51.1
Misclassification
51.2
Direction of bias
51.3
Sensitivity
51.4
Non-diffrential misclassification
51.5
Differential misclassification
51.6
Precision and validity
51.7
Summary
52
Effect-measure Modification
52.1
Difference between effect modification and effect measure modification?
52.2
Difference between effect modification and statistical interaction
52.3
Assessing (exploring) effect modification
52.3.1
Homogeneity of Effects
52.3.2
Observed and Expected Joint Effects
52.4
What is different enough?
52.5
Key points
53
Missing Data
X Introduction to Regression Analysis
54
Introduction to Regression Analysis
54.1
Generalize linear models
54.1.1
The glm function
54.2
Regression intuition
55
Linear Regression
55.1
Continuous regressand and continuous regressor
55.1.1
Interpretation
55.2
Continuous regressand and categorical regressor
55.2.1
Interpretation
55.3
Waist circumference and deep abdominal adipose tissue example
55.3.1
Continuous regressor (waist circumference)
55.3.2
Categorical regressor (large waist)
56
Linear Regression
56.1
Categorical regressand continuous regressor
56.1.1
Interpretation
56.2
Categorical regressand categorical regressor
56.2.1
Interpretation
56.3
Elder mistreatment example
56.3.1
Categorical regressor (dementia)
56.3.2
Interpretation
56.3.3
Continuous regressor (age)
56.3.4
Interpretation
56.4
Assumptions
57
Poisson Regression
57.1
Count regressand continuous regressor
57.1.1
Interpretation
57.2
Count regressand categorical regressor
57.2.1
Interpretation
57.3
Number of drinks and personal problems example
57.3.1
Count regressand and continuous regressor
57.3.2
Interpretation
57.3.3
Count regressand categorical regressor
57.3.4
Interpretation
57.4
Assumptions
58
Cox Proportional Hazards Regression
59
Multilevel Models
60
Generalized Estimating Equations
XI Predictive Analysis
61
Introduction to Predictive Analysis
XII Introduction to Causal Inference
62
Introduction to Causal Inference
63
Sufficient and Component Cause Diagrams
63.1
Summary
64
Introduction to Directed Acyclic Graphs
64.1
Basic DAG structures and vocabulary
64.2
Creating DAGs in R
64.3
Chains
64.4
Forks
64.5
Colliders
64.6
d-Separation Rules
64.6.1
Rule 1
64.6.2
Rule 2
64.6.3
Rule 3
64.6.4
Rule 4
64.7
Summary
65
Confounding
65.1
Ice cream and murder rate simulation
65.2
How do we detect confounding
65.2.1
Change in estimate criteria
65.2.2
Traditional criteria
65.2.3
Structural criteria
65.3
Confounding
66
Deconfounding
66.1
Randomization
66.2
Restriction
66.3
Matching
66.4
Stratification
66.5
Summary
67
Mediation
XIII Study Design
68
Experimental Studies
69
Cohort Studies
70
Case-control Studies
71
Cross-sectional Studies
72
Ecologic Studies
73
Quasi-experimental Studies
74
Meta-analysis
75
Power and Sample Size
XIV Appendix
Appendix A: Glossary
Appendix: Alternative table formats
75.1
Smaller data frame
75.1.1
Default method for printing the data frame to the screen
75.1.2
Using the kable function
75.1.3
Using the datatable function
75.2
Larger data frame
75.2.1
Default method for printing the data frame to the screen
75.2.2
Using the kable function
75.2.3
Using the datatable function
References
Published with bookdown
R for Epidemiology
53
Missing Data
This chapter is under heavy development and may still undergo significant changes.