Appendix A — Glossary

Anchor: A regular expression (regex) metacharacter that anchors a match to a position in a string. The caret (^) anchors to the start of the string, and the dollar sign ($) anchors to the end of the string.
Arguments: Values provided inside the parentheses of a function to specify what the function should act on or how it should behave.
Bivariate: Describes analyses or relationships involving exactly two variables.
Collapse: To summarize a data set by grouping and reducing multiple observations into a single summary value per group, often using functions like summarise() in dplyr.
Complete case analysis: An analysis that includes only observations with complete data for all variables in the model, excluding any rows with missing values.
Console: The interactive window in RStudio (usually bottom-left) where R commands can be typed and executed immediately. The R console is useful for testing small pieces of code and interactive data exploration. However, we recommend using R scripts or Quarto files for all but the simplest programming or data-analysis tasks.
Conditional Operations: Conditional operations control the flow of a program by executing different blocks of code depending on whether specified conditions are TRUE or FALSE. In R this includes the if / else family, vectorised helpers such as ifelse(), and higher-level wrappers like case_when().
Data Checks: Processes that verify the accuracy, completeness, or validity of data before analysis. Examples include type checks (e.g., numeric vs. character), range checks (e.g., no ages below 0), completeness checks (e.g., missing-value rates), and cross-field consistency checks (e.g., start ≤ end dates).
Data frame: R’s primary data structure for storing tabular data. Each column is a vector, and all columns must have the same number of rows.
For loop: A control structure that repeats a block of code once for each element in a sequence or vector.
Frequency count: The number of times a value or category appears in a dataset. Also called a frequency, count, or n.
Functions: A reusable block of code that performs a specific task when called. Functions take inputs (arguments) and return outputs. Functions promote modularity, abstraction, and reproducibility.
Global environment: The main workspace in an R session where user-defined variables and functions are stored unless otherwise specified.
Issue (GitHub): A GitHub feature used to track tasks, bugs, enhancements, or other requests within a project.
Iteratively: A method of solving a problem by repeatedly executing a set of instructions in a step-by-step manner, often using loops. This approach can improve efficiency and help prevent errors.
Join: An operation that merges two tables based on shared key columns. In dplyr, functions like inner_join(), left_join(), etc., determine how unmatched rows are handled.
Key: A column or set of columns that uniquely identifies each row in a data set and enables precise merging with other tables.
Lexical scoping rules: A set of rules that determine which variables are accessible in a function based on where they were defined in the code hierarchy.
List-wise Deletion: A method of handling missing data by excluding any row that has at least one missing value in variables of interest, leaving only complete cases.
Long: A tidy data format where each row represents one measurement at one time point for one unit, and columns contain values and their corresponding identifiers (e.g., variable name or time).
MDL: The Minimum Description Length (MDL) principle is a model selection principle stating that the best model is the one that minimizes the combined complexity of the model and the data encoded using that model.
Marginal totals: Row and column totals in a contingency table, used to summarize the distribution of each variable and to calculate the overall total.
Mean: The arithmetic mean—often denoted $\bar{x}$ —is calculated by summing all values in a numeric variable and dividing by the total number of values.
Median: The middle number in an ordered list of values. When the list has an even number of elements, the median is the average of the two middle numbers. Compared with the mean, the median is relatively resistant to extreme values.
Merge: A base-R term (function merge()) for combining two data frames by matching rows on one or more key variables. Rows that do not match can be kept, dropped, or produce missing values depending on the arguments.
Mode: The value that occurs most often in a set of data. A data set may be unimodal (one mode), multimodal (many modes), or have no mode (all values are equally frequent).
Non-standard Evaluation: A programming behavior in which functions capture or modify expressions instead of immediately evaluating their values, commonly used in tidyverse packages.
Objects: Named containers for storing data or functions in R. Common object types include vectors, lists, data frames, and functions.
Outcome variable: The variable being predicted or explained in an analysis; also called the dependent variable.
Pass: To supply a value or object to a function’s argument when calling that function. For example, passing 2 to the from argument in seq(from = 2, to = 100, by = 2).
Percentage: A value representing a part per hundred. Calculated by dividing the number of occurrences by the total number of observations and multiplying by 100. For example, 25% means 25 out of 100.
Person-level: Describes data organized at the level of the individual, where each row corresponds to one person.
Person-period: Describes data structured with multiple rows per person, usually representing repeated measurements across time or conditions.
Predictor variable: A variable used to explain or predict the value of another variable; also called an independent variable or explanatory variable.
Proportion: A number between 0 and 1 that represents the fraction of a total. Calculated by dividing the number of occurrences by the total number of observations.
Quantifier: A regular expression metacharacter that defines how many times a pattern must repeat. Common quantifiers include * (0 +), + (1 +), ? (0 – 1), and {m,n} (between m and n times).
R: “R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues.”¹ R is open source, and you can download it for free from The Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/.
Range: The difference between the maximum and minimum values in a data set.
Regular Expressions: Compact strings that describe search patterns for text. Regular expressions (regexes) are used for tasks such as finding, extracting, replacing, or validating character data (stringr, grepl(), gsub(), etc.).
Repository: “A repository contains all of your code, your files, and each file’s revision history. You can discuss and manage your work within the repository.”² A repository can exist locally on your computer or remotely on a server such as GitHub.
Return: A command in a function that specifies what value to output when the function finishes running. Instead of saying, “the seq() function gives us a sequence of numbers…,” we could say, “the seq() function returns a sequence of numbers.”
RStudio: An integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor with direct code execution, and tools for plotting, debugging, and work space management. RStudio is available as open-source desktop software and as server versions.³
Split - Apply - Combine: A data-analysis strategy used by dplyr::group_by(), for example, that involves splitting data into smaller components, applying a calculation separately to each component, and then combining the individual results into a single output.
Standard deviation: A measure of spread equal to the square root of the variance—the average of the squared differences between each value and the mean.
Token: A basic unit in text processing, typically referring to individual pieces of data like words, numbers, or punctuation marks. Tokens include literal characters (a), metacharacters (\d), or entire character classes ([A-Z]).
Two-way frequency table: A table that displays the joint distribution of two categorical variables, showing the frequency of each combination of categories. Also called a crosstab or contingency table.
Univariate: Pertaining to a single variable. Univariate analysis describes or summarizes one variable at a time.
Variance: A measure of spread calculated as the average of the squared differences between each value and the mean.
Vectorization: A programming technique in which operations are applied to entire vectors (or matrices/data frames) in a single step rather than iterating element-by-element. Vectorized code in R (x * 2, mean(x)) is clear and fast because the heavy computation occurs in compiled code under the hood.
Wide: A data format where repeated measures or variables are spread across multiple columns (e.g., score_T1, score_T2, score_T3 for test scores across three terms), with one row per subject or unit.