44 Introduction to Directed Acyclic Graphs

This chapter is under heavy development and may still undergo significant changes.

The field of statistics has produced many tools that we can use to quantify, and therefor estimate, the effect of random error if we are willing to make certain assumptions. Similarly, pioneers in the field of causal inference (e.g., Judea Pearl, Jamie Robbins, Sander Greenland, and Miguel Hernán) have refined and tested tools that allow us to estimate average causal effects from observational data if we are willing to make certain assumptions. One of those tools – the one we will focus on in this chapter – is called a directed acyclic graph, or DAG for short. A very simple DAG is shown here.

Putting it simply, DAGs are just graphs that help us tell a story about causes and effects. For example, perhaps the x node in the DAG above represents the position of a light switch and and the y node in the DAG above represents the status of a light bulb (on vs. off). Then, the causal story that DAG above would tell is that changing the position of the light switch causes the light bulb to turn on and off. Because it is a graph, the entire story of causes and effect is summarized into a compact visual representation. More importantly, it turns out that these graphs, powered by a mathematical language called graph theory contain mathematical information that will eventually allow us to make estimates of causal effects, if we are willing to accept some assumptions. In this chapter, we will discuss some of the basic nuts and bolts of DAGs; as well as how to create DAGs with R. In future chapters, we will progressively learn more about using DAGs as a tool for causal inference.

44.1 Basic DAG structures and vocabulary

DAGs are made from nodes and edges. In figure 44.1, x, y, and z are nodes, and A and B are edges. You may see the DAGs where the nodes are literal dots, and you may see DAGs that use variable names as the nodes. Figure 44.1 has an example of each. Nodes represent the variables we are modeling, and edges encode the relationship between the variables we are modeling.

Two nodes are said to be adjacent to each other if there is an edge connecting them. For example, x and y are adjacent in figure 44.1, but x and z are not.

Edges can be directed or undirected. Edges that are directed have an arrowhead on at least one end, and edges that are undirected do not. Edges A and B in figure 44.1 are both directed because they both have an arrowhead on one end. The “start” or “out” side of a directed edge is the side that doesn’t have an arrowhead, and the “end” or “in” side does have an arrowhead. For example, edge A in figure 44.1 starts at node x and ends at node y. If all of the edges in a graph are directed, then the graph is said to be a directed graph.

Are the graphs in figure 44.1 both directed graphs?

Yes, the are both directed graphs because all of the edges are directed.

When there are two nodes connected by a directed edge, the start of the edge is the cause and the end of the edge is the effect. So, the story figure 44.1 tells is that x causes y and y causes z.

Figure 44.1: Basic DAG Structures: Nodes and edges.

Relationships between nodes in a DAG are often described in terms of family relationships. The node a the start of a directed edge is called a parent of the node the directed edge ends at. Conversely, the node at the end of the directed edge is called a child of the node the directed edge started from. In 44.2, x is a parent of y and a grandparent of z. Equivalently, y is a child of x , z is a child of y, and z is a grandchild of x.

Figure 44.2: Basic DAG Structures: Descendants.

A directed path (often simply referred to as a path) is any arrow-based route between two variables on the graph. In 44.3, there is a path from x to y, from y to z, and from x to z that goes through y. Paths are either open or closed according to D-separation rules, which we will discuss soon.

Figure 44.3: Basic DAG Structures: Paths.

When there is at least one directed path from a node that leads back to itself, then the graph is said to be cyclic. In figure 44.3, there is a path – from x to y to z to x – that begins and ends at x. Therefore, the graph in figure 44.3 is cyclic. If there are no cyclic paths in a graph (as is the case in every other graph we’ve seen in this chapter), then the graph is said to be acyclic. When our graphs are directed and acyclic, then they are called directed acyclic graphs or DAGs.

Figure 44.4: Basic DAG Structures: A cyclic graph.

Colliders exist where two arrowheads “collide” into a node. For example, in 44.5, the arrow from x to y and the arrow from z to y “collide” into each other at y.

Figure 44.5: Basic DAG Structures: Colliders.

Note that colliders are path specific. For example, y is a collider on the x -> y <- z path in figure 44.6, but it is not a collider on the x -> y -> W path.

Figure 44.6: Basic DAG Structures: A slightly more complex collider.

Common causes are another important concept when using DAGs. In figure 44.7, y is a common cause of x and z. We know this because an arrow points from y to x and from y to z.

It’s important to note that statistical associations can follow any path regardless of the direction of the arrows (in the absence of colliders). However, causal effects only follow the direction of the arrows (assuming our assumptions are correct). So, in this case, we expect x to be associated with z even though x does not cause z. We can also say that y confounds the relationship (or lack of relationship) between x and z. Confounding is one of the most critical concepts in all of epidemiology, and it has an entire chapter devoted to it later in the book.

Figure 44.7: Basic DAG Structures: Common causes.

44.2 Creating DAGs in R

There are a number of R packages we can use to create and analyze DAGs in R. DAGitty is a popular tool for creating and DAGs. Notably, DAGitty has a graphical interface we can use to create, edit, and analyze DAGs directly in our web browser. We can access DAGitty in our browser by navigating to https://www.dagitty.net/ and clicking on Launch DAGitty online in your browser, which is the far left box of figure 44.8 below.

Figure 44.8: Screenshot of the DAGitty homepage.

You might have also noticed the dagitty R package in figure 44.8. It’s a great package that allows us to use DAGitty directly from within an R session, and I encourage interested readers to check it out. However, for the sake of efficiency, we will focus on using a different R package, built on top of dagitty, ggplot2, and ggraph, for the remainder of this chapter. That package is called ggdag.

# Load the packages we will need below
library(dplyr, warn.conflicts = FALSE)
library(ggdag, warn.conflicts = FALSE)
library(ggplot2)

44.3 Chains

The first DAG structure we will learn how to create with ggdag called a chain. A chain consists of 3 nodes and two edges, with one edges going into the middle node and the other edge coming out of the middle node.¹⁷ We have already seen several examples of chains above, for example, figure 44.3. Let’s look at the code we used to create figure 44.3.

# Create a DAG called chain
chain <- dagify(
  y ~ x, # The form is effect ~ cause
  z ~ y,
  # Optionally add coordinates to control the placement of the nodes on the DAG
  coords = list( 
    x = c(x = 1, y = 2, z = 3),
    y = c(x = 0, y = 0, z = 0)
  )
)

# Plot the dag called chain and print it to the screen
ggdag(chain) +
  theme_dag()