Visualising participant recruitment in R using Sankey plots

learning
tutorial
visualisation
R
Author

Craig Parylo

Published

February 28, 2024

Introduction

Sankey diagrams are great tools to visualise flows through a system. They show connections between the steps of a process where the width of the arrows is proportional to the flow.

I’m working on an evaluation of a risk screening process for people aged between 55-74 years and a history of smoking. In this Targeted Lung Health Check (TLHC) programme1 eligible people are invited to attend a free lung check where those assessed at high risk of lung cancer are then offered low-dose CT screening scans.

1 Please visit the NHS England site for for more background.

We used Sankey diagrams to visualise how people have engaged with the programme, from recruitment, attendance at appointments, their outcome from risk assessment, attendance at CT scans and will eventually be extended to cover the impact of the screening on early detection of those diagnosed with lung cancer.

This blog post is about the technical process of preparing record-level data for visualisation in a Sankey plot using R and customising it to enhance look and feel. Here is how the finished product will look:

Data wrangling

First we’ll attach some packages. I’ll be using plotly for the visualisation of the Sankey chart, tidygraph for graph manipulation and scales to handle colour transformation and rescaling values. We will also be using the tidyverse and glue packages for general data wrangling and reactable to preview our data as we go along.

# libraries
library(tidyverse) # 'tidy' data wrangling
library(plotly) # sankey visualisation framework
library(reactable) # viewing interactive datatables
library(glue) # concatenating strings
library(tidygraph) # api for graph / network manipulation
library(scales) # used for colour transformation

Get the data

In this example we will work with a simplified set of data focused on invitations.

The invites table holds details of when people were sent a letter or message inviting them to take part, how many times they were invited and how the person responded.

The people eligible for the programme are identified up-front and are represented by a unique ID with one row per person. Let’s assume each person receives at least one invitation to take part, they can have one of three outcomes:

  1. They accept the invitation and agree to take part,

  2. They decline the invitation,

  3. They do not respond to the invitation.

If the person doesn’t respond to the first invitation they may be sent a second invitation and could be offered a third invitation if they didn’t respond to the second.

Here is the specification for our simplified invites table:

Invites specification
Field Type Description
Participant ID Integer A unique identifier for each person.
Invite date 1 Date

The date the person was first invited to participate.

Every person will have a date in this field.

Invite date 2 Date The date a second invitation was sent.
Invite date 3 Date The date a third invitation was sent.
Invite outcome Text The outcome from the invite, one of either ‘Accepted’, ‘Declined’ or ‘No response’.

Everyone receives at least one invite. Assuming a third of these respond (to either accept or decline) then two-thirds receive a follow-up invite. Of these, we assume half respond, meaning the remaining participants receive a third invite.

Here we generate 100 rows of example data to populate our table.

Code
# set a randomisation seed for reproducibility
set.seed(seed = 1234)

# define some parameters
start_date <- as.Date("2019-01-01")
end_date <- as.Date("2021-01-01")
rows <- 100

df_invites_1 <- tibble(
  # create a unique id for each participant
  participant_id = 1:rows,

  # create a random initial invite date between our start and end dates
  invite_1_date = sample(
    seq(start_date, end_date, by = "day"),
    size = rows, replace = T
  ),

  # create a random outcome for this participant
  invite_outcome = sample(
    x = c("Accepted", "Declined", "No response"),
    size = rows, replace = T
  )
)

# take a sample of participants and allocate them a second invite date
df_invites_2 <- df_invites_1 |>
  # sample two thirds of participants to get a second invite
  slice_sample(prop = 2 / 3) |>
  # allocate a date between 10 and 30 days following the first
  mutate(
    invite_2_date = invite_1_date + sample(10:30, size = n(), replace = T)
  ) |>
  # keep just id and second date
  select(participant_id, invite_2_date)


# take a sample of those with a second invite and allocate them a third invite date
df_invites_3 <- df_invites_2 |>
  # sample half of these to get a third invite
  slice_sample(prop = 1 / 2) |>
  # allocate a date between 10 to 30 days following the second
  mutate(
    invite_3_date = invite_2_date + sample(10:30, size = n(), replace = T)
  ) |>
  # keep just id and second date
  select(participant_id, invite_3_date)

# combine the 2nd and 3rd invites with the first table
df_invites <- df_invites_1 |>
  left_join(
    y = df_invites_2,
    by = "participant_id"
  ) |>
  left_join(
    y = df_invites_3,
    by = "participant_id"
  ) |>
  # move the outcome field after the third invite
  relocate(invite_outcome, .after = invite_3_date)

# housekeeping
rm(df_invites_1, df_invites_2, df_invites_3, start_date, end_date, rows)

# view our data
df_invites |>
  reactable(defaultPageSize = 5)

Generated invite table

Determine milestone outcomes

The next step is to take our source table and convert the data into a series of milestones (and associated outcomes) that represents how our invited participants moved through the pathway.

In our example we have five milestones to represent in our Sankey plot:

  • Our eligible population (everyone in our invites table),

  • The result from the first invitation,

  • The result from the second invitation,

  • The result from the third invitation,

  • The overall invite outcome.

Aside from the eligible population, where everyone starts with the same value, participants will have one of several outcomes at each milestone. This step is about naming these milestones and the outcomes.

It is important that each milestone-outcome has unique values. An outcome of ‘No response’ can be recorded against the first, second and third invite, and we wish to see these outcomes separately represented on the Sankey (rather than just one ‘No response’), so each outcome must be made unique. In this example we prefix the outcome from each invite with the number of the invite, e.g. ‘Invite 1 No response’.

The reason for this will become clearer when we come to plot the Sankey, but for now we produce these milestone-outcomes from our invites table.

Code
df_milestones <- df_invites |>
  mutate(
    # everyone starts in the eligible population
    start_population = "Eligible population",

    # work out what happened following the first invite
    invite_1_outcome = case_when(
      # if a second invite was sent we assume there was no outcome from the first
      !is.na(invite_2_date) ~ "Invitation 1 No response",
      # otherwise the overall outcome resulted from the first invite
      .default = glue("Invitation 1 {invite_outcome}")
    ),

    # work out what happened following the second invite
    invite_2_outcome = case_when(
      # if a third invite was sent we assume there was no outcome from the second
      !is.na(invite_3_date) ~ "Invitation 2 No response",
      # if a second invite was sent but no third then
      !is.na(invite_2_date) ~ glue("Invitation 2 {invite_outcome}"),
      # default to NA if neither of the above are true
      .default = NA
    ),

    # work out what happened following the third invite
    invite_3_outcome = case_when(
      # if a third invite was sent then the outcome is the overall outcome
      !is.na(invite_3_date) ~ glue("Invitation 3 {invite_outcome}"),
      # otherwise mark as NA
      .default = NA
    )
  ) |>
  # exclude the dates as they are no longer needed
  select(-contains("_date")) |>
  # move the overall invite outcome to the end
  relocate(invite_outcome, .after = invite_3_outcome)

# view our data
df_milestones |>
  reactable(defaultPageSize = 5)

Milestone-outcomes for participants

Calculate flows

Next we take pairs of milestone-outcomes and calculate the number of participants that moved between them.

Here we utilise the power of dplyr::summarise with an argument .by to group by our data before counting the number of unique participants who move between our start and end groups.

For invites 2 and 3 we perform two sets of summaries:

  1. The first where the values in the to and from fields contain details.

  2. The second to capture cases where the to destination is NULL. This is because the participant responded at the previous invite so there was no subsequent invite. In these cases we flow the participant to the overall invite outcome.2

2 If you are thinking there is a lot of repetition here, you’re right. In practice I abstracted both steps to a function and passed in the parameters for the from and to variables and simplified my workflow a little, however, I’m showing it in plain form here for simplification.

Code
df_flows <- bind_rows(

  # flow from population to invite 1
  df_milestones |>
    filter(!is.na(start_population) & !is.na(invite_1_outcome)) |>
    rename(from = start_population, to = invite_1_outcome) |>
    summarise(
      flow = n_distinct(participant_id, na.rm = T),
      .by = c(from, to)
    ),

  # flow from invite 1 to invite 2 (where not NA)
  df_milestones |>
    filter(!is.na(invite_1_outcome) & !is.na(invite_2_outcome)) |>
    rename(from = invite_1_outcome, to = invite_2_outcome) |>
    summarise(
      flow = n_distinct(participant_id, na.rm = T),
      .by = c(from, to)
    ),

  # flow from invite 1 to overall invite outcome (where invite 2 is NA)
  df_milestones |>
    filter(!is.na(invite_1_outcome) & is.na(invite_2_outcome)) |>
    rename(from = invite_1_outcome, to = invite_outcome) |>
    summarise(
      flow = n_distinct(participant_id, na.rm = T),
      .by = c(from, to)
    ),

  # flow from invite 2 to invite 3 (where not NA)
  df_milestones |>
    filter(!is.na(invite_2_outcome) & !is.na(invite_3_outcome)) |>
    rename(from = invite_2_outcome, to = invite_3_outcome) |>
    summarise(
      flow = n_distinct(participant_id, na.rm = T),
      .by = c(from, to)
    ),

  # flow from invite 2 to overall invite outcome (where invite 3 is NA)
  df_milestones |>
    filter(!is.na(invite_2_outcome) & is.na(invite_3_outcome)) |>
    rename(from = invite_2_outcome, to = invite_outcome) |>
    summarise(
      flow = n_distinct(participant_id, na.rm = T),
      .by = c(from, to)
    ),

  # final flow - invite 3 to overall outcome (where both are not NA)
  df_milestones |>
    filter(!is.na(invite_3_outcome) & !is.na(invite_outcome)) |>
    rename(from = invite_3_outcome, to = invite_outcome) |>
    summarise(
      flow = n_distinct(participant_id, na.rm = T),
      .by = c(from, to)
    )
)

# view our data
df_flows |>
  reactable(defaultPageSize = 5)

Flows of participants between milestones

Sankey plot

We now have a neat little summary of movements of participants between the milestones in our recruitment pathway. However, this ‘tidy’ data isn’t the format required by plotly, so the next steps are to prepare it ready for plotting.

Preparing for plotly

Plotly expects to be fed two sets of data:

  1. Nodes - these are the milestones we have in our from and to fields,

  2. Edges - these are the flows that occur between nodes, the flow in our table.

It is possible to extract this data by hand but I found using the tidygraph package was much easier and more convenient.

df_sankey <- df_flows |>
  # convert our flows data to a tidy graph object
  as_tbl_graph()

The tidygraph package splits our data into nodes and edges. We can selectively work on each by ‘activating’ them - here is the nodes list:

df_sankey |>
  activate(what = "nodes") |>
  as_tibble() |>
  reactable(defaultPageSize = 5)

You can see each unique node name listed. The row numbers for these nodes are used as reference IDs in the edges object:

df_sankey |>
  activate(what = "edges") |>
  as_tibble() |>
  reactable(defaultPageSize = 5)

We now have enough information to generate our Sankey.

First we extract our nodes and edges to separate data frames then convert the ID values to be zero-based (starts at 0) as this is what plotly is expecting. To do this is as simple as subtracting 1 from the value of the IDs.

Finally we pass these two dataframes to plotly’s node and link function inputs to generate the plot.

Code
# extract the nodes to a dataframe
nodes <- df_sankey |>
  activate(nodes) |>
  data.frame() |>
  mutate(
    id = row_number() - 1
  )

# extract the edges to a dataframe
edges <- df_sankey |>
  activate(edges) |>
  data.frame() |>
  mutate(
    from = from - 1,
    to = to - 1
  )

# plot our sankey
plot_ly(
  # setup
  type = "sankey",
  orientation = "h",
  arrangement = "snap",

  # use our node data
  node = list(
    label = nodes$name
  ),

  # use our link data
  link = list(
    source = edges$from,
    target = edges$to,
    value = edges$flow
  )
)

Our first sankey

Not bad!

We can see the structure of our Sankey now. Can you see the relative proportions of participants who did or didn’t respond to our first invite? Marvel at how those who responded to the first invite flow into our final outcome. How about those who didn’t respond to the first invitation go on to receive a second invite?

Plotly’s charts are interactive. Try hovering your cursor over the nodes and edges to highlight them and a pop-up box will appear giving you additional details. You can reorder the vertical position of the nodes by dragging them above or below an adjacent node.

This looks functional.

Styling our Sankey

Now we have the foundations of our Sankey I’d like to move on to its presentation. Specifically I’d like to:

  • use colour coding to clearly group those who accept or decline the invite,

  • improve the readability of the node titles,

  • add additional information to the pop-up boxes when you hover over nodes and edges, and

  • control the positioning of the nodes in the plot.

As our nodes and edges objects are dataframes it is straightforward to add this styling information directly to them.

For the nodes object we define colours based on the name of each node and manually position them in the plot

Code
# get the eligible population as a single value
# NB, will be used to work out % amounts in each node and edge
temp_eligible_pop <- df_flows |>
  filter(from == "Eligible population") |>
  summarise(total = sum(flow, na.rm = T)) |>
  pull(total)

# style our nodes object
nodes <- nodes |>
  mutate(
    # colour ----
    # add colour definitions, green for accepted, red for declined
    colour = case_when(
      str_detect(name, "Accepted") ~ "#44bd32",
      str_detect(name, "Declined") ~ "#c23616",
      str_detect(name, "No response") ~ "#7f8fa6",
      str_detect(name, "Eligible population") ~ "#7f8fa6"
    ),

    # add a semi-transparent colour for the edges based on node colours
    colour_fade = col2hcl(colour = colour, alpha = 0.3),

    # positioning ----
    # NB, I found that to position nodes you need to supply both
    # horizontal and vertical positions
    # NNB, it was a bit of trial and error to get the these positions just
    # right

    # horizontal positions (0 = left, 1 = right)
    x = case_when(
      str_detect(name, "Eligible population") ~ 1,
      str_detect(name, "Invitation 1") ~ 2,
      str_detect(name, "Invitation 2") ~ 3,
      str_detect(name, "Invitation 3") ~ 4,
      .default = 5
    ) |> rescale(to = c(0.001, 0.9)),

    # vertical position (1 = bottom, 0 = top)
    y = case_when(
      str_detect(name, "Eligible population") ~ 5,
      # invite 1
      str_detect(name, "Invitation 1 Accepted") ~ 1,
      str_detect(name, "Invitation 1 No response") ~ 5,
      str_detect(name, "Invitation 1 Declined") ~ 8.5,
      # invite 2
      str_detect(name, "Invitation 2 Accepted") ~ 2,
      str_detect(name, "Invitation 2 No response") ~ 5,
      str_detect(name, "Invitation 2 Declined") ~ 7.8,
      # invite 3
      str_detect(name, "Invitation 3 Accepted") ~ 2.7,
      str_detect(name, "Invitation 3 No response") ~ 5.8,
      str_detect(name, "Invitation 3 Declined") ~ 7.2,
      # final outcomes
      str_detect(name, "Accepted") ~ 1,
      str_detect(name, "No response") ~ 5,
      str_detect(name, "Declined") ~ 8,
      .default = 5
    ) |> rescale(to = c(0.001, 0.999))
  ) |>
  # add in a custom field to show the percentage flow
  left_join(
    y = df_flows |>
      group_by(to) |>
      summarise(
        flow = sum(flow, na.rm = T),
        flow_perc = percent(flow / temp_eligible_pop, accuracy = 0.1),
      ) |>
      select(name = to, flow_perc),
    by = "name"
  )

# view our nodes data
nodes |>
  reactable(defaultPageSize = 5)

Styling the nodes dataframe

Next we move to styling the edges, which is a much simpler prospect:

Code
edges <- edges |>
  mutate(
    # add a label for each flow to tell us how many people are in each
    label = number(flow, big.mark = ","),
    # add a percentage flow figure
    flow_perc = percent(flow / temp_eligible_pop, accuracy = 0.1)
  ) |>
  # add the faded colour from our nodes object to match the destinations
  left_join(
    y = nodes |> select(to = id, colour_fade),
    by = "to"
  )

# view our edges data
edges |>
  reactable(defaultPageSize = 5)

Styling the edges dataframe

We now have stylised node and edge tables ready and can bring it all together. Note the use of customdata and hovertemplate help to bring in additional information and styling to the pop-up boxes that appear when you hover over each flow and node.

Code
# plot our stylised sankey
plot_ly(
  # setup
  type = "sankey",
  orientation = "h",
  arrangement = "snap",

  # use our node data
  node = list(
    label = nodes$name,
    color = nodes$colour,
    x = nodes$x,
    y = nodes$y,
    customdata = nodes$flow_perc,
    hovertemplate = "%{label}<br /><b>%{value}</b> participants<br /><b>%{customdata}</b> of eligible population"
  ),

  # use our edge data
  link = list(
    source = edges$from,
    target = edges$to,
    value = edges$flow,
    label = edges$label,
    color = edges$colour_fade,
    customdata = edges$flow_perc,
    hovertemplate = "%{source.label} → %{target.label}<br /><b>%{value}</b> participants<br /><b>%{customdata}</b> of eligible population"
  )
) |>
  layout(
    font = list(
      family = "Arial, Helvetica, sans-serif",
      size = 12
    ),
    # make the background transparent (also removes the text shadow)
    paper_bgcolor = "rgba(0,0,0,0)"
  ) |>
  config(responsive = T)

A stylish Sankey

Conclusion

Creating Sankey plots in R using plotly is an effective way to visualise patient pathways.

In our project we embedded Sankey plots within an interactive Shiny app which allows for selective filters that update the resulting plot. This allowed us to quickly compare the effects of different models of delivering the screening programme, geography, deprivation levels, patient demographic, or any combination of these.

Their use has helped the programme team better understand patient flows through the pathway, where the points of drop-off are and compare / contrast the effects of different models of delivering the screening programme on patient engagement.

Feedback from external stakeholders has been positive too, noting how easy it is to engage with and understand this style of presentation.

In this blog post we have wrangled a dataset to describe how people flow between steps in a process and then produced a Sankey diagram with some stylistic touches to make an effective visualisation.

I hope this post helps you feel better prepared to use Sankeys in your work.