Visualising participant recruitment in R using Sankey plots
Introduction
Sankey diagrams are great tools to visualise flows through a system. They show connections between the steps of a process where the width of the arrows is proportional to the flow.
I’m working on an evaluation of a risk screening process for people aged between 55-74 years and a history of smoking. In this Targeted Lung Health Check (TLHC) programme1 eligible people are invited to attend a free lung check where those assessed at high risk of lung cancer are then offered low-dose CT screening scans.
1 Please visit the NHS England site for for more background.
We used Sankey diagrams to visualise how people have engaged with the programme, from recruitment, attendance at appointments, their outcome from risk assessment, attendance at CT scans and will eventually be extended to cover the impact of the screening on early detection of those diagnosed with lung cancer.
This blog post is about the technical process of preparing record-level data for visualisation in a Sankey plot using R
and customising it to enhance look and feel. Here is how the finished product will look:
Data wrangling
First we’ll attach some packages. I’ll be using plotly for the visualisation of the Sankey chart, tidygraph for graph manipulation and scales to handle colour transformation and rescaling values. We will also be using the tidyverse and glue packages for general data wrangling and reactable to preview our data as we go along.
# libraries
library(tidyverse) # 'tidy' data wrangling
library(plotly) # sankey visualisation framework
library(reactable) # viewing interactive datatables
library(glue) # concatenating strings
library(tidygraph) # api for graph / network manipulation
library(scales) # used for colour transformation
Get the data
In this example we will work with a simplified set of data focused on invitations.
The invites table holds details of when people were sent a letter or message inviting them to take part, how many times they were invited and how the person responded.
The people eligible for the programme are identified up-front and are represented by a unique ID with one row per person. Let’s assume each person receives at least one invitation to take part, they can have one of three outcomes:
They accept the invitation and agree to take part,
They decline the invitation,
They do not respond to the invitation.
If the person doesn’t respond to the first invitation they may be sent a second invitation and could be offered a third invitation if they didn’t respond to the second.
Here is the specification for our simplified invites table:
Field | Type | Description |
---|---|---|
Participant ID | Integer | A unique identifier for each person. |
Invite date 1 | Date | The date the person was first invited to participate. Every person will have a date in this field. |
Invite date 2 | Date | The date a second invitation was sent. |
Invite date 3 | Date | The date a third invitation was sent. |
Invite outcome | Text | The outcome from the invite, one of either ‘Accepted’, ‘Declined’ or ‘No response’. |
Everyone receives at least one invite. Assuming a third of these respond (to either accept or decline) then two-thirds receive a follow-up invite. Of these, we assume half respond, meaning the remaining participants receive a third invite.
Here we generate 100 rows of example data to populate our table.
Code
# set a randomisation seed for reproducibility
set.seed(seed = 1234)
# define some parameters
<- as.Date("2019-01-01")
start_date <- as.Date("2021-01-01")
end_date <- 100
rows
<- tibble(
df_invites_1 # create a unique id for each participant
participant_id = 1:rows,
# create a random initial invite date between our start and end dates
invite_1_date = sample(
seq(start_date, end_date, by = "day"),
size = rows, replace = T
),
# create a random outcome for this participant
invite_outcome = sample(
x = c("Accepted", "Declined", "No response"),
size = rows, replace = T
)
)
# take a sample of participants and allocate them a second invite date
<- df_invites_1 |>
df_invites_2 # sample two thirds of participants to get a second invite
slice_sample(prop = 2 / 3) |>
# allocate a date between 10 and 30 days following the first
mutate(
invite_2_date = invite_1_date + sample(10:30, size = n(), replace = T)
|>
) # keep just id and second date
select(participant_id, invite_2_date)
# take a sample of those with a second invite and allocate them a third invite date
<- df_invites_2 |>
df_invites_3 # sample half of these to get a third invite
slice_sample(prop = 1 / 2) |>
# allocate a date between 10 to 30 days following the second
mutate(
invite_3_date = invite_2_date + sample(10:30, size = n(), replace = T)
|>
) # keep just id and second date
select(participant_id, invite_3_date)
# combine the 2nd and 3rd invites with the first table
<- df_invites_1 |>
df_invites left_join(
y = df_invites_2,
by = "participant_id"
|>
) left_join(
y = df_invites_3,
by = "participant_id"
|>
) # move the outcome field after the third invite
relocate(invite_outcome, .after = invite_3_date)
# housekeeping
rm(df_invites_1, df_invites_2, df_invites_3, start_date, end_date, rows)
# view our data
|>
df_invites reactable(defaultPageSize = 5)
Generated invite table
Determine milestone outcomes
The next step is to take our source table and convert the data into a series of milestones (and associated outcomes) that represents how our invited participants moved through the pathway.
In our example we have five milestones to represent in our Sankey plot:
Our eligible population (everyone in our invites table),
The result from the first invitation,
The result from the second invitation,
The result from the third invitation,
The overall invite outcome.
Aside from the eligible population, where everyone starts with the same value, participants will have one of several outcomes at each milestone. This step is about naming these milestones and the outcomes.
It is important that each milestone-outcome has unique values. An outcome of ‘No response’ can be recorded against the first, second and third invite, and we wish to see these outcomes separately represented on the Sankey (rather than just one ‘No response’), so each outcome must be made unique. In this example we prefix the outcome from each invite with the number of the invite, e.g. ‘Invite 1 No response’.
The reason for this will become clearer when we come to plot the Sankey, but for now we produce these milestone-outcomes from our invites table.
Code
<- df_invites |>
df_milestones mutate(
# everyone starts in the eligible population
start_population = "Eligible population",
# work out what happened following the first invite
invite_1_outcome = case_when(
# if a second invite was sent we assume there was no outcome from the first
!is.na(invite_2_date) ~ "Invitation 1 No response",
# otherwise the overall outcome resulted from the first invite
.default = glue("Invitation 1 {invite_outcome}")
),
# work out what happened following the second invite
invite_2_outcome = case_when(
# if a third invite was sent we assume there was no outcome from the second
!is.na(invite_3_date) ~ "Invitation 2 No response",
# if a second invite was sent but no third then
!is.na(invite_2_date) ~ glue("Invitation 2 {invite_outcome}"),
# default to NA if neither of the above are true
.default = NA
),
# work out what happened following the third invite
invite_3_outcome = case_when(
# if a third invite was sent then the outcome is the overall outcome
!is.na(invite_3_date) ~ glue("Invitation 3 {invite_outcome}"),
# otherwise mark as NA
.default = NA
)|>
) # exclude the dates as they are no longer needed
select(-contains("_date")) |>
# move the overall invite outcome to the end
relocate(invite_outcome, .after = invite_3_outcome)
# view our data
|>
df_milestones reactable(defaultPageSize = 5)
Milestone-outcomes for participants
Calculate flows
Next we take pairs of milestone-outcomes and calculate the number of participants that moved between them.
Here we utilise the power of dplyr::summarise
with an argument .by
to group by our data before counting the number of unique participants who move between our start and end groups.
For invites 2 and 3 we perform two sets of summaries:
The first where the values in the
to
andfrom
fields contain details.The second to capture cases where the
to
destination is NULL. This is because the participant responded at the previous invite so there was no subsequent invite. In these cases we flow the participant to the overall invite outcome.2
2 If you are thinking there is a lot of repetition here, you’re right. In practice I abstracted both steps to a function and passed in the parameters for the from
and to
variables and simplified my workflow a little, however, I’m showing it in plain form here for simplification.
Code
<- bind_rows(
df_flows
# flow from population to invite 1
|>
df_milestones filter(!is.na(start_population) & !is.na(invite_1_outcome)) |>
rename(from = start_population, to = invite_1_outcome) |>
summarise(
flow = n_distinct(participant_id, na.rm = T),
.by = c(from, to)
),
# flow from invite 1 to invite 2 (where not NA)
|>
df_milestones filter(!is.na(invite_1_outcome) & !is.na(invite_2_outcome)) |>
rename(from = invite_1_outcome, to = invite_2_outcome) |>
summarise(
flow = n_distinct(participant_id, na.rm = T),
.by = c(from, to)
),
# flow from invite 1 to overall invite outcome (where invite 2 is NA)
|>
df_milestones filter(!is.na(invite_1_outcome) & is.na(invite_2_outcome)) |>
rename(from = invite_1_outcome, to = invite_outcome) |>
summarise(
flow = n_distinct(participant_id, na.rm = T),
.by = c(from, to)
),
# flow from invite 2 to invite 3 (where not NA)
|>
df_milestones filter(!is.na(invite_2_outcome) & !is.na(invite_3_outcome)) |>
rename(from = invite_2_outcome, to = invite_3_outcome) |>
summarise(
flow = n_distinct(participant_id, na.rm = T),
.by = c(from, to)
),
# flow from invite 2 to overall invite outcome (where invite 3 is NA)
|>
df_milestones filter(!is.na(invite_2_outcome) & is.na(invite_3_outcome)) |>
rename(from = invite_2_outcome, to = invite_outcome) |>
summarise(
flow = n_distinct(participant_id, na.rm = T),
.by = c(from, to)
),
# final flow - invite 3 to overall outcome (where both are not NA)
|>
df_milestones filter(!is.na(invite_3_outcome) & !is.na(invite_outcome)) |>
rename(from = invite_3_outcome, to = invite_outcome) |>
summarise(
flow = n_distinct(participant_id, na.rm = T),
.by = c(from, to)
)
)
# view our data
|>
df_flows reactable(defaultPageSize = 5)
Flows of participants between milestones
Sankey plot
We now have a neat little summary of movements of participants between the milestones in our recruitment pathway. However, this ‘tidy’ data isn’t the format required by plotly, so the next steps are to prepare it ready for plotting.
Preparing for plotly
Plotly expects to be fed two sets of data:
Nodes - these are the milestones we have in our
from
andto
fields,Edges - these are the flows that occur between nodes, the
flow
in our table.
It is possible to extract this data by hand but I found using the tidygraph package was much easier and more convenient.
<- df_flows |>
df_sankey # convert our flows data to a tidy graph object
as_tbl_graph()
The tidygraph package splits our data into nodes and edges. We can selectively work on each by ‘activating’ them - here is the nodes list:
|>
df_sankey activate(what = "nodes") |>
as_tibble() |>
reactable(defaultPageSize = 5)
You can see each unique node name listed. The row numbers for these nodes are used as reference IDs in the edges object:
|>
df_sankey activate(what = "edges") |>
as_tibble() |>
reactable(defaultPageSize = 5)
We now have enough information to generate our Sankey.
First we extract our nodes and edges to separate data frames then convert the ID values to be zero-based (starts at 0) as this is what plotly is expecting. To do this is as simple as subtracting 1 from the value of the IDs.
Finally we pass these two dataframes to plotly’s node
and link
function inputs to generate the plot.
Code
# extract the nodes to a dataframe
<- df_sankey |>
nodes activate(nodes) |>
data.frame() |>
mutate(
id = row_number() - 1
)
# extract the edges to a dataframe
<- df_sankey |>
edges activate(edges) |>
data.frame() |>
mutate(
from = from - 1,
to = to - 1
)
# plot our sankey
plot_ly(
# setup
type = "sankey",
orientation = "h",
arrangement = "snap",
# use our node data
node = list(
label = nodes$name
),
# use our link data
link = list(
source = edges$from,
target = edges$to,
value = edges$flow
) )
Our first sankey
Not bad!
We can see the structure of our Sankey now. Can you see the relative proportions of participants who did or didn’t respond to our first invite? Marvel at how those who responded to the first invite flow into our final outcome. How about those who didn’t respond to the first invitation go on to receive a second invite?
Plotly’s charts are interactive. Try hovering your cursor over the nodes and edges to highlight them and a pop-up box will appear giving you additional details. You can reorder the vertical position of the nodes by dragging them above or below an adjacent node.
This looks functional.
Styling our Sankey
Now we have the foundations of our Sankey I’d like to move on to its presentation. Specifically I’d like to:
use colour coding to clearly group those who accept or decline the invite,
improve the readability of the node titles,
add additional information to the pop-up boxes when you hover over nodes and edges, and
control the positioning of the nodes in the plot.
As our nodes
and edges
objects are dataframes it is straightforward to add this styling information directly to them.
For the nodes object we define colours based on the name of each node and manually position them in the plot
Code
# get the eligible population as a single value
# NB, will be used to work out % amounts in each node and edge
<- df_flows |>
temp_eligible_pop filter(from == "Eligible population") |>
summarise(total = sum(flow, na.rm = T)) |>
pull(total)
# style our nodes object
<- nodes |>
nodes mutate(
# colour ----
# add colour definitions, green for accepted, red for declined
colour = case_when(
str_detect(name, "Accepted") ~ "#44bd32",
str_detect(name, "Declined") ~ "#c23616",
str_detect(name, "No response") ~ "#7f8fa6",
str_detect(name, "Eligible population") ~ "#7f8fa6"
),
# add a semi-transparent colour for the edges based on node colours
colour_fade = col2hcl(colour = colour, alpha = 0.3),
# positioning ----
# NB, I found that to position nodes you need to supply both
# horizontal and vertical positions
# NNB, it was a bit of trial and error to get the these positions just
# right
# horizontal positions (0 = left, 1 = right)
x = case_when(
str_detect(name, "Eligible population") ~ 1,
str_detect(name, "Invitation 1") ~ 2,
str_detect(name, "Invitation 2") ~ 3,
str_detect(name, "Invitation 3") ~ 4,
.default = 5
|> rescale(to = c(0.001, 0.9)),
)
# vertical position (1 = bottom, 0 = top)
y = case_when(
str_detect(name, "Eligible population") ~ 5,
# invite 1
str_detect(name, "Invitation 1 Accepted") ~ 1,
str_detect(name, "Invitation 1 No response") ~ 5,
str_detect(name, "Invitation 1 Declined") ~ 8.5,
# invite 2
str_detect(name, "Invitation 2 Accepted") ~ 2,
str_detect(name, "Invitation 2 No response") ~ 5,
str_detect(name, "Invitation 2 Declined") ~ 7.8,
# invite 3
str_detect(name, "Invitation 3 Accepted") ~ 2.7,
str_detect(name, "Invitation 3 No response") ~ 5.8,
str_detect(name, "Invitation 3 Declined") ~ 7.2,
# final outcomes
str_detect(name, "Accepted") ~ 1,
str_detect(name, "No response") ~ 5,
str_detect(name, "Declined") ~ 8,
.default = 5
|> rescale(to = c(0.001, 0.999))
) |>
) # add in a custom field to show the percentage flow
left_join(
y = df_flows |>
group_by(to) |>
summarise(
flow = sum(flow, na.rm = T),
flow_perc = percent(flow / temp_eligible_pop, accuracy = 0.1),
|>
) select(name = to, flow_perc),
by = "name"
)
# view our nodes data
|>
nodes reactable(defaultPageSize = 5)
Styling the nodes dataframe
Next we move to styling the edges, which is a much simpler prospect:
Code
<- edges |>
edges mutate(
# add a label for each flow to tell us how many people are in each
label = number(flow, big.mark = ","),
# add a percentage flow figure
flow_perc = percent(flow / temp_eligible_pop, accuracy = 0.1)
|>
) # add the faded colour from our nodes object to match the destinations
left_join(
y = nodes |> select(to = id, colour_fade),
by = "to"
)
# view our edges data
|>
edges reactable(defaultPageSize = 5)
Styling the edges dataframe
We now have stylised node and edge tables ready and can bring it all together. Note the use of customdata
and hovertemplate
help to bring in additional information and styling to the pop-up boxes that appear when you hover over each flow and node.
Code
# plot our stylised sankey
plot_ly(
# setup
type = "sankey",
orientation = "h",
arrangement = "snap",
# use our node data
node = list(
label = nodes$name,
color = nodes$colour,
x = nodes$x,
y = nodes$y,
customdata = nodes$flow_perc,
hovertemplate = "%{label}<br /><b>%{value}</b> participants<br /><b>%{customdata}</b> of eligible population"
),
# use our edge data
link = list(
source = edges$from,
target = edges$to,
value = edges$flow,
label = edges$label,
color = edges$colour_fade,
customdata = edges$flow_perc,
hovertemplate = "%{source.label} → %{target.label}<br /><b>%{value}</b> participants<br /><b>%{customdata}</b> of eligible population"
)|>
) layout(
font = list(
family = "Arial, Helvetica, sans-serif",
size = 12
),# make the background transparent (also removes the text shadow)
paper_bgcolor = "rgba(0,0,0,0)"
|>
) config(responsive = T)
A stylish Sankey
Conclusion
Creating Sankey plots in R using plotly is an effective way to visualise patient pathways.
In our project we embedded Sankey plots within an interactive Shiny app which allows for selective filters that update the resulting plot. This allowed us to quickly compare the effects of different models of delivering the screening programme, geography, deprivation levels, patient demographic, or any combination of these.
Their use has helped the programme team better understand patient flows through the pathway, where the points of drop-off are and compare / contrast the effects of different models of delivering the screening programme on patient engagement.
Feedback from external stakeholders has been positive too, noting how easy it is to engage with and understand this style of presentation.
In this blog post we have wrangled a dataset to describe how people flow between steps in a process and then produced a Sankey diagram with some stylistic touches to make an effective visualisation.
I hope this post helps you feel better prepared to use Sankeys in your work.