<- readRDS(file = "workforce_staff_group.rds") staff_group
Map and Nest
I want to share a framework that I like using occasionally for data analysis. It’s the nest-and-map and it’s helped me countless times when I’m working with related datasets. By combining {purrr} mapping with {tidyr} nesting, I can keep my analysis steps linked, allowing me to easily track from a summary or plot, back to the original data.
The main funtions we’ll need are
tidyr::nest()
dplyr::mutate()
purrr::map()
purrr::walk()
Example on NHS workforce statistics
The NHS workforce statistics are official statistics published monthly for England.
I want to perform an analysis for each of the 42 integrated care systems (ICS). The {tidyr} nest()
function creates a list-column, where each cell contains a mini dataframe for each grouping.
Let’s group by ICS, and call the nested data column raw_data
.
<- staff_group |>
group_by_ics ::nest(raw_data = -ics_name) tidyr
The new column is a list-column, with each cell containing an entire tibble of data relating to that individual ICS.
#' echo: false
head(group_by_ics)
# A tibble: 6 × 2
ics_name raw_data
<chr> <list>
1 South East London <tibble [8 × 6]>
2 North East London <tibble [7 × 6]>
3 North Central London <tibble [12 × 6]>
4 North West London <tibble [10 × 6]>
5 South West London <tibble [8 × 6]>
6 Devon <tibble [7 × 6]>
We can grab these mini datasets in the usual way and explore them interactively.
$raw_data[[1]] group_by_ics
# A tibble: 8 × 6
organisation_name total hchs_doctors nurses_health_visitors midwives
<chr> <dbl> <dbl> <dbl> <dbl>
1 Total 58394 7108 14939 926
2 Guy's and St Thomas' NHS F… 21361 3003 6196 281
3 King's College Hospital NH… 13158 2443 4202 375
4 Lewisham and Greenwich NHS… 6617 979 2103 271
5 London Ambulance Service N… 7050 4 44 0
6 NHS South East London ICB 617 9 43 0
7 Oxleas NHS Foundation Trust 4094 200 1196 0
8 South London and Maudsley … 5496 471 1155 0
# ℹ 1 more variable: ambulance_staff <dbl>
Next, let’s apply some simple processing, say converting absolute numbers into percentages, to each of the ICSs in turn.
We use mutate()
to create a new list-column staff_percent
and map()
to apply the processing function to each cell in turn. 1
See function definition for convert_percent()
#' Convert percent
#' @param raw_staff Tibble containing organisation_name, total and a number of staff categories
#' @return Tibble like raw_staff but with staff categories represented as percentages rather than absolute numbers
<- function(staff){
convert_percent |>
staff ::mutate(dplyr::across(.cols = -c(organisation_name, total),
dplyr.fns = \(x)x/total)) |>
::rename("Doctors" = "hchs_doctors",
dplyr"Nurses" = "nurses_health_visitors",
"Ambulance staff" = "ambulance_staff",
"Midwives" = "midwives")
}
<-
processed_staff |>
group_by_ics ::mutate(
dplyrstaff_percent = purrr::map(raw_data, convert_percent)
)
Where I think this map-and-nest process really comes into its own is creating plots. Often, I find myself wanting to create a couple of different plots for each grouping, and then optionally save the plots with sensible names. Particularly in the analysis stage, I like having these plots in the same row as the raw data, so I can quickly compare and validate.
I’ve created two functions, plot_barchart()
and plot_waffle()
which take the data and create charts.
See definition for plot_barchart()
& plot_waffle()
#' Plot barchart
#' Makes a bar chart of staff perentages by organisation
#' @param df tibble of staff data in percent format
<- function(df) {
plot_barchart |>
df ::filter(organisation_name != "Total") |>
dplyr::select(-total) |>
dplyr::pivot_longer(cols = -c(organisation_name), names_to = "job", values_to = "percent") |>
tidyr::ggplot(ggplot2::aes(x = percent, y = organisation_name, fill = job)) +
ggplot2::geom_col(position = "dodge") +
ggplot2::scale_x_continuous(labels = scales::percent_format(scale = 100)) +
ggplot2::labs(x = "", y = "") +
ggplot2::scale_fill_su() +
StrategyUnitTheme::theme_minimal() +
ggplot2::theme(legend.title = ggplot2::element_blank())
ggplot2
}
#' Plot waffle
#' Makes a waffle chart to visualise staff breakdown at an ICS level
#' @param raw_staff count data of staff
#' @param title Title for the graphic
<- function(raw_staff, title) {
plot_waffle <-
waffle_data |>
raw_staff ::filter(organisation_name == "Total") |>
dplyr::select(-total, -organisation_name) |>
dplyr::pivot_longer(cols = dplyr::everything(), names_to = "names", values_to = "vals") |>
tidyr::mutate(vals = round(vals / 100))
dplyr
::ggplot(waffle_data, ggplot2::aes(fill = names, values = vals)) +
ggplot2::geom_waffle(n_rows = 8, size = 0.33, colour = "white") +
waffle::coord_equal() +
ggplot2::theme_void() +
ggplot2::theme(legend.title = ggplot2::element_blank()) +
ggplot2::ggtitle(title)
ggplot2 }
Again, using mutate()
I can create a new column called barchart
and I can map()
the function plot_barchart()
, applying it to each row at a time.
<-
graphs |>
processed_staff ::mutate(
dplyrbarchart = purrr::map(staff_percent, plot_barchart)
)
The resulting column barchart
is again a list-column, but this time instead of containing a tibble, it holds a ggplot object. A whole ggplot in a single cell. 2
If we want to pass two arguments to our function, we can replace map()
with map2()
. Here we’re using map2()
to pass the ics_name
column to use as a title in our waffle plot. 3
<-
graphs |>
processed_staff ::mutate(
dplyrwaffle = purrr::map2(raw_data, ics_name,
plot_waffle(data, title)
\(data, title)
) )
Putting it all together
All of these mutate()
steps can actually be called in one step. Here’s the full workflow again in full after a little refactor. I’ve also used pivot_longer()
to move the two plotting columns into a single plot column. This will make it easier for me to generate nice filenames, and save the plots.
<-
results |>
staff_group ::nest(raw_data = -ics_name) |>
tidyr::mutate(
dplyrstaff_percent = purrr::map(raw_data, convert_percent),
barchart = purrr::map(staff_percent, plot_barchart),
waffle = purrr::map2(raw_data, ics_name, \(data, title) plot_waffle(data, title))
|>
) ::pivot_longer(cols = c(barchart, waffle), names_to = "plot_type", values_to = "plot") |>
tidyr::mutate(filename = glue::glue("{snakecase::to_snake_case(ics_name)}_{plot_type}.png")) dplyr
The walk()
family of functions in {purrr} are used when the function you’re applying does not return an object, but is being used for it’s side-effect, for example reading or writing files.
Here we call walk2()
, passing in both the filename column and the plots column are arguments to save all the plots.
::walk2(
purrr$filename,
results$plot,
results::ggsave(file.path("plots", filename), plot, width = 10, height = 6)
\(filename, plot) ggplot2 )
By keeping everything together in one nested structure, I personally find it much easier to keep track of my analyses. If you’re doing a more complex or permenant analysis, you might want to consider setting up a more formal data processing pipeline, and following RAP principals.
Footnotes
In this example, we actually didn’t need to nest first. We could have performed the
mutate()
step on the full dataset.↩︎This totally blew my mind the first time I saw it 🤯.↩︎
We’re mapping the relationship between the two inputs and the
plot_waffle()
with an anonymous function. This shorthand syntax for anonymous functions came in R v 4.1.0. For compatibility with older versions of R, you’ll need the~
operator. For the different ways you can specify functions in {purrr} see the help file.↩︎