<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Data Science @ The Strategy Unit</title>
<link>https://the-strategy-unit.github.io/data_science/blogs/</link>
<atom:link href="https://the-strategy-unit.github.io/data_science/blogs/index.xml" rel="self" type="application/rss+xml"/>
<description>Blogs from the Data Science Team at The Strategy Unit</description>
<generator>quarto-1.9.37</generator>
<lastBuildDate>Fri, 07 Nov 2025 00:00:00 GMT</lastBuildDate>
<item>
  <title>Pandas vs. Polars</title>
  <link>https://the-strategy-unit.github.io/data_science/blogs/posts/2025-11-07_pandas_vs_polars/</link>
  <description><![CDATA[ 





<section id="introduction" class="level2">
<h2 class="anchored" data-anchor-id="introduction">Introduction</h2>
<p>Data processing is at the heart of our work at the Strategy Unit. As our datasets have grown and NHP model has become more complex, we’ve found that our pipeline takes longer to complete.</p>
<p>Like many data science teams, we’ve relied heavily on Pandas -Python’s well established standard for data processing. It’s familiar and extensively documented. However, as we’ve scaled up our operations, we’ve encountered performance bottlenecks that motivated us to seek more efficient solutions.</p>
<p>Polars has been frequently discussed as a possible improvement over Pandas. With language bindings for three languages -<a href="https://docs.rs/polars/latest/polars/">Rust</a>, <a href="https://pola-rs.github.io/nodejs-polars/">Node.js/Deno backend</a> &amp; <a href="https://github.com/pola-rs/js-polars">frontend</a>, <a href="https://docs.pola.rs/api/python/stable/reference/index.html">Python</a>- and two execution modes i.e.&nbsp;<a href="https://docs.pola.rs/user-guide/concepts/lazy-api/">lazy</a> and eager, Polars provides more flexibility in implementation and potential performance gains.</p>
<p>In this article, we’ll share our experience reimplementing one of our core data processing modules from Pandas to Polars, including the challenges we faced, the benefits we’ve seen, and lessons learned along the way. You can find our <a href="https://github.com/The-Strategy-Unit/nhp_products">nhp_products</a> code on GitHub. We’ve kept both implementations for comparison, with polars modules bearing a <code>_pl.py</code> suffix.</p>
<div class="quarto-layout-panel" data-layout-ncol="2">
<div class="quarto-layout-row">
<div class="quarto-layout-cell" style="flex-basis: 50.0%;justify-content: flex-start;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-11-07_pandas_vs_polars/Rolling-Panda-1647869962.gif" height="200" class="figure-img"></p>
<figcaption>Panda</figcaption>
</figure>
</div>
</div>
<div class="quarto-layout-cell" style="flex-basis: 50.0%;justify-content: flex-start;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-11-07_pandas_vs_polars/playful-polar-bear-cubs.gif" height="200" class="figure-img"></p>
<figcaption>Polar bears</figcaption>
</figure>
</div>
</div>
</div>
</div>
</section>
<section id="background-the-detailed-results-processing-challenge" class="level2">
<h2 class="anchored" data-anchor-id="background-the-detailed-results-processing-challenge">Background: The Detailed Results Processing Challenge</h2>
<p>The module we chose for our first Polars implementation was our detailed results processor. This critical component takes full model results and produces detailed aggregations of expected hospital activity across inpatient (IP), outpatient (OP), and Accident &amp; Emergency (A&amp;E) services.</p>
<p>The processing involves:</p>
<ol type="1">
<li>Loading hundreds of results from Azure storage</li>
<li>Merging these with reference datasets</li>
<li>Processing 256 Monte Carlo simulation runs</li>
<li>Aggregating the results into statistical summaries</li>
<li>Validating the outputs against aggregated results</li>
<li>Saving the final results as both CSV and Parquet files</li>
</ol>
<p>This workload is particularly demanding because:</p>
<ul>
<li>Each run processes thousands of records</li>
<li>The full pipeline handles 256 separate runs (for statistical robustness)</li>
<li>Memory consumption grows substantially during processing</li>
<li>The reference datasets contain multiple categorical dimensions for analysis</li>
</ul>
<p>Let’s look at how the implementations differ between our original Pandas version and the new Polars version.</p>
</section>
<section id="implementation-comparison" class="level2">
<h2 class="anchored" data-anchor-id="implementation-comparison">Implementation Comparison</h2>
<section id="code-organisation" class="level3">
<h3 class="anchored" data-anchor-id="code-organisation">Code Organisation</h3>
<p>One of the first things you’ll notice when comparing the implementations is how the Polars version breaks down complex operations into smaller, more focused functions:</p>
<p><strong>Pandas (Monolithic Functions):</strong></p>
<div id="574e78ba" class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> _process_inpatient_results(</span>
<span id="cb1-2">    ctx: ProcessContext,</span>
<span id="cb1-3">    output_dir: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>,</span>
<span id="cb1-4">) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>:</span>
<span id="cb1-5">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Function contains ~150 lines handling everything from:</span></span>
<span id="cb1-6">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># - Data loading</span></span>
<span id="cb1-7">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># - Reference preparation</span></span>
<span id="cb1-8">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># - Results processing for all runs</span></span>
<span id="cb1-9">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># - Dictionary aggregation</span></span>
<span id="cb1-10">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># - Validation and saving</span></span>
<span id="cb1-11">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ...</span></span></code></pre></div></div>
</details>
</div>
<p><strong>Polars (Granular Functions):</strong></p>
<div id="c7554c34" class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> _process_inpatient_results(</span>
<span id="cb2-2">    ctx: ProcessContext,</span>
<span id="cb2-3">    output_dir: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>,</span>
<span id="cb2-4">) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>:</span>
<span id="cb2-5">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Higher-level orchestration (~40 lines)</span></span>
<span id="cb2-6">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ...</span></span>
<span id="cb2-7"></span>
<span id="cb2-8"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> _load_ip_reference_data(ctx: ProcessContext) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> pl.DataFrame:</span>
<span id="cb2-9">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Focused on just loading reference data (~15 lines)</span></span>
<span id="cb2-10">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ...</span></span>
<span id="cb2-11"></span>
<span id="cb2-12"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> _process_ip_run(</span>
<span id="cb2-13">    run: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>,</span>
<span id="cb2-14">    reference_df: pl.DataFrame,</span>
<span id="cb2-15">    params: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">dict</span>,</span>
<span id="cb2-16">    model_runs: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">dict</span>,</span>
<span id="cb2-17">    ip_columns: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>[<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">str</span>],</span>
<span id="cb2-18">) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>:</span>
<span id="cb2-19">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Processes a single run only (~30 lines)</span></span>
<span id="cb2-20">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ...</span></span>
<span id="cb2-21"></span>
<span id="cb2-22"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> _validate_ip_results(</span>
<span id="cb2-23">    model_runs_df: pl.DataFrame,</span>
<span id="cb2-24">    actual_results_df: pl.DataFrame</span>
<span id="cb2-25">) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>:</span>
<span id="cb2-26">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Focused validation logic (~20 lines)</span></span>
<span id="cb2-27">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ...</span></span></code></pre></div></div>
</details>
</div>
<p>This modular approach is more of a design choice, in the context of continually improving our codebase, rather than a Polars-specific trait. Following a more modular approach in the Polars implementation aims at making the code clearer, easier to test, and maintain. Each function has a single responsibility rather than handling multiple concerns.</p>
</section>
<section id="syntax-and-api-differences" class="level3">
<h3 class="anchored" data-anchor-id="syntax-and-api-differences">Syntax and API Differences</h3>
<section id="data-loading-and-joining" class="level4">
<h4 class="anchored" data-anchor-id="data-loading-and-joining">Data Loading and Joining</h4>
<p>Here, the difference between the two implementations is fairly subtle, showcasing that switching between the two APIs doesn’t always increase the learning curve:</p>
<p><strong>Pandas:</strong></p>
<div id="80529de1" class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Load and merge data</span></span>
<span id="cb3-2">original_df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> az.load_data_file(data_connection, model_version_data, trust, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ip"</span>, baseline_year)</span>
<span id="cb3-3">reference_df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> original_df.copy().drop(columns<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"speldur"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"classpat"</span>])</span>
<span id="cb3-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ...</span></span>
<span id="cb3-5">merged <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> reference_df.merge(df, on<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"rn"</span>, how<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"inner"</span>)</span></code></pre></div></div>
</details>
</div>
<p><strong>Polars:</strong></p>
<div id="43e021b9" class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Load and join data</span></span>
<span id="cb4-2">original_df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> az_pl.load_data_file(data_connection, model_version_data, trust, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ip"</span>, baseline_year)</span>
<span id="cb4-3">reference_df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> original_df.drop([<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"speldur"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"classpat"</span>])</span>
<span id="cb4-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ...</span></span>
<span id="cb4-5">merged <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> reference_df.join(df, on<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"rn"</span>, how<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"inner"</span>)</span></code></pre></div></div>
</details>
</div>
<p>The Polars version uses the more intuitive <code>.join()</code> method instead of <code>.merge()</code>, other than that the two implementations align pretty well with the exception of not creating a DataFrame copy.</p>
</section>
<section id="null-handling" class="level4">
<h4 class="anchored" data-anchor-id="null-handling">Null Handling</h4>
<p>Null handling differs significantly:</p>
<p><strong>Pandas:</strong></p>
<div id="51f8493a" class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Fill missing values</span></span>
<span id="cb5-2">original_df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> az.load_data_file(...).fillna(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"unknown"</span>)</span></code></pre></div></div>
</details>
</div>
<p><strong>Polars:</strong></p>
<div id="8d0db356" class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Fill null values with more explicit control</span></span>
<span id="cb6-2">original_df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> az_pl.load_data_file(...)</span>
<span id="cb6-3">original_df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> original_df.with_columns(</span>
<span id="cb6-4">    [</span>
<span id="cb6-5">        pl.col(col).map_elements(</span>
<span id="cb6-6">            <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">lambda</span> x: <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> x <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> x, return_dtype<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>pl.Utf8</span>
<span id="cb6-7">        )</span>
<span id="cb6-8">        <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> col <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> original_df.select(pl.col(pl.Utf8)).columns</span>
<span id="cb6-9">    ]</span>
<span id="cb6-10">).fill_null(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"unknown"</span>)</span></code></pre></div></div>
</details>
</div>
<p>While the Polars version is more verbose here, it offers greater precision and control over missing data handling. <a href="https://docs.pola.rs/user-guide/expressions/missing-data/">Missing data</a> are typically represented by a single value (<code>null</code>) in Polars, in contrast to Pandas’ dual approach. Our code explicitly converts empty strings to <code>None</code> before filling with “unknown”. This addresses a fundamental difference between the libraries: Pandas represents missing values as <code>NaN</code> (for numeric types) or <code>None</code> (for other types), while Polars uses a single <code>null</code> value for all types. By standardising the treatment of missing values, we ensure consistent results across both implementations. This consistency is important during validation, where we verify that both versions produce identical outputs.</p>
</section>
<section id="data-aggregation-and-filtering" class="level4">
<h4 class="anchored" data-anchor-id="data-aggregation-and-filtering">Data Aggregation and Filtering</h4>
<p>Data filtering and aggregation show some of the most significant syntax differences:</p>
<p><strong>Pandas:</strong></p>
<div id="be693c61" class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Extract a specific value</span></span>
<span id="cb7-2">detailed_beddays_principal <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (</span>
<span id="cb7-3">    model_runs_df.loc[</span>
<span id="cb7-4">        (</span>
<span id="cb7-5">            <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">slice</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>),</span>
<span id="cb7-6">            <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">slice</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>),</span>
<span id="cb7-7">            <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">slice</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>),</span>
<span id="cb7-8">            <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">slice</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>),</span>
<span id="cb7-9">            <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">slice</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>),</span>
<span id="cb7-10">            <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">slice</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>),</span>
<span id="cb7-11">            <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">slice</span>(<span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>),</span>
<span id="cb7-12">            <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"beddays"</span>,</span>
<span id="cb7-13">        ),</span>
<span id="cb7-14">        :,</span>
<span id="cb7-15">    ]</span>
<span id="cb7-16">    .<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>()</span>
<span id="cb7-17">    .loc[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"mean"</span>]</span>
<span id="cb7-18">    .astype(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>)</span>
<span id="cb7-19">)</span></code></pre></div></div>
</details>
</div>
<p><strong>Polars:</strong></p>
<div id="02d34a3a" class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Extract the same value with explicit filtering</span></span>
<span id="cb8-2">detailed_beddays_principal <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">int</span>(</span>
<span id="cb8-3">    model_runs_df.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">filter</span>(pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"measure"</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"beddays"</span>)</span>
<span id="cb8-4">    .select(pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"mean"</span>).<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>())</span>
<span id="cb8-5">    .item()</span>
<span id="cb8-6">)</span></code></pre></div></div>
</details>
</div>
<p>The Polars version is more compact, enhancing readability, with an SQL-like syntax that expresses the intent.</p>
</section>
<section id="lazy-evaluation" class="level4">
<h4 class="anchored" data-anchor-id="lazy-evaluation">Lazy Evaluation</h4>
<p>As mentioned in the introduction, one of Polars’ key features is its support for lazy evaluation, which enables query optimisation:</p>
<p><strong>Pandas:</strong></p>
<div id="68fda7ff" class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Each operation executes immediately</span></span>
<span id="cb9-2">result <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (</span>
<span id="cb9-3">    df[df[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"measure"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"beddays"</span>]</span>
<span id="cb9-4">    .groupby([<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"sitetret"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"age_group"</span>])</span>
<span id="cb9-5">    .agg({<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"value"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"sum"</span>})</span>
<span id="cb9-6">    .reset_index()</span>
<span id="cb9-7">)</span></code></pre></div></div>
</details>
</div>
<p><strong>Polars:</strong></p>
<div id="2df46ec8" class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Builds an execution plan that can be optimised</span></span>
<span id="cb10-2">result <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (</span>
<span id="cb10-3">    df.lazy()</span>
<span id="cb10-4">    .<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">filter</span>(pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"measure"</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"beddays"</span>)</span>
<span id="cb10-5">    .groupby([<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"sitetret"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"age_group"</span>])</span>
<span id="cb10-6">    .agg(pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"value"</span>).<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">sum</span>())</span>
<span id="cb10-7">    .collect()  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Only executes the plan when needed</span></span>
<span id="cb10-8">)</span></code></pre></div></div>
</details>
</div>
<p>Lazy evaluation allows Polars to analyse and optimise the entire query before execution, often resulting in better performance for complex operations. The <a href="https://docs.pola.rs/user-guide/migration/pandas/#polars-can-lazily-evaluate-queries-and-apply-query-optimization">query optimisation engine</a> can eliminate redundant steps, reorder operations, and make better use of available resources (e.g.&nbsp;pushing filter operations before joins to reduce memory usage, combining consecutive transformations, or parallelising independent operations across multiple CPU cores).</p>
</section>
</section>
<section id="memory-management" class="level3">
<h3 class="anchored" data-anchor-id="memory-management">Memory Management</h3>
<p>Both implementations are designed to manage memory, but with fundamental differences in how data is stored and processed:</p>
<p><strong>Pandas:</strong></p>
<div id="1766e7bc" class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Choose batch size and load with caching</span></span>
<span id="cb11-2">batch_size <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Balance between memory usage and I/O performance</span></span>
<span id="cb11-3">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> az.load_model_run_results_file(</span>
<span id="cb11-4">    container_client<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>results_connection,</span>
<span id="cb11-5">    params<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{</span>
<span id="cb11-6">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ... parameters</span></span>
<span id="cb11-7">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"batch_size"</span>: batch_size,  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># This enables batch loading</span></span>
<span id="cb11-8">    },</span>
<span id="cb11-9">)</span>
<span id="cb11-10"></span>
<span id="cb11-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Clean up memory</span></span>
<span id="cb11-12"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">del</span> model_runs_df, model_runs, original_df, reference_df</span>
<span id="cb11-13">gc.collect()</span>
<span id="cb11-14">logger.info(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Memory cleaned after IP processing, current usage: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>get_memory_usage()<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> MB"</span>)</span></code></pre></div></div>
</details>
</div>
<p><strong>Polars:</strong></p>
<div id="55311043" class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Similar batch loading pattern</span></span>
<span id="cb12-2">batch_size <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span>  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Note larger batch size than Pandas</span></span>
<span id="cb12-3">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> az_pl.load_model_run_results_file(</span>
<span id="cb12-4">    container_client<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>results_connection,</span>
<span id="cb12-5">    params<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>{</span>
<span id="cb12-6">        <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ... parameters</span></span>
<span id="cb12-7">        <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"batch_size"</span>: batch_size,</span>
<span id="cb12-8">    },</span>
<span id="cb12-9">)</span>
<span id="cb12-10"></span>
<span id="cb12-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Similar memory clean-up</span></span>
<span id="cb12-12"><span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">def</span> _clean_ip_memory(</span>
<span id="cb12-13">    model_runs_df: pl.DataFrame,</span>
<span id="cb12-14">    model_runs: <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">dict</span>,</span>
<span id="cb12-15">    original_df: pl.DataFrame,</span>
<span id="cb12-16">    reference_df: pl.DataFrame,</span>
<span id="cb12-17">) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-&gt;</span> <span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">None</span>:</span>
<span id="cb12-18">    <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">del</span> model_runs_df, model_runs, original_df, reference_df</span>
<span id="cb12-19">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ... cache clearing and collection</span></span>
<span id="cb12-20">    gc.collect()</span>
<span id="cb12-21">    logger.debug(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Memory cleaned after IP processing, current usage: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>get_memory_usage()<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:.2f}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> MB"</span>)</span></code></pre></div></div>
</details>
</div>
<p>A key difference is that Polars uses <a href="https://docs.pola.rs/user-guide/migration/pandas/#memory-format">Apache Arrow’s columnar memory format</a>, which is more memory-efficient than Pandas’ row-based storage, particularly for operations that work on specific columns rather than entire rows. This format allows Polars to handle a larger batch size without increasing memory pressure.</p>
<p>Both implementations explicitly clean up memory, with the Polars version extracting this into a dedicated function for improved code organisation.</p>
</section>
<section id="type-handling-and-expression-system" class="level3">
<h3 class="anchored" data-anchor-id="type-handling-and-expression-system">Type Handling and Expression System</h3>
<p>Pandas and Polars differ significantly in how they handle types and expressions. Polars uses a <a href="https://deepwiki.com/pola-rs/polars/6-type-system">strongly-typed system</a> based on Apache Arrow’s type system, while Pandas handles types more implicitly.</p>
<p>Polars’ expression system (with <code>pl.col()</code> syntax) is a core part of its design, allowing for clear, composable operations that the query optimiser can understand and optimise:</p>
<p><strong>Explicit Null Checking:</strong></p>
<div id="c923a298" class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Count rows with empty sitetret for validation</span></span>
<span id="cb13-2">empty_sitetret_count <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> results.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">filter</span>(</span>
<span id="cb13-3">    pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"sitetret"</span>).is_null() <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|</span> (pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"sitetret"</span>) <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>)</span>
<span id="cb13-4">).height</span>
<span id="cb13-5"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> empty_sitetret_count <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>:</span>
<span id="cb13-6">    logger.debug(<span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">f"Found </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>empty_sitetret_count<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;"> rows with empty sitetret in run </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{</span>run<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">}</span><span class="ss" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>)</span></code></pre></div></div>
</details>
</div>
<p><strong>Type System Consistency:</strong></p>
<div id="a6ba36fb" class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Ensure consistent numerical representation for statistics</span></span>
<span id="cb14-2">op_model_runs_df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> op_model_runs_df.with_columns(</span>
<span id="cb14-3">    [</span>
<span id="cb14-4">        pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lwr_ci"</span>).fill_nan(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0</span>),</span>
<span id="cb14-5">        pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"median"</span>).fill_nan(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0</span>),</span>
<span id="cb14-6">        pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"mean"</span>).fill_nan(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0</span>),</span>
<span id="cb14-7">        pl.col(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"upr_ci"</span>).fill_nan(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.0</span>),</span>
<span id="cb14-8">    ]</span>
<span id="cb14-9">)</span></code></pre></div></div>
</details>
</div>
<p>These examples demonstrate how Polars requires explicit handling of <code>null</code>s and <code>NaN</code> values, in contrast to Pandas’ more implicit approach. This explicitness, while slightly more verbose, helps prevent subtle bugs that can occur with implicit type conversions.</p>
</section>
</section>
<section id="performance-benefits" class="level2">
<h2 class="anchored" data-anchor-id="performance-benefits">Performance Benefits</h2>
<p>The most compelling reason to switch from Pandas to Polars is its columnar storage, expression-based API, and query optimisation engine. In our benchmarks, we measured total execution time <em>including</em> I/O operations (reading from Azure Blob Storage and writing to local files), providing a realistic picture of performance in production environments. Our experience reimplementing our Pandas code to Polars taught us that:</p>
<ol type="1">
<li><p><strong>Processing Speed</strong>: Our Polars implementation, without any additional optimisations that might increase performance, is more than 32% faster than its Pandas counterpart. Our benchmark included I/O time in speed calculations, which relies on Azure and network speed as well as computational performance of the specific machine used to perform this benchmark. Therefore, with the benefit of hindsight, introducing better separation of concerns such as benchmarking dataframe operations <em>separately</em> from loading data from Azure Blob Storage would likely yield even better performance gains. The Polars team’s <a href="https://pola.rs/posts/benchmarks/">comprehensive benchmarks</a> confirm its improved performance across a wide range of operations.</p></li>
<li><p><strong>I/O Performance</strong>: The Polars implementation handles file operations more efficiently, particularly when loading and saving Parquet files, as it’s designed to work natively with the Apache Arrow format.</p></li>
<li><p><strong>Parallel Processing</strong>: Polars automatically utilises multiple CPU cores by default, whereas Pandas operations are typically single-threaded. While we haven’t explicitly configured parallel execution in our implementation, we benefit from this feature automatically.</p></li>
<li><p><strong>Larger-than-RAM Processing</strong>: Polars can <a href="https://github.com/pola-rs/polars#handles-larger-than-ram-data">handle datasets larger than available RAM</a> through its lazy API and streaming engine. This is crucial for our future work as healthcare datasets continue to grow in size and complexity.</p></li>
<li><p><strong>Lightweight Dependencies</strong>: Unlike Pandas, which depends on NumPy and several other libraries, Polars is certainly <a href="https://github.com/pola-rs/polars#lightweight">more lightweight</a> with no Python dependencies. This makes deployment simpler and reduces potential version conflicts in complex projects.</p></li>
</ol>
<p>For context, processing a typical scenario with 256 runs that previously took about 57 minutes with Pandas now completes in 38 minutes with Polars, with basic Polars API knowledge.</p>
</section>
<section id="trade-offs-and-considerations" class="level2">
<h2 class="anchored" data-anchor-id="trade-offs-and-considerations">Trade-offs and Considerations</h2>
<p>Despite the clear performance advantages, there are some trade-offs to consider:</p>
<section id="learning-curve" class="level3">
<h3 class="anchored" data-anchor-id="learning-curve">Learning Curve</h3>
<p>Polars has a different API than Pandas, requiring time for teams to adjust. While many operations are similar, there are enough differences to necessitate a learning period.</p>
</section>
<section id="ecosystem-integration" class="level3">
<h3 class="anchored" data-anchor-id="ecosystem-integration">Ecosystem Integration</h3>
<p>Pandas integrates seamlessly with libraries such as NumPy, Scikit-Learn and Matplotlib, which are foundational for Data Science and Machine Learning work. Polars’ <a href="https://docs.pola.rs/user-guide/ecosystem/">ecosystem integration</a> shouldn’t be a concern at this stage, given its wider support for third party libraries and its stable API.</p>
</section>
<section id="small-dataset-performance" class="level3">
<h3 class="anchored" data-anchor-id="small-dataset-performance">Small Dataset Performance</h3>
<p>For smaller datasets -fewer than a few tens of thousands of rows- the performance advantage of Polars is less pronounced and sometimes negligible. The overhead of setting up optimised execution can actually make Polars slightly slower for smaller scale operations.</p>
</section>
</section>
<section id="practical-advice-for-adoption" class="level2">
<h2 class="anchored" data-anchor-id="practical-advice-for-adoption">Practical Advice for Adoption</h2>
<p>Based on our experience, here are some recommendations for teams considering a similar transition:</p>
<section id="start-with-non-critical-components" class="level3">
<h3 class="anchored" data-anchor-id="start-with-non-critical-components">Start with Non-Critical Components</h3>
<p>Begin your Polars adoption with non-critical data processing components where you can thoroughly test and validate outputs against your existing Pandas implementation.</p>
</section>
<section id="incremental-migration" class="level3">
<h3 class="anchored" data-anchor-id="incremental-migration">Incremental Migration</h3>
<p>Rather than rewriting entire systems at once, migrate components incrementally. Our approach of maintaining both implementations side by side allowed for a detailed comparison that can lead to an informed decision.</p>
</section>
<section id="validate-validate-validate" class="level3">
<h3 class="anchored" data-anchor-id="validate-validate-validate">Validate, Validate, Validate</h3>
<p>Implement rigorous validation to ensure the Polars implementation produces identical (or appropriately similar) results to your Pandas code. We found tiny discrepancies due to the difference in handling missing data that needed to be addressed.</p>
</section>
<section id="review-memory-management" class="level3">
<h3 class="anchored" data-anchor-id="review-memory-management">Review Memory Management</h3>
<p>Polars handles memory differently than Pandas. Review your memory management practices and consider Polars’ lazy evaluation capabilities for very large datasets.</p>
</section>
<section id="keep-documentation-updated" class="level3">
<h3 class="anchored" data-anchor-id="keep-documentation-updated">Keep Documentation Updated</h3>
<p>Maintain clear documentation about the differences between your Pandas and Polars implementations, particularly regarding API differences and performance characteristics.</p>
</section>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>Our experiment, transforming our detailed results processor from Pandas-based to Polars has been didactic and positive. The expression-based API is more readable, making our code more maintainable. As a bonus, our pipeline can run faster -especially when processing larger datasets.</p>
<p>For teams working with large datasets, particularly those experiencing performance bottlenecks with Pandas, Polars represents a compelling alternative that maintains much of the familiar dataframe paradigm while delivering measurable performance improvements.</p>
<p>We’re continuing to investigate ways to improve our codebase, its speed of execution, memory and overall efficiency, and looking forward to leveraging more effective approaches that’ll render our code even more powerful.</p>
<hr>
<p><em>This blog post reflects the experience of the Strategy Unit’s Data Science team in implementing Polars for detailed results processing. The performance improvements mentioned are specific to our workloads and may vary depending on your specific use case.</em></p>


</section>

 ]]></description>
  <category>Python</category>
  <category>Data Science</category>
  <category>Performance</category>
  <guid>https://the-strategy-unit.github.io/data_science/blogs/posts/2025-11-07_pandas_vs_polars/</guid>
  <pubDate>Fri, 07 Nov 2025 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Verifying Git Commits</title>
  <link>https://the-strategy-unit.github.io/data_science/blogs/posts/2025-09-18_verifying_git_commits/</link>
  <description><![CDATA[ 





<p>Ever looked at a list of commits on GitHub and noticed ✅ green checks next to some commits, but not others?<br>
Wondered what they mean?</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-09-18_verifying_git_commits/nhp_inputs-commits.png" class="img-fluid figure-img"></p>
<figcaption>Screenshot of the nhp_inputs repository, showing some verified commits and other unverified commits.</figcaption>
</figure>
</div>
<p>Here’s a screenshot of our <a href="https://github.com/The-Strategy-Unit/nhp_inputs/commits/main/">nhp_inputs</a> repository.<br>
You can see that some commits (merge commits, specifically here) have those checks, while others don’t.</p>
<p>The green check indicates that the commit has been cryptographically signed — proving it was created by the stated author and hasn’t been tampered with.</p>
<p>There’s nothing stopping someone from changing their Git username and email to impersonate another author.<br>
You might have seen this yourself when your local Git config isn’t set correctly (<a href="https://superuser.com/questions/1435213/github-why-do-i-appear-twice-on-every-commit">example and fix</a>).</p>
<p>So, why are those merge commits verified?</p>
<p>When you make a commit directly on GitHub (either by editing a file in the browser or merging a pull request), GitHub knows it’s you and “signs” the commit using its <a href="https://en.wikipedia.org/wiki/GNU_Privacy_Guard">GPG</a> private key.<br>
Others can then independently verify the commit by validating it against <a href="https://keys.openpgp.org/search?q=B5690EEEBB952194">GitHub’s public key</a>.</p>
<section id="can-you-sign-your-own-commits" class="level2">
<h2 class="anchored" data-anchor-id="can-you-sign-your-own-commits">Can you sign your own commits?</h2>
<p>Absolutely — and you probably should!<br>
This isn’t just a GitHub feature; it’s built directly into Git.</p>
<p>Below is a screenshot of our <a href="https://github.com/the-strategy-unit/nhp_model/commits/main">nhp_model</a>, where all the commits I’ve authored have been signed with <a href="https://keyserver.ubuntu.com/pks/lookup?search=8F3C2735D62D6993&amp;fingerprint=on&amp;op=index">my GPG key</a>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-09-18_verifying_git_commits/nhp_model-commits.png" class="img-fluid figure-img"></p>
<figcaption>Screenshot of the nhp_model repository, showing all commits are verified.</figcaption>
</figure>
</div>
<p>Setting up GPG can be a bit of a faff, but there’s an easier way: using SSH keys to sign commits.<br>
You may already have SSH set up for pushing and pulling from GitHub, so this is a no-brainer.<br>
Even if you don’t, setting up SSH is quick.</p>
</section>
<section id="easy-approach-using-ssh" class="level2">
<h2 class="anchored" data-anchor-id="easy-approach-using-ssh">Easy approach: using SSH</h2>
<p>For more details, see <a href="https://docs.gitlab.com/user/project/repository/signed_commits/ssh/">“Sign commits with SSH keys” from GitLab’s docs</a>, but here’s the short version:</p>
<p>First, generate an SSH key if you haven’t already. Open a terminal (on Windows you might need Git Bash) and run:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode sh code-with-copy"><code class="sourceCode bash"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ssh-keygen</span> <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-t</span> ed25519 <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">-C</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"my.email@example.com"</span></span></code></pre></div></div>
<p>This creates a new file called <code>~/.ssh/id_25519.pub</code>, where <code>~</code> is your home directory (for example, on Windows: <code>C:\Users\thomas.jemmett\.ssh\id_25519.pub</code>).</p>
<p>Then run:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode sh code-with-copy"><code class="sourceCode bash"><span id="cb2-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">git</span> config <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--global</span> gpg.format ssh</span>
<span id="cb2-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">git</span> config <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--global</span> user.signingkey ~/.ssh/id_25519.pub</span>
<span id="cb2-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">git</span> config <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--global</span> tag.gpgsign true</span>
<span id="cb2-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">git</span> config <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">--global</span> commit.gpgsign true</span></code></pre></div></div>
<p>From now on, your commits will be signed — but GitHub won’t yet recognise the signature! You need to upload your SSH key for signing: see<a href="https://docs.github.com/en/authentication/connecting-to-github-with-ssh/adding-a-new-ssh-key-to-your-github-account#adding-a-new-ssh-key-to-your-account">“Adding a new SSH key to your account”, GitHub Docs</a>.</p>
<p>You need to make sure that you set this as a key for signing, not for authentication. If you already have uploaded your key for authentication, you need to upload it again for signing. Using the same key for both is fine.</p>
</section>
<section id="harder-approach-gpg" class="level2">
<h2 class="anchored" data-anchor-id="harder-approach-gpg">Harder approach: GPG</h2>
<p>While GitHub supports SSH-signed commits, not all providers do. GPG is a more established way of performing these cryptographic operations.</p>
<p>It does take more setup, and you may need to install additional tools. Rather than going step by step here, you should read GitHub’s docs on creating a key.</p>
<p>If you do create a GPG key, consider using a modern algorithm like ECDSA or EdDSA.</p>
<p>It’s also a good idea to <a href="https://www.gnupg.org/gph/en/manual/x457.html">send your new key to a few keyservers</a> (e.g.&nbsp;<a href="https://pgp.mit.edu">pgp.mit.edu</a>, <a href="https://keyserver.ubuntu.com">keyserver.ubuntu.com</a>, <a href="https://keys.openpgp.org">keys.openpgp.org</a>).</p>
<p>If some of your colleagues also use GPG, you might <a href="https://gist.github.com/F21/b0e8c62c49dfab267ff1d0c6af39ab84">sign their keys</a> to help establish a <a href="https://en.wikipedia.org/wiki/Chain_of_trust">Chain of trust</a>.</p>
</section>
<section id="paranoid-approach" class="level2">
<h2 class="anchored" data-anchor-id="paranoid-approach">Paranoid approach</h2>
<p>Once your commits are verifiable, you need to ensure your private key never leaks. If an attacker obtained your SSH or GPG key, they could create commits pretending to be you and sign them so others would completely trust them.</p>
<p>If you’re ultra-paranoid (or a maintainer of a popular repository), consider this:</p>
<p>Generate a GPG master key on an air-gapped machine (i.e.&nbsp;not connected to the internet), use it to generate short-lived subkeys, then copy the subkeys to a hardware security key (HSK) like a <a href="https://support.yubico.com/hc/en-us/articles/360013790259-Using-Your-YubiKey-with-OpenPGP">Yubikey</a>.</p>
<p>If you want a good guide to follow, you might want to read this series of blog posts: (though, I would choose ECC over RSA as suggested). Part 6 explains how to get this to work with a Yubikey.</p>
<p>A good guide is the blog series <a href="https://chipsenkbeil.com/posts/applying-gpg-and-yubikey-part-1-overview/">Applying GPG and Yubikey</a> (I’d recommend ECC over RSA as suggested). Part 6 covers integrating this with a YubiKey.</p>
<p>This approach stores your cryptographic secrets in a way they can’t be extracted (HSKs only import keys, not export them), meaning an attacker would have to physically steal your device.</p>


</section>

 ]]></description>
  <category>Git</category>
  <category>GitHub</category>
  <guid>https://the-strategy-unit.github.io/data_science/blogs/posts/2025-09-18_verifying_git_commits/</guid>
  <pubDate>Thu, 18 Sep 2025 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Positron for Product Owners</title>
  <link>https://the-strategy-unit.github.io/data_science/blogs/posts/2025-09-09_positron_for_product/</link>
  <description><![CDATA[ 





<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-09-09_positron_for_product/PST-Blog-PositronforProductOwners-A.jpg" class="img-fluid" alt="A data scientist working and an inset of the Positron IDE"></p>
<p>The clever cookies at Posit (creators of RStudio) recently released Positron, their new ‘next generation’ integrated development environment (IDE), after 2+ years of development. This software is multilingual, and will look familiar to those who have used VSCode in the past. It has been created to build upon the successes of RStudio for statistical computing and graphics, to create a seamless workflow for the development of data science projects. It’s a lovely piece of software that I’ve recently started working with and is providing me with some useful features to help in my role as a product owner. I wanted to give a little round-up of what I’ve found useful with it so far, for someone whose daily tasks differ from standard data science or analysis (but includes elements of both).</p>
<p>Firstly, what do I mean by a ‘product owner’? I’ll keep this brief since there is already so much out there that eloquently <a href="https://www.scrum.org/resources/what-is-a-product-owner">describes this role</a> and its relationship with others in the <a href="https://agilealliance.org/agile101/">Agile</a> universe. Essentially, my job is to ensure our data science team is building the right thing, at the right time, for our customers.<br>
The data science team’s time is spent marshalling their wizardry (as it seems to me) to write code, create packages, and curate workflows to meet the requirements for the tasks at hand. My job is to make sure those tasks are the right ones for meeting customer needs. This involves a lot of thinking about stakeholder requirements, translating them to a prioritised list of tools or functionality elements that will truly answer their questions, gathering and assessing user feedback, improving the user experience, and trying to find the sweet spot where the time, money and effort spent on development is appropriate for the value of the new thing for the customer.</p>
<p>I’m a data scientist myself, with a background in R and Python. I love writing code and building R packages and have done so in RStudio for some time now. A product owner’s working week is more focused on planning, prioritising, prioritising again, thinking about processes, documenting decisions and so forth.<br>
But to contribute to our team’s work I also write little bits of code, write documentation, create diagrams and infographics, undertake some QA, review PRs and so forth. Positron has made many of these elements easy and smooth. Here’s a round-up of the ways Positron is helping me with the day job:</p>
<section id="curating-the-product-backlog" class="level2">
<h2 class="anchored" data-anchor-id="curating-the-product-backlog">Curating the product backlog</h2>
<p>This is probably my favourite thing about Positron. The GitHub interface in RStudio was ok, but I still found myself using the command line, which felt like it interrupted my flow of work to a degree. Positron provides for the ability to never have to leave the IDE. Once you have authenticated your GitHub account in Positron, you can click the GitHub icon in the activity panel to see all the pull requests and issues for that repo. You can create PRs from that menu too – it’s all self-explanatory and works really well. Clicking on one of your repo issues will also open it as an additional tab in the editor pane, and you can interact with it there just as you would at github.com! No need to have browsers and shells open, it’s all just there. This helps my workflow massively, allowing me to keep my train of thought going.</p>
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-09-09_positron_for_product/image1-391x1024.png" class="img-fluid" alt="Screenshot of the Positron IDE Source Control panel" width="300"></p>
</section>
<section id="writing-plans-and-processes-and-sharing-them" class="level2">
<h2 class="anchored" data-anchor-id="writing-plans-and-processes-and-sharing-them">Writing plans and processes and sharing them</h2>
<p>Quarto, which is already bilingual, is the tool I use to write meeting notes, lists and plans, all within a locally version-controlled Quarto book. The whole book is searchable, so it’s easy to find the notes I’m looking for, and Positron makes working with Quarto very easy. Positron comes with a bootstrapped Quarto extension and the CLI is bundled with Positron, so no extra setup is needed. I can create my planning documents, with nice mermaid flow diagrams and such, and I can publish them easily from Positron to share with my team. I’ve written a package that I can use within the IDE to automatically append notes to any of the .qmd pages of my notes book. I can even create diagrams and charts in the IDE, and save them into my notes. If the diagram is complex, I can append the code chunks and output to my book too, since that’s what Quarto is great at.</p>
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-09-09_positron_for_product/image2.png" class="img-fluid" alt="Screenshot of the Positron IDE showing a Quarto document"></p>
<p>I love Positron’s feature that allows scrolling back through the iterations of a plot too, which makes rolling the code back to a preferred solution nice and simple.</p>
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-09-09_positron_for_product/image3.png" class="img-fluid" alt="Screenshot of the Positron IDE plotting pane"></p>
</section>
<section id="testing-new-functionality-and-ux" class="level2">
<h2 class="anchored" data-anchor-id="testing-new-functionality-and-ux">Testing new functionality and UX</h2>
<p>I need to understand how our products are experienced by our users, so it helps me to run our apps in development locally. Using Positron allows me to use a very similar workflow to my familiar RStudio one, including using {devtools} for documenting and running package checks and using devtools::load_all() to test out new functionality locally. In addition to this, Positron makes working with tests really simple (which is saying something coming from me!). The ‘testing’ icon in the activity pane opens a dedicated interface sidebar, which neatly displays your tests and their results (provided you are in a package folder). It lets you easily rerun failing tests, see their error messages, and rerun batches of tests or all tests at once. This is a really great addition to the IDE.</p>
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-09-09_positron_for_product/image4.png" class="img-fluid" alt="Screenshot of the Positron IDE testing pane" width="300"></p>
</section>
<section id="juggling-many-projects" class="level2">
<h2 class="anchored" data-anchor-id="juggling-many-projects">Juggling many projects</h2>
<p>I flit between projects often. Double-clicking the R Project file allows you to start a new RStudio session with the project already loaded. But there is no direct equivalent of the .Rproj file for Positron.<br>
This is a little jarring at first, but opening a folder to begin working in a project is a familiar workflow with VSCode.<br>
I wanted a quick and easy way to switch between projects, and have found the Project Manager Extension useful for this. It creates a new multi-folder icon in the activity pane, which when clicked gives a neat overview of saved projects (folders), which I can tag easily to bundle together related projects. Lovely!</p>
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-09-09_positron_for_product/image5.jpg" class="img-fluid" alt="Screenshot of the Positron IDE Extension Marketplace" width="300"></p>
</section>
<section id="helpers-for-doing-good-data-science" class="level2">
<h2 class="anchored" data-anchor-id="helpers-for-doing-good-data-science">Helpers for doing good data science</h2>
<p>I don’t want to lose my skills as a data scientist; I like to dabble and make useful tools for myself, and very occasionally I get my hands dirty writing code with the team. Positron is helping me write better code, quickly, which helps me look like a data scientist (I hope!). My favourite features so far are having my code formatted automatically when I save an R or Quarto file by the excellent Air formatter (shipped as standard with Positron), using <a href="https://positron.posit.co/migrate-rstudio-code.html#r-snippets">code snippets</a> to get the syntax of common functions right first time, and the cool multi-cursor for editing in multiple places at once!</p>
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-09-09_positron_for_product/image6.png" class="img-fluid" alt="Screenshot of the Positron settings.json file"></p>
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-09-09_positron_for_product/Untitled-design.gif" class="img-fluid" alt="Animation showing the use of multicursor functionality" width="300"></p>
<p>Overall, Positron feels like a nice place to be. I’m finding it to be an IDE that helps with so many aspects of being a product owner, and that can only help us to keep building the right thing at the right time.</p>


</section>

 ]]></description>
  <category>R</category>
  <category>Positron</category>
  <category>Git</category>
  <guid>https://the-strategy-unit.github.io/data_science/blogs/posts/2025-09-09_positron_for_product/</guid>
  <pubDate>Tue, 09 Sep 2025 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Imputing missing data</title>
  <link>https://the-strategy-unit.github.io/data_science/blogs/posts/2025-07-28_imputing-data/</link>
  <description><![CDATA[ 





<p>In the world of data analysis, missing values can feel like puzzle pieces that just won’t fit, leaving analysts frustrated and insights incomplete. But what if I told you that imputing these missing values could be the key to unlocking a treasure trove of insights? By skilfully filling in the gaps, we can enhance the integrity of our datasets <em>and</em> elevate the quality of our analyses.</p>
<p>This blog explores the importance of addressing missing data and how an effective imputation strategy can transform incomplete datasets into powerful tools for decision-making.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-07-28_imputing-data/mice_imputation.jpg" class="img-fluid figure-img" alt="A mouse looking at a wall of binary data containing missing values. The mouse is imputing the missing values, which are coloured red."></p>
<figcaption>Imputing missing data with MICE</figcaption>
</figure>
</div>
<p>Recently, I encountered a challenge while working with a dataset that had approximately 3% missing values. The goal was to match sites that received an intervention with control sites using the <a href="https://cran.r-project.org/package=MatchIt">{MatchIt}</a> package. However, the package excluded records with missing data, which would have resulted in the loss of around 40 intervention sites.</p>
<p>Further investigation revealed the missingness was not at random, which meant that proceeding with a complete-case analysis would have introduced biases into the resulting analysis.</p>
<p>To address these issues, I turned to the Multiple Imputation by Chained Equations (MICE) algorithm, implemented in the <a href="https://cran.r-project.org/package=mice">{mice}</a> package. The MICE algorithm offered a solution by imputing likely values for the missing data which resulted in reduced bias in the resulting analysis and increased accuracy of the matched sites.</p>
<section id="whats-the-problem-with-missing-data" class="level2">
<h2 class="anchored" data-anchor-id="whats-the-problem-with-missing-data">What’s the problem with missing data?</h2>
<p>When datasets contain gaps, whether due to non-response in surveys, data entry errors, or system malfunctions, the integrity of the analysis is compromised.</p>
<p>In my case, missing data presented two main challenges. First, the missingness prevented some of my intervention sites from being matched with control sites, which reduced the overall dataset size and excluded important information from my analysis. Second, the type of missingness in my data meant that excluding incomplete records introduced biases into the dataset.</p>
<p>Understanding the type of missingness in your dataset is essential for choosing the right approach to handle it.</p>
<section id="types-of-missingness" class="level3">
<h3 class="anchored" data-anchor-id="types-of-missingness">Types of missingness</h3>
<p>Missing data can be categorised into three types:</p>
<ul>
<li><p><strong>Missing Completely at Random (MCAR)</strong>, where the missingness is entirely independent of both observed and unobserved data</p></li>
<li><p><strong>Missing at Random (MAR)</strong>, where the missingness is related to observed data but not the missing values themselves</p></li>
<li><p><strong>Missing Not at Random (MNAR)</strong>, where the missingness is related to the unobserved data.</p></li>
</ul>
<p>These types are summarised in the below table along with strategies for handling them.</p>
<table class="caption-top table">
<caption><a href="https://medium.com/data-and-beyond/types-of-missing-data-in-data-analysis-theoretical-background-8b907c1ea33a">Types of missing data in data analysis: theoretical background</a>, Medium</caption>
<thead>
<tr class="header">
<th>Missing Completely at Random (MCAR)</th>
<th>Missing at Random (MAR)</th>
<th>Missing Not at Random (MNAR)</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td><p>The missingness is completely unrelated to the data (no pattern).</p>
<p>Analyses remain unbiased as missingness does not systematically affect the data.</p>
<p>Remove rows with missing data, or simple imputation using, e.g.&nbsp;mean.</p></td>
<td><p>The gaps are related to observed variable/s (conditional).</p>
<p>Biases are introduced if missingness is not accounted for.</p>
<p>Can be addressed using imputation methods using observed data to predict missing values.</p></td>
<td><p>The gaps are related to observed variable/s <strong>and also</strong> to themselves.</p>
<p>Need to collect a sample of missing data to be sure.</p>
<p>Analysis is biased.</p>
<p>Sophisticated techniques can <em>possibly</em> fix.</p></td>
</tr>
</tbody>
</table>
<p>Using techniques described in the <a href="https://www.epirhandbook.com/en/new_pages/missing_data.html#assess-missingness-in-a-data-frame">Epidemiologists Handbook</a>, I found that the missingness in my data was non-random, specifically either MAR or MNAR. Since I could not collect any of the missing values (as I was working with a nationally-produced dataset), I decided to treat my data as Missing at Random (MAR), which meant that biases could be introduced if I failed to account for the missingness in the data.</p>
</section>
</section>
<section id="handling-missingness" class="level2">
<h2 class="anchored" data-anchor-id="handling-missingness">Handling missingness</h2>
<p>Handling missingness can be approached through various techniques, each with its own strengths and weaknesses.</p>
<p>One common method is to <strong>delete rows containing missing data</strong> (i.e.&nbsp;complete-case analysis). This method is simple to do and is often a default method for statistical packages, however, it is only appropriate for situations where the data is missing completely at random (MCAR).</p>
<p>Another approach is <strong>imputation</strong>, where missing values are filled in based on observed data. Simple methods include the mean or median imputation, which replaces missing values with the average of the observed values. While straightforward, these methods can reduce variability in the dataset. More sophisticated techniques, such as regression imputation, use regression models to predict and fill in missing values based on other variables, providing potentially more accurate estimates, but introducing bias if the model is not well-specified.</p>
<p>For more advanced handling of missing data, <strong>multiple imputation</strong> creates several different imputed datasets and combines the results to account for uncertainty in the missing data. This approach offers more robust estimates and better reflects the variability in the data.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-07-28_imputing-data/techniques_for_missing_data.png" class="img-fluid figure-img" alt="A list of techniques for handling missing data in ascending order of complexity."></p>
<figcaption>Techniques for handling missing data</figcaption>
</figure>
</div>
</section>
<section id="applied-example" class="level2">
<h2 class="anchored" data-anchor-id="applied-example">Applied example</h2>
<p>Now that we’ve covered the theory of missingness, its impact on analyses, and ways to deal with it, let’s put these techniques into practice using a fictional dataset.</p>
<ul>
<li><p>We’ll create a sample dataset that simulates authentic relationships between age, gender and salary. This will provide the foundation for our experiment.</p></li>
<li><p>Next, we’ll intentionally introduce non-random missingness into the dataset to create a realistic scenario with incomplete data.</p></li>
<li><p>We’ll then fill in the missing data using four different imputation techniques.</p></li>
<li><p>Finally, we’ll compare the results from each imputation technique with the original dataset to evaluate their performance and see how well they work with this example dataset.</p></li>
</ul>
<section id="create-a-complete-dataset" class="level3">
<h3 class="anchored" data-anchor-id="create-a-complete-dataset">Create a complete dataset</h3>
<p>First, lets create an example dataset consisting of UK salaries by age and gender in 2025. This data is loosely based on figures reported in a <a href="https://www.forbes.com/uk/advisor/business/average-uk-salary-by-age/">Forbes article</a> and based on Office for National Statistics (ONS) data.</p>
<p>Our data will contain details for 1,000 people’s salary along with their age and gender:</p>
<ul>
<li><p>Age will be independently generated using a poisson distribution using an average of 40 years,</p></li>
<li><p>Gender will be independently generated using a random sample from a list of either <code>Male</code> or <code>Female</code>.</p></li>
<li><p>Salary will be dependent on both age and gender such that there is an ‘n’ shaped distribution between age with the highest salary for those aged 40 to 49 years, and introduce a 7% gender pay gap between men and women.</p></li>
</ul>
<div class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># decide the number of rows to create</span></span>
<span id="cb1-2">rows <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1000</span></span>
<span id="cb1-3"></span>
<span id="cb1-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Ensure reproducibility</span></span>
<span id="cb1-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">123</span>)</span>
<span id="cb1-6"></span>
<span id="cb1-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Generate the data</span></span>
<span id="cb1-8">df_complete <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span></span>
<span id="cb1-9">  tibble<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tibble</span>(</span>
<span id="cb1-10">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># age and gender are independent</span></span>
<span id="cb1-11">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">age =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rpois</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> rows, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">lambda =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">40</span>),</span>
<span id="cb1-12">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">gender =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sample</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Male'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Female'</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> rows, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">replace =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>),</span>
<span id="cb1-13"></span>
<span id="cb1-14">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># generate salary based on age group</span></span>
<span id="cb1-15">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">salary_age =</span> dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">case_when</span>(</span>
<span id="cb1-16">      age <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%in%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">18</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">21</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> rows, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">24000</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5000</span>),</span>
<span id="cb1-17">      age <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%in%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">22</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">29</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> rows, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">33000</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6000</span>),</span>
<span id="cb1-18">      age <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%in%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">39</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> rows, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">40000</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7000</span>),</span>
<span id="cb1-19">      age <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%in%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">40</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">49</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> rows, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">43000</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8000</span>),</span>
<span id="cb1-20">      age <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%in%</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">59</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> rows, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">41000</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">7000</span>),</span>
<span id="cb1-21">      age <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">59</span>       <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rnorm</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> rows, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">36000</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5000</span>)</span>
<span id="cb1-22">    ),</span>
<span id="cb1-23"></span>
<span id="cb1-24">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># gender pay gap of 7% across all age ranges</span></span>
<span id="cb1-25">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">salary =</span> dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">case_match</span>(</span>
<span id="cb1-26">      gender,</span>
<span id="cb1-27">      <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'Female'</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> salary_age <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.93</span>,</span>
<span id="cb1-28">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.default =</span> salary_age</span>
<span id="cb1-29">    ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb1-30">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">round</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">digits =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb1-31">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb1-32">  dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>salary_age)</span></code></pre></div></div>
</details>
<div class="cell-output cell-output-stderr">
<pre><code>Warning: `case_match()` was deprecated in dplyr 1.2.0.
ℹ Please use `recode_values()` instead.</code></pre>
</div>
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># see a sample</span></span>
<span id="cb3-2">df_complete <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb3-3">  dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">slice_head</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb3-4">  gt<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">gt</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb3-5">  gt<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">opt_stylize</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">style =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">add_row_striping =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>)</span></code></pre></div></div>
</details>
<div class="cell-output-display">
<div id="fgmozmnhcz" style="padding-left:0px;padding-right:0px;padding-top:10px;padding-bottom:10px;overflow-x:auto;overflow-y:auto;width:auto;height:auto;">
<style>#fgmozmnhcz table {
  font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji';
  -webkit-font-smoothing: antialiased;
  -moz-osx-font-smoothing: grayscale;
}

#fgmozmnhcz thead, #fgmozmnhcz tbody, #fgmozmnhcz tfoot, #fgmozmnhcz tr, #fgmozmnhcz td, #fgmozmnhcz th {
  border-style: none;
}

#fgmozmnhcz p {
  margin: 0;
  padding: 0;
}

#fgmozmnhcz .gt_table {
  display: table;
  border-collapse: collapse;
  line-height: normal;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 3px;
  border-top-color: #D5D5D5;
  border-right-style: solid;
  border-right-width: 3px;
  border-right-color: #D5D5D5;
  border-bottom-style: solid;
  border-bottom-width: 3px;
  border-bottom-color: #D5D5D5;
  border-left-style: solid;
  border-left-width: 3px;
  border-left-color: #D5D5D5;
}

#fgmozmnhcz .gt_caption {
  padding-top: 4px;
  padding-bottom: 4px;
}

#fgmozmnhcz .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#fgmozmnhcz .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 3px;
  padding-bottom: 5px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#fgmozmnhcz .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#fgmozmnhcz .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D5D5D5;
}

#fgmozmnhcz .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D5D5D5;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D5D5D5;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#fgmozmnhcz .gt_col_heading {
  color: #FFFFFF;
  background-color: #004D80;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#fgmozmnhcz .gt_column_spanner_outer {
  color: #FFFFFF;
  background-color: #004D80;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#fgmozmnhcz .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#fgmozmnhcz .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#fgmozmnhcz .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D5D5D5;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 5px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#fgmozmnhcz .gt_spanner_row {
  border-bottom-style: hidden;
}

#fgmozmnhcz .gt_group_heading {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D5D5D5;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D5D5D5;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  text-align: left;
}

#fgmozmnhcz .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D5D5D5;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D5D5D5;
  vertical-align: middle;
}

#fgmozmnhcz .gt_from_md > :first-child {
  margin-top: 0;
}

#fgmozmnhcz .gt_from_md > :last-child {
  margin-bottom: 0;
}

#fgmozmnhcz .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D5D5D5;
  border-left-style: solid;
  border-left-width: 1px;
  border-left-color: #D5D5D5;
  border-right-style: solid;
  border-right-width: 1px;
  border-right-color: #D5D5D5;
  vertical-align: middle;
  overflow-x: hidden;
}

#fgmozmnhcz .gt_stub {
  color: #FFFFFF;
  background-color: #929292;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D5D5D5;
  padding-left: 5px;
  padding-right: 5px;
}

#fgmozmnhcz .gt_stub_row_group {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 5px;
  padding-right: 5px;
  vertical-align: top;
}

#fgmozmnhcz .gt_row_group_first td {
  border-top-width: 2px;
}

#fgmozmnhcz .gt_row_group_first th {
  border-top-width: 2px;
}

#fgmozmnhcz .gt_summary_row {
  color: #FFFFFF;
  background-color: #5F5F5F;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#fgmozmnhcz .gt_first_summary_row {
  border-top-style: solid;
  border-top-color: #D5D5D5;
}

#fgmozmnhcz .gt_first_summary_row.thick {
  border-top-width: 2px;
}

#fgmozmnhcz .gt_last_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D5D5D5;
}

#fgmozmnhcz .gt_grand_summary_row {
  color: #FFFFFF;
  background-color: #929292;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#fgmozmnhcz .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D5D5D5;
}

#fgmozmnhcz .gt_last_grand_summary_row_top {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-style: double;
  border-bottom-width: 6px;
  border-bottom-color: #D5D5D5;
}

#fgmozmnhcz .gt_striped {
  background-color: #F4F4F4;
}

#fgmozmnhcz .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D5D5D5;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D5D5D5;
}

#fgmozmnhcz .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#fgmozmnhcz .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
}

#fgmozmnhcz .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#fgmozmnhcz .gt_sourcenote {
  font-size: 90%;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
}

#fgmozmnhcz .gt_left {
  text-align: left;
}

#fgmozmnhcz .gt_center {
  text-align: center;
}

#fgmozmnhcz .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#fgmozmnhcz .gt_font_normal {
  font-weight: normal;
}

#fgmozmnhcz .gt_font_bold {
  font-weight: bold;
}

#fgmozmnhcz .gt_font_italic {
  font-style: italic;
}

#fgmozmnhcz .gt_super {
  font-size: 65%;
}

#fgmozmnhcz .gt_footnote_marks {
  font-size: 75%;
  vertical-align: 0.4em;
  position: initial;
}

#fgmozmnhcz .gt_asterisk {
  font-size: 100%;
  vertical-align: 0;
}

#fgmozmnhcz .gt_indent_1 {
  text-indent: 5px;
}

#fgmozmnhcz .gt_indent_2 {
  text-indent: 10px;
}

#fgmozmnhcz .gt_indent_3 {
  text-indent: 15px;
}

#fgmozmnhcz .gt_indent_4 {
  text-indent: 20px;
}

#fgmozmnhcz .gt_indent_5 {
  text-indent: 25px;
}

#fgmozmnhcz .katex-display {
  display: inline-flex !important;
  margin-bottom: 0.75em !important;
}

#fgmozmnhcz div.Reactable > div.rt-table > div.rt-thead > div.rt-tr.rt-tr-group-header > div.rt-th-group:after {
  height: 0px !important;
}
</style>

<table class="gt_table caption-top table table-sm table-striped small" data-quarto-bootstrap="false">
<thead>
<tr class="gt_col_headings header">
<th id="age" class="gt_col_heading gt_columns_bottom_border gt_right" data-quarto-table-cell-role="th" scope="col">age</th>
<th id="gender" class="gt_col_heading gt_columns_bottom_border gt_left" data-quarto-table-cell-role="th" scope="col">gender</th>
<th id="salary" class="gt_col_heading gt_columns_bottom_border gt_right" data-quarto-table-cell-role="th" scope="col">salary</th>
</tr>
</thead>
<tbody class="gt_table_body">
<tr class="odd">
<td class="gt_row gt_right" headers="age">36</td>
<td class="gt_row gt_left" headers="gender">Male</td>
<td class="gt_row gt_right" headers="salary">34410</td>
</tr>
<tr class="even">
<td class="gt_row gt_right" headers="age">47</td>
<td class="gt_row gt_left" headers="gender">Male</td>
<td class="gt_row gt_right" headers="salary">41889</td>
</tr>
<tr class="odd">
<td class="gt_row gt_right" headers="age">29</td>
<td class="gt_row gt_left" headers="gender">Male</td>
<td class="gt_row gt_right" headers="salary">31711</td>
</tr>
<tr class="even">
<td class="gt_row gt_right" headers="age">40</td>
<td class="gt_row gt_left" headers="gender">Female</td>
<td class="gt_row gt_right" headers="salary">18838</td>
</tr>
<tr class="odd">
<td class="gt_row gt_right" headers="age">50</td>
<td class="gt_row gt_left" headers="gender">Male</td>
<td class="gt_row gt_right" headers="salary">30264</td>
</tr>
<tr class="even">
<td class="gt_row gt_right" headers="age">42</td>
<td class="gt_row gt_left" headers="gender">Female</td>
<td class="gt_row gt_right" headers="salary">33879</td>
</tr>
<tr class="odd">
<td class="gt_row gt_right" headers="age">31</td>
<td class="gt_row gt_left" headers="gender">Male</td>
<td class="gt_row gt_right" headers="salary">29416</td>
</tr>
<tr class="even">
<td class="gt_row gt_right" headers="age">29</td>
<td class="gt_row gt_left" headers="gender">Female</td>
<td class="gt_row gt_right" headers="salary">17698</td>
</tr>
<tr class="odd">
<td class="gt_row gt_right" headers="age">47</td>
<td class="gt_row gt_left" headers="gender">Female</td>
<td class="gt_row gt_right" headers="salary">44293</td>
</tr>
<tr class="even">
<td class="gt_row gt_right" headers="age">42</td>
<td class="gt_row gt_left" headers="gender">Female</td>
<td class="gt_row gt_right" headers="salary">50074</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<p>Here we see a sample of the 1,000 available in our <em>Complete</em> dataset. This data has built-in relationships between salary and age, and also between salary and gender.</p>
<p>We can visualise the data for age and salary as density distributions. Below are two function definitions, one for generating a density plot for a given variable and another for comparing age and salary distributions between our complete dataset and another dataset.</p>
<div class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' Function to create a density plot</span></span>
<span id="cb4-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' </span></span>
<span id="cb4-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param df Tibble of data</span></span>
<span id="cb4-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param x_val Variable to plot on the x axis</span></span>
<span id="cb4-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param x_label String - Name of the variable in the x axis</span></span>
<span id="cb4-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param summary String - Summary stats (mean and sd) for `x_val`</span></span>
<span id="cb4-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param complete Boolean (default = FALSE) - Is this complete data?</span></span>
<span id="cb4-8">plot_density <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(df, x_val, x_label, summary, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">complete =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>) {</span>
<span id="cb4-9">  </span>
<span id="cb4-10">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># define our fill colour:</span></span>
<span id="cb4-11">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># reference (complete) data is coloured orange,</span></span>
<span id="cb4-12">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># comparison data is coloured blue</span></span>
<span id="cb4-13">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> (complete) {</span>
<span id="cb4-14">    fill_colour <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">adjustcolor</span>(</span>
<span id="cb4-15">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">col =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#f9bf07"</span>,</span>
<span id="cb4-16">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha.f =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span></span>
<span id="cb4-17">    )</span>
<span id="cb4-18">  } <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> {</span>
<span id="cb4-19">    fill_colour <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">adjustcolor</span>(</span>
<span id="cb4-20">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">col =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#5881c1"</span>,</span>
<span id="cb4-21">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha.f =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.2</span></span>
<span id="cb4-22">    )</span>
<span id="cb4-23">  }</span>
<span id="cb4-24">  </span>
<span id="cb4-25">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># produce the plot</span></span>
<span id="cb4-26">  p <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> </span>
<span id="cb4-27">    df <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb4-28">    ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> {{x_val}})) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-29">    ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_density</span>(</span>
<span id="cb4-30">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill =</span> fill_colour,</span>
<span id="cb4-31">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">outline.type =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"upper"</span></span>
<span id="cb4-32">    ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-33">    ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">title =</span> x_label) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-34">    ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">annotate</span>(</span>
<span id="cb4-35">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">geom =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'text'</span>,</span>
<span id="cb4-36">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">label =</span> summary,</span>
<span id="cb4-37">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">14</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>.pt,</span>
<span id="cb4-38">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(df <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pull</span>({{x_val}}), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>),</span>
<span id="cb4-39">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>,</span>
<span id="cb4-40">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">hjust =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>,</span>
<span id="cb4-41">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">vjust =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span></span>
<span id="cb4-42">    ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-43">    ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">base_size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">18</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-44">    ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(</span>
<span id="cb4-45">      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># titles</span></span>
<span id="cb4-46">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">plot.title =</span> ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_text</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">22</span>),</span>
<span id="cb4-47">      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># axes</span></span>
<span id="cb4-48">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.ticks =</span> ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>(),</span>
<span id="cb4-49">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.line =</span> ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>(),</span>
<span id="cb4-50">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.title =</span> ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>(),</span>
<span id="cb4-51">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.text.y =</span> ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>(),</span>
<span id="cb4-52">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">axis.text.x =</span> ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_text</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">18</span>),</span>
<span id="cb4-53">      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># grid lines</span></span>
<span id="cb4-54">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">panel.grid =</span> ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>(),</span>
<span id="cb4-55">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">panel.border =</span> ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>(),</span>
<span id="cb4-56">      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># background</span></span>
<span id="cb4-57">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">plot.background =</span> ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>()</span>
<span id="cb4-58">    )</span>
<span id="cb4-59">  </span>
<span id="cb4-60">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># add thousands suffix to salary outputs</span></span>
<span id="cb4-61">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> (x_label <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Salary"</span>) {</span>
<span id="cb4-62">    p <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span></span>
<span id="cb4-63">      p <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb4-64">      ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_x_continuous</span>(</span>
<span id="cb4-65">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">labels =</span> scales<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">label_number</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">suffix =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'k'</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">scale =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1e-3</span>)</span>
<span id="cb4-66">      )</span>
<span id="cb4-67">  }</span>
<span id="cb4-68">  </span>
<span id="cb4-69">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># return the plot</span></span>
<span id="cb4-70">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(p)</span>
<span id="cb4-71">}</span>
<span id="cb4-72"></span>
<span id="cb4-73"></span>
<span id="cb4-74"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' Function to compare density distributions</span></span>
<span id="cb4-75"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#'</span></span>
<span id="cb4-76"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param df_complete Tibble - the 'complete' dataset</span></span>
<span id="cb4-77"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param df Tibble - the comparison dataset containing missing or imputed values</span></span>
<span id="cb4-78"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#'</span></span>
<span id="cb4-79"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @returns ggplot2 object</span></span>
<span id="cb4-80">compare_distributions <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(df_complete, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">df =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span>) {</span>
<span id="cb4-81">  </span>
<span id="cb4-82">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># summary stats for age and salary</span></span>
<span id="cb4-83">  complete_summary_age <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> glue<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glue</span>(</span>
<span id="cb4-84">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Mean: {round(mean(df_complete$age, na.rm = TRUE), digits = 2)}</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\n</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>,</span>
<span id="cb4-85">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"SD: {round(sd(df_complete$age, na.rm = TRUE), digits = 2)}"</span></span>
<span id="cb4-86">  )</span>
<span id="cb4-87">  </span>
<span id="cb4-88">  complete_summary_salary <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> glue<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glue</span>(</span>
<span id="cb4-89">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Mean: {round(mean(df_complete$salary, na.rm = TRUE)/1000, digits = 2)}k</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\n</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>,</span>
<span id="cb4-90">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"SD: {round(sd(df_complete$salary, na.rm = TRUE)/1000, digits = 2)}k"</span></span>
<span id="cb4-91">  )</span>
<span id="cb4-92">  </span>
<span id="cb4-93">   <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># get the 'complete' density plots</span></span>
<span id="cb4-94">  complete_age <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span></span>
<span id="cb4-95">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plot_density</span>(</span>
<span id="cb4-96">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">df =</span> df_complete,</span>
<span id="cb4-97">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x_val =</span> age,</span>
<span id="cb4-98">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x_label =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Age"</span>,</span>
<span id="cb4-99">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">summary =</span> complete_summary_age,</span>
<span id="cb4-100">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">complete =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span></span>
<span id="cb4-101">    )</span>
<span id="cb4-102">  </span>
<span id="cb4-103">  complete_salary <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span></span>
<span id="cb4-104">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plot_density</span>(</span>
<span id="cb4-105">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">df =</span> df_complete,</span>
<span id="cb4-106">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x_val =</span> salary,</span>
<span id="cb4-107">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x_label =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Salary"</span>,</span>
<span id="cb4-108">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">summary =</span> complete_summary_salary,</span>
<span id="cb4-109">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">complete =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span></span>
<span id="cb4-110">    )</span>
<span id="cb4-111">  </span>
<span id="cb4-112">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># check whether a comparison dataset has been provided</span></span>
<span id="cb4-113">  <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> (<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">missing</span>(df)) {</span>
<span id="cb4-114">    </span>
<span id="cb4-115">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># summary stats for age and salary</span></span>
<span id="cb4-116">    input_summary_age <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> glue<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glue</span>(</span>
<span id="cb4-117">      <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Mean: {round(mean(df$age, na.rm = TRUE), digits = 2)}</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\n</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>,</span>
<span id="cb4-118">      <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"SD: {round(sd(df$age, na.rm = TRUE), digits = 2)}"</span></span>
<span id="cb4-119">    )</span>
<span id="cb4-120">  </span>
<span id="cb4-121">    input_summary_salary <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> glue<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glue</span>(</span>
<span id="cb4-122">      <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Mean: {round(mean(df$salary, na.rm = TRUE)/1000, digits = 2)}k</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">\n</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>,</span>
<span id="cb4-123">      <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"SD: {round(sd(df$salary, na.rm = TRUE)/1000, digits = 2)}k"</span></span>
<span id="cb4-124">    )</span>
<span id="cb4-125">    </span>
<span id="cb4-126">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># get the comparison density plots</span></span>
<span id="cb4-127">    comparison_age <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span></span>
<span id="cb4-128">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plot_density</span>(</span>
<span id="cb4-129">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">df =</span> df,</span>
<span id="cb4-130">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x_val =</span> age,</span>
<span id="cb4-131">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x_label =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Age"</span>,</span>
<span id="cb4-132">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">summary =</span> input_summary_age,</span>
<span id="cb4-133">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">complete =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span></span>
<span id="cb4-134">      )</span>
<span id="cb4-135">    </span>
<span id="cb4-136">    comparison_salary <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span></span>
<span id="cb4-137">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plot_density</span>(</span>
<span id="cb4-138">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">df =</span> df,</span>
<span id="cb4-139">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x_val =</span> salary,</span>
<span id="cb4-140">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x_label =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Salary"</span>,</span>
<span id="cb4-141">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">summary =</span> input_summary_salary,</span>
<span id="cb4-142">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">complete =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span></span>
<span id="cb4-143">      )</span>
<span id="cb4-144">    </span>
<span id="cb4-145">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># combine the charts to compare</span></span>
<span id="cb4-146">    plots <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> </span>
<span id="cb4-147">      patchwork<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">wrap_plots</span>(</span>
<span id="cb4-148">        complete_age, comparison_age,</span>
<span id="cb4-149">        complete_salary, comparison_salary,</span>
<span id="cb4-150">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">nrow =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span></span>
<span id="cb4-151">      )</span>
<span id="cb4-152">  } <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span> {</span>
<span id="cb4-153">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># there is only the complete dataset included, so show that</span></span>
<span id="cb4-154">    plots <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span></span>
<span id="cb4-155">      patchwork<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">wrap_plots</span>(</span>
<span id="cb4-156">        complete_age, complete_salary,</span>
<span id="cb4-157">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">ncol =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb4-158">      )</span>
<span id="cb4-159">  }</span>
<span id="cb4-160">  </span>
<span id="cb4-161">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># return the plots</span></span>
<span id="cb4-162">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(plots)</span>
<span id="cb4-163">  </span>
<span id="cb4-164">}</span></code></pre></div></div>
</details>
</div>
<p>We’ll see examples comparing datasets later, but for now lets examine how <code>age</code> and <code>salary</code> are distributed among our pristine <code>complete</code> dataset.</p>
<div class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">compare_distributions</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">df_complete =</span> df_complete)</span></code></pre></div></div>
</details>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-07-28_imputing-data/index_files/figure-html/unnamed-chunk-3-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<p>The average (mean) age of people in our dataset is 39.91 years with a standard deviation (SD) of 6.24 years. This indicates that most individuals are clustered around this average age, but there is some variability, with ages ranging from 18 to 65 years.</p>
<p>The average (mean) salary is £39,400 with a SD of £7,410. There is a notable spread in salaries, influenced by factors such as age and gender.</p>
<p>While we have established these averages, it’s important to note that neither the age nor salary distribution is perfectly smooth. Several factors contribute to the irregularities observed in the density plots:</p>
<p><strong>Randomness in data generation:</strong> The use of a Poisson distribution for age generation introduces inherent randomness. This randomness can lead to fluctuations in the density curve, resulting in peaks and troughs that may not represent a perfectly normal distribution.</p>
<p><strong>Relationships between salary and gender:</strong> The salary calculations were influenced by both age and gender, creating an ‘n’ shaped distribution where salaries peak for individuals aged 40 to 49 years. The 7% gender pay gap further complicates the distribution, as it introduces additional variability based on gender.</p>
<p>These plots represent the complete dataset, showcasing all the complexities we introduced during the data generation process. In many ways, they illustrate an idealised version of data - one that is fully complete, with no missing values and clear relationships between variables. This is often referred to as the “unknowable truth” of the data, as it reflects a perfect scenario that analysts rarely experience in practice.</p>
</section>
<section id="introducing-missingness-in-our-dataset" class="level3">
<h3 class="anchored" data-anchor-id="introducing-missingness-in-our-dataset">Introducing missingness in our dataset</h3>
<p>We will now introduce missingness into our complete dataset, creating a new dataset that reflects a more realistic situation. Specifically, we will implement a Missing At Random (MAR) mechanism, where the missingness is conditionally related to other observed variables.</p>
<p>We will remove 100 values (10% of the dataset) from each variable based on the following criteria:</p>
<p><strong>Age missingness:</strong> we will randomly select 100 individuals whose salary is above £45,000 and set their age to missing.</p>
<p><strong>Salary missingness:</strong> we will randomly select 100 individuals whose age is below 40 years and set their salary to missing.</p>
<p><strong>Gender missingness:</strong> we will randomly select 100 individuals to have their gender value set to missing, independent of other variables.</p>
<div class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb6-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># take a copy of the complete data</span></span>
<span id="cb6-2">df_missing <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> df_complete</span>
<span id="cb6-3">seed <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span></span>
<span id="cb6-4"></span>
<span id="cb6-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># age - where salary &gt; 45000</span></span>
<span id="cb6-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(seed) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># for reproducibility</span></span>
<span id="cb6-7">df_missing<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>age[</span>
<span id="cb6-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sample</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">which</span>(df_complete<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>salary <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">45000</span>), </span>
<span id="cb6-9">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, </span>
<span id="cb6-10">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">replace =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>)] <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span></span>
<span id="cb6-11">  </span>
<span id="cb6-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># salary - 90 where age &gt; 45, 10 from rest</span></span>
<span id="cb6-13"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(seed) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># for reproducibility</span></span>
<span id="cb6-14">df_missing<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>salary[</span>
<span id="cb6-15">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sample</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">which</span>(df_complete<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>age <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">40</span>), </span>
<span id="cb6-16">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, </span>
<span id="cb6-17">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">replace =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>)] <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span></span>
<span id="cb6-18"></span>
<span id="cb6-19"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># gender - 100 randomly</span></span>
<span id="cb6-20"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(seed) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># for reproducibility</span></span>
<span id="cb6-21">df_missing<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>gender[</span>
<span id="cb6-22">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sample</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>rows, </span>
<span id="cb6-23">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, </span>
<span id="cb6-24">         <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">replace =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>)] <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span></span>
<span id="cb6-25"></span>
<span id="cb6-26"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># see a sample</span></span>
<span id="cb6-27">missing_cell_style <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> gt<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cell_fill</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#ec6555"</span>)</span>
<span id="cb6-28">df_missing <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb6-29">  dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">slice_head</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb6-30">  gt<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">gt</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb6-31">  gt<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">opt_stylize</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">style =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">add_row_striping =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb6-32">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># highlight cells with missing values</span></span>
<span id="cb6-33">  gt<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tab_style</span>(</span>
<span id="cb6-34">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">style =</span> missing_cell_style,</span>
<span id="cb6-35">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">locations =</span> gt<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cells_body</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">columns =</span> age, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">rows =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(age))</span>
<span id="cb6-36">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb6-37">  gt<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tab_style</span>(</span>
<span id="cb6-38">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">style =</span> missing_cell_style,</span>
<span id="cb6-39">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">locations =</span> gt<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cells_body</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">columns =</span> gender, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">rows =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(gender))</span>
<span id="cb6-40">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb6-41">  gt<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tab_style</span>(</span>
<span id="cb6-42">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">style =</span> missing_cell_style,</span>
<span id="cb6-43">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">locations =</span> gt<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cells_body</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">columns =</span> salary, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">rows =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(salary))</span>
<span id="cb6-44">  )</span></code></pre></div></div>
</details>
<div class="cell-output-display">
<div id="xlizkwkczz" style="padding-left:0px;padding-right:0px;padding-top:10px;padding-bottom:10px;overflow-x:auto;overflow-y:auto;width:auto;height:auto;">
<style>#xlizkwkczz table {
  font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji';
  -webkit-font-smoothing: antialiased;
  -moz-osx-font-smoothing: grayscale;
}

#xlizkwkczz thead, #xlizkwkczz tbody, #xlizkwkczz tfoot, #xlizkwkczz tr, #xlizkwkczz td, #xlizkwkczz th {
  border-style: none;
}

#xlizkwkczz p {
  margin: 0;
  padding: 0;
}

#xlizkwkczz .gt_table {
  display: table;
  border-collapse: collapse;
  line-height: normal;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 3px;
  border-top-color: #D5D5D5;
  border-right-style: solid;
  border-right-width: 3px;
  border-right-color: #D5D5D5;
  border-bottom-style: solid;
  border-bottom-width: 3px;
  border-bottom-color: #D5D5D5;
  border-left-style: solid;
  border-left-width: 3px;
  border-left-color: #D5D5D5;
}

#xlizkwkczz .gt_caption {
  padding-top: 4px;
  padding-bottom: 4px;
}

#xlizkwkczz .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#xlizkwkczz .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 3px;
  padding-bottom: 5px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#xlizkwkczz .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#xlizkwkczz .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D5D5D5;
}

#xlizkwkczz .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D5D5D5;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D5D5D5;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#xlizkwkczz .gt_col_heading {
  color: #FFFFFF;
  background-color: #004D80;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#xlizkwkczz .gt_column_spanner_outer {
  color: #FFFFFF;
  background-color: #004D80;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#xlizkwkczz .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#xlizkwkczz .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#xlizkwkczz .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D5D5D5;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 5px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#xlizkwkczz .gt_spanner_row {
  border-bottom-style: hidden;
}

#xlizkwkczz .gt_group_heading {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D5D5D5;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D5D5D5;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  text-align: left;
}

#xlizkwkczz .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D5D5D5;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D5D5D5;
  vertical-align: middle;
}

#xlizkwkczz .gt_from_md > :first-child {
  margin-top: 0;
}

#xlizkwkczz .gt_from_md > :last-child {
  margin-bottom: 0;
}

#xlizkwkczz .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D5D5D5;
  border-left-style: solid;
  border-left-width: 1px;
  border-left-color: #D5D5D5;
  border-right-style: solid;
  border-right-width: 1px;
  border-right-color: #D5D5D5;
  vertical-align: middle;
  overflow-x: hidden;
}

#xlizkwkczz .gt_stub {
  color: #FFFFFF;
  background-color: #929292;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D5D5D5;
  padding-left: 5px;
  padding-right: 5px;
}

#xlizkwkczz .gt_stub_row_group {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 5px;
  padding-right: 5px;
  vertical-align: top;
}

#xlizkwkczz .gt_row_group_first td {
  border-top-width: 2px;
}

#xlizkwkczz .gt_row_group_first th {
  border-top-width: 2px;
}

#xlizkwkczz .gt_summary_row {
  color: #FFFFFF;
  background-color: #5F5F5F;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#xlizkwkczz .gt_first_summary_row {
  border-top-style: solid;
  border-top-color: #D5D5D5;
}

#xlizkwkczz .gt_first_summary_row.thick {
  border-top-width: 2px;
}

#xlizkwkczz .gt_last_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D5D5D5;
}

#xlizkwkczz .gt_grand_summary_row {
  color: #FFFFFF;
  background-color: #929292;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#xlizkwkczz .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D5D5D5;
}

#xlizkwkczz .gt_last_grand_summary_row_top {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-style: double;
  border-bottom-width: 6px;
  border-bottom-color: #D5D5D5;
}

#xlizkwkczz .gt_striped {
  background-color: #F4F4F4;
}

#xlizkwkczz .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D5D5D5;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D5D5D5;
}

#xlizkwkczz .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#xlizkwkczz .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
}

#xlizkwkczz .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#xlizkwkczz .gt_sourcenote {
  font-size: 90%;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
}

#xlizkwkczz .gt_left {
  text-align: left;
}

#xlizkwkczz .gt_center {
  text-align: center;
}

#xlizkwkczz .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#xlizkwkczz .gt_font_normal {
  font-weight: normal;
}

#xlizkwkczz .gt_font_bold {
  font-weight: bold;
}

#xlizkwkczz .gt_font_italic {
  font-style: italic;
}

#xlizkwkczz .gt_super {
  font-size: 65%;
}

#xlizkwkczz .gt_footnote_marks {
  font-size: 75%;
  vertical-align: 0.4em;
  position: initial;
}

#xlizkwkczz .gt_asterisk {
  font-size: 100%;
  vertical-align: 0;
}

#xlizkwkczz .gt_indent_1 {
  text-indent: 5px;
}

#xlizkwkczz .gt_indent_2 {
  text-indent: 10px;
}

#xlizkwkczz .gt_indent_3 {
  text-indent: 15px;
}

#xlizkwkczz .gt_indent_4 {
  text-indent: 20px;
}

#xlizkwkczz .gt_indent_5 {
  text-indent: 25px;
}

#xlizkwkczz .katex-display {
  display: inline-flex !important;
  margin-bottom: 0.75em !important;
}

#xlizkwkczz div.Reactable > div.rt-table > div.rt-thead > div.rt-tr.rt-tr-group-header > div.rt-th-group:after {
  height: 0px !important;
}
</style>

<table class="gt_table caption-top table table-sm table-striped small" data-quarto-bootstrap="false">
<thead>
<tr class="gt_col_headings header">
<th id="age" class="gt_col_heading gt_columns_bottom_border gt_right" data-quarto-table-cell-role="th" scope="col">age</th>
<th id="gender" class="gt_col_heading gt_columns_bottom_border gt_left" data-quarto-table-cell-role="th" scope="col">gender</th>
<th id="salary" class="gt_col_heading gt_columns_bottom_border gt_right" data-quarto-table-cell-role="th" scope="col">salary</th>
</tr>
</thead>
<tbody class="gt_table_body">
<tr class="odd">
<td class="gt_row gt_right" headers="age">36</td>
<td class="gt_row gt_left" headers="gender">Male</td>
<td class="gt_row gt_right" headers="salary">34410</td>
</tr>
<tr class="even">
<td class="gt_row gt_right" headers="age">47</td>
<td class="gt_row gt_left" headers="gender">Male</td>
<td class="gt_row gt_right" headers="salary">41889</td>
</tr>
<tr class="odd">
<td class="gt_row gt_right" headers="age">29</td>
<td class="gt_row gt_left" headers="gender">Male</td>
<td class="gt_row gt_right" headers="salary">31711</td>
</tr>
<tr class="even">
<td class="gt_row gt_right" headers="age">40</td>
<td class="gt_row gt_left" headers="gender">Female</td>
<td class="gt_row gt_right" headers="salary">18838</td>
</tr>
<tr class="odd">
<td class="gt_row gt_right" headers="age">50</td>
<td class="gt_row gt_left" headers="gender">Male</td>
<td class="gt_row gt_right" headers="salary">30264</td>
</tr>
<tr class="even">
<td class="gt_row gt_right" headers="age">42</td>
<td class="gt_row gt_left" headers="gender">Female</td>
<td class="gt_row gt_right" headers="salary">33879</td>
</tr>
<tr class="odd">
<td class="gt_row gt_right" headers="age">31</td>
<td class="gt_row gt_left" headers="gender" style="background-color: #EC6555">NA</td>
<td class="gt_row gt_right" headers="salary">29416</td>
</tr>
<tr class="even">
<td class="gt_row gt_right" headers="age">29</td>
<td class="gt_row gt_left" headers="gender">Female</td>
<td class="gt_row gt_right" headers="salary" style="background-color: #EC6555">NA</td>
</tr>
<tr class="odd">
<td class="gt_row gt_right" headers="age">47</td>
<td class="gt_row gt_left" headers="gender">Female</td>
<td class="gt_row gt_right" headers="salary">44293</td>
</tr>
<tr class="even">
<td class="gt_row gt_right" headers="age">42</td>
<td class="gt_row gt_left" headers="gender">Female</td>
<td class="gt_row gt_right" headers="salary">50074</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<p>Here we can see a sample from our dataset with missingness introduced. Two missing values are shown here, highlighted in red.</p>
<p>We will now explore techniques handling this missingness.</p>
</section>
<section id="technique-1---complete-case-analysis" class="level3">
<h3 class="anchored" data-anchor-id="technique-1---complete-case-analysis">Technique 1 - complete-case analysis</h3>
<p>The complete-case analysis approach, also known as listwise deletion, involves removing any cases that have missing values. This means only observations with complete data for age, gender and salary are used in the analysis.</p>
<div class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># remove records containing missing values</span></span>
<span id="cb7-2">df_removal <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span></span>
<span id="cb7-3">  df_missing <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb7-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">na.omit</span>()</span>
<span id="cb7-5"></span>
<span id="cb7-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># see the result</span></span>
<span id="cb7-7"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">compare_distributions</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">df_complete =</span> df_complete, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">df =</span> df_removal)</span></code></pre></div></div>
</details>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-07-28_imputing-data/index_files/figure-html/unnamed-chunk-5-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<p>Here we see the original dataset (left, yellow) compared with the newly created dataset (right, blue) produced by omitting records where they contain one or more missing values.</p>
<p>This approach has a significant impact on the shape of the distributions, especially for salary. In the dataset with missing values, there is a noticeable gap or ‘chunk’ missing from the right-hand side of the salary distribution. Additionally, the salary standard deviation decreases, indicating there is less variability in the imputed dataset than we know existed in the original complete data.</p>
</section>
<section id="technique-2---substituting-averages" class="level3">
<h3 class="anchored" data-anchor-id="technique-2---substituting-averages">Technique 2 - substituting averages</h3>
<p>This technique involves replacing missing values with the mean of the observed values for age and salary.</p>
<div class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1">df_average <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> </span>
<span id="cb8-2">  df_missing <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb8-3">  dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(</span>
<span id="cb8-4">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">age =</span> dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">case_match</span>(</span>
<span id="cb8-5">      age,</span>
<span id="cb8-6">      <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(age, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>),</span>
<span id="cb8-7">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.default =</span> age</span>
<span id="cb8-8">    ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb8-9">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">round</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">digits =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>),</span>
<span id="cb8-10">    </span>
<span id="cb8-11">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">salary =</span> dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">case_match</span>(</span>
<span id="cb8-12">      salary,</span>
<span id="cb8-13">      <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(salary, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>),</span>
<span id="cb8-14">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.default =</span> salary</span>
<span id="cb8-15">    ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb8-16">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">round</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">digits =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb8-17">  )</span>
<span id="cb8-18"></span>
<span id="cb8-19"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># see the result</span></span>
<span id="cb8-20"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">compare_distributions</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">df_complete =</span> df_complete, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">df =</span> df_average)</span></code></pre></div></div>
</details>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-07-28_imputing-data/index_files/figure-html/unnamed-chunk-6-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<p>This technique produces unusual distributions, characterised by noticeable spikes in the central region in the newly created data, (right, blue), that result from substituting all missing values with the average. While this method does not significantly alter the mean values, it does lead to much smaller standard deviations for both age and salary. This reduction indicates there is considerably less variability in the imputed data compared to the original dataset.</p>
</section>
<section id="technique-3---linear-regression" class="level3">
<h3 class="anchored" data-anchor-id="technique-3---linear-regression">Technique 3 - linear regression</h3>
<p>In this technique the missing values are imputed based on a linear regression using the relationships with the other variable. Specifically, salary is imputed based on its relationship with age, and age is imputed based on its relationship with salary.</p>
<div class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># model age and salary</span></span>
<span id="cb9-2">model_age <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glm</span>(</span>
<span id="cb9-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> df_missing <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">na.omit</span>(),</span>
<span id="cb9-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">formula =</span> age <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> salary,</span>
<span id="cb9-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.action =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"na.exclude"</span></span>
<span id="cb9-6">)</span>
<span id="cb9-7"></span>
<span id="cb9-8">model_salary <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glm</span>(</span>
<span id="cb9-9">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> df_missing <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">na.omit</span>(),</span>
<span id="cb9-10">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">formula =</span> salary <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> age,</span>
<span id="cb9-11">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.action =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"na.exclude"</span></span>
<span id="cb9-12">)</span>
<span id="cb9-13"></span>
<span id="cb9-14"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># fill in details using regression</span></span>
<span id="cb9-15">df_regression <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span></span>
<span id="cb9-16">  df_missing <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb9-17">  dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(</span>
<span id="cb9-18">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">age_pred =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">predict</span>(</span>
<span id="cb9-19">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">object =</span> model_age,</span>
<span id="cb9-20">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">newdata =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb9-21">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">gender =</span> df_average<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>gender, </span>
<span id="cb9-22">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">salary =</span> df_average<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>salary)</span>
<span id="cb9-23">    ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb9-24">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">round</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">digits =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>),</span>
<span id="cb9-25">    </span>
<span id="cb9-26">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">salary_pred =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">predict</span>(</span>
<span id="cb9-27">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">object =</span> model_salary,</span>
<span id="cb9-28">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">newdata =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(</span>
<span id="cb9-29">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">gender =</span> df_average<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>gender, </span>
<span id="cb9-30">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">age =</span> df_average<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>age)</span>
<span id="cb9-31">    ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb9-32">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">round</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">digits =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>),</span>
<span id="cb9-33">    </span>
<span id="cb9-34">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">age =</span> dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coalesce</span>(age, age_pred),</span>
<span id="cb9-35">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">salary =</span> dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coalesce</span>(salary, salary_pred)</span>
<span id="cb9-36">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb9-37">  dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(age_pred, salary_pred))</span>
<span id="cb9-38"></span>
<span id="cb9-39"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># see the result</span></span>
<span id="cb9-40"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">compare_distributions</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">df_complete =</span> df_complete, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">df =</span> df_regression)</span></code></pre></div></div>
</details>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-07-28_imputing-data/index_files/figure-html/unnamed-chunk-7-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<p>This technique also produces unusual distributions, characterised by noticeable spikes in the new dataset (right, blue). Both the imputed age and imputed salary exhibit much smaller standard deviations compared to the original dataset, indicating reduced variability in the imputed data.</p>
</section>
<section id="technique-4---multivariate-imputation-by-chained-equations-mice" class="level3">
<h3 class="anchored" data-anchor-id="technique-4---multivariate-imputation-by-chained-equations-mice">Technique 4 - Multivariate Imputation by Chained Equations (MICE)</h3>
<p>The MICE algorithm works by iteratively imputing missing values using a series of regression models. It initialises missing values with a starting estimate, then for each variable with missing values, creates a regression model using the observed values and other variables. The regression model is used to predict the missing values, and this process is repeated for each variable with missing values, updating the estimates at each iteration. This iterative process is repeated multiple times, creating multiple imputed datasets, which are then combined to obtain a single, final estimate.</p>
<div class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb10-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># let {mice} suggest an imputation method for each variable</span></span>
<span id="cb10-2">init <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> mice<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mice</span>(df_missing, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">maxit =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb10-3"></span>
<span id="cb10-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># calculate the imputed values</span></span>
<span id="cb10-5">imputed <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> mice<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mice</span>(</span>
<span id="cb10-6">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> df_missing,          <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># data to be used</span></span>
<span id="cb10-7">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">m =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>,                     <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># number of multiple imputations (dflt = 5)</span></span>
<span id="cb10-8">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">method =</span> init<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>method,       <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># matching method - will go with default</span></span>
<span id="cb10-9">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">seed =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">123</span>,                 <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># for reproducibility</span></span>
<span id="cb10-10">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">maxit =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>,                 <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 10 iterations (default = 5)</span></span>
<span id="cb10-11">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">printFlag =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span>           <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># don't print history to the console</span></span>
<span id="cb10-12">)</span>
<span id="cb10-13"></span>
<span id="cb10-14"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># complete the data</span></span>
<span id="cb10-15">df_mice <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> mice<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">complete</span>(imputed)</span>
<span id="cb10-16"></span>
<span id="cb10-17"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># see the result</span></span>
<span id="cb10-18"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">compare_distributions</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">df_complete =</span> df_complete, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">df =</span> df_mice)</span></code></pre></div></div>
</details>
<div class="cell-output-display">
<div>
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-07-28_imputing-data/index_files/figure-html/unnamed-chunk-8-1.png" class="img-fluid figure-img" width="672"></p>
</figure>
</div>
</div>
</div>
<p>Here we see the original dataset (left, yellow) compared with the newly created dataset (right, blue) produced by multiple imputation using the MICE algorithm.</p>
<p>While this approach is not without its imperfections, it provides significantly better estimates for the missing data compared to previous techniques. As a result, the density distributions of the imputed dataset resemble those of the original dataset, indicating a more accurate representation of the underlying data.</p>
</section>
<section id="technique-review" class="level3">
<h3 class="anchored" data-anchor-id="technique-review">Technique review</h3>
<p>We have explored four techniques for handling missing data: complete-case analysis, mean imputation, linear regression and multiple imputation using the MICE algorithm. Now, let’s consolidate our findings to see how these techniques compare in terms of their effectiveness.</p>
<p>In the tables below, we compile the mean and standard deviation (SD) values for age and salary from each technique, along with the absolute differences from the original dataset’s mean and standard deviations. This comparison highlights the performance of each method in imputing missing data.</p>
<section id="mice-performs-best-at-imputing-age" class="level4">
<h4 class="anchored" data-anchor-id="mice-performs-best-at-imputing-age">MICE performs best at imputing age</h4>
<p>The MICE algorithm performed the best at imputing missing age data, providing mean and standard deviation values that are the closest overall match to the original dataset. While it is not perfect, it significantly improves the accuracy of the imputed values.</p>
<p>Both the mean imputation and linear regression techniques performed reasonably well in estimating the average value of age. However, they resulted in significantly different standard deviations compared to the original dataset. This discrepancy indicates that the resulting distributions were altered, as observed in our earlier plot comparisons.</p>
<div class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb11-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># list the data showing different imputation approaches</span></span>
<span id="cb11-2">data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> </span>
<span id="cb11-3">  tibble<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tribble</span>(</span>
<span id="cb11-4">    <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span>df, <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span>df_name,</span>
<span id="cb11-5">    df_complete, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Original"</span>,</span>
<span id="cb11-6">    df_removal, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Complete-case"</span>,</span>
<span id="cb11-7">    df_average, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Average"</span>,</span>
<span id="cb11-8">    df_regression, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Regression"</span>,</span>
<span id="cb11-9">    df_mice, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"MICE"</span></span>
<span id="cb11-10">  )</span>
<span id="cb11-11"></span>
<span id="cb11-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># summarise the results to a single df</span></span>
<span id="cb11-13">df_summary_test <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span></span>
<span id="cb11-14">  purrr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map2_dfr</span>(</span>
<span id="cb11-15">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.x =</span> data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>df,</span>
<span id="cb11-16">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.y =</span> data<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>df_name,</span>
<span id="cb11-17">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.f =</span> \(.x, .y) {</span>
<span id="cb11-18">      .x <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb11-19">        dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(</span>
<span id="cb11-20">          <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">set =</span> .y,</span>
<span id="cb11-21">          <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">age_mean =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(age),</span>
<span id="cb11-22">          <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">age_sd =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sd</span>(age),</span>
<span id="cb11-23">          <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">salary_mean =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(salary),</span>
<span id="cb11-24">          <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">salary_sd =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sd</span>(salary)</span>
<span id="cb11-25">        )</span>
<span id="cb11-26">    }</span>
<span id="cb11-27">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb11-28">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># split age and salary to separate rows</span></span>
<span id="cb11-29">  tidyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pivot_longer</span>(</span>
<span id="cb11-30">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">cols =</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>set,</span>
<span id="cb11-31">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">names_to =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"measure"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">".value"</span>),</span>
<span id="cb11-32">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">names_pattern =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"(.*)_(.*)"</span></span>
<span id="cb11-33">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb11-34">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># convert salary to thousands to put on the same scale as age</span></span>
<span id="cb11-35">  dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(</span>
<span id="cb11-36">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">case_when</span>(</span>
<span id="cb11-37">      measure <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"salary"</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> mean <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1e-3</span>,</span>
<span id="cb11-38">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.default =</span> mean</span>
<span id="cb11-39">    ),</span>
<span id="cb11-40">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">case_when</span>(</span>
<span id="cb11-41">      measure <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"salary"</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> sd <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">*</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1e-3</span>,</span>
<span id="cb11-42">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.default =</span> sd</span>
<span id="cb11-43">    )</span>
<span id="cb11-44">  )</span>
<span id="cb11-45"></span>
<span id="cb11-46"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># define a function to display results in a formatted table</span></span>
<span id="cb11-47"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' Produce a summary table of results for a given measure</span></span>
<span id="cb11-48"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#'</span></span>
<span id="cb11-49"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param df Tibble containing summary results from each imputation approach</span></span>
<span id="cb11-50"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param .measure String identifying the measure to summarise (either 'age' or 'salary')</span></span>
<span id="cb11-51"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#'</span></span>
<span id="cb11-52"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @returns gt table</span></span>
<span id="cb11-53">imputation_summary_table <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(df, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.measure =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"age"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"salary"</span>)) {</span>
<span id="cb11-54">  </span>
<span id="cb11-55">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># get the original data as reference</span></span>
<span id="cb11-56">  df_original <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span></span>
<span id="cb11-57">    df <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb11-58">    dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(measure <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> .measure, set <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Original"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb11-59">    dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(mean, sd)</span>
<span id="cb11-60">  </span>
<span id="cb11-61">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># filter the data for the specified measure and work out differences</span></span>
<span id="cb11-62">  df_summary <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span></span>
<span id="cb11-63">    df <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb11-64">    dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(measure <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> .measure) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb11-65">    dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(</span>
<span id="cb11-66">      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># add the 'original' mean and sd values to each row</span></span>
<span id="cb11-67">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean_original =</span> df_original<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>mean,</span>
<span id="cb11-68">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd_original =</span> df_original<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>sd,</span>
<span id="cb11-69">      </span>
<span id="cb11-70">      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># work out  differences between the imputation and original</span></span>
<span id="cb11-71">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean_difference =</span> dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">case_when</span>(</span>
<span id="cb11-72">        set <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Original"</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span>, <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># leave original blank</span></span>
<span id="cb11-73">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.default =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">abs</span>(mean <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> mean_original)</span>
<span id="cb11-74">      ),</span>
<span id="cb11-75">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd_difference =</span> dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">case_when</span>(</span>
<span id="cb11-76">        set <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Original"</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span>, <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># leave original blank</span></span>
<span id="cb11-77">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.default =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">abs</span>(sd <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> sd_original)</span>
<span id="cb11-78">      ),</span>
<span id="cb11-79">      </span>
<span id="cb11-80">      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># determine which approach gives the overall lowest difference</span></span>
<span id="cb11-81">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">overall_difference =</span> mean_difference <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> sd_difference,</span>
<span id="cb11-82">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">trophy =</span> dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">case_when</span>(</span>
<span id="cb11-83">        overall_difference <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">min</span>(overall_difference, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"🏆"</span></span>
<span id="cb11-84">      )</span>
<span id="cb11-85">    )</span>
<span id="cb11-86">  </span>
<span id="cb11-87">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># display as a formatted table</span></span>
<span id="cb11-88">  tab <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> </span>
<span id="cb11-89">    df_summary <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb11-90">    gt<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">gt</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb11-91">    gt<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tab_options</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">quarto.disable_processing =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">TRUE</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb11-92">    gt<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fmt_number</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">decimals =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb11-93">    gt<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cols_hide</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">columns =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(</span>
<span id="cb11-94">      <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"measure"</span>, </span>
<span id="cb11-95">      <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"mean_original"</span>, </span>
<span id="cb11-96">      <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"sd_original"</span></span>
<span id="cb11-97">    )) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb11-98">    gt<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sub_missing</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">missing_text =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb11-99">    gt<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data_color</span>(</span>
<span id="cb11-100">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">columns =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"mean_difference"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"sd_difference"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"overall_difference"</span>),</span>
<span id="cb11-101">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">palette =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#5881c1"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#dff9fb"</span>),</span>
<span id="cb11-102">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na_color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"white"</span></span>
<span id="cb11-103">    ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb11-104">    gt<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cols_label</span>(</span>
<span id="cb11-105">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">set =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Set"</span>,</span>
<span id="cb11-106">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Mean"</span>,</span>
<span id="cb11-107">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"SD"</span>,</span>
<span id="cb11-108">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">mean_difference =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Mean"</span>,</span>
<span id="cb11-109">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sd_difference =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"SD"</span>,</span>
<span id="cb11-110">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">overall_difference =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Overall"</span>,</span>
<span id="cb11-111">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">trophy =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Trophy"</span></span>
<span id="cb11-112">    ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb11-113">    gt<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tab_spanner</span>(</span>
<span id="cb11-114">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">columns =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(mean_difference, sd_difference, overall_difference),</span>
<span id="cb11-115">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">label =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Difference from 'Original'"</span></span>
<span id="cb11-116">    ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> </span>
<span id="cb11-117">    gt<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">cols_align</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">columns =</span> trophy, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">align =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"center"</span>)</span>
<span id="cb11-118">  </span>
<span id="cb11-119">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(tab)</span>
<span id="cb11-120">}</span>
<span id="cb11-121"></span>
<span id="cb11-122"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># summarise the approaches for 'age'</span></span>
<span id="cb11-123"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">imputation_summary_table</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">df =</span> df_summary_test, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.measure =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"age"</span>)</span></code></pre></div></div>
</details>
<div class="cell-output-display">
<div id="zucvoulpeb" style="padding-left:0px;padding-right:0px;padding-top:10px;padding-bottom:10px;overflow-x:auto;overflow-y:auto;width:auto;height:auto;">
<style>#zucvoulpeb table {
  font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji';
  -webkit-font-smoothing: antialiased;
  -moz-osx-font-smoothing: grayscale;
}

#zucvoulpeb thead, #zucvoulpeb tbody, #zucvoulpeb tfoot, #zucvoulpeb tr, #zucvoulpeb td, #zucvoulpeb th {
  border-style: none;
}

#zucvoulpeb p {
  margin: 0;
  padding: 0;
}

#zucvoulpeb .gt_table {
  display: table;
  border-collapse: collapse;
  line-height: normal;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#zucvoulpeb .gt_caption {
  padding-top: 4px;
  padding-bottom: 4px;
}

#zucvoulpeb .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#zucvoulpeb .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 3px;
  padding-bottom: 5px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#zucvoulpeb .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#zucvoulpeb .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#zucvoulpeb .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#zucvoulpeb .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#zucvoulpeb .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#zucvoulpeb .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#zucvoulpeb .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#zucvoulpeb .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 5px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#zucvoulpeb .gt_spanner_row {
  border-bottom-style: hidden;
}

#zucvoulpeb .gt_group_heading {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  text-align: left;
}

#zucvoulpeb .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#zucvoulpeb .gt_from_md > :first-child {
  margin-top: 0;
}

#zucvoulpeb .gt_from_md > :last-child {
  margin-bottom: 0;
}

#zucvoulpeb .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#zucvoulpeb .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 5px;
  padding-right: 5px;
}

#zucvoulpeb .gt_stub_row_group {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 5px;
  padding-right: 5px;
  vertical-align: top;
}

#zucvoulpeb .gt_row_group_first td {
  border-top-width: 2px;
}

#zucvoulpeb .gt_row_group_first th {
  border-top-width: 2px;
}

#zucvoulpeb .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#zucvoulpeb .gt_first_summary_row {
  border-top-style: solid;
  border-top-color: #D3D3D3;
}

#zucvoulpeb .gt_first_summary_row.thick {
  border-top-width: 2px;
}

#zucvoulpeb .gt_last_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#zucvoulpeb .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#zucvoulpeb .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#zucvoulpeb .gt_last_grand_summary_row_top {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-style: double;
  border-bottom-width: 6px;
  border-bottom-color: #D3D3D3;
}

#zucvoulpeb .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#zucvoulpeb .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#zucvoulpeb .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#zucvoulpeb .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
}

#zucvoulpeb .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#zucvoulpeb .gt_sourcenote {
  font-size: 90%;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
}

#zucvoulpeb .gt_left {
  text-align: left;
}

#zucvoulpeb .gt_center {
  text-align: center;
}

#zucvoulpeb .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#zucvoulpeb .gt_font_normal {
  font-weight: normal;
}

#zucvoulpeb .gt_font_bold {
  font-weight: bold;
}

#zucvoulpeb .gt_font_italic {
  font-style: italic;
}

#zucvoulpeb .gt_super {
  font-size: 65%;
}

#zucvoulpeb .gt_footnote_marks {
  font-size: 75%;
  vertical-align: 0.4em;
  position: initial;
}

#zucvoulpeb .gt_asterisk {
  font-size: 100%;
  vertical-align: 0;
}

#zucvoulpeb .gt_indent_1 {
  text-indent: 5px;
}

#zucvoulpeb .gt_indent_2 {
  text-indent: 10px;
}

#zucvoulpeb .gt_indent_3 {
  text-indent: 15px;
}

#zucvoulpeb .gt_indent_4 {
  text-indent: 20px;
}

#zucvoulpeb .gt_indent_5 {
  text-indent: 25px;
}

#zucvoulpeb .katex-display {
  display: inline-flex !important;
  margin-bottom: 0.75em !important;
}

#zucvoulpeb div.Reactable > div.rt-table > div.rt-thead > div.rt-tr.rt-tr-group-header > div.rt-th-group:after {
  height: 0px !important;
}
</style>
<table class="gt_table" data-quarto-disable-processing="true" data-quarto-bootstrap="false">
  <thead>
    <tr class="gt_col_headings gt_spanner_row">
      <th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="2" colspan="1" scope="col" id="set">Set</th>
      <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="2" colspan="1" scope="col" id="mean">Mean</th>
      <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="2" colspan="1" scope="col" id="sd">SD</th>
      <th class="gt_center gt_columns_top_border gt_column_spanner_outer" rowspan="1" colspan="3" scope="colgroup" id="Difference from 'Original'">
        <div class="gt_column_spanner">Difference from 'Original'</div>
      </th>
      <th class="gt_col_heading gt_columns_bottom_border gt_center" rowspan="2" colspan="1" scope="col" id="trophy">Trophy</th>
    </tr>
    <tr class="gt_col_headings">
      <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="mean_difference">Mean</th>
      <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="sd_difference">SD</th>
      <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="overall_difference">Overall</th>
    </tr>
  </thead>
  <tbody class="gt_table_body">
    <tr><td headers="set" class="gt_row gt_left">Original</td>
<td headers="mean" class="gt_row gt_right">39.91</td>
<td headers="sd" class="gt_row gt_right">6.24</td>
<td headers="mean_difference" class="gt_row gt_right" style="background-color: #FFFFFF; color: #000000;"><br></td>
<td headers="sd_difference" class="gt_row gt_right" style="background-color: #FFFFFF; color: #000000;"><br></td>
<td headers="overall_difference" class="gt_row gt_right" style="background-color: #FFFFFF; color: #000000;"><br></td>
<td headers="trophy" class="gt_row gt_center"><br></td></tr>
    <tr><td headers="set" class="gt_row gt_left">Complete-case</td>
<td headers="mean" class="gt_row gt_right">40.20</td>
<td headers="sd" class="gt_row gt_right">6.36</td>
<td headers="mean_difference" class="gt_row gt_right" style="background-color: #DFF9FB; color: #000000;">0.30</td>
<td headers="sd_difference" class="gt_row gt_right" style="background-color: #7A9BCE; color: #FFFFFF;">0.12</td>
<td headers="overall_difference" class="gt_row gt_right" style="background-color: #DAF4F9; color: #000000;">0.42</td>
<td headers="trophy" class="gt_row gt_center"><br></td></tr>
    <tr><td headers="set" class="gt_row gt_left">Average</td>
<td headers="mean" class="gt_row gt_right">39.76</td>
<td headers="sd" class="gt_row gt_right">5.95</td>
<td headers="mean_difference" class="gt_row gt_right" style="background-color: #98B6DC; color: #000000;">0.15</td>
<td headers="sd_difference" class="gt_row gt_right" style="background-color: #DFF9FB; color: #000000;">0.29</td>
<td headers="overall_difference" class="gt_row gt_right" style="background-color: #DFF9FB; color: #000000;">0.43</td>
<td headers="trophy" class="gt_row gt_center"><br></td></tr>
    <tr><td headers="set" class="gt_row gt_left">Regression</td>
<td headers="mean" class="gt_row gt_right">39.93</td>
<td headers="sd" class="gt_row gt_right">5.99</td>
<td headers="mean_difference" class="gt_row gt_right" style="background-color: #5881C1; color: #FFFFFF;">0.02</td>
<td headers="sd_difference" class="gt_row gt_right" style="background-color: #CAE4F2; color: #000000;">0.25</td>
<td headers="overall_difference" class="gt_row gt_right" style="background-color: #9CB9DD; color: #000000;">0.27</td>
<td headers="trophy" class="gt_row gt_center"><br></td></tr>
    <tr><td headers="set" class="gt_row gt_left">MICE</td>
<td headers="mean" class="gt_row gt_right">39.86</td>
<td headers="sd" class="gt_row gt_right">6.31</td>
<td headers="mean_difference" class="gt_row gt_right" style="background-color: #678CC7; color: #FFFFFF;">0.05</td>
<td headers="sd_difference" class="gt_row gt_right" style="background-color: #5881C1; color: #FFFFFF;">0.08</td>
<td headers="overall_difference" class="gt_row gt_right" style="background-color: #5881C1; color: #FFFFFF;">0.13</td>
<td headers="trophy" class="gt_row gt_center">🏆</td></tr>
  </tbody>
  
</table>
</div>
</div>
</div>
</section>
<section id="mice-performs-best-at-imputing-salary" class="level4">
<h4 class="anchored" data-anchor-id="mice-performs-best-at-imputing-salary">MICE performs best at imputing salary</h4>
<p>The MICE algorithm also performed best at imputing missing salary data, providing mean and standard deviation values that are the closest overall match to the original dataset.</p>
<div class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb12-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># summarise the approaches for 'salary'</span></span>
<span id="cb12-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">imputation_summary_table</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">df =</span> df_summary_test, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.measure =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"salary"</span>)</span></code></pre></div></div>
</details>
<div class="cell-output-display">
<div id="mpyixsehej" style="padding-left:0px;padding-right:0px;padding-top:10px;padding-bottom:10px;overflow-x:auto;overflow-y:auto;width:auto;height:auto;">
<style>#mpyixsehej table {
  font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji';
  -webkit-font-smoothing: antialiased;
  -moz-osx-font-smoothing: grayscale;
}

#mpyixsehej thead, #mpyixsehej tbody, #mpyixsehej tfoot, #mpyixsehej tr, #mpyixsehej td, #mpyixsehej th {
  border-style: none;
}

#mpyixsehej p {
  margin: 0;
  padding: 0;
}

#mpyixsehej .gt_table {
  display: table;
  border-collapse: collapse;
  line-height: normal;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#mpyixsehej .gt_caption {
  padding-top: 4px;
  padding-bottom: 4px;
}

#mpyixsehej .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#mpyixsehej .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 3px;
  padding-bottom: 5px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#mpyixsehej .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#mpyixsehej .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#mpyixsehej .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#mpyixsehej .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#mpyixsehej .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#mpyixsehej .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#mpyixsehej .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#mpyixsehej .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 5px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#mpyixsehej .gt_spanner_row {
  border-bottom-style: hidden;
}

#mpyixsehej .gt_group_heading {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  text-align: left;
}

#mpyixsehej .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#mpyixsehej .gt_from_md > :first-child {
  margin-top: 0;
}

#mpyixsehej .gt_from_md > :last-child {
  margin-bottom: 0;
}

#mpyixsehej .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#mpyixsehej .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 5px;
  padding-right: 5px;
}

#mpyixsehej .gt_stub_row_group {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 5px;
  padding-right: 5px;
  vertical-align: top;
}

#mpyixsehej .gt_row_group_first td {
  border-top-width: 2px;
}

#mpyixsehej .gt_row_group_first th {
  border-top-width: 2px;
}

#mpyixsehej .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#mpyixsehej .gt_first_summary_row {
  border-top-style: solid;
  border-top-color: #D3D3D3;
}

#mpyixsehej .gt_first_summary_row.thick {
  border-top-width: 2px;
}

#mpyixsehej .gt_last_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#mpyixsehej .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#mpyixsehej .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#mpyixsehej .gt_last_grand_summary_row_top {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-style: double;
  border-bottom-width: 6px;
  border-bottom-color: #D3D3D3;
}

#mpyixsehej .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#mpyixsehej .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#mpyixsehej .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#mpyixsehej .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
}

#mpyixsehej .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#mpyixsehej .gt_sourcenote {
  font-size: 90%;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
}

#mpyixsehej .gt_left {
  text-align: left;
}

#mpyixsehej .gt_center {
  text-align: center;
}

#mpyixsehej .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#mpyixsehej .gt_font_normal {
  font-weight: normal;
}

#mpyixsehej .gt_font_bold {
  font-weight: bold;
}

#mpyixsehej .gt_font_italic {
  font-style: italic;
}

#mpyixsehej .gt_super {
  font-size: 65%;
}

#mpyixsehej .gt_footnote_marks {
  font-size: 75%;
  vertical-align: 0.4em;
  position: initial;
}

#mpyixsehej .gt_asterisk {
  font-size: 100%;
  vertical-align: 0;
}

#mpyixsehej .gt_indent_1 {
  text-indent: 5px;
}

#mpyixsehej .gt_indent_2 {
  text-indent: 10px;
}

#mpyixsehej .gt_indent_3 {
  text-indent: 15px;
}

#mpyixsehej .gt_indent_4 {
  text-indent: 20px;
}

#mpyixsehej .gt_indent_5 {
  text-indent: 25px;
}

#mpyixsehej .katex-display {
  display: inline-flex !important;
  margin-bottom: 0.75em !important;
}

#mpyixsehej div.Reactable > div.rt-table > div.rt-thead > div.rt-tr.rt-tr-group-header > div.rt-th-group:after {
  height: 0px !important;
}
</style>
<table class="gt_table" data-quarto-disable-processing="true" data-quarto-bootstrap="false">
  <thead>
    <tr class="gt_col_headings gt_spanner_row">
      <th class="gt_col_heading gt_columns_bottom_border gt_left" rowspan="2" colspan="1" scope="col" id="set">Set</th>
      <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="2" colspan="1" scope="col" id="mean">Mean</th>
      <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="2" colspan="1" scope="col" id="sd">SD</th>
      <th class="gt_center gt_columns_top_border gt_column_spanner_outer" rowspan="1" colspan="3" scope="colgroup" id="Difference from 'Original'">
        <div class="gt_column_spanner">Difference from 'Original'</div>
      </th>
      <th class="gt_col_heading gt_columns_bottom_border gt_center" rowspan="2" colspan="1" scope="col" id="trophy">Trophy</th>
    </tr>
    <tr class="gt_col_headings">
      <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="mean_difference">Mean</th>
      <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="sd_difference">SD</th>
      <th class="gt_col_heading gt_columns_bottom_border gt_right" rowspan="1" colspan="1" scope="col" id="overall_difference">Overall</th>
    </tr>
  </thead>
  <tbody class="gt_table_body">
    <tr><td headers="set" class="gt_row gt_left">Original</td>
<td headers="mean" class="gt_row gt_right">39.40</td>
<td headers="sd" class="gt_row gt_right">7.41</td>
<td headers="mean_difference" class="gt_row gt_right" style="background-color: #FFFFFF; color: #000000;"><br></td>
<td headers="sd_difference" class="gt_row gt_right" style="background-color: #FFFFFF; color: #000000;"><br></td>
<td headers="overall_difference" class="gt_row gt_right" style="background-color: #FFFFFF; color: #000000;"><br></td>
<td headers="trophy" class="gt_row gt_center"><br></td></tr>
    <tr><td headers="set" class="gt_row gt_left">Complete-case</td>
<td headers="mean" class="gt_row gt_right">38.42</td>
<td headers="sd" class="gt_row gt_right">7.06</td>
<td headers="mean_difference" class="gt_row gt_right" style="background-color: #DFF9FB; color: #000000;">0.98</td>
<td headers="sd_difference" class="gt_row gt_right" style="background-color: #DDF7FA; color: #000000;">0.35</td>
<td headers="overall_difference" class="gt_row gt_right" style="background-color: #DFF9FB; color: #000000;">1.33</td>
<td headers="trophy" class="gt_row gt_center"><br></td></tr>
    <tr><td headers="set" class="gt_row gt_left">Average</td>
<td headers="mean" class="gt_row gt_right">39.54</td>
<td headers="sd" class="gt_row gt_right">7.06</td>
<td headers="mean_difference" class="gt_row gt_right" style="background-color: #648AC6; color: #FFFFFF;">0.14</td>
<td headers="sd_difference" class="gt_row gt_right" style="background-color: #DFF9FB; color: #000000;">0.35</td>
<td headers="overall_difference" class="gt_row gt_right" style="background-color: #85A4D3; color: #FFFFFF;">0.49</td>
<td headers="trophy" class="gt_row gt_center"><br></td></tr>
    <tr><td headers="set" class="gt_row gt_left">Regression</td>
<td headers="mean" class="gt_row gt_right">39.33</td>
<td headers="sd" class="gt_row gt_right">7.09</td>
<td headers="mean_difference" class="gt_row gt_right" style="background-color: #5881C1; color: #FFFFFF;">0.07</td>
<td headers="sd_difference" class="gt_row gt_right" style="background-color: #D3EDF6; color: #000000;">0.32</td>
<td headers="overall_difference" class="gt_row gt_right" style="background-color: #799BCE; color: #FFFFFF;">0.39</td>
<td headers="trophy" class="gt_row gt_center"><br></td></tr>
    <tr><td headers="set" class="gt_row gt_left">MICE</td>
<td headers="mean" class="gt_row gt_right">39.48</td>
<td headers="sd" class="gt_row gt_right">7.38</td>
<td headers="mean_difference" class="gt_row gt_right" style="background-color: #5B83C2; color: #FFFFFF;">0.09</td>
<td headers="sd_difference" class="gt_row gt_right" style="background-color: #5881C1; color: #FFFFFF;">0.03</td>
<td headers="overall_difference" class="gt_row gt_right" style="background-color: #5881C1; color: #FFFFFF;">0.12</td>
<td headers="trophy" class="gt_row gt_center">🏆</td></tr>
  </tbody>
  
</table>
</div>
</div>
</div>
<p>It is interesting that complete-case analysis performed poorly on this dataset, leading to distorted average values for both age and salary. This is concerning, given that complete-case analysis is often the default method used by many statistical packages to handle missing data. This highlights the importance of reviewing and potentially adjusting for missing data to ensure accurate results.</p>
</section>
</section>
</section>
<section id="summary" class="level2">
<h2 class="anchored" data-anchor-id="summary">Summary</h2>
<p>Missing data can significantly impact the integrity of your findings, leading to skewed results and misleading conclusions. Understanding how to manage missing values is essential for drawing accurate insights.</p>
<p>We identified three types of missingness to consider:</p>
<p><strong>Missing Completely At Random (MCAR):</strong> missingness is random and unrelated to observed or unobserved data.</p>
<p><strong>Missing At Random (MAR):</strong> missingness is related to observed data, but not the missing data itself.</p>
<p><strong>Missing Not At Random (MNAR):</strong> missingness is related to the unobserved data itself.</p>
<p>To explore techniques for handling missingness we created a synthetic dataset of 1,000 individuals with age, gender and salary, using authentic relationships. We introduced MAR missingness, simulating everyday scenarios.</p>
<p>We used four methods for handling missing data:</p>
<ul>
<li><p>Complete-case analysis</p></li>
<li><p>Mean imputation</p></li>
<li><p>Linear regression</p></li>
<li><p>Multiple imputation using the MICE algorithm</p></li>
</ul>
<p>The MICE algorithm emerged as the most accurate, closely mirroring the original dataset’s characteristics. Mean imputation and linear regression provided decent average estimates but distorted variability, leading to altered distributions.</p>
<p>This example serves as a powerful reminder of the importance of choosing the right approach for missing data. So, the next time you encounter missing values, remember: how you handle them can shape the story your data tells.</p>


</section>

 ]]></description>
  <category>R</category>
  <category>Imputation</category>
  <category>Learning</category>
  <category>MICE</category>
  <guid>https://the-strategy-unit.github.io/data_science/blogs/posts/2025-07-28_imputing-data/</guid>
  <pubDate>Mon, 28 Jul 2025 00:00:00 GMT</pubDate>
</item>
<item>
  <title>What does a scrummaster do, anyway?</title>
  <link>https://the-strategy-unit.github.io/data_science/blogs/posts/2025-06-20-how-we-sprint/</link>
  <description><![CDATA[ 





<p>We use sprints to manage our workload developing <a href="https://connect.strategyunitwm.nhs.uk/nhp/project_information/">the open source NHP Model</a> and several adjacent products. We’ve written before about sprint roles: <a href="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-05-16%20data-science-as-a-product/">Claire as Product Owner</a> and <a href="https://the-strategy-unit.github.io/data_science/presentations/2024-11-21_agile_project_management/">Chris as Team Leader</a>. This blogpost attempts to distil what it is that a <em>scrummaster</em> does, both as a form of internal documentation and as a way of sharing our learning with others.</p>
<section id="practical-tasks" class="level2">
<h2 class="anchored" data-anchor-id="practical-tasks">Practical tasks</h2>
<section id="organising-meetings" class="level3">
<h3 class="anchored" data-anchor-id="organising-meetings">Organising meetings</h3>
<p>On a practical level, a scrummaster organises and conducts all the meetings that are a regular part of the sprint. These are:</p>
<ul>
<li>Sprint planning (1 hour) with Product Owner before sprint starts</li>
<li>Sprint kickoff (2.5 hours) at start of sprint, with all team members</li>
<li>Sprint catchups (1 hour) weekly during sprint duration, with all team members</li>
<li>Sprint retro (2 hours) at end of sprint, with all team members</li>
</ul>
<p>We have created a GitHub template which acts as a checklist to help us with sorting these, including suggested agendas and useful links. We currently work in 3-week sprints, with 2 weeks for coding and 1 week for Quality Assurance (QA). We then work on other projects between sprints (a “fallow week”), effectively working in 4 week cycles.</p>
</section>
<section id="managing-meetings-and-tasks" class="level3">
<h3 class="anchored" data-anchor-id="managing-meetings-and-tasks">Managing meetings and tasks</h3>
<p>The scrummaster leads all meetings during the sprint, taking team members through the agenda. They help people working on the sprint to work out what they are going to do and how they are going to do it. If there are any blockers, the scrummaster helps scrum participants with unblocking them. The scrummaster is responsible for helping members of the sprint keep to deadlines by regularly checking in on progress.</p>
<p>The scrummaster works with the Product Owner to work out the priority order of the tasks in the sprint. If there are any unexpected development requests or changes in priority during the sprint, the scrummaster helps to work out what can and can’t move, and helps to distribute tasks equally amongst the team and manage workloads.</p>
<p>We manage our sprints using <a href="https://docs.github.com/en/issues/planning-and-tracking-with-projects/learning-about-projects/about-projects">GitHub Projects</a>, and the scrummaster is responsible for helping keep this tidy - for example, by ensuring that any uncompleted issues at the end of a sprint are either closed, or assigned to a new sprint.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-06-20-how-we-sprint/wallace-and-gromit-train.gif" class="img-fluid figure-img" alt="Gromit laying train tracks down before a fast moving train."></p>
<figcaption>Scrummaster in action</figcaption>
</figure>
</div>
</section>
</section>
<section id="knowledge-and-skills-required" class="level2">
<h2 class="anchored" data-anchor-id="knowledge-and-skills-required">Knowledge and skills required</h2>
<section id="seeing-the-big-picture-knowledge-of-the-project-as-a-whole" class="level3">
<h3 class="anchored" data-anchor-id="seeing-the-big-picture-knowledge-of-the-project-as-a-whole">Seeing the big picture: Knowledge of the project as a whole</h3>
<p>The scrummaster should have an understanding of the project as a whole, including an awareness of who the key stakeholders are, and what their priorities might be. The project that we’re working on has lots of <a href="https://connect.strategyunitwm.nhs.uk/nhp/project_information/project_plan_and_summary/components-overview.html">interconnected parts</a>, and there are often external time-sensitive pressures impacting our work as well. Having this broad overview enables the scrummaster to spot potential blockers and issues, and ensure that the product develops in a way that meets the needs of all stakeholders.</p>
</section>
<section id="an-eye-for-detail-technical-understanding" class="level3">
<h3 class="anchored" data-anchor-id="an-eye-for-detail-technical-understanding">An eye for detail: Technical understanding</h3>
<p>The scrummaster does not necessarily have to be actively involved in the sprint in terms of contributing code, but they should have enough technical knowledge to be able to understand (in broad terms) what the requirements are for each piece of work forming part of the sprint. They should also be able to signpost sprint participants to relevant resources and help them with decision making, where required.</p>
</section>
<section id="communication-skills" class="level3">
<h3 class="anchored" data-anchor-id="communication-skills">Communication skills</h3>
<p>The scrummaster coordinates communication between team members working on interconnected elements of tasks, and between the sprint team and key stakeholders. They ensure that development team members have all the information they need to accomplish their sprint goals.</p>
<p>In our team, the scrummaster is responsible for writing up the user-facing “model updates”, which provide a broad overview of the developments at the end of each sprint. This often involves translation of complex technical details into human readable terms.</p>
</section>
</section>
<section id="sharing-the-load" class="level2">
<h2 class="anchored" data-anchor-id="sharing-the-load">Sharing the load</h2>
<p>While there are benefits to maintaining the same person as scrummaster, we’ve switched recently to rotating the role among team members. This will provide respite for everyone and also improve big-picture knowledge across the team. In turn, this will reduce bus factor so that we won’t depend on any one individual to keep all the scrummaster knowledge in their brain.</p>


</section>

 ]]></description>
  <category>GitHub</category>
  <category>Scrum</category>
  <category>Agile</category>
  <guid>https://the-strategy-unit.github.io/data_science/blogs/posts/2025-06-20-how-we-sprint/</guid>
  <pubDate>Fri, 20 Jun 2025 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Data science as a product</title>
  <dc:creator>Claire Welsh</dc:creator>
  <link>https://the-strategy-unit.github.io/data_science/blogs/posts/2025-05-16 data-science-as-a-product/</link>
  <description><![CDATA[ 





<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-05-16 data-science-as-a-product/StockCake-Healthcare Data Analysis_1747409640.jpg" class="img-fluid figure-img"></p>
<figcaption>Data Science</figcaption>
</figure>
</div>
<p>Data science in healthcare is booming. Every day new advances emerge that promise to deepen understanding, improve screening or optimise treatments. But as this panoply of models grows, so does the scrapheap of potentially useful ones that have never helped a single patient. The vast majority of promising new models, even when published in prestigious journals and cited extensively, are <a href="https://www.nature.com/articles/s41698-024-00553-6">never fully translated</a> into clinical practice.<br>
There are many reasons for this, and many potential solutions. Viewing a model through the lens of a ‘product’, produced and optimised for ‘customers’, is one potential solution.</p>
<p>At The Strategy Unit, our acute demand forecasting model for the New Hospitals Programme has been in use for nearly 2 years. The suite of apps, simulation models, bespoke functions and outputs that make up this ‘model’ has been constantly evolving since its inception.<br>
This is a fast-paced environment in which to do data science, since the team are continuously improving the model alongside its active deployment. The NHP model is not ‘new’, and has already had considerable national impact, but there are a whole host of potential applications which we are yet to explore.<br>
It is all too easy to get your head down and work away at the next issue in the backlog without realising that doing something else, something you may not even have thought of, could be preferable.</p>
<p>To avoid this problem, we have created a ‘product team’ for the NHP model.<br>
Borrowing from the realm of software development and <a href="https://www.scrum.org/resources/what-scrum-module">Scrum</a>, our team has a Product Owner, a Product Manager and a Customer Engagement Manager, all overseen by a project director. The team is in its infancy, but almost immediately the need for it was obvious. The overarching goal of the team is to ensure that the model (and all its component parts) meet the needs of current and future ‘customers’, and prioritising work to ensure the medium and long-term ambitions for the product are met.<br>
How we do this requires, first, a brief explanation of the roles of the product team members:</p>
<p>*Product Owner**: As product owner, it is my responsibility to ensure that the work the data science team does is moving us forward towards our shared goals. This means that I need to have a clear view of the work backlog, how each element relates to the vision for the product and understand the competing priorities coming from our stakeholders. I distil those into plans of action that the data science team can fulfil.</p>
<p>*Product Manager**: This person’s priority is to have a clear understanding of where we are now. Who are the current users, are their needs being met, what niche does the model fit into currently, who are the main competitors, what are the potential opportunities for growth and development. What are we good at, what could/should we improve? The product manager does not need to be a data scientist but needs to think holistically and be comfortable talking to coders, planners, customers and everyone in between. They have a key role to play in helping prioritise the data science team’s work and contextualising new requests.</p>
<p><em>Customer Engagement</em>: How do we know if the model is fulfilling the needs of our existing customers? What are the pain points? How could we improve the offer? Gathering and assimilating this information is essential if we are to keep the model friction-free and fit-for-purpose. This role is very time consuming and iterative but is crucial to the model’s success and eventual impact.</p>
<p>Our product team works closely together.<br>
We hold regular meetings and end up having ad hoc chats most days.<br>
We create our own tools for planning, prioritisation and information dissemination, using whatever software works for us. Currently this means a combination of GitHub projects, Excel workbooks, Quarto documents and Canva. We’ve created a product vision and goal, adopted a prioritisation framework for assessing proposed new functionality, we’ve created Gantt charts and a bespoke product team backlog, and gathered hours of user feedback, all within the first few months of the team’s inception. All this user feedback is fed into the data science team’s work backlog, ensuring that it will be done as determined by our prioritisation exercises. Our current users know that their opinions matter, that we are continually improving a tool that is as intuitive, functional and impactful as they need it to be to do their jobs. Where the product team identifies new use cases for the model, the work to understand the implications of this are captured, examined, and measured.<br>
Through conversation and research, we help the team to make decisions around whether a risky new path may be worth the work.<br>
Although data scientists are excellent at building useful tools, conceptualisation of the needs of new users is a different challenge entirely.<br>
We want our data scientists to do data science because they love it and excel at it – other tasks should sit with other teams.</p>
<p>Every iteration of data science work (we operate in ‘sprints’, meaning we release code every 4 weeks) moves us one increment closer to the ultimate vision for the product, and this journey is transparent, building the trust of stakeholders. The product team helps us sharpen the point of the spear, helping us clarify what we should do and why, and it does so without burdening the data science team with any extra workload. This frees them up to do what they do best.</p>
<p>Its early days for this way of working, but I believe we’ve already seen benefits.</p>
<p>The more our workload grows, the more important it is to have team members whose job it is to help steer the ship where we all want it to go. To realise opportunities and help us make the most of our work.</p>



 ]]></description>
  <category>GitHub</category>
  <category>Scrum</category>
  <category>Agile</category>
  <guid>https://the-strategy-unit.github.io/data_science/blogs/posts/2025-05-16 data-science-as-a-product/</guid>
  <pubDate>Fri, 16 May 2025 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Taking a Hackathon approach to exploring new methods in NLP</title>
  <dc:creator>YiWen Hon</dc:creator>
  <link>https://the-strategy-unit.github.io/data_science/blogs/posts/2025-05-06_NLP-hackathon/</link>
  <description><![CDATA[ 





<p>At the Strategy Unit, we’re lucky to have an awesome Evaluation team who often process and categorise large amounts of free text in <a href="https://www.strategyunitwm.nhs.uk/sites/default/files/2025-01/Strategy%20Unit%20Interactive%20Evaluation%20Guide.pdf">the work that they do</a>. This can be a time-consuming, labour-intensive process, and presents a good opportunity for AI/machine learning to help reduce some of this burden using Natural Language Processing (NLP). We want to use technology to help augment the existing skills and expertise of our qualitative analysts.</p>
<p>The Data Science team has intended for some time to explore ways of helping to automate some Evaluation tasks. However, we lacked the opportunity to do so effectively, given competing demands on our time and capacity. We finally decided to take a Hackathon-like approach which worked really well for us. We’re sharing our methods and findings here, in case they’re helpful to others. You can find our code on <a href="https://github.com/ai-mindset/TagSurvey">this GitHub repo</a>. We approached the problem from a few different angles, each of which lives in a separate branch.</p>
<section id="setup-defining-the-problem" class="level2">
<h2 class="anchored" data-anchor-id="setup-defining-the-problem">Setup: Defining the problem</h2>
<p>The key to success when you don’t have much time is to define your problem and objectives well. This helps keep your session focused and realistic. I had a good preparatory meeting with Andriana, our collaborator in the Evaluation team, who provided some examples of the free text data generated from Evaluation studies and their intended uses. We decided to use some data from a recent survey, which needed to be categorised into one or more of six different categories - a multilabel classification problem, with a small dataset of only 460 rows.</p>
<p>Our mission was to develop a model to help the Evaluation team categorise responses quickly, with a reasonable degree of performance.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-05-06_NLP-hackathon/mission_impossible.jpg" class="img-fluid figure-img"></p>
<figcaption>Mission impossible</figcaption>
</figure>
</div>
</section>
<section id="the-hackathon-format" class="level2">
<h2 class="anchored" data-anchor-id="the-hackathon-format">The Hackathon format</h2>
<p>We set aside one working day for our Hackathon. Having the event in our calendars was useful - we were able to focus uninterrupted on one problem all day, without other meetings and obligations. We kept our schedule pretty loose, but ended up with three meetings; one in the morning, one at lunchtime, and one at the end of the day. We opted to work separately, although we discussed the option of pair programming all day as well.</p>
<p>The meeting in the morning was mostly spent talking through the problem, and discussing the approaches we were going to take. This was so that we could avoid duplicating our efforts, and explore different potential solutions.</p>
<p>Our afternoon meeting was just a quick status update; I hadn’t done much coding by lunchtime because I’d spent hours trying to sort out my virtual environment! There’s quite a lot of setup required for some of these more state-of-the-art packages, and Windows isn’t the ideal operating system…</p>
<p>We reconvened in the evening to discuss what we’d achieved. It was great having quite a tight deadline - it made me really focus on the task at hand.</p>
<p>As an aside, I used ChatGPT to scaffold a lot of my code, and to help with debugging - it saved me so much time! I was able to use my time thinking through the problem and working out my approach, instead of trawling through Stackoverflow to figure out why my code wasn’t working. For me, the Hackathon would have been much less successful without it.</p>
<p>Between the three of us, we tried two different methods out.</p>
</section>
<section id="zero-shot-classification-with-fine-tuning" class="level2">
<h2 class="anchored" data-anchor-id="zero-shot-classification-with-fine-tuning">Zero shot classification with fine-tuning</h2>
<p>This method is more like traditional machine learning. It is relatively quick and requires less computing power than the zero shot prompting method. We used the huggingface <a href="https://huggingface.co/facebook/bart-large-mnli">facebook/bart-large-mnli model</a> model, which could be finetuned as part of a human-in-the-loop training cycle. However, this method is quite slow, taking about 20 minutes to retrain the model even with such a small dataset.</p>
<div class="cell" data-layout-align="default">
<div class="cell-output-display">
<div>
<p></p><figure class="figure"><p></p>
<div>
<pre class="mermaid mermaid-js">flowchart TB
    A("Model labels data") --&gt; B("Humans check and correct data")
    B --&gt; n3("Model retrains")
    n3 --&gt; A
    nn("Give model labels and data") --&gt; A

    style A stroke:#000000,fill:#276DC2,color:#fff
    style nn stroke: #000000,fill:#ffde57
    style B stroke: #000000,fill:#ffde57
    style n3 stroke: #000000,fill:#276DC2,color:#fff
</pre>
</div>
<p></p></figure><p></p>
</div>
</div>
</div>
</section>
<section id="zero-shot-prompting" class="level2">
<h2 class="anchored" data-anchor-id="zero-shot-prompting">Zero shot prompting</h2>
<p>This method utilises <a href="https://ollama.com/">Ollama</a> to run complex Large Language Models (LLMs) locally. This method relies on good prompt engineering, meaning that we had to test several different ways of asking the model to complete the task. No retraining was required. We just used our own machines to do this, and a computer with decent hardware can generate just over 80 tokens per second - meaning that it’s reasonably quick. Performance was slightly better, I would say, and it would even be possible to <a href="https://github.com/langchain-ai/openevals">evaluate results automatically</a>, with human oversight, using a method called “LLM eval”.</p>
</section>
<section id="next-steps" class="level2">
<h2 class="anchored" data-anchor-id="next-steps">Next steps</h2>
<p>Both methods performed reasonably well, and Andriana could see how they would be able to help the Evaluation team in the future. The Evaluation team would need explore how exactly these models might fit into the qualitative analysis process and which types of projects they would be relevant for, as these methods are likely more applicable to projects with ‘simpler’ data. We also spotted some opportunities for quick wins with better cleaning of the data. Having built these simple proofs of concept on our local machines, we now need to think about how we would productionise these approaches in the real world. I also want to spend time thinking about how the Data Science team and the Evaluation team can work more closely together in the future.</p>
<p>Overall, I would say that the Hackathon format was a great way to explore new methods in a collaborative way, without too much disruption of regular day to day responsibilities.</p>


</section>

 ]]></description>
  <category>learning</category>
  <category>NLP</category>
  <category>AI</category>
  <guid>https://the-strategy-unit.github.io/data_science/blogs/posts/2025-05-06_NLP-hackathon/</guid>
  <pubDate>Tue, 06 May 2025 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Using GitHub Apps for Authentication</title>
  <dc:creator>Thomas Jemmett</dc:creator>
  <link>https://the-strategy-unit.github.io/data_science/blogs/posts/2025-04-30_using-github-apps-for-authentication/</link>
  <description><![CDATA[ 





<p>Recently, we’ve been working on building an internal dashboard to monitor the repositories in our GitHub organisation. The intention is to perform various checks, such as ensuring each repo has a <a href="https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners">CODEOWNERS</a> file.</p>
<p>GitHub has a <a href="https://docs.github.com/en/rest">REST API</a> that can do all of the things we need, but we hit a bit of a snag early on. We want this dashboard to update itself on our <a href="https://posit.co/products/enterprise/connect/">Posit Connect</a> server—but authenticating with the GitHub API requires a <a href="https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens">Personal Access Token (PAT)</a>.</p>
<p>PATs are useful, but they’re managed by a user (not an organisation), and ideally should be short-lived and expire regularly.</p>
<p>What we really need is a more robust way of authenticating with GitHub.</p>
<section id="github-apps" class="level2">
<h2 class="anchored" data-anchor-id="github-apps">GitHub Apps</h2>
<blockquote class="blockquote">
<p>A GitHub App is a type of integration that you can build to interact with and extend the functionality of GitHub. You can build a GitHub App to provide flexibility and reduce friction in your processes, without needing to sign in a user or create a service account.</p>
<p>Common use cases for GitHub Apps include:</p>
<ul>
<li>Automating tasks or background processes</li>
</ul>
<p>…</p>
</blockquote>
<p><a href="https://docs.github.com/en/apps/creating-github-apps/about-creating-github-apps/about-creating-github-apps">About GitHub Apps</a>.</p>
<p>Sounds ideal, right? And it turns out it’s pretty easy to create your own app too! Well, there are a few steps, and a bit of boilerplate code to write, but I’ll get to that later.</p>
<p>If you explore that link, you’ll find all the details needed to create your own app—but I’ll quickly note the steps I took below.</p>
<section id="creating-the-app" class="level3">
<h3 class="anchored" data-anchor-id="creating-the-app">Creating the App</h3>
<ol type="1">
<li>Go to your organisation’s <strong>Settings</strong> page on GitHub.</li>
<li>At the bottom of the left-hand navigation, find <strong>Developer settings</strong> and choose <strong>GitHub Apps</strong>.</li>
<li>Click the <strong>New GitHub App</strong> button.</li>
<li>Give it a name (I named ours “Strategy Unit GitHub Dashboard”).</li>
<li>For the Homepage URL, set it to where the app will be deployed.</li>
<li>Skip down to <strong>Webhook</strong> and uncheck the <strong>Active</strong> checkbox.</li>
<li>Grant the app only the minimum permissions required. In my case, I gave <strong>repository metadata</strong> read access—additional permissions can be granted later if needed.</li>
<li>Leave <strong>Where can this GitHub App be Installed?</strong> set to <strong>Only on this account</strong>.</li>
<li>Click <strong>Create GitHub App</strong>.</li>
<li>On the newly created app page, a small menu should appear on the left with <strong>Install App</strong> near the bottom. Use that to install the app into your organisation.</li>
<li>Back on the app’s settings page, note the <strong>App ID</strong> near the top.</li>
<li>At the bottom of the settings page, click <strong>Generate a private key</strong>—this will download a private key for later use.</li>
</ol>
</section>
<section id="using-the-app" class="level3">
<h3 class="anchored" data-anchor-id="using-the-app">Using the App</h3>
<p>We can now use the app to authenticate with the GitHub API. But to perform requests—like listing repositories—we still need a token.</p>
<p><em>Wait, I thought we were trying to avoid using PATs?</em></p>
<p>Well… yes. But we’ll use the GitHub App to generate a PAT for us!<br>
Let me outline the workflow and show how to generate the token using R and the <code>{httr2}</code> package.</p>
<p>If you haven’t used <code>{httr2}</code> before, the final code example includes extra comments explaining what’s going on.</p>
<section id="generate-a-jwt" class="level4">
<h4 class="anchored" data-anchor-id="generate-a-jwt">1. Generate a JWT</h4>
<p>First, we need to create a <a href="https://jwt.io/">JSON Web Token (JWT)</a> issued by our app (using the App ID and private key from earlier):</p>
<div class="cell">
<details class="code-fold">
<summary>show code for get_github_jwt()</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' Get GitHub JWT for an Application</span></span>
<span id="cb1-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#'</span></span>
<span id="cb1-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param key Path to the private key file or a string containing the private</span></span>
<span id="cb1-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#'     key. Defaults to the environment variable `GITHUB_APP_PRIVATE_KEY`.</span></span>
<span id="cb1-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param app_id GitHub App ID. Defaults to the environment variable</span></span>
<span id="cb1-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#'     `GITHUB_APP_ID`.</span></span>
<span id="cb1-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param expiry_time Expiry time for the JWT in seconds. Defaults to 30s.</span></span>
<span id="cb1-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#'</span></span>
<span id="cb1-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @return A JSON Web Token (JWT) for the GitHub App.</span></span>
<span id="cb1-10">get_github_jwt <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(</span>
<span id="cb1-11">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">key =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Sys.getenv</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"GITHUB_APP_PRIVATE_KEY"</span>),</span>
<span id="cb1-12">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">app_id =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Sys.getenv</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"GITHUB_APP_ID"</span>),</span>
<span id="cb1-13">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">expiry_time =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span>) {</span>
<span id="cb1-14">  private_key <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> openssl<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read_key</span>(key)</span>
<span id="cb1-15"></span>
<span id="cb1-16">  now <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.numeric</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Sys.time</span>())</span>
<span id="cb1-17">  claim <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">jwt_claim</span>(</span>
<span id="cb1-18">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">iat =</span> now,</span>
<span id="cb1-19">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">exp =</span> now <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> expiry_time,</span>
<span id="cb1-20">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">iss =</span> app_id</span>
<span id="cb1-21">  )</span>
<span id="cb1-22"></span>
<span id="cb1-23">  httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">jwt_encode_sig</span>(claim, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">key =</span> private_key)</span>
<span id="cb1-24">}</span></code></pre></div></div>
</details>
</div>
</section>
<section id="get-the-installation-id-for-the-app" class="level4">
<h4 class="anchored" data-anchor-id="get-the-installation-id-for-the-app">2. Get the Installation ID for the App</h4>
<p>Next, we need the App’s installation ID.</p>
<p>You could find it manually via your organisation’s settings page under Installed Apps, but that’s cumbersome. Instead, we’ll use the API and our JWT to fetch it. Since we created the app and restricted installation to our org only, we can assume there’s just one installation.</p>
<div class="cell">
<details class="code-fold">
<summary>show code for get_github_app_installation_id()</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' Get GitHub PAT from Installation Access Token</span></span>
<span id="cb2-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#'</span></span>
<span id="cb2-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param jwt JSON Web Token (JWT) for the GitHub App. Defaults to the output of</span></span>
<span id="cb2-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#'     `get_github_jwt()`.</span></span>
<span id="cb2-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param installation_id GitHub Installation ID. Defaults to the environment</span></span>
<span id="cb2-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#'     variable</span></span>
<span id="cb2-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param github_api_ep The base URL for the GitHub API. Defaults to</span></span>
<span id="cb2-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#'     "https://api.github.com/".</span></span>
<span id="cb2-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#'</span></span>
<span id="cb2-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @return A personal access token (PAT) with permissions granted to the app.</span></span>
<span id="cb2-11">get_github_app_installation_id <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(</span>
<span id="cb2-12">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">jwt =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">get_github_jwt</span>(),</span>
<span id="cb2-13">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">github_api_ep =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://api.github.com/"</span>) {</span>
<span id="cb2-14">  resp <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">request</span>(github_api_ep) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb2-15">    httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">req_url_path_append</span>(</span>
<span id="cb2-16">      <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"app"</span>,</span>
<span id="cb2-17">      <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"installations"</span></span>
<span id="cb2-18">    ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb2-19">    httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">req_method</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"GET"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb2-20">    httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">req_auth_bearer_token</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">get_github_jwt</span>()) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb2-21">    httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">req_headers</span>(</span>
<span id="cb2-22">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Accept =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"application/vnd.github+json"</span></span>
<span id="cb2-23">    ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb2-24">    httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">req_perform</span>()</span>
<span id="cb2-25"></span>
<span id="cb2-26">  httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">resp_check_status</span>(resp)</span>
<span id="cb2-27"></span>
<span id="cb2-28">  httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">resp_body_json</span>(resp)[[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]][[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"id"</span>]]</span>
<span id="cb2-29">}</span></code></pre></div></div>
</details>
</div>
</section>
<section id="generate-a-pat" class="level4">
<h4 class="anchored" data-anchor-id="generate-a-pat">3. Generate a PAT</h4>
<p>We’re now ready to generate the token we’ll use for API requests.</p>
<div class="cell">
<details class="code-fold">
<summary>show code for get_github_iat_pat()</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' Get GitHub PAT from Installation Access Token</span></span>
<span id="cb3-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#'</span></span>
<span id="cb3-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param jwt JSON Web Token (JWT) for the GitHub App. Defaults to the output of</span></span>
<span id="cb3-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#'     `get_github_jwt()`.</span></span>
<span id="cb3-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param installation_id GitHub Installation ID. Defaults to the output of</span></span>
<span id="cb3-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#'     get_github_app_installation_id()`.</span></span>
<span id="cb3-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param github_api_ep The base URL for the GitHub API. Defaults to</span></span>
<span id="cb3-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#'     "https://api.github.com/".</span></span>
<span id="cb3-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#'</span></span>
<span id="cb3-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @return A personal access token (PAT) with permissions granted to the app.</span></span>
<span id="cb3-11">get_github_iat_pat <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(</span>
<span id="cb3-12">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">jwt =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">get_github_jwt</span>(),</span>
<span id="cb3-13">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">installation_id =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">get_github_app_installation_id</span>(),</span>
<span id="cb3-14">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">github_api_ep =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://api.github.com/"</span>) {</span>
<span id="cb3-15">  resp <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">request</span>(github_api_ep) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb3-16">    httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">req_url_path_append</span>(</span>
<span id="cb3-17">      <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"app"</span>,</span>
<span id="cb3-18">      <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"installations"</span>,</span>
<span id="cb3-19">      installation_id,</span>
<span id="cb3-20">      <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"access_tokens"</span></span>
<span id="cb3-21">    ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb3-22">    httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">req_auth_bearer_token</span>(jwt) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb3-23">    httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">req_headers</span>(</span>
<span id="cb3-24">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Accept =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"application/vnd.github+json"</span></span>
<span id="cb3-25">    ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb3-26">    httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">req_method</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"POST"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb3-27">    httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">req_perform</span>()</span>
<span id="cb3-28"></span>
<span id="cb3-29">  httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">resp_check_status</span>(resp)</span>
<span id="cb3-30"></span>
<span id="cb3-31">  httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">resp_body_json</span>(resp) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb3-32">    purrr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pluck</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"token"</span>)</span>
<span id="cb3-33">}</span></code></pre></div></div>
</details>
</div>
</section>
</section>
</section>
<section id="putting-it-all-together" class="level2">
<h2 class="anchored" data-anchor-id="putting-it-all-together">Putting it all together</h2>
<p>Now that we can generate a token using our app, we can write a function to query the list of repositories.</p>
<p>We need to keep in mind that the API returns a maximum of 100 items per page. Fortunately, <code>{httr2}</code> makes it easy to iterate through paginated responses.</p>
<div class="cell">
<details class="code-fold">
<summary>show code for get_repos()</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' Get GitHub Repositories for an organisation</span></span>
<span id="cb4-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#'</span></span>
<span id="cb4-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param org The name of the GitHub organisation.</span></span>
<span id="cb4-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param pat Personal Access Token (PAT) for authentication. Defaults to the</span></span>
<span id="cb4-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#'     output of `get_github_iat_pat()`.</span></span>
<span id="cb4-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param github_api_ep The base URL for the GitHub API. Defaults to</span></span>
<span id="cb4-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#'     "https://api.github.com/".</span></span>
<span id="cb4-8">get_repos <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(</span>
<span id="cb4-9">    org,</span>
<span id="cb4-10">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">pat =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">get_github_iat_pat</span>(),</span>
<span id="cb4-11">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">github_api_ep =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://api.github.com/"</span>) {</span>
<span id="cb4-12">  req <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">request</span>(github_api_ep) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-13">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># build the url up, this should create something like</span></span>
<span id="cb4-14">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># https://api.github.com/orgs/YOUR_ORG/repos</span></span>
<span id="cb4-15">    httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">req_url_path_append</span>(</span>
<span id="cb4-16">      <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"orgs"</span>,</span>
<span id="cb4-17">      org,</span>
<span id="cb4-18">      <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"repos"</span></span>
<span id="cb4-19">    ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-20">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># add the correct requeste header for authentication using our PAT</span></span>
<span id="cb4-21">    httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">req_auth_bearer_token</span>(pat) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-22">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># additional headers GitHub expects to be passed to their API</span></span>
<span id="cb4-23">    httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">req_headers</span>(</span>
<span id="cb4-24">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Accept =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"application/vnd.github+json"</span>,</span>
<span id="cb4-25">      <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"X-GitHub-Api-Version"</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2022-11-28"</span></span>
<span id="cb4-26">    ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-27">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># append url query parameters, this should look something like</span></span>
<span id="cb4-28">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># https://api.github.com/orgs/YOUR_ORG/repos?per_page=100&amp;page=1&amp;sort=created</span></span>
<span id="cb4-29">    httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">req_url_query</span>(</span>
<span id="cb4-30">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">per_page =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>, <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># anything between 1 and 100 max, as per the docs</span></span>
<span id="cb4-31">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">page =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,</span>
<span id="cb4-32">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">sort =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"created"</span></span>
<span id="cb4-33">    )</span>
<span id="cb4-34"></span>
<span id="cb4-35">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># because the API will only return a maximum of 100 items at a time, we need</span></span>
<span id="cb4-36">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># to query multiple times for each page of results.</span></span>
<span id="cb4-37">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># {httr2} makes this super easy, as the GitHub api returns page links in the</span></span>
<span id="cb4-38">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Link header as per RFC8288  (https://datatracker.ietf.org/doc/html/rfc8288)</span></span>
<span id="cb4-39">  resps <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">req_perform_iterative</span>(</span>
<span id="cb4-40">    req,</span>
<span id="cb4-41">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">next_req =</span> httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">iterate_with_link_url</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">rel =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"next"</span>)</span>
<span id="cb4-42">  )</span>
<span id="cb4-43"></span>
<span id="cb4-44">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ensure that we got a non-error response for each request</span></span>
<span id="cb4-45">  purrr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">walk</span>(resps, httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>resp_check_status)</span>
<span id="cb4-46"></span>
<span id="cb4-47">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># get the data from each response, iterate over them and just extract the</span></span>
<span id="cb4-48">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># "name" field that is returned for each item</span></span>
<span id="cb4-49">  resps <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-50">    httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">resps_data</span>(httr2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span>resp_body_json) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-51">    purrr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map_chr</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"name"</span>)</span>
<span id="cb4-52">}</span></code></pre></div></div>
</details>
</div>
<p>Finally, run the function like this:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># replace the below as required</span></span>
<span id="cb5-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Sys.setenv</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"GITHUB_APP_ID"</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"[app_id]"</span>)</span>
<span id="cb5-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Sys.setenv</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"GITHUB_APP_PRIVATE_KEY"</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"path-to-your.private-key.pem"</span>)</span>
<span id="cb5-4"></span>
<span id="cb5-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">get_repos</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Your Organisation"</span>)</span></code></pre></div></div>
</div>


</section>

 ]]></description>
  <category>GitHub</category>
  <category>learning</category>
  <category>deployment</category>
  <guid>https://the-strategy-unit.github.io/data_science/blogs/posts/2025-04-30_using-github-apps-for-authentication/</guid>
  <pubDate>Wed, 30 Apr 2025 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Mapping my R journey so far: ten things that I have done along the way</title>
  <dc:creator>Sheila Ali</dc:creator>
  <link>https://the-strategy-unit.github.io/data_science/blogs/posts/2024-11-29-mapping-my-r-learning/</link>
  <description><![CDATA[ 





<p>This blog post follows up from a talk I gave last year at coffee and coding about my experiences of learning how to code using Rstudio. Here I build on that talk to share some more reflections and advice for others who are starting out on their R learning journey.</p>
<ol type="1">
<li><strong>I faced up to my fears</strong></li>
</ol>
<p>I have tried to learn R a few times over several years, with mixed success. When I first tried learning it a few years ago, I only managed to learn some basics. The second time, I was going through a crisis of confidence about my ability, and so when I had difficulties with learning R, I thought it was more evidence to show that I couldn’t do it. I tried again, and got to the stage of making a plot with some of the data that was included with Rstudio. Soon after that I got swept up in the demands of everyday life, and gradually my work moved away from the world of quantitative data into qualitative research, and I had fewer opportunities to use R. Still, in the back of my mind I had this strange feeling of both wanting to avoid R, but also wondering what it would have been like if I had persisted with learning it.</p>
<p>A couple of years later, when I started my current job, I heard about the NHS-R community, and felt encouraged to learn R again. I tried to join my colleagues who were participating in <a href="https://adventofcode.com/2024/about">Advent of Code</a>. But I couldn’t understand a lot of what was going on, and when I tried to participate in some of the exercises, I immediately hit some hurdles with the basics, which was discouraging.</p>
<p>It seemed important to try and change my approach, so that learning R didn’t seem so daunting. I came across the <a href="https://cran.r-project.org/web/packages/aRtsy/readme/README.html">aRtsy</a> package and was amazed by the colourful and intricate artwork that it could produce. But better still, all of the code was open-source. I experimented with the code, making very small changes to see what kind of images it would create.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2024-11-29-mapping-my-r-learning/Nebula.jpg" class="img-fluid figure-img"></p>
<figcaption>An image generated using the <a href="https://cran.r-project.org/web/packages/aRtsy/readme/README.html">aRtsy</a> package and the <a href="https://koenderks.github.io/aRtsy/#nebula">canvas_nebula</a> function</figcaption>
</figure>
</div>
<p>I also discovered colour palettes such as those in the <a href="https://github.com/karthik/wesanderson">wesanderson</a> package, and tried experimenting with those along with the generative art functions. I soon found that my fear of R was quickly replaced by a geeky fascination with all of the beautiful artwork that could be created with only a few lines of code. It felt like a low-stakes situation, because the worst that could happen was that the code wouldn’t work. Suddenly, the process of coding felt less intimidating, and it had opened up a wealth of possibilities<sup>1</sup>.</p>
<ol start="2" type="1">
<li><strong>I found a supportive community</strong></li>
</ol>
<p>The great thing about R is that it is free and open source. I believe this lends itself well to a culture of shared learning. When I joined the SU’s <a href="https://the-strategy-unit.github.io/data_science/blogs/posts/2024-05-13_one-year-coffee-code/">Coffee and Coding</a> sessions and <a href="https://nhsrway.nhsrcommunity.com/community-handbook.html#coffee-and-coding">NHS-R community’s Coffee and Code</a>, I felt like a child asking very silly questions, but to my surprise, all of the people I have met have been keen to answer my questions. I learned to recognise and value the people in those communities who would encourage me and fellow learners by making time to answer our questions and help us learn.</p>
<ol start="3" type="1">
<li><strong>I approached learning R like I would approach learning any other language</strong></li>
</ol>
<p>This meant learning some of the key words and phrases, and getting exposure to the language in various ways: reading learning materials, watching tutorials, and spending time with people who were using it, and writing my own code. This had an incremental effect and over time, the more information I absorbed, the more familiar I became with the terminology.</p>
<ol start="4" type="1">
<li><strong>I set myself a goal and structured my learning to help me reach it</strong></li>
</ol>
<p>In my day job, I was working on a qualitative case study and wanted to illustrate my findings using geospatial and population density data in the form of a choropleth map. Unfortunately this was one of the most challenging tasks I could have chosen as an R novice, but luckily, I had kind mentors who both believed I could achieve the task and were also on hand to help me learn the skills I needed. So I set myself the goal of trying to learn how to create a choropleth map by the end of the year. This involved breaking the task down into steps, and learning skills which I could build on along the way. I celebrated my small wins, even the tiny ones, until I achieved the goals I set for myself.</p>
<ol start="5" type="1">
<li><strong>I figured out how I learn best</strong></li>
</ol>
<p>This involved watching tutorials on YouTube, working through books (such as <a href="https://r4ds.hadley.nz/">R for Data Science</a> and <a href="https://r4np.com/">R for non-programmers</a>, trying out online coding courses, using search engines and forums, and asking my colleagues and mentors for advice about what resources I should look at as well as what to avoid.</p>
<p>Although learning resources were plentiful, I faced some common barriers when trying to use them. Often tutorials were not always written in a way that I could reproduce the code or access the data they cited, or were written in very technical language, which meant that I had to go away and learn some key concepts to be able to understand them properly. Therefore an important part of the learning journey for me has been to gradually build up a vocabulary of words and concepts in Rstudio. This has enabled me to better understand what key concepts I need to learn, and to understand the content of any training materials or tutorials. I realised that chipping away at it, spending an hour here and there, several times a week, was the best approach for me specifically, with some bigger blocks of time set aside occasionally for more difficult tasks where I could just spend a couple of hours trying out different things or understanding the problem in more depth.</p>
<ol start="6" type="1">
<li><strong>I applied what I was learning to real data</strong></li>
</ol>
<p>When I became more confident with trying out some packages and functions in R, I decided to find opportunities to apply my learning to real data. I practiced using the inbuilt datasets in Rstudio, the palmerpenguins dataset, and the datasets that were referred to in the books and learning resources I was using. For creating my choropleth maps, I then used data from the UK <a href="https://www.ons.gov.uk/census">Census</a> as well as geographical data about local authority geographical boundaries. Applying my learning to real data was an essential step in learning some of the key data wrangling skills.</p>
<ol start="7" type="1">
<li><strong>I embraced failure and started using it as a tool for learning</strong></li>
</ol>
<p>Over time I understood that failure is part of the learning journey, and a helpful tool for the learning process itself. If I could figure out what didn’t work, that often gave me information about what had gone wrong. This was useful as it either pointed me towards what I needed to fix, or gave me the words and concepts I could look into to help me solve the problem. Sometimes the process of trying to learn different functions accidentally produced hilariously terrible results<sup>2</sup></p>
<p>As well as providing some humour to contrast with the often frustrating process of learning to code, these failures also helped me to get unstuck. More often than not, they were a catalyst for problem-solving as they provided useful information about what specific aspect of the code had gone wrong, which would give me a clue about what I needed to look into to fix the problem.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2024-11-29-mapping-my-r-learning/Terrible map.png" class="img-fluid figure-img"></p>
<figcaption>This is an example of one of the terrible maps I accidentally made, where the map ended up so small it could have been a data point, the axes were changed beyond recognition, and one of the map markers, which was supposed to be located in the North of England, ended up with coordinates somewhere in the Atlantic Ocean.</figcaption>
</figure>
</div>
<ol start="8" type="1">
<li><strong>I looked for inspiration to encourage me to keep going</strong></li>
</ol>
<p>One of my worries about trying to learn R was that learning new things took more time, now I was years older than the last time I tried. But I was fairly confident that there must have been other people out there who had successfully learned how to code when they were my age or older. This led to a fascinating rabbit hole of learning about people who had successfully learned to code later in their life and the hidden history of <a href="https://www.codecademy.com/resources/blog/eniac-six-women-programmed-computer/">women in coding</a>. I bookmarked these stories so that I could revisit them on the days where I was having a difficult time understanding a particular concept or getting my code to work.</p>
<ol start="9" type="1">
<li><strong>I <em>made it sew</em></strong></li>
</ol>
<p>Throughout my R learning journey, I have found that coding has been a useful conduit for my creativity, and similarly, my creative projects outside of work have been a catalyst for learning some key concepts related to coding<sup>3</sup>.</p>
<p>I realised this a few months ago when my friend got me a beginner’s embroidery kit, and as I followed the pattern and learned how to create the different types of embroidery stitch, I reflected that just like with the embroidery pattern I was working on, I needed to structure the coding for the map in <a href="https://ggplot2.tidyverse.org/reference/layer_geoms.html#:~:text=In%20ggplot2%2C%20a%20plot%20in,displayed%2C%20not%20what%20is%20displayed.">layers</a>. This led me to approach the process like I would for an art project<sup>4</sup> to identify what I needed to do to adequately visualise both types of data that I wanted to include in the map.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2024-11-29-mapping-my-r-learning/R textile map.jpg" class="img-fluid figure-img" alt="Textile art piece showing a map with the letter R - for decorative purposes only" width="384"></p>
<figcaption>An abstract textile art piece that I made to illustrate the blog post. This symbolises my non-linear R learning journey - with overlapping and convoluted pathways, dead ends, and roadblocks along the way.</figcaption>
</figure>
</div>
<ol start="10" type="1">
<li><strong>I started learning about how to stay involved in the community</strong></li>
</ol>
<p>As I write this, it has been over a year since I re-started my R learning journey in earnest. Early on in the journey, I remember feeling overwhelmed by the kindness and helpfulness of the community. I decided to channel these feelings into learning as best I could, so that I could then pass the learning on. I was reminded of this when I attended the most recent <a href="https://nhsrcommunity.com/conference24.html">RPYSOC conference</a> where I once again experienced the warm sense of collaboration and community in NHS-R and NHS.pycom. Therefore my aim for 2025 and beyond is to continue my R learning journey (and become more familiar with GitHub), so that I can give back to the wonderful communities that helped me to find my way.</p>




<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>If this topic is of interest, I would recommend getting involved in the <a href="https://github.com/rfordatascience/tidytuesday">Tidy Tuesday community activity</a> and also having a look at <a href="https://nrennie.rbind.io/tidytuesday-shiny-app/">Nicola Rennie’s data visualisations</a>.↩︎</p></li>
<li id="fn2"><p>This amused me greatly, as a fan of the <a href="https://www.instagram.com/terriblemap/p/DCh2NhfB2JX/">Terrible Maps social media pages</a>.↩︎</p></li>
<li id="fn3"><p>This has also worked the other way around, with my R learning journey helping me with learning new crafts. I have recently begun learning sewing and dressmaking. I have quickly found that the learning journey is just as intimidating, meticulous and complicated as it was for learning R. I have also unintentionally chosen a very complicated project for a beginner, which has resulted in a very steep learning curve and lots of failures and mistakes along the way. Throughout the process, I have applied some of the same principles as I did for learning coding. For example, one of the key parts of my journey of learning sewing and dressmaking has been the process of embracing and learning from failure. This has been essential both in terms of knowing what not to do next time, but also to learn how to fix mistakes, ideally early on in a practice situation (e.g.&nbsp;when creating a mock-up). Luckily there is a large community of supportive fellow learners and patient mentors, who are keen to help with fixing mistakes and to pass on their knowledge to new learners. I’m pleased to say, with a lot of help (and many failures) along the way, I did eventually manage to produce three choropleth maps and submitted them with the report late last year.↩︎</p></li>
<li id="fn4"><p>Throughout the journey I have realised that thinking about the problem like an artist has been very helpful, because it allows me to use a similarly iterative approach. I wanted my choropleth maps to show both the population density and the underlying terrain when superimposed. To do this, I used the <a href="https://colorbrewer2.org/#type=sequential&amp;scheme=BuGn&amp;n=3">colorbrewer2</a> tool to test out different colour palettes, and changed the opacity and terrain to identify which colours would clearly to show the population data and the terrain underneath. The tool let me test this on an example map and showed me the hexadecimal colour codes for the colours in the palettes. Once I had found some combinations that would likely work for my particular map, I then iteratively adjusted the aesthetics in my R code until I found a combination that worked for my data. &nbsp;↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>learning</category>
  <guid>https://the-strategy-unit.github.io/data_science/blogs/posts/2024-11-29-mapping-my-r-learning/</guid>
  <pubDate>Mon, 10 Mar 2025 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Introduction to text vectorization</title>
  <dc:creator>YiWen Hon</dc:creator>
  <link>https://the-strategy-unit.github.io/data_science/blogs/posts/2025-01-03_text-vectorization/NLP - text vectorisation.html</link>
  <description><![CDATA[ 





<section id="what-is-vectorisation-and-why-do-we-need-to-do-it" class="level1">
<h1>What is vectorisation and why do we need to do it?</h1>
<p>This post is comprised of the Jupyter notebook that was used during a Coffee &amp; Coding session providing an overview of text vectorization, a key concept in Natural Language Processing.</p>
<p>Let’s take as our first example a dataset of reviews from IMDB. The aim is to try and classify if the review had a positive or negative score, based on the words in the text.</p>
<div id="8f91e146-7a4f-410a-8922-61d65baa8c6d" class="cell" data-execution_count="1">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb1-2">pd.set_option(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'display.max_colwidth'</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">400</span>)</span>
<span id="cb1-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Dataset from  https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download</span></span>
<span id="cb1-4">data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_csv(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'IMDB Dataset.csv'</span>)</span>
<span id="cb1-5">data</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="1">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">review</th>
<th data-quarto-table-cell-role="th">sentiment</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<th data-quarto-table-cell-role="th">0</th>
<td>One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.&lt;br /&gt;&lt;br /&gt;The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regard...</td>
<td>positive</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">1</th>
<td>A wonderful little production. &lt;br /&gt;&lt;br /&gt;The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. &lt;br /&gt;&lt;br /&gt;The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the ref...</td>
<td>positive</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">2</th>
<td>I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof...</td>
<td>positive</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">3</th>
<td>Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet &amp; his parents are fighting all the time.&lt;br /&gt;&lt;br /&gt;This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.&lt;br /&gt;&lt;br /&gt;OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Paren...</td>
<td>negative</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">4</th>
<td>Petter Mattei's "Love in the Time of Money" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. &lt;br /&gt;&lt;br /&gt;This being a variation on the Arthur Schnitzler's play about the same theme, the director transfers the action t...</td>
<td>positive</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">...</th>
<td>...</td>
<td>...</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">49995</th>
<td>I thought this movie did a down right good job. It wasn't as creative or original as the first, but who was expecting it to be. It was a whole lotta fun. the more i think about it the more i like it, and when it comes out on DVD I'm going to pay the money for it very proudly, every last cent. Sharon Stone is great, she always is, even if her movie is horrible(Catwoman), but this movie isn't, t...</td>
<td>positive</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">49996</th>
<td>Bad plot, bad dialogue, bad acting, idiotic directing, the annoying porn groove soundtrack that ran continually over the overacted script, and a crappy copy of the VHS cannot be redeemed by consuming liquor. Trust me, because I stuck this turkey out to the end. It was so pathetically bad all over that I had to figure it was a fourth-rate spoof of Springtime for Hitler.&lt;br /&gt;&lt;br /&gt;The girl who ...</td>
<td>negative</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">49997</th>
<td>I am a Catholic taught in parochial elementary schools by nuns, taught by Jesuit priests in high school &amp; college. I am still a practicing Catholic but would not be considered a "good Catholic" in the church's eyes because I don't believe certain things or act certain ways just because the church tells me to.&lt;br /&gt;&lt;br /&gt;So back to the movie...its bad because two people are killed by this nun w...</td>
<td>negative</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">49998</th>
<td>I'm going to have to disagree with the previous comment and side with Maltin on this one. This is a second rate, excessively vicious Western that creaks and groans trying to put across its central theme of the Wild West being tamed and kicked aside by the steady march of time. It would like to be in the tradition of "Butch Cassidy and the Sundance Kid", but lacks that film's poignancy and char...</td>
<td>negative</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">49999</th>
<td>No one expects the Star Trek movies to be high art, but the fans do expect a movie that is as good as some of the best episodes. Unfortunately, this movie had a muddled, implausible plot that just left me cringing - this is by far the worst of the nine (so far) movies. Even the chance to watch the well known characters interact in another movie can't save this movie - including the goofy scene...</td>
<td>negative</td>
</tr>
</tbody>
</table>

<p>50000 rows × 2 columns</p>
</div>
</div>
</div>
<div id="9e1d193b-d805-427f-a1d8-6cf58c6add58" class="cell" data-execution_count="2">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb2-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Our dataset is quite balanced, with an equal number of positive and negative reviews.</span></span>
<span id="cb2-2">data[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'sentiment'</span>].value_counts()</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="2">
<pre><code>sentiment
positive    25000
negative    25000
Name: count, dtype: int64</code></pre>
</div>
</div>
<div id="f71b0820-4727-4c42-9624-68fbda534108" class="cell" data-execution_count="3">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># In this cell, we are trying to use a very basic machine learning model (Multinomial Naive Bayes) </span></span>
<span id="cb4-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># to predict the sentiment of the text (whether it was positive or negative) based on the text.</span></span>
<span id="cb4-3"></span>
<span id="cb4-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.naive_bayes <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> MultinomialNB</span>
<span id="cb4-5"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.model_selection <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> cross_validate</span>
<span id="cb4-6"></span>
<span id="cb4-7">naivebayes <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> MultinomialNB()</span>
<span id="cb4-8"></span>
<span id="cb4-9">X <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> data.review</span>
<span id="cb4-10"></span>
<span id="cb4-11">cv_nb <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> cross_validate(</span>
<span id="cb4-12">    naivebayes,</span>
<span id="cb4-13">    X,</span>
<span id="cb4-14">    data.sentiment,</span>
<span id="cb4-15">    scoring <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"accuracy"</span></span>
<span id="cb4-16">)</span>
<span id="cb4-17"></span>
<span id="cb4-18"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(cv_nb[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'test_score'</span>].mean(),<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span>
<span id="cb4-19"></span>
<span id="cb4-20"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ⚠️ Uh oh!! we're getting an error... let's decode it together</span></span>
<span id="cb4-21"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ValueError: could not convert string to float (it doesn't like the text being as a string!)</span></span></code></pre></div></div>
<div class="cell-output cell-output-error">
<div class="ansi-escaped-output">
<pre><span class="ansi-red-fg ansi-bold">---------------------------------------------------------------------------</span>
<span class="ansi-red-fg ansi-bold">ValueError</span>                                Traceback (most recent call last)
Cell <span class="ansi-green-fg ansi-bold">In[3], line 11</span>
<span class="ansi-green-fg">      7</span> naivebayes <span style="color:rgb(98,98,98)">=</span> MultinomialNB()
<span class="ansi-green-fg">      9</span> X <span style="color:rgb(98,98,98)">=</span> data<span style="color:rgb(98,98,98)">.</span>review
<span class="ansi-green-fg ansi-bold">---&gt; 11</span> cv_nb <span style="color:rgb(98,98,98)">=</span> cross_validate(
<span class="ansi-green-fg">     12</span>     naivebayes,
<span class="ansi-green-fg">     13</span>     X,
<span class="ansi-green-fg">     14</span>     data<span style="color:rgb(98,98,98)">.</span>sentiment,
<span class="ansi-green-fg">     15</span>     scoring <span style="color:rgb(98,98,98)">=</span> <span style="color:rgb(175,0,0)">"</span><span style="color:rgb(175,0,0)">accuracy</span><span style="color:rgb(175,0,0)">"</span>
<span class="ansi-green-fg">     16</span> )
<span class="ansi-green-fg">     18</span> <span style="color:rgb(0,135,0)">round</span>(cv_nb[<span style="color:rgb(175,0,0)">'</span><span style="color:rgb(175,0,0)">test_score</span><span style="color:rgb(175,0,0)">'</span>]<span style="color:rgb(98,98,98)">.</span>mean(),<span style="color:rgb(98,98,98)">2</span>)

File <span class="ansi-green-fg ansi-bold">c:\Users\Yiwen.Hon\AppData\Local\miniconda3\envs\nlp\Lib\site-packages\sklearn\utils\_param_validation.py:211</span>, in <span class="ansi-cyan-fg">validate_params.&lt;locals&gt;.decorator.&lt;locals&gt;.wrapper</span><span class="ansi-blue-fg ansi-bold">(*args, **kwargs)</span>
<span class="ansi-green-fg">    205</span> <span style="font-weight:bold;color:rgb(0,135,0)">try</span>:
<span class="ansi-green-fg">    206</span>     <span style="font-weight:bold;color:rgb(0,135,0)">with</span> config_context(
<span class="ansi-green-fg">    207</span>         skip_parameter_validation<span style="color:rgb(98,98,98)">=</span>(
<span class="ansi-green-fg">    208</span>             prefer_skip_nested_validation <span style="font-weight:bold;color:rgb(175,0,255)">or</span> global_skip_validation
<span class="ansi-green-fg">    209</span>         )
<span class="ansi-green-fg">    210</span>     ):
<span class="ansi-green-fg ansi-bold">--&gt; 211</span>         <span style="font-weight:bold;color:rgb(0,135,0)">return</span> func(<span style="color:rgb(98,98,98)">*</span>args, <span style="color:rgb(98,98,98)">*</span><span style="color:rgb(98,98,98)">*</span>kwargs)
<span class="ansi-green-fg">    212</span> <span style="font-weight:bold;color:rgb(0,135,0)">except</span> InvalidParameterError <span style="font-weight:bold;color:rgb(0,135,0)">as</span> e:
<span class="ansi-green-fg">    213</span>     <span style="font-style:italic;color:rgb(95,135,135)"># When the function is just a wrapper around an estimator, we allow</span>
<span class="ansi-green-fg">    214</span>     <span style="font-style:italic;color:rgb(95,135,135)"># the function to delegate validation to the estimator, but we replace</span>
<span class="ansi-green-fg">    215</span>     <span style="font-style:italic;color:rgb(95,135,135)"># the name of the estimator by the name of the function in the error</span>
<span class="ansi-green-fg">    216</span>     <span style="font-style:italic;color:rgb(95,135,135)"># message to avoid confusion.</span>
<span class="ansi-green-fg">    217</span>     msg <span style="color:rgb(98,98,98)">=</span> re<span style="color:rgb(98,98,98)">.</span>sub(
<span class="ansi-green-fg">    218</span>         <span style="color:rgb(175,0,0)">r</span><span style="color:rgb(175,0,0)">"</span><span style="color:rgb(175,0,0)">parameter of </span><span style="color:rgb(175,0,0)">\</span><span style="color:rgb(175,0,0)">w+ must be</span><span style="color:rgb(175,0,0)">"</span>,
<span class="ansi-green-fg">    219</span>         <span style="color:rgb(175,0,0)">f</span><span style="color:rgb(175,0,0)">"</span><span style="color:rgb(175,0,0)">parameter of </span><span style="font-weight:bold;color:rgb(175,95,135)">{</span>func<span style="color:rgb(98,98,98)">.</span><span style="color:rgb(0,0,135)">__qualname__</span><span style="font-weight:bold;color:rgb(175,95,135)">}</span><span style="color:rgb(175,0,0)"> must be</span><span style="color:rgb(175,0,0)">"</span>,
<span class="ansi-green-fg">    220</span>         <span style="color:rgb(0,135,0)">str</span>(e),
<span class="ansi-green-fg">    221</span>     )

File <span class="ansi-green-fg ansi-bold">c:\Users\Yiwen.Hon\AppData\Local\miniconda3\envs\nlp\Lib\site-packages\sklearn\model_selection\_validation.py:328</span>, in <span class="ansi-cyan-fg">cross_validate</span><span class="ansi-blue-fg ansi-bold">(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, return_indices, error_score)</span>
<span class="ansi-green-fg">    308</span> parallel <span style="color:rgb(98,98,98)">=</span> Parallel(n_jobs<span style="color:rgb(98,98,98)">=</span>n_jobs, verbose<span style="color:rgb(98,98,98)">=</span>verbose, pre_dispatch<span style="color:rgb(98,98,98)">=</span>pre_dispatch)
<span class="ansi-green-fg">    309</span> results <span style="color:rgb(98,98,98)">=</span> parallel(
<span class="ansi-green-fg">    310</span>     delayed(_fit_and_score)(
<span class="ansi-green-fg">    311</span>         clone(estimator),
<span class="ansi-green-fg ansi-bold">   (...)</span>
<span class="ansi-green-fg">    325</span>     <span style="font-weight:bold;color:rgb(0,135,0)">for</span> train, test <span style="font-weight:bold;color:rgb(175,0,255)">in</span> indices
<span class="ansi-green-fg">    326</span> )
<span class="ansi-green-fg ansi-bold">--&gt; 328</span> _warn_or_raise_about_fit_failures(results, error_score)
<span class="ansi-green-fg">    330</span> <span style="font-style:italic;color:rgb(95,135,135)"># For callable scoring, the return type is only know after calling. If the</span>
<span class="ansi-green-fg">    331</span> <span style="font-style:italic;color:rgb(95,135,135)"># return type is a dictionary, the error scores can now be inserted with</span>
<span class="ansi-green-fg">    332</span> <span style="font-style:italic;color:rgb(95,135,135)"># the correct key.</span>
<span class="ansi-green-fg">    333</span> <span style="font-weight:bold;color:rgb(0,135,0)">if</span> <span style="color:rgb(0,135,0)">callable</span>(scoring):

File <span class="ansi-green-fg ansi-bold">c:\Users\Yiwen.Hon\AppData\Local\miniconda3\envs\nlp\Lib\site-packages\sklearn\model_selection\_validation.py:414</span>, in <span class="ansi-cyan-fg">_warn_or_raise_about_fit_failures</span><span class="ansi-blue-fg ansi-bold">(results, error_score)</span>
<span class="ansi-green-fg">    407</span> <span style="font-weight:bold;color:rgb(0,135,0)">if</span> num_failed_fits <span style="color:rgb(98,98,98)">==</span> num_fits:
<span class="ansi-green-fg">    408</span>     all_fits_failed_message <span style="color:rgb(98,98,98)">=</span> (
<span class="ansi-green-fg">    409</span>         <span style="color:rgb(175,0,0)">f</span><span style="color:rgb(175,0,0)">"</span><span style="font-weight:bold;color:rgb(175,95,0)">\n</span><span style="color:rgb(175,0,0)">All the </span><span style="font-weight:bold;color:rgb(175,95,135)">{</span>num_fits<span style="font-weight:bold;color:rgb(175,95,135)">}</span><span style="color:rgb(175,0,0)"> fits failed.</span><span style="font-weight:bold;color:rgb(175,95,0)">\n</span><span style="color:rgb(175,0,0)">"</span>
<span class="ansi-green-fg">    410</span>         <span style="color:rgb(175,0,0)">"</span><span style="color:rgb(175,0,0)">It is very likely that your model is misconfigured.</span><span style="font-weight:bold;color:rgb(175,95,0)">\n</span><span style="color:rgb(175,0,0)">"</span>
<span class="ansi-green-fg">    411</span>         <span style="color:rgb(175,0,0)">"</span><span style="color:rgb(175,0,0)">You can try to debug the error by setting error_score=</span><span style="color:rgb(175,0,0)">'</span><span style="color:rgb(175,0,0)">raise</span><span style="color:rgb(175,0,0)">'</span><span style="color:rgb(175,0,0)">.</span><span style="font-weight:bold;color:rgb(175,95,0)">\n</span><span style="font-weight:bold;color:rgb(175,95,0)">\n</span><span style="color:rgb(175,0,0)">"</span>
<span class="ansi-green-fg">    412</span>         <span style="color:rgb(175,0,0)">f</span><span style="color:rgb(175,0,0)">"</span><span style="color:rgb(175,0,0)">Below are more details about the failures:</span><span style="font-weight:bold;color:rgb(175,95,0)">\n</span><span style="font-weight:bold;color:rgb(175,95,135)">{</span>fit_errors_summary<span style="font-weight:bold;color:rgb(175,95,135)">}</span><span style="color:rgb(175,0,0)">"</span>
<span class="ansi-green-fg">    413</span>     )
<span class="ansi-green-fg ansi-bold">--&gt; 414</span>     <span style="font-weight:bold;color:rgb(0,135,0)">raise</span> <span style="font-weight:bold;color:rgb(215,95,95)">ValueError</span>(all_fits_failed_message)
<span class="ansi-green-fg">    416</span> <span style="font-weight:bold;color:rgb(0,135,0)">else</span>:
<span class="ansi-green-fg">    417</span>     some_fits_failed_message <span style="color:rgb(98,98,98)">=</span> (
<span class="ansi-green-fg">    418</span>         <span style="color:rgb(175,0,0)">f</span><span style="color:rgb(175,0,0)">"</span><span style="font-weight:bold;color:rgb(175,95,0)">\n</span><span style="font-weight:bold;color:rgb(175,95,135)">{</span>num_failed_fits<span style="font-weight:bold;color:rgb(175,95,135)">}</span><span style="color:rgb(175,0,0)"> fits failed out of a total of </span><span style="font-weight:bold;color:rgb(175,95,135)">{</span>num_fits<span style="font-weight:bold;color:rgb(175,95,135)">}</span><span style="color:rgb(175,0,0)">.</span><span style="font-weight:bold;color:rgb(175,95,0)">\n</span><span style="color:rgb(175,0,0)">"</span>
<span class="ansi-green-fg">    419</span>         <span style="color:rgb(175,0,0)">"</span><span style="color:rgb(175,0,0)">The score on these train-test partitions for these parameters</span><span style="color:rgb(175,0,0)">"</span>
<span class="ansi-green-fg ansi-bold">   (...)</span>
<span class="ansi-green-fg">    423</span>         <span style="color:rgb(175,0,0)">f</span><span style="color:rgb(175,0,0)">"</span><span style="color:rgb(175,0,0)">Below are more details about the failures:</span><span style="font-weight:bold;color:rgb(175,95,0)">\n</span><span style="font-weight:bold;color:rgb(175,95,135)">{</span>fit_errors_summary<span style="font-weight:bold;color:rgb(175,95,135)">}</span><span style="color:rgb(175,0,0)">"</span>
<span class="ansi-green-fg">    424</span>     )

<span class="ansi-red-fg ansi-bold">ValueError</span>: 
All the 5 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
1 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\Yiwen.Hon\AppData\Local\miniconda3\envs\nlp\Lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\Yiwen.Hon\AppData\Local\miniconda3\envs\nlp\Lib\site-packages\sklearn\base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Yiwen.Hon\AppData\Local\miniconda3\envs\nlp\Lib\site-packages\sklearn\naive_bayes.py", line 745, in fit
    X, y = self._check_X_y(X, y)
           ^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Yiwen.Hon\AppData\Local\miniconda3\envs\nlp\Lib\site-packages\sklearn\naive_bayes.py", line 578, in _check_X_y
    return self._validate_data(X, y, accept_sparse="csr", reset=reset)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Yiwen.Hon\AppData\Local\miniconda3\envs\nlp\Lib\site-packages\sklearn\base.py", line 621, in _validate_data
    X, y = check_X_y(X, y, **check_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Yiwen.Hon\AppData\Local\miniconda3\envs\nlp\Lib\site-packages\sklearn\utils\validation.py", line 1147, in check_X_y
    X = check_array(
        ^^^^^^^^^^^^
  File "c:\Users\Yiwen.Hon\AppData\Local\miniconda3\envs\nlp\Lib\site-packages\sklearn\utils\validation.py", line 917, in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Yiwen.Hon\AppData\Local\miniconda3\envs\nlp\Lib\site-packages\sklearn\utils\_array_api.py", line 380, in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Yiwen.Hon\AppData\Local\miniconda3\envs\nlp\Lib\site-packages\pandas\core\series.py", line 1022, in __array__
    arr = np.asarray(values, dtype=dtype)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: could not convert string to float: '"Sorte Nula" is the #1 Box Office Portuguese movie of 2004. This extreme low budget production (estimated USD$150,000) opened during Christmas opposite American Blockbusters like National Treasure, Polar Express, The Incredibles and Alexander but rapidly caught the adulation of the Portuguese moviegoers. Despite the harsh competition, the small film did surprisingly well, topping all other Portuguese films of the past two years in its first weeks. The film is a mystery/murder with a humorous tone cleverly written and directed by Fernando Fragata who has become a solid reference in the European independent film arena. Did I like the film? Oh, yes!'

--------------------------------------------------------------------------------
4 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\Yiwen.Hon\AppData\Local\miniconda3\envs\nlp\Lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\Yiwen.Hon\AppData\Local\miniconda3\envs\nlp\Lib\site-packages\sklearn\base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Yiwen.Hon\AppData\Local\miniconda3\envs\nlp\Lib\site-packages\sklearn\naive_bayes.py", line 745, in fit
    X, y = self._check_X_y(X, y)
           ^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Yiwen.Hon\AppData\Local\miniconda3\envs\nlp\Lib\site-packages\sklearn\naive_bayes.py", line 578, in _check_X_y
    return self._validate_data(X, y, accept_sparse="csr", reset=reset)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Yiwen.Hon\AppData\Local\miniconda3\envs\nlp\Lib\site-packages\sklearn\base.py", line 621, in _validate_data
    X, y = check_X_y(X, y, **check_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Yiwen.Hon\AppData\Local\miniconda3\envs\nlp\Lib\site-packages\sklearn\utils\validation.py", line 1147, in check_X_y
    X = check_array(
        ^^^^^^^^^^^^
  File "c:\Users\Yiwen.Hon\AppData\Local\miniconda3\envs\nlp\Lib\site-packages\sklearn\utils\validation.py", line 917, in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Yiwen.Hon\AppData\Local\miniconda3\envs\nlp\Lib\site-packages\sklearn\utils\_array_api.py", line 380, in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Yiwen.Hon\AppData\Local\miniconda3\envs\nlp\Lib\site-packages\pandas\core\series.py", line 1022, in __array__
    arr = np.asarray(values, dtype=dtype)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: could not convert string to float: "One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.&lt;br /&gt;&lt;br /&gt;The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.&lt;br /&gt;&lt;br /&gt;It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.&lt;br /&gt;&lt;br /&gt;I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences, forget charm, forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal, I couldn't say I was ready for it, but as I watched more, I developed a taste for Oz, and got accustomed to the high levels of graphic violence. Not just violence, but injustice (crooked guards who'll be sold out for a nickel, inmates who'll kill on order and get away with it, well mannered, middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz, you may become comfortable with what is uncomfortable viewing....thats if you can get in touch with your darker side."
</pre>
</div>
</div>
</div>
<section id="key-learning" class="level2">
<h2 class="anchored" data-anchor-id="key-learning">KEY LEARNING</h2>
<p>When working with text data, computers need to convert the words into numbers first before being able to work with them. Hence vectorisation - the process of converting numbers into words. There are a few different approaches and concepts which we’ll explore today</p>
<ul>
<li>Tokenization</li>
<li>Bag of words</li>
<li>TF-IDF</li>
<li>n-grams</li>
<li>Word2Vec embeddings</li>
</ul>
<p>Finally we’ll look at (at a very very high level!) how Transformer/attention-based approaches to word vectorisation have transformed NLP</p>
</section>
<section id="tokenization" class="level2">
<h2 class="anchored" data-anchor-id="tokenization">Tokenization</h2>
<p>Breaking up texts into their individual components, or tokens</p>
<div id="15d31329-25e9-48b6-8bdf-0febd73ab855" class="cell" data-execution_count="4">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> nltk.tokenize <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> word_tokenize</span>
<span id="cb5-2"></span>
<span id="cb5-3">text <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Had a slight weapons malfunction but, uh everything's perfectly all right now. We're fine. We're all fine here now. Thank you. How are you?"</span></span>
<span id="cb5-4"></span>
<span id="cb5-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Document before tokenization</span></span>
<span id="cb5-6"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(text)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>Had a slight weapons malfunction but, uh everything's perfectly all right now. We're fine. We're all fine here now. Thank you. How are you?</code></pre>
</div>
</div>
<div id="8b24b8c6-70f7-43b7-9bd7-abc24051e093" class="cell" data-execution_count="5">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1">word_tokens <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> word_tokenize(text)</span>
<span id="cb7-2"></span>
<span id="cb7-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Document after tokenization - each word is separated out. Compound words like "everything's" are now two words: "everything" and "'s"</span></span>
<span id="cb7-4"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(word_tokens)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>['Had', 'a', 'slight', 'weapons', 'malfunction', 'but', ',', 'uh', 'everything', "'s", 'perfectly', 'all', 'right', 'now', '.', 'We', "'re", 'fine', '.', 'We', "'re", 'all', 'fine', 'here', 'now', '.', 'Thank', 'you', '.', 'How', 'are', 'you', '?']</code></pre>
</div>
</div>
</section>
<section id="some-terminology" class="level2">
<h2 class="anchored" data-anchor-id="some-terminology">Some terminology</h2>
<ul>
<li>Tokens: how we’ve broken down the text into smaller units</li>
<li>Document: the unit of text we’re analysing. Could be sentences, could be paragraphs, could be a whole book. Different breakdowns for different purposes</li>
<li>Corpus: The collection of documents being analysed</li>
</ul>
</section>
<section id="bag-of-words" class="level2">
<h2 class="anchored" data-anchor-id="bag-of-words">Bag of words</h2>
<div id="38809574-719a-4831-bcd6-04573dbca33d" class="cell" data-execution_count="6">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1">texts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb9-2">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'I love to run'</span>,</span>
<span id="cb9-3">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'the cat does not eat fruit'</span>,</span>
<span id="cb9-4">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'run to the cat'</span>,</span>
<span id="cb9-5">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'I love to eat fruit. fruit fruit fruit fruit'</span></span>
<span id="cb9-6">]</span></code></pre></div></div>
</div>
<div id="bba90d4b-d321-4433-ad3d-15bba9896669" class="cell" data-execution_count="7">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.feature_extraction.text <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> CountVectorizer</span>
<span id="cb10-2"></span>
<span id="cb10-3">count_vectorizer <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> CountVectorizer()</span>
<span id="cb10-4">X <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> count_vectorizer.fit_transform(texts)</span>
<span id="cb10-5">X.toarray()</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="7">
<pre><code>array([[0, 0, 0, 0, 1, 0, 1, 0, 1],
       [1, 1, 1, 1, 0, 1, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 1, 5, 1, 0, 0, 0, 1]], dtype=int64)</code></pre>
</div>
</div>
<p>Each column is a different word, and the count vectorizer simply counts how many appearances of each word are in each sentence.</p>
<p>🤔 Can you guess which column represents which word?</p>
<p>It’s column 4: the word “fruit” appears 5 times in the last sentence.</p>
<div id="2aba1c56-6c07-4ea0-9ade-2751de503a65" class="cell" data-scrolled="true" data-execution_count="8">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Visualising what the vectorizer has done</span></span>
<span id="cb12-2"></span>
<span id="cb12-3">vectorized_texts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame(</span>
<span id="cb12-4">    X.toarray(),</span>
<span id="cb12-5">    columns <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> count_vectorizer.get_feature_names_out(),</span>
<span id="cb12-6">    index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> texts</span>
<span id="cb12-7">)</span>
<span id="cb12-8"></span>
<span id="cb12-9">vectorized_texts</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="8">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">cat</th>
<th data-quarto-table-cell-role="th">does</th>
<th data-quarto-table-cell-role="th">eat</th>
<th data-quarto-table-cell-role="th">fruit</th>
<th data-quarto-table-cell-role="th">love</th>
<th data-quarto-table-cell-role="th">not</th>
<th data-quarto-table-cell-role="th">run</th>
<th data-quarto-table-cell-role="th">the</th>
<th data-quarto-table-cell-role="th">to</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<th data-quarto-table-cell-role="th">I love to run</th>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">the cat does not eat fruit</th>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">run to the cat</th>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">I love to eat fruit. fruit fruit fruit fruit</th>
<td>0</td>
<td>0</td>
<td>1</td>
<td>5</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<p>We will try the same code from above - this time on the <em>vectorised</em> text instead of the raw text! This time we shouldn’t get any errors.</p>
<div id="9986c995-e6c3-4218-a81c-563aa2b534d8" class="cell" data-execution_count="9">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1">naivebayes <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> MultinomialNB()</span>
<span id="cb13-2">count_vectorizer <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> CountVectorizer()</span>
<span id="cb13-3"></span>
<span id="cb13-4">X <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> count_vectorizer.fit_transform(data.review)</span>
<span id="cb13-5"></span>
<span id="cb13-6">cv_nb <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> cross_validate(</span>
<span id="cb13-7">    naivebayes,</span>
<span id="cb13-8">    X,</span>
<span id="cb13-9">    data.sentiment,</span>
<span id="cb13-10">    scoring <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"accuracy"</span></span>
<span id="cb13-11">)</span>
<span id="cb13-12"></span>
<span id="cb13-13"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">round</span>(cv_nb[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'test_score'</span>].mean(),<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="9">
<pre><code>0.85</code></pre>
</div>
</div>
<p>Our accuracy score is 85% which isn’t too bad</p>
<p>What are the limitations of this approach?</p>
<ul>
<li>No context</li>
<li>Word order not available</li>
<li>All words treated the same</li>
<li>Very simplistic approach!</li>
</ul>
</section>
<section id="tf-idf" class="level2">
<h2 class="anchored" data-anchor-id="tf-idf">TF-IDF</h2>
<p><strong>TERM FREQUENCY (TF)</strong></p>
<p>The more often a word appears in a document relative to others, the more likely it is that it will be important to this document</p>
<p>Example: if a word appears relatively frequently in a document, it is obvious that this word is important to the overall meaning of the document.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-01-03_text-vectorization/NLP - text vectorisation_files/figure-html/a8008e62-48cb-4ecc-ac9c-1a4ec4bceee4-1-4310763b-ece9-458a-b345-2dc4902b093b.png" class="img-fluid figure-img"></p>
<figcaption>image.png</figcaption>
</figure>
</div>
<div id="fee59c74-01e9-4abc-8b77-de687fa5ae11" class="cell" data-execution_count="10">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1">texts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb15-2">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'I love to run'</span>,</span>
<span id="cb15-3">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'the cat does not eat the fruit'</span>,</span>
<span id="cb15-4">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'run to the cat'</span>,</span>
<span id="cb15-5">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'I love to eat the fruit. fruit fruit fruit fruit'</span></span>
<span id="cb15-6">]</span></code></pre></div></div>
</div>
<div id="a2f947df-e050-40f7-b30e-69fbfe1d19fa" class="cell" data-execution_count="11">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># In document 4, the Term Frequency (TF) of the word FRUIT is?</span></span>
<span id="cb16-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># The word fruit appears 5 times</span></span>
<span id="cb16-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># There are 10 words in the document</span></span>
<span id="cb16-4"></span>
<span id="cb16-5"><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span></span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="11">
<pre><code>0.5</code></pre>
</div>
</div>
<p><strong>DOCUMENT FREQUENCY (DF)</strong></p>
<p>If a word appears in many documents of a corpus, it’s not important to understand a particular document.</p>
<p>Example: on eurosport.com/football, the word “football” appears in every article, hence why the word football on this website is an unimportant word!</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-01-03_text-vectorization/NLP - text vectorisation_files/figure-html/82900ce8-5b7e-47fa-beb6-209161aa5e68-1-67dfc1ad-72a7-4f91-8e47-5c2089494ced.png" class="img-fluid figure-img"></p>
<figcaption>image.png</figcaption>
</figure>
</div>
<p>For the word “football” on Eurosport, we would expect this formula to be close to 1 since the number of docs containing the word “football” will probably only be slightly less than the total number of docs (out of 100 maybe only 5 don’t have the word “football”, so we get 95/100).</p>
<p>if the word “football” appears in all the articles it is not very useful for helping us identify between two articles, but if only a few documents contain words like “concussion” or “wellbeing”, (e.g.&nbsp;they appear in 2/100 articles) it will be much more useful in determining the topic of that article (they are probably specifically about player wellfare).</p>
<p>💡 Thus the intuition of the term frequency - inverse document frequency approach is to give a high weight to any term which appears frequently in a single document, but not in too many documents of the corpus.</p>
<div id="a04a9f13-c46a-4940-ae5c-9bbb5f72e990" class="cell" data-execution_count="12">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## ?? Which words appear frequently in our small corpus </span></span>
<span id="cb18-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># and might not be useful for deriving meaning?</span></span>
<span id="cb18-3"></span>
<span id="cb18-4">texts <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb18-5">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'I love to run'</span>,</span>
<span id="cb18-6">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'the cat does not eat the fruit'</span>,</span>
<span id="cb18-7">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'run to the cat'</span>,</span>
<span id="cb18-8">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'I love to eat the fruit. fruit fruit fruit fruit'</span></span>
<span id="cb18-9">]</span>
<span id="cb18-10"></span>
<span id="cb18-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># the</span></span>
<span id="cb18-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># to</span></span></code></pre></div></div>
</div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-01-03_text-vectorization/NLP - text vectorisation_files/figure-html/6bd701e4-2031-4d99-a8d0-0058e4172030-1-065d751b-b7ce-4ccf-984f-6cb7ee919f76.png" class="img-fluid figure-img"></p>
<figcaption>image.png</figcaption>
</figure>
</div>
<div id="3e24743e-0d06-47df-bf35-2c06fc345488" class="cell" data-execution_count="13">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb19" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> sklearn.feature_extraction.text <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> TfidfVectorizer</span>
<span id="cb19-2"></span>
<span id="cb19-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Instantiating the TfidfVectorizer</span></span>
<span id="cb19-4">tf_idf_vectorizer <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> TfidfVectorizer()</span>
<span id="cb19-5"></span>
<span id="cb19-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Training it on the texts</span></span>
<span id="cb19-7">weighted_words <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame(tf_idf_vectorizer.fit_transform(texts).toarray(),</span>
<span id="cb19-8">                 columns <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> tf_idf_vectorizer.get_feature_names_out(),</span>
<span id="cb19-9">                index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> texts)</span>
<span id="cb19-10"></span>
<span id="cb19-11">weighted_words</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="13">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">cat</th>
<th data-quarto-table-cell-role="th">does</th>
<th data-quarto-table-cell-role="th">eat</th>
<th data-quarto-table-cell-role="th">fruit</th>
<th data-quarto-table-cell-role="th">love</th>
<th data-quarto-table-cell-role="th">not</th>
<th data-quarto-table-cell-role="th">run</th>
<th data-quarto-table-cell-role="th">the</th>
<th data-quarto-table-cell-role="th">to</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<th data-quarto-table-cell-role="th">I love to run</th>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.613667</td>
<td>0.000000</td>
<td>0.613667</td>
<td>0.000000</td>
<td>0.496816</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">the cat does not eat the fruit</th>
<td>0.336350</td>
<td>0.426618</td>
<td>0.336350</td>
<td>0.336350</td>
<td>0.000000</td>
<td>0.426618</td>
<td>0.000000</td>
<td>0.544609</td>
<td>0.000000</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">run to the cat</th>
<td>0.549578</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.549578</td>
<td>0.444931</td>
<td>0.444931</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">I love to eat the fruit. fruit fruit fruit fruit</th>
<td>0.000000</td>
<td>0.000000</td>
<td>0.187942</td>
<td>0.939709</td>
<td>0.187942</td>
<td>0.000000</td>
<td>0.000000</td>
<td>0.152155</td>
<td>0.152155</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<p>Weaknesses of this approach?</p>
<ul>
<li>word order still missing</li>
<li>relationship between words still missing</li>
</ul>
</section>
<section id="n-grams" class="level2">
<h2 class="anchored" data-anchor-id="n-grams">n-grams</h2>
<div id="1eaffc72-4839-499b-8157-cd8d1d5f95ba" class="cell" data-execution_count="14">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb20" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb20-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># The two following sentences have the exact same representation in bag of words/ TFIDF approaches</span></span>
<span id="cb20-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># However, they have very different meanings!</span></span>
<span id="cb20-3"></span>
<span id="cb20-4">sentences <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> [</span>
<span id="cb20-5">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"I like cats but not dogs"</span>,</span>
<span id="cb20-6">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"I like dogs but not cats"</span></span>
<span id="cb20-7">]</span></code></pre></div></div>
</div>
<div id="89d9f7b6-db18-4b1f-a58d-21aa0d147190" class="cell" data-execution_count="15">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb21" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Vectorize the sentences</span></span>
<span id="cb21-2">count_vectorizer <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> CountVectorizer()</span>
<span id="cb21-3">sentences_vectorized <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> count_vectorizer.fit_transform(sentences)</span>
<span id="cb21-4"></span>
<span id="cb21-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Show the representations in a nice DataFrame</span></span>
<span id="cb21-6">sentences_vectorized <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame(</span>
<span id="cb21-7">    sentences_vectorized.toarray(),</span>
<span id="cb21-8">    columns <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> count_vectorizer.get_feature_names_out(),</span>
<span id="cb21-9">    index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> sentences</span>
<span id="cb21-10">)</span>
<span id="cb21-11"></span>
<span id="cb21-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Show the vectorized words</span></span>
<span id="cb21-13">sentences_vectorized</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="15">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">but</th>
<th data-quarto-table-cell-role="th">cats</th>
<th data-quarto-table-cell-role="th">dogs</th>
<th data-quarto-table-cell-role="th">like</th>
<th data-quarto-table-cell-role="th">not</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<th data-quarto-table-cell-role="th">I like cats but not dogs</th>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">I like dogs but not cats</th>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<p>🧑🏻‍🏫 When using a bag-of-words representation, an efficient way to capture context is to consider:</p>
<ul>
<li>the count of single tokens (unigrams)</li>
<li>the count of pairs (bigrams), triplets (trigrams), and more generally sequences of n words, also known as n-grams</li>
</ul>
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-01-03_text-vectorization/NLP - text vectorisation_files/figure-html/38e1eb15-abc8-4d63-ad85-3fe790f862b4-1-dd4e9a1c-ffed-482a-95f7-15b60b19c7d7.png" class="img-fluid" alt="image.png"> 4)</p>
<p>😥 With a unigram vectorization, we couldn’t distinguish two sentences with the same words, despite their meaning being quite different</p>
<div id="d36bce7a-55c2-4044-8d10-a1b9c4101d69" class="cell" data-scrolled="true" data-execution_count="16">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb22" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb22-1">sentences_vectorized</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="16">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">but</th>
<th data-quarto-table-cell-role="th">cats</th>
<th data-quarto-table-cell-role="th">dogs</th>
<th data-quarto-table-cell-role="th">like</th>
<th data-quarto-table-cell-role="th">not</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<th data-quarto-table-cell-role="th">I like cats but not dogs</th>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">I like dogs but not cats</th>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<p>👩🏻‍🔬 What about a bigram vectorization?</p>
<div id="ccab66a1-f168-4669-87f2-36f75b8317d8" class="cell" data-execution_count="17">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb23" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb23-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Vectorize the sentences</span></span>
<span id="cb23-2">count_vectorizer_n_gram <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> CountVectorizer(ngram_range <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> (<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># BI-GRAMS</span></span>
<span id="cb23-3">sentences_vectorized_n_gram <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> count_vectorizer_n_gram.fit_transform(sentences)</span>
<span id="cb23-4"></span>
<span id="cb23-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Show the representations in a nice DataFrame</span></span>
<span id="cb23-6">sentences_vectorized_n_gram <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.DataFrame(</span>
<span id="cb23-7">    sentences_vectorized_n_gram.toarray(),</span>
<span id="cb23-8">    columns <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> count_vectorizer_n_gram.get_feature_names_out(),</span>
<span id="cb23-9">    index <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> sentences</span>
<span id="cb23-10">)</span>
<span id="cb23-11"></span>
<span id="cb23-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Show the vectorized movies with bigrams (pairs of words)</span></span>
<span id="cb23-13">sentences_vectorized_n_gram</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="17">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">but</th>
<th data-quarto-table-cell-role="th">but not</th>
<th data-quarto-table-cell-role="th">cats</th>
<th data-quarto-table-cell-role="th">cats but</th>
<th data-quarto-table-cell-role="th">dogs</th>
<th data-quarto-table-cell-role="th">dogs but</th>
<th data-quarto-table-cell-role="th">like</th>
<th data-quarto-table-cell-role="th">like cats</th>
<th data-quarto-table-cell-role="th">like dogs</th>
<th data-quarto-table-cell-role="th">not</th>
<th data-quarto-table-cell-role="th">not cats</th>
<th data-quarto-table-cell-role="th">not dogs</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<th data-quarto-table-cell-role="th">I like cats but not dogs</th>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">I like dogs but not cats</th>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>0</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
</section>
<section id="word2vec-embeddings" class="level2">
<h2 class="anchored" data-anchor-id="word2vec-embeddings">Word2Vec embeddings</h2>
<p>Attempting to capture semantic meaning of words in numerical format</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-01-03_text-vectorization/NLP - text vectorisation_files/figure-html/1ee0885b-cb03-4b52-ab0c-4208318b8a4e-2-c49ccd37-ec3a-4a4b-b1ee-a786f4038074.png" class="img-fluid figure-img"></p>
<figcaption>image.png</figcaption>
</figure>
</div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-01-03_text-vectorization/NLP - text vectorisation_files/figure-html/1ee0885b-cb03-4b52-ab0c-4208318b8a4e-1-3eba3f5d-26d5-415d-8fc4-773086347326.png" class="img-fluid figure-img"></p>
<figcaption>image.png</figcaption>
</figure>
</div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-01-03_text-vectorization/NLP - text vectorisation_files/figure-html/298ea242-139b-456d-9a1c-0d0cdcadecb6-1-ab28afa4-e32d-4b42-9b5d-5e73e7e5cbd4.png" class="img-fluid figure-img"></p>
<figcaption>image.png</figcaption>
</figure>
</div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-01-03_text-vectorization/NLP - text vectorisation_files/figure-html/9549f3f0-43a0-4386-885b-dc465878dd07-1-b1873ad5-e140-4118-9330-e7dbec2a2fc1.png" class="img-fluid figure-img"></p>
<figcaption>image.png</figcaption>
</figure>
</div>
<div id="bd2d7f0d-404e-42e7-bcac-996527cc0ca0" class="cell" data-execution_count="18">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb24" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb24-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> gensim.downloader</span>
<span id="cb24-2"></span>
<span id="cb24-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Lots of different pretrained embeddings we can use for free!</span></span>
<span id="cb24-4"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">list</span>(gensim.downloader.info()[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'models'</span>].keys()))</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']</code></pre>
</div>
</div>
<div id="60389a85" class="cell" data-execution_count="19">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb26" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb26-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># We will use a vectoriser trained on Wikipedia today</span></span>
<span id="cb26-2">model_wiki <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> gensim.downloader.load(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'glove-wiki-gigaword-50'</span>)</span></code></pre></div></div>
</div>
<div id="1266afa0-fa31-4011-b396-bf61a868871a" class="cell" data-execution_count="20">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb27" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb27-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Vectors based on 2B tweets, 27B tokens, 1.2M vocab!</span></span>
<span id="cb27-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 50 dimensions</span></span>
<span id="cb27-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># N.B. Words not in glove-wiki-gigaword-50 will not have vectors computed. For example, if there was a niche word or acronym like "NHS-R" there would not be a vector for this word.</span></span>
<span id="cb27-4"></span>
<span id="cb27-5">model_wiki[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"cat"</span>]</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="20">
<pre><code>array([ 0.45281 , -0.50108 , -0.53714 , -0.015697,  0.22191 ,  0.54602 ,
       -0.67301 , -0.6891  ,  0.63493 , -0.19726 ,  0.33685 ,  0.7735  ,
        0.90094 ,  0.38488 ,  0.38367 ,  0.2657  , -0.08057 ,  0.61089 ,
       -1.2894  , -0.22313 , -0.61578 ,  0.21697 ,  0.35614 ,  0.44499 ,
        0.60885 , -1.1633  , -1.1579  ,  0.36118 ,  0.10466 , -0.78325 ,
        1.4352  ,  0.18629 , -0.26112 ,  0.83275 , -0.23123 ,  0.32481 ,
        0.14485 , -0.44552 ,  0.33497 , -0.95946 , -0.097479,  0.48138 ,
       -0.43352 ,  0.69455 ,  0.91043 , -0.28173 ,  0.41637 , -1.2609  ,
        0.71278 ,  0.23782 ], dtype=float32)</code></pre>
</div>
</div>
<div id="9a98fe73-b161-4c46-b581-1bfef59af25d" class="cell" data-execution_count="21">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb29" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb29-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># King is to Queen as Man is to ...</span></span>
<span id="cb29-2"></span>
<span id="cb29-3">example_1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model_wiki[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"queen"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> model_wiki[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"king"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> model_wiki[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"man"</span>]</span>
<span id="cb29-4">model_wiki.most_similar(example_1)[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="21">
<pre><code>('woman', 0.8903914093971252)</code></pre>
</div>
</div>
<div id="46a673f2-7cc3-4f4e-a537-d4b9e470b91d" class="cell" data-execution_count="22">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb31" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb31-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Similar words to cat</span></span>
<span id="cb31-2"></span>
<span id="cb31-3">model_wiki.most_similar(model_wiki[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"cat"</span>])</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="22">
<pre><code>[('cat', 1.0),
 ('dog', 0.9218006134033203),
 ('rabbit', 0.8487820625305176),
 ('monkey', 0.804108202457428),
 ('rat', 0.7891963124275208),
 ('cats', 0.7865270972251892),
 ('snake', 0.7798910140991211),
 ('dogs', 0.7795815467834473),
 ('pet', 0.7792249917984009),
 ('mouse', 0.7731667160987854)]</code></pre>
</div>
</div>
<div id="337cbcfb-2722-4db6-aaf0-437433d84348" class="cell" data-execution_count="23">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb33" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb33-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Opposite of cold...?</span></span>
<span id="cb33-2"></span>
<span id="cb33-3">example_2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model_wiki[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"good"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> model_wiki[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"evil"</span>] <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> model_wiki[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"cold"</span>]</span>
<span id="cb33-4">model_wiki.most_similar(example_2)[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="23">
<pre><code>('warm', 0.7870427966117859)</code></pre>
</div>
</div>
</section>
<section id="attention-mechanism" class="level2">
<h2 class="anchored" data-anchor-id="attention-mechanism">Attention mechanism</h2>
<p>The basis of transformer-based neural networks like ChatGPT! The paper that started it all: <a href="https://arxiv.org/abs/1706.03762">Attention is all you need</a></p>
<ol type="1">
<li><p>Each token (word) embedding gets projected ➡️ into 3 further vectors: the query, key and value vectors (usually 768 dimensions each)!!</p></li>
<li><p>We compute a scaled dot-product 🔴 on the query and key vectors to work out how much each word relates to those around it</p></li>
<li><p>Take these scores and normalize with softmax ⤵️</p></li>
<li><p>Multiply by our value vectors ❎, sum and pass to our dense neural network</p></li>
</ol>
<p>⚠️ <strong>TLDR</strong>: The main point is that each word is now represented by 768 * 3 numbers! This is partly what makes LLMs so powerful (and resource-hungry) ⚠️</p>
<p>In the simple bag-of-words and TFIDF approaches, each word was represented by only 1 number each</p>
<p>In more complex word embeddings each word was represented by between 50 to 300 numbers each</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-01-03_text-vectorization/NLP - text vectorisation_files/figure-html/c0406057-f220-47b2-9387-f9b32d33957a-1-084cca64-8713-4c78-81af-e4e730b44636.png" class="img-fluid figure-img"></p>
<figcaption>image.png</figcaption>
</figure>
</div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2025-01-03_text-vectorization/NLP - text vectorisation_files/figure-html/a0b68e32-96bc-45e7-b6d1-4ed62f0ed925-1-0a2c45c7-282c-44e1-94f3-0d1ded4caaf7.png" class="img-fluid figure-img"></p>
<figcaption>image.png</figcaption>
</figure>
</div>
</section>
<section id="resources" class="level2">
<h2 class="anchored" data-anchor-id="resources">Resources</h2>
<ul>
<li>R text mining book https://www.tidytextmining.com/</li>
<li>Huggingface tutorials (python) https://huggingface.co/learn/nlp-course/chapter1/1</li>
<li>Great video on attention https://www.youtube.com/watch?v=zxQyTK8quyY</li>
</ul>


</section>
</section>

 ]]></description>
  <category>NLP</category>
  <category>Python</category>
  <category>Tutorial</category>
  <guid>https://the-strategy-unit.github.io/data_science/blogs/posts/2025-01-03_text-vectorization/NLP - text vectorisation.html</guid>
  <pubDate>Fri, 03 Jan 2025 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Deploy previews with GitHub pages</title>
  <dc:creator>Rhian Davies</dc:creator>
  <link>https://the-strategy-unit.github.io/data_science/blogs/posts/2024-12-04-gha-branch-preview/</link>
  <description><![CDATA[ 





<pre class="{r lockfile}"><code>#| include: FALSE
renv::use(lockfile = "renv.lock")</code></pre>
<p>When reviewing a pull request (PR) for a Quarto website, it’s good practice to check the rendered output, as well as the code. This is useful for ensuring that the changes look as expected, for example, ensuring that bullet points have rendered correctly, or that images are well sized.</p>
<p>However, it’s a pain for the reviewer to clone the repository and render the Quarto site locally just to check it looks correct. Wouldn’t it be nice if when the PR was created, you automatically got a deployed version of your changes to look at?</p>
<p>Other development platforms like <a href="https://docs.netlify.com/site-deploys/deploy-previews/">Netlify</a> and <a href="https://vercel.com/docs/deployments/preview-deployments">Vercel</a> have offered deploy previews for a while, and although these are free for individual users in public repos they aren’t free for organisations.</p>
<p>There has been <a href="https://github.com/orgs/community/discussions/7730">discussion of GitHub deploy preview for a few years</a>, but there is currently no ETA for this feature. However, there is a popular GitHub marketplace action <a href="https://github.com/marketplace/actions/deploy-pr-preview">deploy-pr-preview</a> by <a href="https://github.com/rossjrw">rossjrw</a> which does just what we need.</p>
<p>This features of this action are:</p>
<ul>
<li>Creates and deploys previews of pull requests to your GitHub Pages site</li>
<li>Leaves a comment on the pull request with a link to the preview so that you and your team can collaborate on new features faster</li>
<li>Updates the deployment and the comment whenever new commits are pushed to the pull request</li>
<li>Cleans up after itself — removes deployed previews when the pull request is closed</li>
</ul>
<section id="how-to-use-deploy-pr-preview" class="level2">
<h2 class="anchored" data-anchor-id="how-to-use-deploy-pr-preview">How to use deploy-pr-preview</h2>
<p>We weren’t doing any CI/CD on PRs initially, so first I need to define a new workflow. Workflows are defined in <code>.yaml</code> files in the <code>.github/workflows</code> folder. At the top of the workflow, I need to give it a name and tell it <em>when</em> to trigger. In this case I want it to trigger on any PR.</p>
<pre class="{yaml}"><code>name: Quarto Preview

on:
  pull_request:
    types:
      - opened
      - reopened
      - synchronize
      - closed</code></pre>
<p>Once I’ve defined <em>when</em> it should run, I need to specify <em>what</em> it should run. That tends to involve a number of steps such as</p>
<ul>
<li>Checking out the repository</li>
<li>Installing system dependencies</li>
<li>Installing packages (via {renv})</li>
<li>Rendering the Quarto site</li>
</ul>
<p>We already have a <code>publish.yml</code> workflow for main which has a number of relevant steps which I’ve borrowed.</p>
<p>The file looks a bit like this:</p>
<pre class="{yaml}"><code>jobs:
  build-deploy:
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: ./.github/workflows
    permissions:
      contents: write
      pull-requests: write
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4

      - name: Use cache
        uses: actions/cache@v4
        with:
          key: freeze
          path: _freeze

      - name: Install System Dependencies
        run:  bash install_system_deps.sh

      - name: Set up R
        uses: r-lib/actions/setup-r@v2
        with:
          use-public-rspm: true
      
      - name: Set up renv
        uses: r-lib/actions/setup-renv@v2

      - name: Setup Quarto
        uses: quarto-dev/quarto-actions/setup@v2

      - name: Render
        uses: quarto-dev/quarto-actions/render@v2</code></pre>
<p>Once the site has rendered, we just need to add a step to deploy the PR.</p>
<pre class="{yaml}"><code>      - name: Deploy PR Preview
        uses: rossjrw/pr-preview-action@v1.4.8
        with:
          source-dir: ./_site/</code></pre>
<p>Now everytime a PR is created, or updated, the GitHub Action bot will spring into action.</p>
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2024-12-04-gha-branch-preview/deploy-pr-preview.PNG" class="img-fluid" alt="A screenshot of a GitHub PR. Under the description, there is a comment from the GitHub actions bot with links to the deployed PR."><br>
</p>
</section>
<section id="adjusting-the-publish.yml" class="level2">
<h2 class="anchored" data-anchor-id="adjusting-the-publish.yml">Adjusting the publish.yml</h2>
<p>That’s all working and looking good! However, there is one last change we need to make. When someone merges into main, the publish action is triggered, and sometimes in that process the pr-previews can get wiped. This means that your preview link would no longer work if someone merges another branch to main whilst you’re working on your PR.</p>
<p>We were using the <a href="https://github.com/quarto-dev/quarto-actions">standard Quarto publish action</a> which as far as I know, does not have an inbuilt configuration to exclude folders from the clean up process.</p>
<p>Instead, I’ve replaced the Quarto action with another marketplace action, <a href="https://github.com/JamesIves/github-pages-deploy-action">github-pages-deploy-action</a> by <a href="https://github.com/JamesIves">JamesIves</a> which has an inbuilt way to exclude certain folders from the clean-up. Since it’s not a specific Quarto publish action, we need to remember to add a Quarto render step first.</p>
<pre class="{yaml}"><code>      - name: Publish to GitHub Pages (and render)
        uses: quarto-dev/quarto-actions/publish@v2
        with:
          target: gh-pages</code></pre>
<pre class="{yaml}"><code>      - name: Render Quarto
        uses: quarto-dev/quarto-actions/render@v2

      - name: Deploy 🚀
        uses: JamesIves/github-pages-deploy-action@v4
        with:
          folder: _site/
          clean-exclude: pr-preview/
          force: false</code></pre>
</section>
<section id="spring-cleaning" class="level2">
<h2 class="anchored" data-anchor-id="spring-cleaning">Spring cleaning</h2>
<p>Now that we have a preview workflow (for branches) and a publish workflow (for main), we have a number of duplicated steps. As any good programmer knows, this is not ideal as it goes against the DRY prinicipal and makes code harder to maintain.</p>
<p>I couldn’t find a quick way for two workflows to share steps<sup>1</sup>, so as a small improvement, I moved the lengthy system depencies call into it’s own script. I then ran the script using the following snippet. and called it via the following step.</p>
<pre class="{yaml}"><code>      - name: Install System Dependencies
        run:  bash install_system_deps.sh</code></pre>
<p>which I think is a bit nicer to read than the original.</p>
<pre><code> - name: Install System Dependencies
        run: |
          sudo apt update
          sudo apt install -y cmake
          sudo apt install -y gdal-bin
          sudo apt install -y git
          sudo apt install -y libcurl4-openssl-dev
          ...</code></pre>
<p>In order for the actions to find the script, I also specified the path when I defined the job.</p>
<pre class="{yaml}"><code>jobs:
  build-deploy:
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: ./.github/workflows</code></pre>
<p>Now we’ve got this working for GitHub pages 🎉 we’d also like to start using it for some of our deployments on Posit Connect. But that’s for another day.</p>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>If you know how to do this, do let me know, or make a <a href="https://github.com/The-Strategy-Unit/data_science/pulls">PR</a>.↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>GitHub</category>
  <category>learning</category>
  <category>deployment</category>
  <guid>https://the-strategy-unit.github.io/data_science/blogs/posts/2024-12-04-gha-branch-preview/</guid>
  <pubDate>Wed, 04 Dec 2024 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Using GitHub to plan and organise Coffee &amp; Coding</title>
  <dc:creator>YiWen Hon</dc:creator>
  <link>https://the-strategy-unit.github.io/data_science/blogs/posts/2024-11-12_coffee-coding-github-planner/</link>
  <description><![CDATA[ 





<section id="coffee-coding" class="level2">
<h2 class="anchored" data-anchor-id="coffee-coding">Coffee &amp; Coding</h2>
<p>Coffee &amp; Coding is a fortnightly hour-long session organised by the Data Science team, open to all members of the Strategy Unit with an interest in coding. It’s been <a href="../../../blogs/posts/2024-05-13_one-year-coffee-code/index.html" target="_blank">well received</a> and is a valued source of professional development and general geekery in the team.</p>
<p>We’ve been experimenting with using <a href="https://github.com/">GitHub</a> as an organisational tool for our team’s work, and are testing the same approach for Coffee &amp; Coding sessions as well. Previously, future Coffee &amp; Coding sessions were haphazardly listed in a Google Doc that was only accessible to members of the Data Science team, and we wanted a more open approach. We also didn’t have a good record of topics that were previously covered.</p>
<p>You’ll need a GitHub account to enjoy the full functionality of the planner. If you need help setting this up, get in touch with any member of the Data Science team.</p>
<p>Any feedback on this new system for organising and planning Coffee &amp; Coding is very welcome! Hope you enjoy using it.</p>
</section>
<section id="viewing-upcoming-sessions" class="level2">
<h2 class="anchored" data-anchor-id="viewing-upcoming-sessions">Viewing upcoming sessions</h2>
<p>We have created <a href="https://github.com/orgs/The-Strategy-Unit/projects/14/views/1">a fully open GitHub project for tracking Coffee &amp; Coding sessions</a>. Any sessions with scheduled dates can be seen in the “Upcoming sessions” view. Clicking on a session title brings up more information, including a brief overview of the session and the people running it. Users with GitHub accounts can make comments or post emoji reactions.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2024-11-12_coffee-coding-github-planner/upcoming-sessions.gif" class="img-fluid figure-img" alt="A short clip showing a person clicking on an upcoming session title. A pop up box appears with more information"></p>
<figcaption>Viewing upcoming session details</figcaption>
</figure>
</div>
</section>
<section id="adding-session-ideas" class="level2">
<h2 class="anchored" data-anchor-id="adding-session-ideas">Adding session ideas</h2>
<p>To add a session idea:</p>
<ol type="1">
<li><a href="https://github.com/The-Strategy-Unit/data_science/issues/new?template=Blank+issue">Create a new issue</a> on the <a href="https://github.com/The-Strategy-Unit/data_science">data_science repository</a>. Provide a useful title and description for the session.</li>
<li>Give your new issue the label C&amp;C☕</li>
<li>If you would like to run or contribute to the session, assign yourself to it.</li>
<li>Click “Create” to save your session idea as a GitHub issue. You should then be able to see it listed as a “Potential session” on the planner, and others will be able to view, vote for, and comment on your session idea.</li>
</ol>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2024-11-12_coffee-coding-github-planner/creating-session.gif" class="img-fluid figure-img" alt="A short clip showing a person creating a new session idea as a GitHub issue, and giving it a title, description, and label"></p>
<figcaption>Adding a session idea</figcaption>
</figure>
</div>
</section>
<section id="voting-for-session-ideas" class="level2">
<h2 class="anchored" data-anchor-id="voting-for-session-ideas">Voting for session ideas</h2>
<p>We will use thumbs up (👍) emoji reactions to suggested sessions as a voting system to help us with planning and scheduling.</p>
<p>If you see any potential sessions that you are interested in, react to them with a thumbs up emoji. You can see all planned sessions, in order of votes received, <a href="https://github.com/The-Strategy-Unit/data_science/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22C%26C%20%E2%98%95%22%20sort%3Areactions-%2B1-desc">listed here</a>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2024-11-12_coffee-coding-github-planner/voting-for-session.gif" class="img-fluid figure-img" alt="A short clip showing a person reacting to a GitHub issue with a thumbs up emoji"></p>
<figcaption>Voting for a session idea</figcaption>
</figure>
</div>


</section>

 ]]></description>
  <category>GitHub</category>
  <category>learning</category>
  <guid>https://the-strategy-unit.github.io/data_science/blogs/posts/2024-11-12_coffee-coding-github-planner/</guid>
  <pubDate>Tue, 12 Nov 2024 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Map and Nest</title>
  <dc:creator>Rhian Davies</dc:creator>
  <link>https://the-strategy-unit.github.io/data_science/blogs/posts/2024-08-08_map-and-nest/</link>
  <description><![CDATA[ 





<p>I want to share a framework that I like using occasionally for data analysis. It’s the nest-and-map and it’s helped me countless times when I’m working with related datasets. By combining <a href="https://purrr.tidyverse.org/">{purrr}</a> mapping with <a href="https://tidyr.tidyverse.org/">{tidyr}</a> nesting, I can keep my analysis steps linked, allowing me to easily track from a summary or plot, back to the original data.</p>
<p>The main funtions we’ll need are</p>
<ul>
<li><code>tidyr::nest()</code></li>
<li><code>dplyr::mutate()</code></li>
<li><code>purrr::map()</code></li>
<li><code>purrr::walk()</code></li>
</ul>
<section id="example-on-nhs-workforce-statistics" class="level2">
<h2 class="anchored" data-anchor-id="example-on-nhs-workforce-statistics">Example on NHS workforce statistics</h2>
<p>The <a href="https://digital.nhs.uk/data-and-information/publications/statistical/nhs-workforce-statistics">NHS workforce statistics</a> are official statistics published monthly for England.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1">staff_group <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">readRDS</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">file =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"workforce_staff_group.rds"</span>)</span></code></pre></div></div>
</div>
<p>I want to perform an analysis for each of the 42 integrated care systems (ICS). The {tidyr} <code>nest()</code> function creates a list-column, where each cell contains a mini dataframe for each grouping.</p>
<p>Let’s group by ICS, and call the nested data column <code>raw_data</code>.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1">group_by_ics <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> staff_group <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb2-2">    tidyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">nest</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">raw_data =</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>ics_name)</span></code></pre></div></div>
</div>
<p>The new column is a list-column, with each cell containing an entire tibble of data relating to that individual ICS.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' echo: false</span></span>
<span id="cb3-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">head</span>(group_by_ics)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 6 × 2
  ics_name             raw_data         
  &lt;chr&gt;                &lt;list&gt;           
1 South East London    &lt;tibble [8 × 6]&gt; 
2 North East London    &lt;tibble [7 × 6]&gt; 
3 North Central London &lt;tibble [12 × 6]&gt;
4 North West London    &lt;tibble [10 × 6]&gt;
5 South West London    &lt;tibble [8 × 6]&gt; 
6 Devon                &lt;tibble [7 × 6]&gt; </code></pre>
</div>
</div>
<p>We can grab these mini datasets in the usual way and explore them interactively.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1">group_by_ics<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>raw_data[[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>]]</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code># A tibble: 8 × 6
  organisation_name           total hchs_doctors nurses_health_visitors midwives
  &lt;chr&gt;                       &lt;dbl&gt;        &lt;dbl&gt;                  &lt;dbl&gt;    &lt;dbl&gt;
1 Total                       58394         7108                  14939      926
2 Guy's and St Thomas' NHS F… 21361         3003                   6196      281
3 King's College Hospital NH… 13158         2443                   4202      375
4 Lewisham and Greenwich NHS…  6617          979                   2103      271
5 London Ambulance Service N…  7050            4                     44        0
6 NHS South East London ICB     617            9                     43        0
7 Oxleas NHS Foundation Trust  4094          200                   1196        0
8 South London and Maudsley …  5496          471                   1155        0
# ℹ 1 more variable: ambulance_staff &lt;dbl&gt;</code></pre>
</div>
</div>
<p>Next, let’s apply some simple processing, say converting absolute numbers into percentages, to each of the ICSs in turn.</p>
<p>We use <code>mutate()</code> to create a new list-column <code>staff_percent</code> and <code>map()</code> to apply the processing function to each cell in turn. <sup>1</sup></p>
<details>
<summary>
See function definition for <code>convert_percent()</code>
</summary>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' Convert percent</span></span>
<span id="cb7-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param raw_staff Tibble containing organisation_name, total and a number of staff categories</span></span>
<span id="cb7-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @return Tibble like raw_staff but with staff categories represented as percentages rather than absolute numbers</span></span>
<span id="cb7-4">convert_percent <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(staff){</span>
<span id="cb7-5">    staff <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb7-6">    dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">across</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.cols =</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(organisation_name, total),</span>
<span id="cb7-7">                  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.fns =</span>  \(x)x<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span>total)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb7-8">    dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rename</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Doctors"</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"hchs_doctors"</span>,</span>
<span id="cb7-9">                  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Nurses"</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"nurses_health_visitors"</span>,</span>
<span id="cb7-10">                  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Ambulance staff"</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"ambulance_staff"</span>,</span>
<span id="cb7-11">                  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Midwives"</span> <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"midwives"</span>)</span>
<span id="cb7-12">}</span></code></pre></div></div>
</div>
</details>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1">processed_staff <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span></span>
<span id="cb8-2">group_by_ics <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb8-3">    dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(</span>
<span id="cb8-4">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">staff_percent =</span> purrr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map</span>(raw_data, convert_percent)</span>
<span id="cb8-5">    )</span></code></pre></div></div>
</div>
<p>Where I think this map-and-nest process really comes into its own is creating plots. Often, I find myself wanting to create a couple of different plots for each grouping, and then optionally save the plots with sensible names. Particularly in the analysis stage, I like having these plots in the same row as the raw data, so I can quickly compare and validate.</p>
<p>I’ve created two functions, <code>plot_barchart()</code> and <code>plot_waffle()</code> which take the data and create charts.</p>
<details>
<summary>
See definition for <code>plot_barchart()</code> &amp; <code>plot_waffle()</code>
</summary>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' Plot barchart</span></span>
<span id="cb9-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' Makes a bar chart of staff perentages by organisation</span></span>
<span id="cb9-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param df tibble of staff data in percent format</span></span>
<span id="cb9-4">plot_barchart <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(df) {</span>
<span id="cb9-5">  df <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb9-6">    dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(organisation_name <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Total"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb9-7">    dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>total) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb9-8">    tidyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pivot_longer</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">cols =</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(organisation_name), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">names_to =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"job"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">values_to =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"percent"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb9-9">    ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> percent, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> organisation_name, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill =</span> job)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb9-10">    ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_col</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">position =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dodge"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> </span>
<span id="cb9-11">    ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_x_continuous</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">labels =</span> scales<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">percent_format</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">scale =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb9-12">    ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">labs</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">""</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb9-13">    StrategyUnitTheme<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">scale_fill_su</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> </span>
<span id="cb9-14">    ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_minimal</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> </span>
<span id="cb9-15">    ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">legend.title =</span> ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>())</span>
<span id="cb9-16">}</span>
<span id="cb9-17"></span>
<span id="cb9-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' Plot waffle</span></span>
<span id="cb9-19"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' Makes a waffle chart to visualise staff breakdown at an ICS level</span></span>
<span id="cb9-20"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param raw_staff count data of staff</span></span>
<span id="cb9-21"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#' @param title Title for the graphic</span></span>
<span id="cb9-22">plot_waffle <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(raw_staff, title) {</span>
<span id="cb9-23">waffle_data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span></span>
<span id="cb9-24">raw_staff <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb9-25">    dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(organisation_name <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Total"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb9-26">    dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>total, <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>organisation_name) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb9-27">    tidyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pivot_longer</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">cols =</span> dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">everything</span>(), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">names_to =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"names"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">values_to =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"vals"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb9-28">    dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">vals =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">round</span>(vals <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span>))</span>
<span id="cb9-29"></span>
<span id="cb9-30">ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(waffle_data, ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fill =</span> names, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">values =</span> vals)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb9-31">  waffle<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_waffle</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n_rows =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.33</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">colour =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"white"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb9-32">  ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">coord_equal</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb9-33">  ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme_void</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> </span>
<span id="cb9-34">  ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">theme</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">legend.title =</span> ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">element_blank</span>()) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb9-35">  ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggtitle</span>(title)</span>
<span id="cb9-36">}</span></code></pre></div></div>
</div>
</details>
<p>Again, using <code>mutate()</code> I can create a new column called <code>barchart</code> and I can <code>map()</code> the function <code>plot_barchart()</code>, applying it to each row at a time.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb10-1">graphs <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span></span>
<span id="cb10-2">processed_staff <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb10-3">    dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(</span>
<span id="cb10-4">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">barchart =</span>  purrr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map</span>(staff_percent, plot_barchart)</span>
<span id="cb10-5">    ) </span></code></pre></div></div>
</div>
<p>The resulting column <code>barchart</code> is again a list-column, but this time instead of containing a tibble, it holds a ggplot object. A whole ggplot in a single cell. <sup>2</sup></p>
<p>If we want to pass two arguments to our function, we can replace <code>map()</code> with <code>map2()</code>. Here we’re using <code>map2()</code> to pass the <code>ics_name</code> column to use as a title in our waffle plot. <sup>3</sup></p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb11-1">graphs <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span></span>
<span id="cb11-2">processed_staff <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb11-3">    dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(</span>
<span id="cb11-4">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">waffle =</span>  purrr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map2</span>(raw_data, ics_name, </span>
<span id="cb11-5">            \(data, title) <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plot_waffle</span>(data, title)</span>
<span id="cb11-6">        )</span>
<span id="cb11-7">    ) </span></code></pre></div></div>
</div>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2024-08-08_map-and-nest/example_bar_chart.png" class="img-fluid figure-img"></p>
<figcaption>An example bar chart plot</figcaption>
</figure>
</div>
</section>
<section id="putting-it-all-together" class="level2">
<h2 class="anchored" data-anchor-id="putting-it-all-together">Putting it all together</h2>
<p>All of these <code>mutate()</code> steps can actually be called in one step. Here’s the full workflow again in full after a little refactor. I’ve also used <code>pivot_longer()</code> to move the two plotting columns into a single plot column. This will make it easier for me to generate nice filenames, and save the plots.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb12" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb12-1">results <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span></span>
<span id="cb12-2">staff_group <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb12-3">    tidyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">nest</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">raw_data =</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>ics_name) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb12-4">    dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(</span>
<span id="cb12-5">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">staff_percent =</span> purrr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map</span>(raw_data, convert_percent),</span>
<span id="cb12-6">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">barchart =</span>  purrr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map</span>(staff_percent, plot_barchart),</span>
<span id="cb12-7">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">waffle =</span>  purrr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">map2</span>(raw_data, ics_name, \(data, title) <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plot_waffle</span>(data, title)) </span>
<span id="cb12-8">    )     <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb12-9">    tidyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pivot_longer</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">cols =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(barchart, waffle), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">names_to =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"plot_type"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">values_to =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"plot"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb12-10">    dplyr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">filename =</span> glue<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glue</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"{snakecase::to_snake_case(ics_name)}_{plot_type}.png"</span>))</span></code></pre></div></div>
</div>
<p>The <code>walk()</code> family of functions in {purrr} are used when the function you’re applying does not return an object, but is being used for it’s side-effect, for example reading or writing files.</p>
<p>Here we call <code>walk2()</code>, passing in both the filename column and the plots column are arguments to save all the plots.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb13-1">purrr<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">walk2</span>(</span>
<span id="cb13-2">  results<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>filename,</span>
<span id="cb13-3">  results<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>plot,</span>
<span id="cb13-4">  \(filename, plot) ggplot2<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggsave</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">file.path</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"plots"</span>, filename), plot, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">width =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">height =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">6</span>)</span>
<span id="cb13-5">)</span></code></pre></div></div>
</div>
<p>By keeping everything together in one nested structure, I personally find it much easier to keep track of my analyses. If you’re doing a more complex or permenant analysis, you might want to consider setting up a more formal data processing pipeline, and following RAP principals.</p>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>In this example, we actually didn’t need to nest first. We could have performed the <code>mutate()</code> step on the full dataset.↩︎</p></li>
<li id="fn2"><p>This totally blew my mind the first time I saw it 🤯.↩︎</p></li>
<li id="fn3"><p>We’re mapping the relationship between the two inputs and the <code>plot_waffle()</code> with an anonymous function. This shorthand syntax for anonymous functions came in R v 4.1.0. For compatibility with older versions of R, you’ll need the <code>~</code> operator. For the different ways you can specify functions in {purrr} see the <a href="https://purrr.tidyverse.org/reference/map.html">help file</a>.↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>purrr</category>
  <category>R</category>
  <category>tutorial</category>
  <guid>https://the-strategy-unit.github.io/data_science/blogs/posts/2024-08-08_map-and-nest/</guid>
  <pubDate>Thu, 08 Aug 2024 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Storing data safely</title>
  <dc:creator>[YiWen Hon](mailto:yiwen.hon1@nhs.net)</dc:creator>
  <dc:creator>[Matt Dray](mailto:matt.dray@nhs.net)</dc:creator>
  <dc:creator>[Claire Welsh](mailto:claire.welsh8@nhs.net)</dc:creator>
  <link>https://the-strategy-unit.github.io/data_science/blogs/posts/2024-05-22_storing-data-safely/</link>
  <description><![CDATA[ 





<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note
</div>
</div>
<div class="callout-body-container callout-body">
<p><strong>UPDATED</strong>: Please see the Addendum to this blog, added 2025-04-03 regarding accessing data from SharePoint</p>
</div>
</div>
<section id="coffee-coding" class="level2">
<h2 class="anchored" data-anchor-id="coffee-coding">Coffee &amp; Coding</h2>
<p>In a recent Coffee &amp; Coding session we chatted about storing data safely for use in Reproducible Analytical Pipelines (RAP), and <a href="https://the-strategy-unit.github.io/data_science/presentations/2024-05-16_store-data-safely/">the slides from the presentation are now available</a>. We discussed the use of <a href="https://docs.posit.co/connect/user/pins/">Posit Connect Pins</a> and <a href="https://azure.microsoft.com/en-gb/products/storage/blobs/">Azure Storage</a>.</p>
<p>In order to avoid duplication, this blog post will not cover the pros and cons of each approach, and will instead focus on documenting the code that was used in our live demonstrations. I would recommend that you look through the slides before using the code in this blogpost and have them alongside, as they provide lots of useful context!</p>
</section>
<section id="posit-connect-pins" class="level2">
<h2 class="anchored" data-anchor-id="posit-connect-pins">Posit Connect Pins</h2>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># A brief intro to using {pins} to store, version, share and protect a dataset</span></span>
<span id="cb1-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># on Posit Connect. Documentation: https://pins.rstudio.com/</span></span>
<span id="cb1-3"></span>
<span id="cb1-4"></span>
<span id="cb1-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Setup -------------------------------------------------------------------</span></span>
<span id="cb1-6"></span>
<span id="cb1-7"></span>
<span id="cb1-8"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">install.packages</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"pins"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dplyr"</span>)) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># if not yet installed</span></span>
<span id="cb1-9"></span>
<span id="cb1-10"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">suppressPackageStartupMessages</span>({</span>
<span id="cb1-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(pins)</span>
<span id="cb1-12">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(dplyr) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># for wrangling and the 'starwars' demo dataset</span></span>
<span id="cb1-13">})</span>
<span id="cb1-14"></span>
<span id="cb1-15">board <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">board_connect</span>() <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># will error if you haven't authenticated before</span></span>
<span id="cb1-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Error in `check_auth()`: ! auth = `auto` has failed to find a way to authenticate:</span></span>
<span id="cb1-17"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># • `server` and `key` not provided for `auth = 'manual'`</span></span>
<span id="cb1-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># • Can't find CONNECT_SERVER and CONNECT_API_KEY envvars for `auth = 'envvar'`</span></span>
<span id="cb1-19"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># • rsconnect package not installed for `auth = 'rsconnect'`</span></span>
<span id="cb1-20"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Run `rlang::last_trace()` to see where the error occurred.</span></span>
<span id="cb1-21"></span>
<span id="cb1-22"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># To authenticate</span></span>
<span id="cb1-23"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># In RStudio: Tools &gt; Global Options &gt; Publishing &gt; Connect... &gt; Posit Connect</span></span>
<span id="cb1-24"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># public URL of the Strategy Unit Posit Connect Server: connect.strategyunitwm.nhs.uk</span></span>
<span id="cb1-25"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Your browser will open to the Posit Connect web page and you're prompted to</span></span>
<span id="cb1-26"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># for your password. Enter it and you'll be authenticated.</span></span>
<span id="cb1-27"></span>
<span id="cb1-28"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Once authenticated</span></span>
<span id="cb1-29">board <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">board_connect</span>()</span>
<span id="cb1-30"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Connecting to Posit Connect 2024.03.0 at</span></span>
<span id="cb1-31"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># &lt;https://connect.strategyunitwm.nhs.uk&gt;</span></span>
<span id="cb1-32"></span>
<span id="cb1-33">board <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pin_list</span>() <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># see all the pins on that board</span></span>
<span id="cb1-34"></span>
<span id="cb1-35"></span>
<span id="cb1-36"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create a pin ------------------------------------------------------------</span></span>
<span id="cb1-37"></span>
<span id="cb1-38"></span>
<span id="cb1-39"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Write a dataset to the board as a pin</span></span>
<span id="cb1-40">board <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pin_write</span>(</span>
<span id="cb1-41">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> starwars,</span>
<span id="cb1-42">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">name =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"starwars_demo"</span></span>
<span id="cb1-43">)</span>
<span id="cb1-44"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Guessing `type = 'rds'`</span></span>
<span id="cb1-45"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Writing to pin 'matt.dray/starwars_demo'</span></span>
<span id="cb1-46"></span>
<span id="cb1-47">board <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pin_exists</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"starwars_demo"</span>)</span>
<span id="cb1-48"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ! Use a fully specified name including user name: "matt.dray/starwars_demo",</span></span>
<span id="cb1-49"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># not "starwars_demo".</span></span>
<span id="cb1-50"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># [1] TRUE</span></span>
<span id="cb1-51"></span>
<span id="cb1-52">pin_name <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"matt.dray/starwars_demo"</span></span>
<span id="cb1-53"></span>
<span id="cb1-54">board <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pin_exists</span>(pin_name) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># logical, TRUE/FALSE</span></span>
<span id="cb1-55">board <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pin_meta</span>(pin_name) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># metadata, see also 'metadata' arg in pin_write()</span></span>
<span id="cb1-56">board <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pin_browse</span>(pin_name) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># view the pin in the browser</span></span>
<span id="cb1-57"></span>
<span id="cb1-58"></span>
<span id="cb1-59"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Permissions -------------------------------------------------------------</span></span>
<span id="cb1-60"></span>
<span id="cb1-61"></span>
<span id="cb1-62"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># You can let people see and edit a pin. Log into Posit Connect and select the</span></span>
<span id="cb1-63"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># pin under 'Content'. In the 'Settings' panel on the right-hand side, adjust</span></span>
<span id="cb1-64"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># the 'sharing' options in the 'Access' tab.</span></span>
<span id="cb1-65"></span>
<span id="cb1-66"></span>
<span id="cb1-67"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Overwrite and version ---------------------------------------------------</span></span>
<span id="cb1-68"></span>
<span id="cb1-69"></span>
<span id="cb1-70">starwars_droids <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> starwars <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb1-71">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(species <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Droid"</span>) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># beep boop</span></span>
<span id="cb1-72"></span>
<span id="cb1-73">board <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pin_write</span>(</span>
<span id="cb1-74">  starwars_droids,</span>
<span id="cb1-75">  pin_name,</span>
<span id="cb1-76">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">type =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"rds"</span></span>
<span id="cb1-77">)</span>
<span id="cb1-78"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Writing to pin 'matt.dray/starwars_demo'</span></span>
<span id="cb1-79"></span>
<span id="cb1-80">board <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pin_versions</span>(pin_name) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># see version history</span></span>
<span id="cb1-81">board <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pin_versions_prune</span>(pin_name, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># remove history</span></span>
<span id="cb1-82">board <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pin_versions</span>(pin_name)</span>
<span id="cb1-83"></span>
<span id="cb1-84"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># What if you try to overwrite the data but it hasn't changed?</span></span>
<span id="cb1-85">board <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pin_write</span>(</span>
<span id="cb1-86">  starwars_droids,</span>
<span id="cb1-87">  pin_name,</span>
<span id="cb1-88">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">type =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"rds"</span></span>
<span id="cb1-89">)</span>
<span id="cb1-90"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># ! The hash of pin "matt.dray/starwars_demo" has not changed.</span></span>
<span id="cb1-91"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># • Your pin will not be stored.</span></span>
<span id="cb1-92"></span>
<span id="cb1-93"></span>
<span id="cb1-94"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Use the pin -------------------------------------------------------------</span></span>
<span id="cb1-95"></span>
<span id="cb1-96"></span>
<span id="cb1-97"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># You can read a pin to your local machine, or access it from a Quarto file</span></span>
<span id="cb1-98"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># or Shiny app hosted on Connect, for example. If the output and the pin are</span></span>
<span id="cb1-99"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># both on Connect, no authentication is required; the board is defaulted to</span></span>
<span id="cb1-100"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># the Posit Connect instance where they're both hosted.</span></span>
<span id="cb1-101"></span>
<span id="cb1-102">board <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb1-103">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pin_read</span>(pin_name) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># like you would use e.g. read_csv</span></span>
<span id="cb1-104">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">with</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> _, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plot</span>(mass, height)) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># wow!</span></span>
<span id="cb1-105"></span>
<span id="cb1-106"></span>
<span id="cb1-107"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Delete pin --------------------------------------------------------------</span></span>
<span id="cb1-108"></span>
<span id="cb1-109"></span>
<span id="cb1-110">board <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pin_exists</span>(pin_name) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># logical, good function for error handling</span></span>
<span id="cb1-111">board <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pin_delete</span>(pin_name)</span>
<span id="cb1-112">board <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pin_exists</span>(pin_name)</span></code></pre></div></div>
</div>
</section>
<section id="azure-storage-in-r" class="level2">
<h2 class="anchored" data-anchor-id="azure-storage-in-r">Azure Storage in R</h2>
<p>You will need an .Renviron file with the four environment variables listed below for the code to work. This .Renviron file should be ignored by git. You can share the contents of .Renviron files with other team members via Teams, email, or Sharepoint.</p>
<p>Below is a sample .Renviron file</p>
<pre><code>AZ_STORAGE_EP=https://STORAGEACCOUNT.blob.core.windows.net/
AZ_STORAGE_CONTAINER=container-name
AZ_TENANT_ID=long-sequence-of-numbers-and-letters
AZ_APP_ID=another-long-sequence-of-numbers-and-letters</code></pre>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">install.packages</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"AzureAuth"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"AzureStor"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"arrow"</span>)) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># if not yet installed</span></span>
<span id="cb3-2"></span>
<span id="cb3-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Load all environment variables</span></span>
<span id="cb3-4">ep_uri <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Sys.getenv</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"AZ_STORAGE_EP"</span>)</span>
<span id="cb3-5">app_id <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Sys.getenv</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"AZ_APP_ID"</span>)</span>
<span id="cb3-6">container_name <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Sys.getenv</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"AZ_STORAGE_CONTAINER"</span>)</span>
<span id="cb3-7">tenant <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Sys.getenv</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"AZ_TENANT_ID"</span>)</span>
<span id="cb3-8"></span>
<span id="cb3-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Authenticate</span></span>
<span id="cb3-10">token <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> AzureAuth<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">get_azure_token</span>(</span>
<span id="cb3-11">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://storage.azure.com"</span>,</span>
<span id="cb3-12">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">tenant =</span> tenant,</span>
<span id="cb3-13">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">app =</span> app_id,</span>
<span id="cb3-14">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">auth_type =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"device_code"</span>,</span>
<span id="cb3-15">)</span>
<span id="cb3-16"></span>
<span id="cb3-17"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># If you have not authenticated before, you will be taken to an external page to</span></span>
<span id="cb3-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># authenticate!Use your mlcsu.nhs.uk account.</span></span>
<span id="cb3-19"></span>
<span id="cb3-20"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Connect to container</span></span>
<span id="cb3-21">endpoint <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> AzureStor<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">blob_endpoint</span>(ep_uri, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">token =</span> token)</span>
<span id="cb3-22">container <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> AzureStor<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">storage_container</span>(endpoint, container_name)</span>
<span id="cb3-23"></span>
<span id="cb3-24"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># List files in container</span></span>
<span id="cb3-25">blob_list <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> AzureStor<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list_blobs</span>(container)</span>
<span id="cb3-26"></span>
<span id="cb3-27"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># If you get a 403 error when trying to interact with the container, you may</span></span>
<span id="cb3-28"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># have to clear your Azure token and re-authenticate using a different browser.</span></span>
<span id="cb3-29"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Use AzureAuth::clean_token_directory() to clear your token, then repeat the</span></span>
<span id="cb3-30"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># AzureAuth::get_azure_token() step above.</span></span>
<span id="cb3-31"></span>
<span id="cb3-32"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Upload specific file to container</span></span>
<span id="cb3-33">AzureStor<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">storage_upload</span>(container, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"data/ronald.jpeg"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"newdir/ronald.jpeg"</span>)</span>
<span id="cb3-34"></span>
<span id="cb3-35"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Upload contents of a local directory to container</span></span>
<span id="cb3-36">AzureStor<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">storage_multiupload</span>(container, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"data/*"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"newdir"</span>)</span>
<span id="cb3-37"></span>
<span id="cb3-38"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Check files have uploaded</span></span>
<span id="cb3-39">blob_list <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> AzureStor<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list_blobs</span>(container)</span>
<span id="cb3-40"></span>
<span id="cb3-41"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Load file directly from Azure container</span></span>
<span id="cb3-42">df_from_azure <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> AzureStor<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">storage_read_csv</span>(</span>
<span id="cb3-43">  container,</span>
<span id="cb3-44">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"newdir/cats.csv"</span>,</span>
<span id="cb3-45">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">show_col_types =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">FALSE</span></span>
<span id="cb3-46">)</span>
<span id="cb3-47"></span>
<span id="cb3-48"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Load file directly from Azure container (by temporarily downloading file</span></span>
<span id="cb3-49"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># and storing it in memory)</span></span>
<span id="cb3-50">parquet_in_memory <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> AzureStor<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">storage_download</span>(</span>
<span id="cb3-51">  container,</span>
<span id="cb3-52">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">src =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"newdir/cats.parquet"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">dest =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NULL</span></span>
<span id="cb3-53">)</span>
<span id="cb3-54">parq_df <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> arrow<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read_parquet</span>(parquet_in_memory)</span>
<span id="cb3-55"></span>
<span id="cb3-56"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Delete from Azure container (!!!)</span></span>
<span id="cb3-57"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> (blobfile <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> blob_list<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>name) {</span>
<span id="cb3-58">  AzureStor<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">delete_storage_file</span>(container, blobfile)</span>
<span id="cb3-59">}</span></code></pre></div></div>
</div>
</section>
<section id="azure-storage-in-python" class="level2">
<h2 class="anchored" data-anchor-id="azure-storage-in-python">Azure Storage in Python</h2>
<p>This will use the same environment variables as the R version, just stored in a .env file instead.</p>
<p>We didn’t cover this in the presentation, so it’s not in the slides, but the code should be self-explanatory.</p>
<div class="quarto-embed-nb-cell">
<div id="cell-0" class="cell" data-execution_count="1">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> os</span>
<span id="cb4-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> io</span>
<span id="cb4-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> pandas <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> pd</span>
<span id="cb4-4"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> dotenv <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> load_dotenv</span>
<span id="cb4-5"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> azure.identity <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> DefaultAzureCredential</span>
<span id="cb4-6"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> azure.storage.blob <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> ContainerClient</span></code></pre></div></div>
</div>
<div id="cell-1" class="cell" data-execution_count="2">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Load all environment variables</span></span>
<span id="cb5-2">load_dotenv()</span>
<span id="cb5-3">account_url <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> os.getenv(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'AZ_STORAGE_EP'</span>)</span>
<span id="cb5-4">container_name <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> os.getenv(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'AZ_STORAGE_CONTAINER'</span>)</span></code></pre></div></div>
</div>
<div id="cell-2" class="cell" data-execution_count="3">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Authenticate</span></span>
<span id="cb6-2">default_credential <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> DefaultAzureCredential()</span></code></pre></div></div>
</div>
<p>For the first time, you might need to authenticate via the Azure CLI</p>
<p>Download it from https://learn.microsoft.com/en-us/cli/azure/install-azure-cli-windows?tabs=azure-cli</p>
<p>Install then run <code>az login</code> in your terminal. Once you have logged in with your browser try the <code>DefaultAzureCredential()</code> again!</p>
<div id="cell-4" class="cell" data-execution_count="4">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Connect to container</span></span>
<span id="cb7-2">container_client <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> ContainerClient(account_url, container_name, default_credential)</span></code></pre></div></div>
</div>
<div id="cell-5" class="cell" data-execution_count="5">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># List files in container - should be empty</span></span>
<span id="cb8-2">blob_list <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> container_client.list_blob_names()</span>
<span id="cb8-3"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> blob <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> blob_list:</span>
<span id="cb8-4">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> blob.startswith(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'newdir'</span>):</span>
<span id="cb8-5">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(blob)</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>newdir/cats.parquet
newdir/ronald.jpeg</code></pre>
</div>
</div>
<div id="cell-6" class="cell" data-execution_count="6">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Upload file to container</span></span>
<span id="cb10-2"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">open</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">file</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'data/cats.csv'</span>, mode<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"rb"</span>) <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> data:</span>
<span id="cb10-3">    blob_client <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> container_client.upload_blob(name<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'newdir/cats.csv'</span>, </span>
<span id="cb10-4">                                               data<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>data, </span>
<span id="cb10-5">                                               overwrite<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>)</span></code></pre></div></div>
</div>
<div id="cell-7" class="cell" data-execution_count="7">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># # Check files have uploaded - List files in container again</span></span>
<span id="cb11-2">blob_list <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> container_client.list_blobs()</span>
<span id="cb11-3"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> blob <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> blob_list:</span>
<span id="cb11-4">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> blob[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>].startswith(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'newdir'</span>):</span>
<span id="cb11-5">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(blob[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>])</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>newdir/cats.csv
newdir/cats.parquet
newdir/ronald.jpeg</code></pre>
</div>
</div>
<div id="cell-8" class="cell" data-execution_count="8">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb13" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Download file from Azure container to temporary filepath</span></span>
<span id="cb13-2"></span>
<span id="cb13-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Connect to blob</span></span>
<span id="cb13-4">blob_client <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> container_client.get_blob_client(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'newdir/cats.csv'</span>)</span>
<span id="cb13-5"></span>
<span id="cb13-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Write to local file from blob</span></span>
<span id="cb13-7">temp_filepath <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> os.path.join(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'temp_data'</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'cats.csv'</span>)</span>
<span id="cb13-8"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">open</span>(<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">file</span><span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span>temp_filepath, mode<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"wb"</span>) <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> sample_blob:</span>
<span id="cb13-9">    download_stream <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> blob_client.download_blob()</span>
<span id="cb13-10">    sample_blob.write(download_stream.readall())</span>
<span id="cb13-11">cat_data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_csv(temp_filepath)</span>
<span id="cb13-12">cat_data.head()</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="8">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">Name</th>
<th data-quarto-table-cell-role="th">Physical_characteristics</th>
<th data-quarto-table-cell-role="th">Behaviour</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<th data-quarto-table-cell-role="th">0</th>
<td>Ronald</td>
<td>White and ginger</td>
<td>Lazy and greedy but undoubtedly cutest and best</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">1</th>
<td>Kaspie</td>
<td>Small calico</td>
<td>Sweet and very shy but adventurous</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">2</th>
<td>Hennimore</td>
<td>Pale orange</td>
<td>Unhinged and always in a state of panic</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">3</th>
<td>Thug cat</td>
<td>Black and white - very large</td>
<td>Local bully</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">4</th>
<td>Son of Stripey</td>
<td>Grey tabby</td>
<td>Very vocal</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<div id="cell-9" class="cell" data-execution_count="9">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb14" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Load directly from Azure - no local copy</span></span>
<span id="cb14-2"></span>
<span id="cb14-3">download_stream <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> blob_client.download_blob()</span>
<span id="cb14-4">stream_object <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> io.BytesIO(download_stream.readall())</span>
<span id="cb14-5">cat_data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_csv(stream_object)</span>
<span id="cb14-6">cat_data</span></code></pre></div></div>
<div class="cell-output cell-output-display" data-execution_count="9">
<div>


<table class="dataframe caption-top table table-sm table-striped small" data-border="1">
<thead>
<tr class="header">
<th data-quarto-table-cell-role="th"></th>
<th data-quarto-table-cell-role="th">Name</th>
<th data-quarto-table-cell-role="th">Physical_characteristics</th>
<th data-quarto-table-cell-role="th">Behaviour</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<th data-quarto-table-cell-role="th">0</th>
<td>Ronald</td>
<td>White and ginger</td>
<td>Lazy and greedy but undoubtedly cutest and best</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">1</th>
<td>Kaspie</td>
<td>Small calico</td>
<td>Sweet and very shy but adventurous</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">2</th>
<td>Hennimore</td>
<td>Pale orange</td>
<td>Unhinged and always in a state of panic</td>
</tr>
<tr class="even">
<th data-quarto-table-cell-role="th">3</th>
<td>Thug cat</td>
<td>Black and white - very large</td>
<td>Local bully</td>
</tr>
<tr class="odd">
<th data-quarto-table-cell-role="th">4</th>
<td>Son of Stripey</td>
<td>Grey tabby</td>
<td>Very vocal</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<div id="cell-10" class="cell" data-execution_count="10">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb15" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># !!!!!!!!! Delete from Azure container !!!!!!!!!</span></span>
<span id="cb15-2">blob_client <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> container_client.get_blob_client(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'newdir/cats.csv'</span>)</span>
<span id="cb15-3">blob_client.delete_blob()</span></code></pre></div></div>
</div>
<div id="cell-11" class="cell" data-execution_count="11">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb16" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1">blob_list <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> container_client.list_blobs()</span>
<span id="cb16-2"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">for</span> blob <span class="kw" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">in</span> blob_list:</span>
<span id="cb16-3">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> blob[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>].startswith(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'newdir'</span>):</span>
<span id="cb16-4">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(blob[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'name'</span>])</span></code></pre></div></div>
<div class="cell-output cell-output-stdout">
<pre><code>newdir/cats.parquet
newdir/ronald.jpeg</code></pre>
</div>
</div>
</div>
</section>
<section id="addendum-accessing-data-from-sharepoint" class="level2">
<h2 class="anchored" data-anchor-id="addendum-accessing-data-from-sharepoint">ADDENDUM: Accessing data from SharePoint</h2>
<p>SharePoint is a Microsoft product, which is a content/knowledge management tool. Many teams across the NHS use SharePoint for all sorts of file types that need to be preserved or shared within and between teams, but also need to be kept secure.</p>
<p>Accessing SharePoint requires user authentication, which you’ll be prompted for in the browser when you try to access SharePoint from R. Note that you must ‘follow’ a SharePoint site before you can fetch its content.</p>
<p>To access data on SharePoint, follow these steps:</p>
<ol type="1">
<li><p>Navigate to the SharePoint page that has the file of interest in it, using your browser.</p></li>
<li><p>Click the small star in the top right corner of the window, labelled ‘Follow’.</p></li>
</ol>
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2024-05-22_storing-data-safely/Follow.jpg" class="img-fluid" alt="A screenshot of the upper-right corner of a SharePoint site in a web browser, highlighting a star-shaped toggle labelled 'follow'."></p>
<ol start="3" type="1">
<li><p>Open (or create) your project’s <code>.Renviron</code> file, used for <a href="https://docs.posit.co/ide/user/ide/guide/environments/r/managing-r.html#renviron">storing environmental variables</a>. You can do this by running <code>usethis::edit_r_environ()</code> from the R console.</p></li>
<li><p>Save two new environment variables to the <code>.Renviron</code> file:</p>
<ol type="a">
<li><code>SP_SITE_NAME</code>: the name of the site (i.e.&nbsp;the name at the top of the browser screen)</li>
<li><code>SP_FILE_PATH</code>: the full file path from below the <code>Documents/</code> folder to the file itself, including the file type extension.</li>
</ol></li>
<li><p>Save and close the <code>.Renviron</code> file then either restart your R session (CTRL + Shift + F10 on Windows machines) or run <code>readRenviron(".Renviron")</code> to make the new variables available in your session.</p></li>
<li><p>Use the code below to read in an Excel file (xlsx) into R memory, or adapt it to read other file types.</p></li>
</ol>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb18" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb18-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Read in the securely saved path and site variables</span></span>
<span id="cb18-2">sharepoint_site <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Sys.getenv</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"SP_SITE_NAME"</span>)</span>
<span id="cb18-3">template_path <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">Sys.getenv</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"SP_FILE_PATH"</span>)</span>
<span id="cb18-4"></span>
<span id="cb18-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># access the dataset and save it into a temporary file</span></span>
<span id="cb18-6">site <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> Microsoft365R<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">get_sharepoint_site</span>(sharepoint_site)</span>
<span id="cb18-7">drv <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> site<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">get_drive</span>()</span>
<span id="cb18-8">tmp_file <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tempfile</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">fileext =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">".xlsx"</span>)</span>
<span id="cb18-9">drv<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">download_file</span>(template_path, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">dest =</span> tmp_file)</span>
<span id="cb18-10"></span>
<span id="cb18-11">data <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> readxl<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">::</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read_xlsx</span>(tmp_file)</span>
<span id="cb18-12"></span>
<span id="cb18-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#tidy up</span></span>
<span id="cb18-14"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">unlink</span>(tmp_file)</span></code></pre></div></div>
</div>


</section>

 ]]></description>
  <category>learning</category>
  <category>R</category>
  <category>Python</category>
  <guid>https://the-strategy-unit.github.io/data_science/blogs/posts/2024-05-22_storing-data-safely/</guid>
  <pubDate>Wed, 22 May 2024 00:00:00 GMT</pubDate>
</item>
<item>
  <title>One year of coffee &amp; coding</title>
  <dc:creator>Rhian Davies</dc:creator>
  <link>https://the-strategy-unit.github.io/data_science/blogs/posts/2024-05-13_one-year-coffee-code/</link>
  <description><![CDATA[ 





<p>The data science team have been running coffee &amp; coding sessions for just over a year now. When I joined that Strategy Unit, I was really pleased to see these sessions running as I think making time to discuss and share technical knowledge is highly valuable, especially as an organisation grows.</p>
<p>Coffee and coding sessions run every two weeks and usually take the form of a short presentation, followed by a discussion. Although we have had a variety of different sessions including live coding demos and show and tell for projects.</p>
<p>We figured it would be a good idea to do a quick survey of attendees to make sure that the sessions were beneficial and see if there were any suggestions for future sessions. We had 11 responses, all of which were really positive, with 90% agreeing that the sessions are interesting, and over 80% saying that they learn new things. Respondents said that the sessions were well varied across the technical spectrum and that they “almost always learn something useful”.</p>
<p>The two main themes of the results were that sessions were <em>inclusive</em> and <em>sparked collaboration.</em> ✨</p>
<blockquote class="blockquote">
<p>I like that everyone can contribute</p>
</blockquote>
<blockquote class="blockquote">
<p>It’s great seeing what else people are doing</p>
</blockquote>
<blockquote class="blockquote">
<p>I get more ideas for future projects</p>
</blockquote>
<p>Some of the main suggestions included more content for newer programmers and encouraging the wider analytical team to share real project examples.</p>
<p>So with that, why not consider presenting? The sessions are informal and everyone is welcome to contribute. If you’ve got something to share, please let a member of the data science team know.</p>
<p>As a reminder, materials for our previous sessions are available under <a href="https://the-strategy-unit.github.io/data_science/presentations/">Presentations</a>.</p>



 ]]></description>
  <category>learning</category>
  <guid>https://the-strategy-unit.github.io/data_science/blogs/posts/2024-05-13_one-year-coffee-code/</guid>
  <pubDate>Mon, 13 May 2024 00:00:00 GMT</pubDate>
</item>
<item>
  <title>RStudio Tips and Tricks</title>
  <dc:creator>Matt Dray</dc:creator>
  <link>https://the-strategy-unit.github.io/data_science/blogs/posts/2024-03-21_rstudio-tips/</link>
  <description><![CDATA[ 





<section id="coffee-coding" class="level2">
<h2 class="anchored" data-anchor-id="coffee-coding">Coffee &amp; Coding</h2>
<p>In a recent Coffee &amp; Coding session we chatted about tips and tricks for <a href="https://posit.co/products/open-source/rstudio/">RStudio</a>, the popular and free Integrated Development Environment (IDE) that many Strategy Unit analysts use to write R code.</p>
<p>RStudio has lots of neat features but many are tucked away in submenus. This session was a chance for the community to uncover and discuss some hidden gems to make our work easier and faster.</p>
</section>
<section id="official-guidance" class="level2">
<h2 class="anchored" data-anchor-id="official-guidance">Official guidance</h2>
<p><a href="https://posit.co/">Posit</a> is the company who build and maintain RStudio. They host a number of cheatsheets on their website, <a href="https://rstudio.github.io/cheatsheets/html/rstudio-ide.html">including one for RStudio</a>. They also have a more <a href="https://docs.posit.co/ide/user/">in-depth user guide</a>.</p>
</section>
<section id="command-palette" class="level2">
<h2 class="anchored" data-anchor-id="command-palette">Command palette</h2>
<p>RStudio has a powerful built-in <a href="https://docs.posit.co/ide/user/ide/guide/ui/command-palette.html">Command Palette</a>, which is a special search box that gives instant access to features and settings without needing to find them in the menus. Many of the tips and tricks we discussed can be found by searching in the Palette. Open it with the keyboard shortcut <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>P</kbd>.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2024-03-21_rstudio-tips/command-palette.png" class="img-fluid figure-img" style="width:100.0%" data-fig.alt="The RStudio command palette. It's a search box with some suggested actiosn underneath, like 'create a new R script' and 'insert pipe operator'. Some of these options show what grouping they belong to, like 'help'. Others display keyboard shortcuts."></p>
<figcaption>Opening the Command Palette.</figcaption>
</figure>
</div>
<p>For example, let’s say you forgot how to restart R. If you open the Command Palette and start typing ‘restart’, you’ll see the option ‘Restart R Session’. Clicking it will do exactly that. Handily, the Palette also displays the keyboard shortcut (<kbd>Control</kbd> + <kbd>Shift</kbd> + <kbd>F10</kbd> on Windows) as a reminder.</p>
<p>As for settings, a search for ‘rainbow’ in the Command Palette will find ‘Use rainbow parentheses’, an option to help prevent bracket-mismatch errors by colouring pairs of parentheses. What’s nice is that the checkbox to toggle the feature appears right there in the palette so you can change it immediately.</p>
<p>I refer to menu paths and keyboard shortcuts in the rest of this post, but bear in mind that you can use the Command Palette instead.</p>
</section>
<section id="options" class="level2">
<h2 class="anchored" data-anchor-id="options">Options</h2>
<p>In general, most settings can be found under Tools &gt; Global Options… and many of these are discussed in the rest of this post.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2024-03-21_rstudio-tips/options.png" class="img-fluid figure-img" style="width:100.0%" data-fig.alt="RStudio options window opened to the 'basic' tab of the 'general'section. There are options to adjust things like yoru workspace and history settings. Other sections include settings for the console, appearance and accessibility."></p>
<figcaption>Adjusting workspace and history settings.</figcaption>
</figure>
</div>
<p>But there’s a few settings in particular that we recommend you change to help maximise reproducibility and reduce the chance of confusion. Under General &gt; Basic, uncheck ‘Restore .Rdata into workspace at startup’ and select ‘Never’ from the dropdown options next to ‘Save workspace to .Rdata on exit’. These options mean you start with the ‘blank slate’ of an empty environment when you open a project, allowing you to rebuild objects from scratch<sup>1</sup>.</p>
</section>
<section id="keyboard-shortcuts" class="level2">
<h2 class="anchored" data-anchor-id="keyboard-shortcuts">Keyboard shortcuts</h2>
<p>You can speed up day-to-day coding with keyboard shortcuts instead of clicking buttons in the interface.</p>
<p>You can see some available shortcuts in RStudio if you navigate to Help &gt; Keyboard Shortcuts Help, or use the shortcut <kbd>Alt</kbd> + <kbd>Shift</kbd> + <kbd>K</kbd> (how meta). You can go to Help &gt; Modify Keyboard Shortcuts… to search all shortcuts and change them to what you prefer<sup>2</sup>.</p>
<p>We discussed a number of handy shortcuts that we use frequently<sup>3</sup>. You can:</p>
<ul>
<li>re-indent lines to the appropriate depth with <kbd>Control</kbd> + <kbd>I</kbd></li>
<li>reformat code with <kbd>Control</kbd> + <kbd>Shift</kbd> + <kbd>A</kbd></li>
<li>turn one or more lines into a comment with <kbd>Control</kbd> + <kbd>Shift</kbd> + <kbd>C</kbd></li>
<li>insert the pipe operator (<code>%&gt;%</code> or <code>|&gt;</code><sup>4</sup>) with <kbd>Control</kbd> + <kbd>Shift</kbd> + <kbd>M</kbd><sup>5</sup></li>
<li>insert the assignment arrow (<code>&lt;-</code>) with <kbd>Alt</kbd> + <kbd>-</kbd> (hyphen)</li>
<li>highlight a function in the script or console and press <kbd>F1</kbd> to open the function documentation in the ‘Help’ pane</li>
<li>use ‘Find in Files’ to search for a particular variable, function or string across all the files in your project, with <kbd>Control</kbd> + <kbd>Shift</kbd> + <kbd>F</kbd></li>
</ul>
</section>
<section id="themes" class="level2">
<h2 class="anchored" data-anchor-id="themes">Themes</h2>
<p>You can change a number of settings to alter RStudio’s theme, colours and fonts to whatever you desire.</p>
<p>You can <a href="https://docs.posit.co/ide/user/ide/guide/ui/appearance.html">change the default theme</a> in Tools &gt; Global Options… &gt; Appearance &gt; Editor theme and select one from the pre-installed list. You can upload new themes by clicking the ‘Add’ button and selecting a theme from your computer. They typically have the file extension .rsthemes and can be downloaded from the web, or you can create or tweak one yourself. <a href="https://www.garrickadenbuie.com/project/rsthemes/">The {rsthemes} package</a> has a number of options and also allows you to switch between themes and automatically switch between light and dark themes depending on the time of day.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2024-03-21_rstudio-tips/appearance.png" class="img-fluid figure-img" style="width:100.0%" data-fig.alt="RStudio options window opened to the 'appearance' section. The font has been changed to an installed font called 'Fira Code'. The theme has been changed to an installed theme called 'viridis'."></p>
<figcaption>Customising the appearance and font.</figcaption>
</figure>
</div>
<p>In the same ‘Appearance’ submenu as the theme settings, you can find an option to change fonts. Monospace fonts, ones where each character takes up the same width, will appear here automatically if you’ve installed them on your computer. One popular font for coding is <a href="https://github.com/tonsky/FiraCode">Fira Code</a>, which has the special property of converting certain sets of characters into ‘ligatures’, which some people find easier to read. For example, the base pipe will appear as a rightward-pointing arrow rather than its constituent vertical-pipe and greater-than symbol (<code>|&gt;</code>).</p>
</section>
<section id="panes" class="level2">
<h2 class="anchored" data-anchor-id="panes">Panes</h2>
<section id="layout" class="level3">
<h3 class="anchored" data-anchor-id="layout">Layout</h3>
<p>The structural layout of RStudio’s panes can be adjusted. One simple thing you can do is minimise and maximise each pane by clicking the window icons in their upper-right corners. This is useful when you want more screen real-estate for a particular pane.</p>
<p>You can move pane loations too. Click the ‘Workspace Panes’ button (a square with four more inside it) at the top of the IDE to see a number of settings. For example, you can select ‘Console on the right’ to move the R console to the upper-right pane, which you may prefer for maximimsing the vertical space in which code is shown. You could also click Pane Layout… in this menu to be taken to Tools &gt; Global Options… &gt; Pane layout, where you can click ‘Add Column’ to insert new script panes that allow you to inspect and write multiple files side-by-side.</p>
</section>
<section id="script-navigation" class="level3">
<h3 class="anchored" data-anchor-id="script-navigation">Script navigation</h3>
<p>The script pane in particular has a nice feature for navigating through sections of your script or Quarto/R Markdown files. Click the ‘Show Document Outline’ button or use the keyboard shortcut <kbd>Control</kbd> + <kbd>Shift</kbd> + <kbd>O</kbd> to slide open a tray that provides a nice indented list of all the sections and function defintions in your file.</p>
<p>Section headers are auto-detected in a Quarto or R Markdown document wherever the Markdown header markup has been used: one hashmark (<code>#</code>) for a level 1 header, two for level 2, and so on. To add section headers to an R Script, add at least four hyphens after a commented line that starts with <code>#</code>. Use two or more hashes at the start of the comment to increase the nestedness of that section.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Header ------------------------------------------------------------------</span></span>
<span id="cb1-2"></span>
<span id="cb1-3"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">## Section ----</span></span>
<span id="cb1-4"></span>
<span id="cb1-5"><span class="do" style="color: #5E5E5E;
background-color: null;
font-style: italic;">### Subsection ----</span></span></code></pre></div></div>
</div>
<p>Note that <kbd>Ctrl</kbd> + <kbd>Shift</kbd> + <kbd>R</kbd> will open a dialog box for you to input the name of a section header, which will be inserted and automatically padded to 75 characters to provide a strong visual cue between sections.</p>
<p>As well as the document outline, there’s also a reminder in the lower-left of the script pane that gives the name of the section that your cursor is currently in. A symbol is also shown: a hashmark means it’s a headed section and an ‘f’ means it’s a function definition. You can click this to jump to other sections.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2024-03-21_rstudio-tips/rstudio-sections.png" class="img-fluid figure-img" style="width:100.0%" data-fig.alt="RStudio script pane. There are three headers denoterd with one, two and three hash marks, named 'header', 'section' and 'subsection'. There is a demo functio nnamed 'add_one' in the subsection. The outline panel is open on the right and shows these items nested under each other. The cursor is on the subsection header and this is noted under the script."></p>
<figcaption>Navigating with headers in the R script pane.</figcaption>
</figure>
</div>
</section>
<section id="background-jobs" class="level3">
<h3 class="anchored" data-anchor-id="background-jobs">Background jobs</h3>
<p>Perhaps an under-used pane is ‘<a href="https://docs.posit.co/ide/user/ide/guide/tools/jobs.html">Background jobs</a>’. This is where you can run a separate R process that keeps your R console free. Go to Tools &gt; Background Jobs &gt; Start Background Job… to expose this tab if it isn’t already listed alongside the R console.</p>
<p>Why might you want to do this? As I write this post, there’s a background process to detect changes to the Quarto document that I’m writing and then update a preview I have running in the browser. You can do something similar for Shiny apps. You can continue to develop your app and test things in the console and the app preview will update on save. You won’t need to keep hitting the ‘Render’ or ‘Run app’ button every time you make a change.</p>
</section>
</section>
<section id="magic-wand" class="level2">
<h2 class="anchored" data-anchor-id="magic-wand">Magic wand</h2>
<p>There’s a miscellany of useful tools available when you click the ‘magic wand’ button in the script pane.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://the-strategy-unit.github.io/data_science/blogs/posts/2024-03-21_rstudio-tips/wand.png" class="img-fluid figure-img" style="width:100.0%" data-fig.alt="The magic wand submenu, which contains options to do things like 'rename in scope', 'reflow comment' and 'reindent lines'."></p>
<figcaption>Abracadabra! Casting open the ‘magic wand’ menu.</figcaption>
</figure>
</div>
<p>This includes:</p>
<ul>
<li>‘Rename in Scope’, which is like find-and-replace but you only change instances with the same ‘scope’, so you could select the variable <code>x</code>, go to Rename in Scope and then you can edit all instances of the variable in the document and change them at the same time (e.g.&nbsp;to rename them)</li>
<li>‘Reflow Comment’, which you can click after higlighting a comments block to have the comments automatically line-break at the maximum width</li>
<li>‘Insert Roxygen Skeleton’, which you can click when your cursor is inside the body of a function you’ve written and a {roxygen2} documentation template will be added above your function with the <code>@params</code> argument names pre-filled</li>
</ul>
<p>Along with ‘Comment/Uncomment Lines’, ‘Reindent Lines’ and ‘Reformat Lines’, mentioned above in the keyboard shortcuts section.</p>
</section>
<section id="wrapping-up" class="level2">
<h2 class="anchored" data-anchor-id="wrapping-up">Wrapping up</h2>
<p>Time was limited in our discussion. There are so many more tips and tricks that we didn’t get to. Let us know what we missed, or what your favourite shortcuts and settings are.</p>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>For the same reason it’s a good idea to restart R on a frequent basis. You may assume that an object <code>x</code> in your environment was made in a certain way and contains certain information, but does it? What if you overwrote it at some point and forgot? Best to wipe the slate clean and rebuild it from scratch. Jenny Bryan has <a href="https://www.tidyverse.org/blog/2017/12/workflow-vs-script/">written an explainer</a>.↩︎</p></li>
<li id="fn2"><p>You can ‘snap focus’ to the script and console panes with the pre-existing shortcuts <kbd>Control</kbd> + <kbd>1</kbd> and <kbd>Control</kbd> + <kbd>2</kbd>. My next most-used pane is the terminal, so I’ve re-mapped the shortcut to <kbd>Control</kbd> + <kbd>3</kbd>.↩︎</p></li>
<li id="fn3"><p>The classic shortcuts of select-all (<kbd>Control</kbd> + <kbd>A</kbd>), cut (<kbd>Control</kbd> + <kbd>X</kbd>), copy <kbd>Control</kbd> + <kbd>C</kbd>, paste (<kbd>Control</kbd> + <kbd>V</kbd>), undo (<kbd>Control</kbd> + <kbd>Z</kbd>) and redo (<kbd>Control</kbd> + <kbd>Shift</kbd> + <kbd>Z</kbd>) are all available when editing.↩︎</p></li>
<li id="fn4"><p>Note that you can set the default pipe to the base-R version (<code>|&gt;</code>) by checking the box at Tools &gt; Global Options… &gt; Code &gt; Use native pipe operator↩︎</p></li>
<li id="fn5"><p>Probably ‘M’ for {magrittr}, the name of the package that contains the <code>%&gt;%</code> incarnation of the operator.↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>learning</category>
  <category>R</category>
  <guid>https://the-strategy-unit.github.io/data_science/blogs/posts/2024-03-21_rstudio-tips/</guid>
  <pubDate>Thu, 21 Mar 2024 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Visualising participant recruitment in R using Sankey plots</title>
  <dc:creator>Craig Parylo</dc:creator>
  <link>https://the-strategy-unit.github.io/data_science/blogs/posts/2024-02-28_sankey_plot/</link>
  <description><![CDATA[ 





<section id="introduction" class="level1 page-columns page-full">
<h1>Introduction</h1>
<p>Sankey diagrams are great tools to visualise flows through a system. They show connections between the steps of a process where the width of the arrows is proportional to the flow.</p>
<p>I’m working on an evaluation of a risk screening process for people aged between 55-74 years and a history of smoking. In this Targeted Lung Health Check (TLHC) programme<sup>1</sup> eligible people are invited to attend a free lung check where those assessed at high risk of lung cancer are then offered low-dose CT screening scans.</p>
<div class="no-row-height column-margin column-container"><div id="fn1"><p><sup>1</sup>&nbsp;Please visit the <a href="https://www.england.nhs.uk/contact-us/privacy-notice/how-we-use-your-information/our-services/evaluation-of-the-targeted-lung-health-check-programme/">NHS England</a> site for for more background.</p></div></div><p>We used Sankey diagrams to visualise how people have engaged with the programme, from recruitment, attendance at appointments, their outcome from risk assessment, attendance at CT scans and will eventually be extended to cover the impact of the screening on early detection of those diagnosed with lung cancer.</p>
<p>This blog post is about the technical process of preparing record-level data for visualisation in a Sankey plot using <code>R</code> and customising it to enhance look and feel. Here is how the finished product will look:</p>
<div class="cell">
<div class="cell-output-display">
<div class="plotly html-widget html-fill-item" id="htmlwidget-8d3c50b5bc54d0f0514d" style="width:672px;height:480px;"></div>
<script type="application/json" data-for="htmlwidget-8d3c50b5bc54d0f0514d">{"x":{"visdat":{"74a468a063c8":["function () ","plotlyVisDat"]},"cur_data":"74a468a063c8","attrs":{"74a468a063c8":{"orientation":"h","arrangement":"snap","node":{"label":["Eligible population","Invitation 1 No response","Invitation 1 Declined","Invitation 1 Accepted","Invitation 2 No response","Invitation 2 Accepted","Invitation 2 Declined","Invitation 3 Accepted","Invitation 3 Declined","Invitation 3 No response","Declined","Accepted","No response"],"color":["#7f8fa6","#7f8fa6","#c23616","#44bd32","#7f8fa6","#44bd32","#c23616","#44bd32","#c23616","#7f8fa6","#c23616","#44bd32","#7f8fa6"],"x":[0.001,0.22575000000000001,0.22575000000000001,0.22575000000000001,0.45050000000000001,0.45050000000000001,0.45050000000000001,0.67525000000000002,0.67525000000000002,0.67525000000000002,0.90000000000000002,0.90000000000000002,0.90000000000000002],"y":[0.53326666666666667,0.53326666666666667,0.999,0.001,0.53326666666666667,0.13406666666666667,0.90585333333333329,0.22721333333333335,0.82601333333333338,0.63972000000000007,0.93246666666666667,0.001,0.53326666666666667],"customdata":[null,"77.0%","15.0%","8.0%","46.0%","12.0%","8.0%","9.0%","10.0%","14.0%","33.0%","29.0%","38.0%"],"hovertemplate":"%{label}<br /><b>%{value}<\/b> participants<br /><b>%{customdata}<\/b> of eligible population"},"link":{"source":[0,0,0,1,1,1,2,3,1,4,4,4,5,6,4,7,8,9],"target":[2,1,3,4,5,6,10,11,12,7,8,9,11,10,12,11,10,12],"value":[15,77,8,46,12,8,15,8,11,9,10,14,12,8,13,9,10,14],"label":["15","77","8","46","12","8","15","8","11","9","10","14","12","8","13","9","10","14"],"color":["#C236164C","#7F8FA64C","#44BD324C","#7F8FA64C","#44BD324C","#C236164C","#C236164C","#44BD324C","#7F8FA64C","#44BD324C","#C236164C","#7F8FA64C","#44BD324C","#C236164C","#7F8FA64C","#44BD324C","#C236164C","#7F8FA64C"],"customdata":["15.0%","77.0%","8.0%","46.0%","12.0%","8.0%","15.0%","8.0%","11.0%","9.0%","10.0%","14.0%","12.0%","8.0%","13.0%","9.0%","10.0%","14.0%"],"hovertemplate":"%{source.label} → %{target.label}<br /><b>%{value}<\/b> participants<br /><b>%{customdata}<\/b> of eligible population"},"alpha_stroke":1,"sizes":[10,100],"spans":[1,20],"type":"sankey"}},"layout":{"margin":{"b":40,"l":60,"t":25,"r":10},"font":{"family":"Arial, Helvetica, sans-serif","size":12},"paper_bgcolor":"rgba(0,0,0,0)","hovermode":"closest","showlegend":false},"source":"A","config":{"modeBarButtonsToAdd":["hoverclosest","hovercompare"],"showSendToCloud":false,"responsive":true},"data":[{"orientation":"h","arrangement":"snap","node":{"label":["Eligible population","Invitation 1 No response","Invitation 1 Declined","Invitation 1 Accepted","Invitation 2 No response","Invitation 2 Accepted","Invitation 2 Declined","Invitation 3 Accepted","Invitation 3 Declined","Invitation 3 No response","Declined","Accepted","No response"],"color":["#7f8fa6","#7f8fa6","#c23616","#44bd32","#7f8fa6","#44bd32","#c23616","#44bd32","#c23616","#7f8fa6","#c23616","#44bd32","#7f8fa6"],"x":[0.001,0.22575000000000001,0.22575000000000001,0.22575000000000001,0.45050000000000001,0.45050000000000001,0.45050000000000001,0.67525000000000002,0.67525000000000002,0.67525000000000002,0.90000000000000002,0.90000000000000002,0.90000000000000002],"y":[0.53326666666666667,0.53326666666666667,0.999,0.001,0.53326666666666667,0.13406666666666667,0.90585333333333329,0.22721333333333335,0.82601333333333338,0.63972000000000007,0.93246666666666667,0.001,0.53326666666666667],"customdata":[null,"77.0%","15.0%","8.0%","46.0%","12.0%","8.0%","9.0%","10.0%","14.0%","33.0%","29.0%","38.0%"],"hovertemplate":"%{label}<br /><b>%{value}<\/b> participants<br /><b>%{customdata}<\/b> of eligible population"},"link":{"source":[0,0,0,1,1,1,2,3,1,4,4,4,5,6,4,7,8,9],"target":[2,1,3,4,5,6,10,11,12,7,8,9,11,10,12,11,10,12],"value":[15,77,8,46,12,8,15,8,11,9,10,14,12,8,13,9,10,14],"label":["15","77","8","46","12","8","15","8","11","9","10","14","12","8","13","9","10","14"],"color":["#C236164C","#7F8FA64C","#44BD324C","#7F8FA64C","#44BD324C","#C236164C","#C236164C","#44BD324C","#7F8FA64C","#44BD324C","#C236164C","#7F8FA64C","#44BD324C","#C236164C","#7F8FA64C","#44BD324C","#C236164C","#7F8FA64C"],"customdata":["15.0%","77.0%","8.0%","46.0%","12.0%","8.0%","15.0%","8.0%","11.0%","9.0%","10.0%","14.0%","12.0%","8.0%","13.0%","9.0%","10.0%","14.0%"],"hovertemplate":"%{source.label} → %{target.label}<br /><b>%{value}<\/b> participants<br /><b>%{customdata}<\/b> of eligible population"},"type":"sankey","frame":null}],"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.20000000000000001,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot","plotly_sunburstclick"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script>
</div>
</div>
</section>
<section id="data-wrangling" class="level1 page-columns page-full">
<h1>Data wrangling</h1>
<p>First we’ll attach some packages. I’ll be using <a href="https://plotly.com/r/sankey-diagram/">plotly</a> for the visualisation of the Sankey chart, <a href="https://tidygraph.data-imaginist.com/">tidygraph</a> for graph manipulation and <a href="https://scales.r-lib.org/">scales</a> to handle colour transformation and rescaling values. We will also be using the <a href="https://www.tidyverse.org/">tidyverse</a> and <a href="https://glue.tidyverse.org/">glue</a> packages for general data wrangling and <a href="https://glin.github.io/reactable/index.html">reactable</a> to preview our data as we go along.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># libraries</span></span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tidyverse) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 'tidy' data wrangling</span></span>
<span id="cb1-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(plotly) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># sankey visualisation framework</span></span>
<span id="cb1-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(reactable) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># viewing interactive datatables</span></span>
<span id="cb1-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(glue) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># concatenating strings</span></span>
<span id="cb1-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tidygraph) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># api for graph / network manipulation</span></span>
<span id="cb1-7"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(scales) <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># used for colour transformation</span></span></code></pre></div></div>
</div>
<section id="get-the-data" class="level2">
<h2 class="anchored" data-anchor-id="get-the-data">Get the data</h2>
<p>In this example we will work with a simplified set of data focused on invitations.</p>
<p>The invites table holds details of when people were sent a letter or message inviting them to take part, how many times they were invited and how the person responded.</p>
<p>The people eligible for the programme are identified up-front and are represented by a unique ID with one row per person. Let’s assume each person receives at least one invitation to take part, they can have one of three outcomes:</p>
<ol type="1">
<li><p>They accept the invitation and agree to take part,</p></li>
<li><p>They decline the invitation,</p></li>
<li><p>They do not respond to the invitation.</p></li>
</ol>
<p>If the person doesn’t respond to the first invitation they may be sent a second invitation and could be offered a third invitation if they didn’t respond to the second.</p>
<p>Here is the specification for our simplified invites table:</p>
<table class="caption-top table">
<caption>Invites specification</caption>
<colgroup>
<col style="width: 14%">
<col style="width: 8%">
<col style="width: 75%">
</colgroup>
<thead>
<tr class="header">
<th>Field</th>
<th>Type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Participant ID</td>
<td>Integer</td>
<td>A unique identifier for each person.</td>
</tr>
<tr class="even">
<td>Invite date 1</td>
<td>Date</td>
<td><p>The date the person was first invited to participate.</p>
<p>Every person will have a date in this field.</p></td>
</tr>
<tr class="odd">
<td>Invite date 2</td>
<td>Date</td>
<td>The date a second invitation was sent.</td>
</tr>
<tr class="even">
<td>Invite date 3</td>
<td>Date</td>
<td>The date a third invitation was sent.</td>
</tr>
<tr class="odd">
<td>Invite outcome</td>
<td>Text</td>
<td>The outcome from the invite, one of either ‘Accepted’, ‘Declined’ or ‘No response’.</td>
</tr>
</tbody>
</table>
<p>Everyone receives at least one invite. Assuming a third of these respond (to either accept or decline) then two-thirds receive a follow-up invite. Of these, we assume half respond, meaning the remaining participants receive a third invite.</p>
<p>Here we generate 100 rows of example data to populate our table.</p>
<div class="cell fig-cap-location-top">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" data-cap-location="top" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># set a randomisation seed for reproducibility</span></span>
<span id="cb2-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">set.seed</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">seed =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1234</span>)</span>
<span id="cb2-3"></span>
<span id="cb2-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># define some parameters</span></span>
<span id="cb2-5">start_date <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.Date</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2019-01-01"</span>)</span>
<span id="cb2-6">end_date <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.Date</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"2021-01-01"</span>)</span>
<span id="cb2-7">rows <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">100</span></span>
<span id="cb2-8"></span>
<span id="cb2-9">df_invites_1 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tibble</span>(</span>
<span id="cb2-10">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># create a unique id for each participant</span></span>
<span id="cb2-11">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">participant_id =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span>rows,</span>
<span id="cb2-12"></span>
<span id="cb2-13">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># create a random initial invite date between our start and end dates</span></span>
<span id="cb2-14">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">invite_1_date =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sample</span>(</span>
<span id="cb2-15">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">seq</span>(start_date, end_date, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"day"</span>),</span>
<span id="cb2-16">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> rows, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">replace =</span> T</span>
<span id="cb2-17">  ),</span>
<span id="cb2-18"></span>
<span id="cb2-19">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># create a random outcome for this participant</span></span>
<span id="cb2-20">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">invite_outcome =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sample</span>(</span>
<span id="cb2-21">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Accepted"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Declined"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"No response"</span>),</span>
<span id="cb2-22">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> rows, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">replace =</span> T</span>
<span id="cb2-23">  )</span>
<span id="cb2-24">)</span>
<span id="cb2-25"></span>
<span id="cb2-26"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># take a sample of participants and allocate them a second invite date</span></span>
<span id="cb2-27">df_invites_2 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> df_invites_1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb2-28">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># sample two thirds of participants to get a second invite</span></span>
<span id="cb2-29">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">slice_sample</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">prop =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb2-30">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># allocate a date between 10 and 30 days following the first</span></span>
<span id="cb2-31">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(</span>
<span id="cb2-32">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">invite_2_date =</span> invite_1_date <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sample</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">n</span>(), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">replace =</span> T)</span>
<span id="cb2-33">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb2-34">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># keep just id and second date</span></span>
<span id="cb2-35">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(participant_id, invite_2_date)</span>
<span id="cb2-36"></span>
<span id="cb2-37"></span>
<span id="cb2-38"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># take a sample of those with a second invite and allocate them a third invite date</span></span>
<span id="cb2-39">df_invites_3 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> df_invites_2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb2-40">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># sample half of these to get a third invite</span></span>
<span id="cb2-41">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">slice_sample</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">prop =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb2-42">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># allocate a date between 10 to 30 days following the second</span></span>
<span id="cb2-43">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(</span>
<span id="cb2-44">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">invite_3_date =</span> invite_2_date <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sample</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">10</span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">:</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">30</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">n</span>(), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">replace =</span> T)</span>
<span id="cb2-45">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb2-46">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># keep just id and second date</span></span>
<span id="cb2-47">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(participant_id, invite_3_date)</span>
<span id="cb2-48"></span>
<span id="cb2-49"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># combine the 2nd and 3rd invites with the first table</span></span>
<span id="cb2-50">df_invites <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> df_invites_1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb2-51">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">left_join</span>(</span>
<span id="cb2-52">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> df_invites_2,</span>
<span id="cb2-53">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"participant_id"</span></span>
<span id="cb2-54">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb2-55">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">left_join</span>(</span>
<span id="cb2-56">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> df_invites_3,</span>
<span id="cb2-57">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"participant_id"</span></span>
<span id="cb2-58">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb2-59">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># move the outcome field after the third invite</span></span>
<span id="cb2-60">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">relocate</span>(invite_outcome, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.after =</span> invite_3_date)</span>
<span id="cb2-61"></span>
<span id="cb2-62"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># housekeeping</span></span>
<span id="cb2-63"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rm</span>(df_invites_1, df_invites_2, df_invites_3, start_date, end_date, rows)</span>
<span id="cb2-64"></span>
<span id="cb2-65"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># view our data</span></span>
<span id="cb2-66">df_invites <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb2-67">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">reactable</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">defaultPageSize =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span></code></pre></div></div>
</details>
<div class="cell-output-display">
<div class="reactable html-widget html-fill-item" id="htmlwidget-58b100ebed7fb947b986" style="width:auto;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-58b100ebed7fb947b986">{"x":{"tag":{"name":"Reactable","attribs":{"data":{"participant_id":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100],"invite_1_date":["2019-10-11","2019-04-11","2020-09-14","2020-10-06","2020-02-04","2019-04-08","2019-04-13","2020-12-26","2020-08-24","2019-11-22","2019-03-20","2019-09-27","2020-01-17","2019-07-03","2020-07-27","2019-01-04","2020-10-22","2020-07-05","2019-07-31","2019-07-14","2020-05-25","2020-04-23","2020-08-27","2020-09-25","2020-07-31","2020-05-24","2020-11-17","2020-02-28","2020-01-14","2019-04-18","2019-05-11","2019-12-09","2019-02-10","2020-09-18","2019-10-25","2019-09-15","2020-09-20","2019-03-20","2019-07-01","2019-11-01","2019-12-24","2020-11-26","2019-11-03","2019-11-01","2019-08-09","2020-07-14","2019-11-09","2019-05-16","2019-05-25","2019-05-03","2019-08-22","2020-08-30","2020-05-09","2020-06-17","2019-10-24","2019-07-27","2019-05-11","2020-07-22","2020-06-05","2019-09-05","2019-12-31","2020-10-26","2020-10-04","2020-08-17","2020-03-09","2019-08-06","2020-12-27","2020-05-22","2019-10-03","2019-06-18","2019-03-12","2020-07-26","2019-11-22","2020-04-29","2020-10-28","2020-04-04","2019-03-01","2020-03-24","2020-07-01","2019-01-19","2020-10-10","2020-09-29","2020-10-31","2019-11-15","2019-04-26","2019-05-25","2019-04-12","2019-08-02","2020-01-25","2020-08-19","2020-08-19","2019-06-09","2019-03-18","2020-06-12","2019-05-06","2019-09-19","2020-03-17","2020-10-03","2019-06-30","2019-06-12"],"invite_2_date":[null,"2019-04-26","2020-10-03",null,"2020-03-05",null,null,null,"2020-09-13","2019-12-14","2019-04-13","2019-10-10","2020-02-04","2019-07-27","2020-08-09",null,null,null,null,"2019-08-10","2020-06-17","2020-05-21","2020-09-10","2020-10-12","2020-08-23",null,"2020-12-16",null,"2020-02-12","2019-04-29","2019-05-26","2020-01-04","2019-03-10","2020-10-11",null,"2019-09-28","2020-09-30","2019-04-17",null,null,"2020-01-18",null,"2019-11-28","2019-11-14","2019-08-28","2020-08-10","2019-12-07","2019-05-26",null,"2019-05-29","2019-09-10","2020-09-23","2020-05-21",null,"2019-11-15",null,"2019-06-06","2020-08-07","2020-06-16","2019-09-29","2020-01-13",null,null,null,"2020-03-28","2019-09-05","2021-01-21",null,"2019-10-18","2019-07-10",null,null,"2019-12-18","2020-05-23","2020-11-24","2020-04-26","2019-03-11","2020-04-16","2020-07-22",null,"2020-11-09","2020-10-29","2020-11-28",null,null,null,"2019-05-04",null,"2020-02-12","2020-09-07","2020-09-11",null,null,"2020-07-06",null,null,null,"2020-10-20","2019-07-22","2019-07-01"],"invite_3_date":[null,"2019-05-24","2020-10-16",null,"2020-04-01",null,null,null,"2020-10-01","2020-01-13","2019-05-05",null,null,"2019-08-12",null,null,null,null,null,"2019-09-06","2020-07-04","2020-06-14","2020-09-22","2020-11-11","2020-09-18",null,null,null,null,null,null,null,null,null,null,null,null,"2019-05-02",null,null,null,null,"2019-12-09","2019-12-04","2019-09-22",null,null,null,null,null,"2019-09-23",null,null,null,"2019-12-09",null,"2019-06-30",null,"2020-07-16","2019-10-10",null,null,null,null,null,"2019-09-29",null,null,null,null,null,null,"2019-12-31",null,null,null,"2019-03-27","2020-05-12","2020-08-03",null,"2020-11-29",null,"2020-12-22",null,null,null,null,null,"2020-03-12",null,"2020-09-28",null,null,"2020-08-04",null,null,null,"2020-11-10",null,null],"invite_outcome":["Declined","Accepted","Declined","Declined","Declined","Accepted","No response","No response","Accepted","Declined","Declined","Accepted","Accepted","Accepted","Accepted","Declined","Declined","Declined","No response","Declined","Declined","No response","No response","No response","Accepted","Declined","Declined","No response","No response","No response","No response","Accepted","No response","No response","No response","Declined","Declined","No response","Declined","Accepted","Accepted","Accepted","No response","No response","No response","Accepted","Accepted","No response","Declined","No response","Accepted","Accepted","No response","Declined","Accepted","No response","No response","Declined","Declined","Declined","Accepted","Accepted","No response","Accepted","Accepted","Accepted","Accepted","No response","No response","Declined","Accepted","Declined","Declined","Declined","No response","Declined","No response","No response","No response","Declined","No response","No response","Accepted","No response","Accepted","No response","No response","Declined","No response","No response","Declined","Declined","No response","Accepted","Declined","Declined","Accepted","No response","Declined","Accepted"]},"columns":[{"id":"participant_id","name":"participant_id","type":"numeric"},{"id":"invite_1_date","name":"invite_1_date","type":"Date"},{"id":"invite_2_date","name":"invite_2_date","type":"Date"},{"id":"invite_3_date","name":"invite_3_date","type":"Date"},{"id":"invite_outcome","name":"invite_outcome","type":"character"}],"defaultPageSize":5,"dataKey":"812760e5c2269ef93c4668c581c72808"},"children":[]},"class":"reactR_markup"},"evals":[],"jsHooks":[]}</script>
<p>Generated invite table</p>
</div>
</div>
</section>
<section id="determine-milestone-outcomes" class="level2">
<h2 class="anchored" data-anchor-id="determine-milestone-outcomes">Determine milestone outcomes</h2>
<p>The next step is to take our source table and convert the data into a series of milestones (and associated outcomes) that represents how our invited participants moved through the pathway.</p>
<p>In our example we have five milestones to represent in our Sankey plot:</p>
<ul>
<li><p>Our eligible population (everyone in our invites table),</p></li>
<li><p>The result from the first invitation,</p></li>
<li><p>The result from the second invitation,</p></li>
<li><p>The result from the third invitation,</p></li>
<li><p>The overall invite outcome.</p></li>
</ul>
<p>Aside from the eligible population, where everyone starts with the same value, participants will have one of several outcomes at each milestone. This step is about naming these milestones and the outcomes.</p>
<p>It is important that each milestone-outcome has unique values. An outcome of ‘No response’ can be recorded against the first, second and third invite, and we wish to see these outcomes separately represented on the Sankey (rather than just one ‘No response’), so each outcome must be made unique. In this example we prefix the outcome from each invite with the number of the invite, e.g.&nbsp;‘Invite 1 No response’.</p>
<p>The reason for this will become clearer when we come to plot the Sankey, but for now we produce these milestone-outcomes from our invites table.</p>
<div class="cell fig-cap-location-top">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" data-cap-location="top" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1">df_milestones <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> df_invites <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb3-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(</span>
<span id="cb3-3">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># everyone starts in the eligible population</span></span>
<span id="cb3-4">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">start_population =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Eligible population"</span>,</span>
<span id="cb3-5"></span>
<span id="cb3-6">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># work out what happened following the first invite</span></span>
<span id="cb3-7">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">invite_1_outcome =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">case_when</span>(</span>
<span id="cb3-8">      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># if a second invite was sent we assume there was no outcome from the first</span></span>
<span id="cb3-9">      <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(invite_2_date) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Invitation 1 No response"</span>,</span>
<span id="cb3-10">      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># otherwise the overall outcome resulted from the first invite</span></span>
<span id="cb3-11">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.default =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glue</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Invitation 1 {invite_outcome}"</span>)</span>
<span id="cb3-12">    ),</span>
<span id="cb3-13"></span>
<span id="cb3-14">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># work out what happened following the second invite</span></span>
<span id="cb3-15">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">invite_2_outcome =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">case_when</span>(</span>
<span id="cb3-16">      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># if a third invite was sent we assume there was no outcome from the second</span></span>
<span id="cb3-17">      <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(invite_3_date) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Invitation 2 No response"</span>,</span>
<span id="cb3-18">      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># if a second invite was sent but no third then</span></span>
<span id="cb3-19">      <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(invite_2_date) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glue</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Invitation 2 {invite_outcome}"</span>),</span>
<span id="cb3-20">      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># default to NA if neither of the above are true</span></span>
<span id="cb3-21">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.default =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span></span>
<span id="cb3-22">    ),</span>
<span id="cb3-23"></span>
<span id="cb3-24">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># work out what happened following the third invite</span></span>
<span id="cb3-25">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">invite_3_outcome =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">case_when</span>(</span>
<span id="cb3-26">      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># if a third invite was sent then the outcome is the overall outcome</span></span>
<span id="cb3-27">      <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(invite_3_date) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glue</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Invitation 3 {invite_outcome}"</span>),</span>
<span id="cb3-28">      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># otherwise mark as NA</span></span>
<span id="cb3-29">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.default =</span> <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span></span>
<span id="cb3-30">    )</span>
<span id="cb3-31">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb3-32">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># exclude the dates as they are no longer needed</span></span>
<span id="cb3-33">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">contains</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"_date"</span>)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb3-34">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># move the overall invite outcome to the end</span></span>
<span id="cb3-35">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">relocate</span>(invite_outcome, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.after =</span> invite_3_outcome)</span>
<span id="cb3-36"></span>
<span id="cb3-37"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># view our data</span></span>
<span id="cb3-38">df_milestones <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb3-39">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">reactable</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">defaultPageSize =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span></code></pre></div></div>
</details>
<div class="cell-output-display">
<div class="reactable html-widget html-fill-item" id="htmlwidget-286350ed2b9015d71ef1" style="width:auto;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-286350ed2b9015d71ef1">{"x":{"tag":{"name":"Reactable","attribs":{"data":{"participant_id":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100],"start_population":["Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population","Eligible population"],"invite_1_outcome":["Invitation 1 Declined","Invitation 1 No response","Invitation 1 No response","Invitation 1 Declined","Invitation 1 No response","Invitation 1 Accepted","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 Declined","Invitation 1 Declined","Invitation 1 Declined","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 Declined","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 Declined","Invitation 1 Accepted","Invitation 1 No response","Invitation 1 Accepted","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 Declined","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 Declined","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 Accepted","Invitation 1 No response","Invitation 1 Accepted","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 Accepted","Invitation 1 Declined","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 Declined","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 Accepted","Invitation 1 No response","Invitation 1 No response","Invitation 1 Declined","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 Declined","Invitation 1 No response","Invitation 1 No response","Invitation 1 Declined","Invitation 1 Declined","Invitation 1 Accepted","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response"],"invite_2_outcome":[null,"Invitation 2 No response","Invitation 2 No response",null,"Invitation 2 No response",null,null,null,"Invitation 2 No response","Invitation 2 No response","Invitation 2 No response","Invitation 2 Accepted","Invitation 2 Accepted","Invitation 2 No response","Invitation 2 Accepted",null,null,null,null,"Invitation 2 No response","Invitation 2 No response","Invitation 2 No response","Invitation 2 No response","Invitation 2 No response","Invitation 2 No response",null,"Invitation 2 Declined",null,"Invitation 2 No response","Invitation 2 No response","Invitation 2 No response","Invitation 2 Accepted","Invitation 2 No response","Invitation 2 No response",null,"Invitation 2 Declined","Invitation 2 Declined","Invitation 2 No response",null,null,"Invitation 2 Accepted",null,"Invitation 2 No response","Invitation 2 No response","Invitation 2 No response","Invitation 2 Accepted","Invitation 2 Accepted","Invitation 2 No response",null,"Invitation 2 No response","Invitation 2 No response","Invitation 2 Accepted","Invitation 2 No response",null,"Invitation 2 No response",null,"Invitation 2 No response","Invitation 2 Declined","Invitation 2 No response","Invitation 2 No response","Invitation 2 Accepted",null,null,null,"Invitation 2 Accepted","Invitation 2 No response","Invitation 2 Accepted",null,"Invitation 2 No response","Invitation 2 Declined",null,null,"Invitation 2 No response","Invitation 2 Declined","Invitation 2 No response","Invitation 2 Declined","Invitation 2 No response","Invitation 2 No response","Invitation 2 No response",null,"Invitation 2 No response","Invitation 2 No response","Invitation 2 No response",null,null,null,"Invitation 2 No response",null,"Invitation 2 No response","Invitation 2 No response","Invitation 2 No response",null,null,"Invitation 2 No response",null,null,null,"Invitation 2 No response","Invitation 2 Declined","Invitation 2 Accepted"],"invite_3_outcome":[null,"Invitation 3 Accepted","Invitation 3 Declined",null,"Invitation 3 Declined",null,null,null,"Invitation 3 Accepted","Invitation 3 Declined","Invitation 3 Declined",null,null,"Invitation 3 Accepted",null,null,null,null,null,"Invitation 3 Declined","Invitation 3 Declined","Invitation 3 No response","Invitation 3 No response","Invitation 3 No response","Invitation 3 Accepted",null,null,null,null,null,null,null,null,null,null,null,null,"Invitation 3 No response",null,null,null,null,"Invitation 3 No response","Invitation 3 No response","Invitation 3 No response",null,null,null,null,null,"Invitation 3 Accepted",null,null,null,"Invitation 3 Accepted",null,"Invitation 3 No response",null,"Invitation 3 Declined","Invitation 3 Declined",null,null,null,null,null,"Invitation 3 Accepted",null,null,null,null,null,null,"Invitation 3 Declined",null,null,null,"Invitation 3 No response","Invitation 3 No response","Invitation 3 No response",null,"Invitation 3 No response",null,"Invitation 3 Accepted",null,null,null,null,null,"Invitation 3 No response",null,"Invitation 3 Declined",null,null,"Invitation 3 Accepted",null,null,null,"Invitation 3 No response",null,null],"invite_outcome":["Declined","Accepted","Declined","Declined","Declined","Accepted","No response","No response","Accepted","Declined","Declined","Accepted","Accepted","Accepted","Accepted","Declined","Declined","Declined","No response","Declined","Declined","No response","No response","No response","Accepted","Declined","Declined","No response","No response","No response","No response","Accepted","No response","No response","No response","Declined","Declined","No response","Declined","Accepted","Accepted","Accepted","No response","No response","No response","Accepted","Accepted","No response","Declined","No response","Accepted","Accepted","No response","Declined","Accepted","No response","No response","Declined","Declined","Declined","Accepted","Accepted","No response","Accepted","Accepted","Accepted","Accepted","No response","No response","Declined","Accepted","Declined","Declined","Declined","No response","Declined","No response","No response","No response","Declined","No response","No response","Accepted","No response","Accepted","No response","No response","Declined","No response","No response","Declined","Declined","No response","Accepted","Declined","Declined","Accepted","No response","Declined","Accepted"]},"columns":[{"id":"participant_id","name":"participant_id","type":"numeric"},{"id":"start_population","name":"start_population","type":"character"},{"id":"invite_1_outcome","name":"invite_1_outcome","type":["glue","character"]},{"id":"invite_2_outcome","name":"invite_2_outcome","type":["glue","character"]},{"id":"invite_3_outcome","name":"invite_3_outcome","type":["glue","character"]},{"id":"invite_outcome","name":"invite_outcome","type":"character"}],"defaultPageSize":5,"dataKey":"a450da1debd7915b64a9b7311b7f1d33"},"children":[]},"class":"reactR_markup"},"evals":[],"jsHooks":[]}</script>
<p>Milestone-outcomes for participants</p>
</div>
</div>
</section>
<section id="calculate-flows" class="level2 page-columns page-full">
<h2 class="anchored" data-anchor-id="calculate-flows">Calculate flows</h2>
<p>Next we take pairs of milestone-outcomes and calculate the number of participants that moved between them.</p>
<p>Here we utilise the power of <code>dplyr::summarise</code> with an argument <code>.by</code> to group by our data before counting the number of unique participants who move between our start and end groups.</p>
<p>For invites 2 and 3 we perform two sets of summaries:</p>
<ol type="1">
<li><p>The first where the values in the <code>to</code> and <code>from</code> fields contain details.</p></li>
<li><p>The second to capture cases where the <code>to</code> destination is NULL. This is because the participant responded at the previous invite so there was no subsequent invite. In these cases we flow the participant to the overall invite outcome.<sup>2</sup></p></li>
</ol>
<div class="no-row-height column-margin column-container"><div id="fn2"><p><sup>2</sup>&nbsp;If you are thinking there is a lot of repetition here, you’re right. In practice I abstracted both steps to a function and passed in the parameters for the <code>from</code> and <code>to</code> variables and simplified my workflow a little, however, I’m showing it in plain form here for simplification.</p></div></div><div class="cell fig-cap-location-top">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" data-cap-location="top" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1">df_flows <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">bind_rows</span>(</span>
<span id="cb4-2"></span>
<span id="cb4-3">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># flow from population to invite 1</span></span>
<span id="cb4-4">  df_milestones <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-5">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(start_population) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(invite_1_outcome)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-6">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rename</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">from =</span> start_population, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">to =</span> invite_1_outcome) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-7">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(</span>
<span id="cb4-8">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">flow =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">n_distinct</span>(participant_id, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> T),</span>
<span id="cb4-9">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.by =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(from, to)</span>
<span id="cb4-10">    ),</span>
<span id="cb4-11"></span>
<span id="cb4-12">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># flow from invite 1 to invite 2 (where not NA)</span></span>
<span id="cb4-13">  df_milestones <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-14">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(invite_1_outcome) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(invite_2_outcome)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-15">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rename</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">from =</span> invite_1_outcome, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">to =</span> invite_2_outcome) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-16">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(</span>
<span id="cb4-17">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">flow =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">n_distinct</span>(participant_id, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> T),</span>
<span id="cb4-18">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.by =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(from, to)</span>
<span id="cb4-19">    ),</span>
<span id="cb4-20"></span>
<span id="cb4-21">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># flow from invite 1 to overall invite outcome (where invite 2 is NA)</span></span>
<span id="cb4-22">  df_milestones <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-23">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(invite_1_outcome) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(invite_2_outcome)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-24">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rename</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">from =</span> invite_1_outcome, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">to =</span> invite_outcome) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-25">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(</span>
<span id="cb4-26">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">flow =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">n_distinct</span>(participant_id, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> T),</span>
<span id="cb4-27">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.by =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(from, to)</span>
<span id="cb4-28">    ),</span>
<span id="cb4-29"></span>
<span id="cb4-30">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># flow from invite 2 to invite 3 (where not NA)</span></span>
<span id="cb4-31">  df_milestones <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-32">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(invite_2_outcome) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(invite_3_outcome)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-33">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rename</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">from =</span> invite_2_outcome, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">to =</span> invite_3_outcome) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-34">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(</span>
<span id="cb4-35">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">flow =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">n_distinct</span>(participant_id, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> T),</span>
<span id="cb4-36">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.by =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(from, to)</span>
<span id="cb4-37">    ),</span>
<span id="cb4-38"></span>
<span id="cb4-39">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># flow from invite 2 to overall invite outcome (where invite 3 is NA)</span></span>
<span id="cb4-40">  df_milestones <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-41">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(invite_2_outcome) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(invite_3_outcome)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-42">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rename</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">from =</span> invite_2_outcome, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">to =</span> invite_outcome) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-43">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(</span>
<span id="cb4-44">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">flow =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">n_distinct</span>(participant_id, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> T),</span>
<span id="cb4-45">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.by =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(from, to)</span>
<span id="cb4-46">    ),</span>
<span id="cb4-47"></span>
<span id="cb4-48">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># final flow - invite 3 to overall outcome (where both are not NA)</span></span>
<span id="cb4-49">  df_milestones <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-50">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(invite_3_outcome) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(invite_outcome)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-51">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rename</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">from =</span> invite_3_outcome, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">to =</span> invite_outcome) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-52">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(</span>
<span id="cb4-53">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">flow =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">n_distinct</span>(participant_id, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> T),</span>
<span id="cb4-54">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.by =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(from, to)</span>
<span id="cb4-55">    )</span>
<span id="cb4-56">)</span>
<span id="cb4-57"></span>
<span id="cb4-58"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># view our data</span></span>
<span id="cb4-59">df_flows <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-60">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">reactable</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">defaultPageSize =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span></code></pre></div></div>
</details>
<div class="cell-output-display">
<div class="reactable html-widget html-fill-item" id="htmlwidget-71d01034894540eb0211" style="width:auto;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-71d01034894540eb0211">{"x":{"tag":{"name":"Reactable","attribs":{"data":{"from":["Eligible population","Eligible population","Eligible population","Invitation 1 No response","Invitation 1 No response","Invitation 1 No response","Invitation 1 Declined","Invitation 1 Accepted","Invitation 1 No response","Invitation 2 No response","Invitation 2 No response","Invitation 2 No response","Invitation 2 Accepted","Invitation 2 Declined","Invitation 2 No response","Invitation 3 Accepted","Invitation 3 Declined","Invitation 3 No response"],"to":["Invitation 1 Declined","Invitation 1 No response","Invitation 1 Accepted","Invitation 2 No response","Invitation 2 Accepted","Invitation 2 Declined","Declined","Accepted","No response","Invitation 3 Accepted","Invitation 3 Declined","Invitation 3 No response","Accepted","Declined","No response","Accepted","Declined","No response"],"flow":[15,77,8,46,12,8,15,8,11,9,10,14,12,8,13,9,10,14]},"columns":[{"id":"from","name":"from","type":["glue","character"]},{"id":"to","name":"to","type":["glue","character"]},{"id":"flow","name":"flow","type":"numeric"}],"defaultPageSize":5,"dataKey":"742d09c4727207338778e12e06444f4b"},"children":[]},"class":"reactR_markup"},"evals":[],"jsHooks":[]}</script>
<p>Flows of participants between milestones</p>
</div>
</div>
</section>
</section>
<section id="sankey-plot" class="level1">
<h1>Sankey plot</h1>
<p>We now have a neat little summary of movements of participants between the milestones in our recruitment pathway. However, this ‘tidy’ data isn’t the format required by <a href="https://plotly.com/r/sankey-diagram/">plotly</a>, so the next steps are to prepare it ready for plotting.</p>
<section id="preparing-for-plotly" class="level2">
<h2 class="anchored" data-anchor-id="preparing-for-plotly">Preparing for plotly</h2>
<p>Plotly expects to be fed two sets of data:</p>
<ol type="1">
<li><p>Nodes - these are the milestones we have in our <code>from</code> and <code>to</code> fields,</p></li>
<li><p>Edges - these are the flows that occur between nodes, the <code>flow</code> in our table.</p></li>
</ol>
<p>It is possible to extract this data by hand but I found using the <a href="https://tidygraph.data-imaginist.com">tidygraph</a> package was much easier and more convenient.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1">df_sankey <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> df_flows <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb5-2">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># convert our flows data to a tidy graph object</span></span>
<span id="cb5-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as_tbl_graph</span>()</span></code></pre></div></div>
</div>
<p>The tidygraph package splits our data into nodes and edges. We can selectively work on each by ‘activating’ them - here is the nodes list:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb6-1">df_sankey <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb6-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">activate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">what =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"nodes"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb6-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as_tibble</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb6-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">reactable</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">defaultPageSize =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span></code></pre></div></div>
<div class="cell-output-display">
<div class="reactable html-widget html-fill-item" id="htmlwidget-c88b684aa59b7053a197" style="width:auto;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-c88b684aa59b7053a197">{"x":{"tag":{"name":"Reactable","attribs":{"data":{"name":["Eligible population","Invitation 1 No response","Invitation 1 Declined","Invitation 1 Accepted","Invitation 2 No response","Invitation 2 Accepted","Invitation 2 Declined","Invitation 3 Accepted","Invitation 3 Declined","Invitation 3 No response","Declined","Accepted","No response"]},"columns":[{"id":"name","name":"name","type":"character"}],"defaultPageSize":5,"dataKey":"c8916df8221f76b5b09469d3a29a7b47"},"children":[]},"class":"reactR_markup"},"evals":[],"jsHooks":[]}</script>
</div>
</div>
<p>You can see each unique node name listed. The row numbers for these nodes are used as reference IDs in the edges object:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1">df_sankey <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb7-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">activate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">what =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"edges"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb7-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as_tibble</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb7-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">reactable</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">defaultPageSize =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span></code></pre></div></div>
<div class="cell-output-display">
<div class="reactable html-widget html-fill-item" id="htmlwidget-61d01eb1917d40cdfca7" style="width:auto;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-61d01eb1917d40cdfca7">{"x":{"tag":{"name":"Reactable","attribs":{"data":{"from":[1,1,1,2,2,2,3,4,2,5,5,5,6,7,5,8,9,10],"to":[3,2,4,5,6,7,11,12,13,8,9,10,12,11,13,12,11,13],"flow":[15,77,8,46,12,8,15,8,11,9,10,14,12,8,13,9,10,14]},"columns":[{"id":"from","name":"from","type":"numeric"},{"id":"to","name":"to","type":"numeric"},{"id":"flow","name":"flow","type":"numeric"}],"defaultPageSize":5,"dataKey":"a889e9b644c7e0f45f526b0ebde36ebd"},"children":[]},"class":"reactR_markup"},"evals":[],"jsHooks":[]}</script>
</div>
</div>
<p>We now have enough information to generate our Sankey.</p>
<p>First we extract our nodes and edges to separate data frames then convert the ID values to be zero-based (starts at 0) as this is what plotly is expecting. To do this is as simple as subtracting 1 from the value of the IDs.</p>
<p>Finally we pass these two dataframes to plotly’s <code>node</code> and <code>link</code> function inputs to generate the plot.</p>
<div class="cell fig-cap-location-top">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" data-cap-location="top" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># extract the nodes to a dataframe</span></span>
<span id="cb8-2">nodes <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> df_sankey <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb8-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">activate</span>(nodes) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb8-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb8-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(</span>
<span id="cb8-6">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">id =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">row_number</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb8-7">  )</span>
<span id="cb8-8"></span>
<span id="cb8-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># extract the edges to a dataframe</span></span>
<span id="cb8-10">edges <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> df_sankey <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb8-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">activate</span>(edges) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb8-12">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb8-13">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(</span>
<span id="cb8-14">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">from =</span> from <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,</span>
<span id="cb8-15">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">to =</span> to <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span></span>
<span id="cb8-16">  )</span>
<span id="cb8-17"></span>
<span id="cb8-18"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># plot our sankey</span></span>
<span id="cb8-19"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plot_ly</span>(</span>
<span id="cb8-20">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># setup</span></span>
<span id="cb8-21">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">type =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"sankey"</span>,</span>
<span id="cb8-22">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">orientation =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"h"</span>,</span>
<span id="cb8-23">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">arrangement =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"snap"</span>,</span>
<span id="cb8-24"></span>
<span id="cb8-25">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># use our node data</span></span>
<span id="cb8-26">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">node =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(</span>
<span id="cb8-27">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">label =</span> nodes<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>name</span>
<span id="cb8-28">  ),</span>
<span id="cb8-29"></span>
<span id="cb8-30">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># use our link data</span></span>
<span id="cb8-31">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">link =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(</span>
<span id="cb8-32">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">source =</span> edges<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>from,</span>
<span id="cb8-33">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">target =</span> edges<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>to,</span>
<span id="cb8-34">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">value =</span> edges<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>flow</span>
<span id="cb8-35">  )</span>
<span id="cb8-36">)</span></code></pre></div></div>
</details>
<div class="cell-output-display">
<div class="plotly html-widget html-fill-item" id="htmlwidget-81df00dabeaef98ceac6" style="width:672px;height:480px;"></div>
<script type="application/json" data-for="htmlwidget-81df00dabeaef98ceac6">{"x":{"visdat":{"617636d5c3ff":["function () ","plotlyVisDat"]},"cur_data":"617636d5c3ff","attrs":{"617636d5c3ff":{"orientation":"h","arrangement":"snap","node":{"label":["Eligible population","Invitation 1 No response","Invitation 1 Declined","Invitation 1 Accepted","Invitation 2 No response","Invitation 2 Accepted","Invitation 2 Declined","Invitation 3 Accepted","Invitation 3 Declined","Invitation 3 No response","Declined","Accepted","No response"]},"link":{"source":[0,0,0,1,1,1,2,3,1,4,4,4,5,6,4,7,8,9],"target":[2,1,3,4,5,6,10,11,12,7,8,9,11,10,12,11,10,12],"value":[15,77,8,46,12,8,15,8,11,9,10,14,12,8,13,9,10,14]},"alpha_stroke":1,"sizes":[10,100],"spans":[1,20],"type":"sankey"}},"layout":{"margin":{"b":40,"l":60,"t":25,"r":10},"hovermode":"closest","showlegend":false},"source":"A","config":{"modeBarButtonsToAdd":["hoverclosest","hovercompare"],"showSendToCloud":false},"data":[{"orientation":"h","arrangement":"snap","node":{"label":["Eligible population","Invitation 1 No response","Invitation 1 Declined","Invitation 1 Accepted","Invitation 2 No response","Invitation 2 Accepted","Invitation 2 Declined","Invitation 3 Accepted","Invitation 3 Declined","Invitation 3 No response","Declined","Accepted","No response"]},"link":{"source":[0,0,0,1,1,1,2,3,1,4,4,4,5,6,4,7,8,9],"target":[2,1,3,4,5,6,10,11,12,7,8,9,11,10,12,11,10,12],"value":[15,77,8,46,12,8,15,8,11,9,10,14,12,8,13,9,10,14]},"type":"sankey","frame":null}],"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.20000000000000001,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot","plotly_sunburstclick"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script>
<p>Our first sankey</p>
</div>
</div>
<p>Not bad!</p>
<p>We can see the structure of our Sankey now. Can you see the relative proportions of participants who did or didn’t respond to our first invite? Marvel at how those who responded to the first invite flow into our final outcome. How about those who didn’t respond to the first invitation go on to receive a second invite?</p>
<p>Plotly’s charts are interactive. Try hovering your cursor over the nodes and edges to highlight them and a pop-up box will appear giving you additional details. You can reorder the vertical position of the nodes by dragging them above or below an adjacent node.</p>
<p>This looks functional.</p>
</section>
<section id="styling-our-sankey" class="level2">
<h2 class="anchored" data-anchor-id="styling-our-sankey">Styling our Sankey</h2>
<p>Now we have the foundations of our Sankey I’d like to move on to its presentation. Specifically I’d like to:</p>
<ul>
<li><p>use colour coding to clearly group those who accept or decline the invite,</p></li>
<li><p>improve the readability of the node titles,</p></li>
<li><p>add additional information to the pop-up boxes when you hover over nodes and edges, and</p></li>
<li><p>control the positioning of the nodes in the plot.</p></li>
</ul>
<p>As our <code>nodes</code> and <code>edges</code> objects are dataframes it is straightforward to add this styling information directly to them.</p>
<p>For the nodes object we define colours based on the name of each node and manually position them in the plot</p>
<div class="cell fig-cap-location-top">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" data-cap-location="top" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># get the eligible population as a single value</span></span>
<span id="cb9-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># NB, will be used to work out % amounts in each node and edge</span></span>
<span id="cb9-3">temp_eligible_pop <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> df_flows <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb9-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(from <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Eligible population"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb9-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">total =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(flow, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> T)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb9-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pull</span>(total)</span>
<span id="cb9-7"></span>
<span id="cb9-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># style our nodes object</span></span>
<span id="cb9-9">nodes <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> nodes <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb9-10">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(</span>
<span id="cb9-11">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># colour ----</span></span>
<span id="cb9-12">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># add colour definitions, green for accepted, red for declined</span></span>
<span id="cb9-13">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">colour =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">case_when</span>(</span>
<span id="cb9-14">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(name, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Accepted"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#44bd32"</span>,</span>
<span id="cb9-15">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(name, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Declined"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#c23616"</span>,</span>
<span id="cb9-16">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(name, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"No response"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#7f8fa6"</span>,</span>
<span id="cb9-17">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(name, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Eligible population"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"#7f8fa6"</span></span>
<span id="cb9-18">    ),</span>
<span id="cb9-19"></span>
<span id="cb9-20">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># add a semi-transparent colour for the edges based on node colours</span></span>
<span id="cb9-21">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">colour_fade =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">col2hcl</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">colour =</span> colour, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">alpha =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.3</span>),</span>
<span id="cb9-22"></span>
<span id="cb9-23">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># positioning ----</span></span>
<span id="cb9-24">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># NB, I found that to position nodes you need to supply both</span></span>
<span id="cb9-25">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># horizontal and vertical positions</span></span>
<span id="cb9-26">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># NNB, it was a bit of trial and error to get the these positions just</span></span>
<span id="cb9-27">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># right</span></span>
<span id="cb9-28"></span>
<span id="cb9-29">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># horizontal positions (0 = left, 1 = right)</span></span>
<span id="cb9-30">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">case_when</span>(</span>
<span id="cb9-31">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(name, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Eligible population"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,</span>
<span id="cb9-32">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(name, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Invitation 1"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>,</span>
<span id="cb9-33">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(name, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Invitation 2"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>,</span>
<span id="cb9-34">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(name, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Invitation 3"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>,</span>
<span id="cb9-35">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.default =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span></span>
<span id="cb9-36">    ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rescale</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">to =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.001</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.9</span>)),</span>
<span id="cb9-37"></span>
<span id="cb9-38">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># vertical position (1 = bottom, 0 = top)</span></span>
<span id="cb9-39">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">case_when</span>(</span>
<span id="cb9-40">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(name, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Eligible population"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>,</span>
<span id="cb9-41">      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># invite 1</span></span>
<span id="cb9-42">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(name, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Invitation 1 Accepted"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,</span>
<span id="cb9-43">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(name, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Invitation 1 No response"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>,</span>
<span id="cb9-44">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(name, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Invitation 1 Declined"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">8.5</span>,</span>
<span id="cb9-45">      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># invite 2</span></span>
<span id="cb9-46">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(name, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Invitation 2 Accepted"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>,</span>
<span id="cb9-47">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(name, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Invitation 2 No response"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>,</span>
<span id="cb9-48">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(name, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Invitation 2 Declined"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">7.8</span>,</span>
<span id="cb9-49">      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># invite 3</span></span>
<span id="cb9-50">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(name, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Invitation 3 Accepted"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">2.7</span>,</span>
<span id="cb9-51">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(name, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Invitation 3 No response"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">5.8</span>,</span>
<span id="cb9-52">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(name, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Invitation 3 Declined"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">7.2</span>,</span>
<span id="cb9-53">      <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># final outcomes</span></span>
<span id="cb9-54">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(name, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Accepted"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>,</span>
<span id="cb9-55">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(name, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"No response"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>,</span>
<span id="cb9-56">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_detect</span>(name, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Declined"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span>,</span>
<span id="cb9-57">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.default =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span></span>
<span id="cb9-58">    ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">rescale</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">to =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.001</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.999</span>))</span>
<span id="cb9-59">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb9-60">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># add in a custom field to show the percentage flow</span></span>
<span id="cb9-61">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">left_join</span>(</span>
<span id="cb9-62">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> df_flows <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb9-63">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(to) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb9-64">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(</span>
<span id="cb9-65">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">flow =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">sum</span>(flow, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">na.rm =</span> T),</span>
<span id="cb9-66">        <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">flow_perc =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">percent</span>(flow <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> temp_eligible_pop, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">accuracy =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>),</span>
<span id="cb9-67">      ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb9-68">      <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">name =</span> to, flow_perc),</span>
<span id="cb9-69">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"name"</span></span>
<span id="cb9-70">  )</span>
<span id="cb9-71"></span>
<span id="cb9-72"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># view our nodes data</span></span>
<span id="cb9-73">nodes <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb9-74">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">reactable</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">defaultPageSize =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span></code></pre></div></div>
</details>
<div class="cell-output-display">
<div class="reactable html-widget html-fill-item" id="htmlwidget-f7768febd5ad0e1a4493" style="width:auto;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-f7768febd5ad0e1a4493">{"x":{"tag":{"name":"Reactable","attribs":{"data":{"name":["Eligible population","Invitation 1 No response","Invitation 1 Declined","Invitation 1 Accepted","Invitation 2 No response","Invitation 2 Accepted","Invitation 2 Declined","Invitation 3 Accepted","Invitation 3 Declined","Invitation 3 No response","Declined","Accepted","No response"],"id":[0,1,2,3,4,5,6,7,8,9,10,11,12],"colour":["#7f8fa6","#7f8fa6","#c23616","#44bd32","#7f8fa6","#44bd32","#c23616","#44bd32","#c23616","#7f8fa6","#c23616","#44bd32","#7f8fa6"],"colour_fade":["#7F8FA64C","#7F8FA64C","#C236164C","#44BD324C","#7F8FA64C","#44BD324C","#C236164C","#44BD324C","#C236164C","#7F8FA64C","#C236164C","#44BD324C","#7F8FA64C"],"x":[0.001,0.22575,0.22575,0.22575,0.4505,0.4505,0.4505,0.67525,0.67525,0.67525,0.9,0.9,0.9],"y":[0.533266666666667,0.533266666666667,0.999,0.001,0.533266666666667,0.134066666666667,0.905853333333333,0.227213333333333,0.826013333333333,0.63972,0.932466666666667,0.001,0.533266666666667],"flow_perc":[null,"77.0%","15.0%","8.0%","46.0%","12.0%","8.0%","9.0%","10.0%","14.0%","33.0%","29.0%","38.0%"]},"columns":[{"id":"name","name":"name","type":["glue","character"]},{"id":"id","name":"id","type":"numeric"},{"id":"colour","name":"colour","type":"character"},{"id":"colour_fade","name":"colour_fade","type":"character"},{"id":"x","name":"x","type":"numeric"},{"id":"y","name":"y","type":"numeric"},{"id":"flow_perc","name":"flow_perc","type":"character"}],"defaultPageSize":5,"dataKey":"846bee5df8dbde8324fb073f5cfd1799"},"children":[]},"class":"reactR_markup"},"evals":[],"jsHooks":[]}</script>
<p>Styling the nodes dataframe</p>
</div>
</div>
<p>Next we move to styling the edges, which is a much simpler prospect:</p>
<div class="cell fig-cap-location-top">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" data-cap-location="top" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb10-1">edges <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> edges <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb10-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(</span>
<span id="cb10-3">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># add a label for each flow to tell us how many people are in each</span></span>
<span id="cb10-4">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">label =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">number</span>(flow, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">big.mark =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">","</span>),</span>
<span id="cb10-5">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># add a percentage flow figure</span></span>
<span id="cb10-6">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">flow_perc =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">percent</span>(flow <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">/</span> temp_eligible_pop, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">accuracy =</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.1</span>)</span>
<span id="cb10-7">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb10-8">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># add the faded colour from our nodes object to match the destinations</span></span>
<span id="cb10-9">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">left_join</span>(</span>
<span id="cb10-10">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> nodes <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">to =</span> id, colour_fade),</span>
<span id="cb10-11">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"to"</span></span>
<span id="cb10-12">  )</span>
<span id="cb10-13"></span>
<span id="cb10-14"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># view our edges data</span></span>
<span id="cb10-15">edges <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb10-16">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">reactable</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">defaultPageSize =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span></code></pre></div></div>
</details>
<div class="cell-output-display">
<div class="reactable html-widget html-fill-item" id="htmlwidget-c34880ede9aa21e663fa" style="width:auto;height:auto;"></div>
<script type="application/json" data-for="htmlwidget-c34880ede9aa21e663fa">{"x":{"tag":{"name":"Reactable","attribs":{"data":{"from":[0,0,0,1,1,1,2,3,1,4,4,4,5,6,4,7,8,9],"to":[2,1,3,4,5,6,10,11,12,7,8,9,11,10,12,11,10,12],"flow":[15,77,8,46,12,8,15,8,11,9,10,14,12,8,13,9,10,14],"label":["15","77","8","46","12","8","15","8","11","9","10","14","12","8","13","9","10","14"],"flow_perc":["15.0%","77.0%","8.0%","46.0%","12.0%","8.0%","15.0%","8.0%","11.0%","9.0%","10.0%","14.0%","12.0%","8.0%","13.0%","9.0%","10.0%","14.0%"],"colour_fade":["#C236164C","#7F8FA64C","#44BD324C","#7F8FA64C","#44BD324C","#C236164C","#C236164C","#44BD324C","#7F8FA64C","#44BD324C","#C236164C","#7F8FA64C","#44BD324C","#C236164C","#7F8FA64C","#44BD324C","#C236164C","#7F8FA64C"]},"columns":[{"id":"from","name":"from","type":"numeric"},{"id":"to","name":"to","type":"numeric"},{"id":"flow","name":"flow","type":"numeric"},{"id":"label","name":"label","type":"character"},{"id":"flow_perc","name":"flow_perc","type":"character"},{"id":"colour_fade","name":"colour_fade","type":"character"}],"defaultPageSize":5,"dataKey":"adb0410c463383f1a7998a47c33b6495"},"children":[]},"class":"reactR_markup"},"evals":[],"jsHooks":[]}</script>
<p>Styling the edges dataframe</p>
</div>
</div>
<p>We now have stylised node and edge tables ready and can bring it all together. Note the use of <code>customdata</code> and <code>hovertemplate</code> help to bring in additional information and styling to the pop-up boxes that appear when you hover over each flow and node.</p>
<div class="cell fig-cap-location-top">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb11" data-cap-location="top" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb11-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># plot our stylised sankey</span></span>
<span id="cb11-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">plot_ly</span>(</span>
<span id="cb11-3">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># setup</span></span>
<span id="cb11-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">type =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"sankey"</span>,</span>
<span id="cb11-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">orientation =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"h"</span>,</span>
<span id="cb11-6">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">arrangement =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"snap"</span>,</span>
<span id="cb11-7"></span>
<span id="cb11-8">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># use our node data</span></span>
<span id="cb11-9">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">node =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(</span>
<span id="cb11-10">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">label =</span> nodes<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>name,</span>
<span id="cb11-11">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> nodes<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>colour,</span>
<span id="cb11-12">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> nodes<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>x,</span>
<span id="cb11-13">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">y =</span> nodes<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>y,</span>
<span id="cb11-14">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">customdata =</span> nodes<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>flow_perc,</span>
<span id="cb11-15">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">hovertemplate =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"%{label}&lt;br /&gt;&lt;b&gt;%{value}&lt;/b&gt; participants&lt;br /&gt;&lt;b&gt;%{customdata}&lt;/b&gt; of eligible population"</span></span>
<span id="cb11-16">  ),</span>
<span id="cb11-17"></span>
<span id="cb11-18">  <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># use our edge data</span></span>
<span id="cb11-19">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">link =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(</span>
<span id="cb11-20">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">source =</span> edges<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>from,</span>
<span id="cb11-21">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">target =</span> edges<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>to,</span>
<span id="cb11-22">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">value =</span> edges<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>flow,</span>
<span id="cb11-23">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">label =</span> edges<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>label,</span>
<span id="cb11-24">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> edges<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>colour_fade,</span>
<span id="cb11-25">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">customdata =</span> edges<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>flow_perc,</span>
<span id="cb11-26">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">hovertemplate =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"%{source.label} → %{target.label}&lt;br /&gt;&lt;b&gt;%{value}&lt;/b&gt; participants&lt;br /&gt;&lt;b&gt;%{customdata}&lt;/b&gt; of eligible population"</span></span>
<span id="cb11-27">  )</span>
<span id="cb11-28">) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb11-29">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">layout</span>(</span>
<span id="cb11-30">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">font =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">list</span>(</span>
<span id="cb11-31">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">family =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Arial, Helvetica, sans-serif"</span>,</span>
<span id="cb11-32">      <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">size =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">12</span></span>
<span id="cb11-33">    ),</span>
<span id="cb11-34">    <span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># make the background transparent (also removes the text shadow)</span></span>
<span id="cb11-35">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">paper_bgcolor =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"rgba(0,0,0,0)"</span></span>
<span id="cb11-36">  ) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb11-37">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">config</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">responsive =</span> T)</span></code></pre></div></div>
</details>
<div class="cell-output-display">
<div class="plotly html-widget html-fill-item" id="htmlwidget-5915d3e9ad0239c93ab1" style="width:672px;height:480px;"></div>
<script type="application/json" data-for="htmlwidget-5915d3e9ad0239c93ab1">{"x":{"visdat":{"6176529af01b":["function () ","plotlyVisDat"]},"cur_data":"6176529af01b","attrs":{"6176529af01b":{"orientation":"h","arrangement":"snap","node":{"label":["Eligible population","Invitation 1 No response","Invitation 1 Declined","Invitation 1 Accepted","Invitation 2 No response","Invitation 2 Accepted","Invitation 2 Declined","Invitation 3 Accepted","Invitation 3 Declined","Invitation 3 No response","Declined","Accepted","No response"],"color":["#7f8fa6","#7f8fa6","#c23616","#44bd32","#7f8fa6","#44bd32","#c23616","#44bd32","#c23616","#7f8fa6","#c23616","#44bd32","#7f8fa6"],"x":[0.001,0.22575000000000001,0.22575000000000001,0.22575000000000001,0.45050000000000001,0.45050000000000001,0.45050000000000001,0.67525000000000002,0.67525000000000002,0.67525000000000002,0.90000000000000002,0.90000000000000002,0.90000000000000002],"y":[0.53326666666666667,0.53326666666666667,0.999,0.001,0.53326666666666667,0.13406666666666667,0.90585333333333329,0.22721333333333335,0.82601333333333338,0.63972000000000007,0.93246666666666667,0.001,0.53326666666666667],"customdata":[null,"77.0%","15.0%","8.0%","46.0%","12.0%","8.0%","9.0%","10.0%","14.0%","33.0%","29.0%","38.0%"],"hovertemplate":"%{label}<br /><b>%{value}<\/b> participants<br /><b>%{customdata}<\/b> of eligible population"},"link":{"source":[0,0,0,1,1,1,2,3,1,4,4,4,5,6,4,7,8,9],"target":[2,1,3,4,5,6,10,11,12,7,8,9,11,10,12,11,10,12],"value":[15,77,8,46,12,8,15,8,11,9,10,14,12,8,13,9,10,14],"label":["15","77","8","46","12","8","15","8","11","9","10","14","12","8","13","9","10","14"],"color":["#C236164C","#7F8FA64C","#44BD324C","#7F8FA64C","#44BD324C","#C236164C","#C236164C","#44BD324C","#7F8FA64C","#44BD324C","#C236164C","#7F8FA64C","#44BD324C","#C236164C","#7F8FA64C","#44BD324C","#C236164C","#7F8FA64C"],"customdata":["15.0%","77.0%","8.0%","46.0%","12.0%","8.0%","15.0%","8.0%","11.0%","9.0%","10.0%","14.0%","12.0%","8.0%","13.0%","9.0%","10.0%","14.0%"],"hovertemplate":"%{source.label} → %{target.label}<br /><b>%{value}<\/b> participants<br /><b>%{customdata}<\/b> of eligible population"},"alpha_stroke":1,"sizes":[10,100],"spans":[1,20],"type":"sankey"}},"layout":{"margin":{"b":40,"l":60,"t":25,"r":10},"font":{"family":"Arial, Helvetica, sans-serif","size":12},"paper_bgcolor":"rgba(0,0,0,0)","hovermode":"closest","showlegend":false},"source":"A","config":{"modeBarButtonsToAdd":["hoverclosest","hovercompare"],"showSendToCloud":false,"responsive":true},"data":[{"orientation":"h","arrangement":"snap","node":{"label":["Eligible population","Invitation 1 No response","Invitation 1 Declined","Invitation 1 Accepted","Invitation 2 No response","Invitation 2 Accepted","Invitation 2 Declined","Invitation 3 Accepted","Invitation 3 Declined","Invitation 3 No response","Declined","Accepted","No response"],"color":["#7f8fa6","#7f8fa6","#c23616","#44bd32","#7f8fa6","#44bd32","#c23616","#44bd32","#c23616","#7f8fa6","#c23616","#44bd32","#7f8fa6"],"x":[0.001,0.22575000000000001,0.22575000000000001,0.22575000000000001,0.45050000000000001,0.45050000000000001,0.45050000000000001,0.67525000000000002,0.67525000000000002,0.67525000000000002,0.90000000000000002,0.90000000000000002,0.90000000000000002],"y":[0.53326666666666667,0.53326666666666667,0.999,0.001,0.53326666666666667,0.13406666666666667,0.90585333333333329,0.22721333333333335,0.82601333333333338,0.63972000000000007,0.93246666666666667,0.001,0.53326666666666667],"customdata":[null,"77.0%","15.0%","8.0%","46.0%","12.0%","8.0%","9.0%","10.0%","14.0%","33.0%","29.0%","38.0%"],"hovertemplate":"%{label}<br /><b>%{value}<\/b> participants<br /><b>%{customdata}<\/b> of eligible population"},"link":{"source":[0,0,0,1,1,1,2,3,1,4,4,4,5,6,4,7,8,9],"target":[2,1,3,4,5,6,10,11,12,7,8,9,11,10,12,11,10,12],"value":[15,77,8,46,12,8,15,8,11,9,10,14,12,8,13,9,10,14],"label":["15","77","8","46","12","8","15","8","11","9","10","14","12","8","13","9","10","14"],"color":["#C236164C","#7F8FA64C","#44BD324C","#7F8FA64C","#44BD324C","#C236164C","#C236164C","#44BD324C","#7F8FA64C","#44BD324C","#C236164C","#7F8FA64C","#44BD324C","#C236164C","#7F8FA64C","#44BD324C","#C236164C","#7F8FA64C"],"customdata":["15.0%","77.0%","8.0%","46.0%","12.0%","8.0%","15.0%","8.0%","11.0%","9.0%","10.0%","14.0%","12.0%","8.0%","13.0%","9.0%","10.0%","14.0%"],"hovertemplate":"%{source.label} → %{target.label}<br /><b>%{value}<\/b> participants<br /><b>%{customdata}<\/b> of eligible population"},"type":"sankey","frame":null}],"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.20000000000000001,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot","plotly_sunburstclick"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script>
<p>A stylish Sankey</p>
</div>
</div>
</section>
</section>
<section id="conclusion" class="level1">
<h1>Conclusion</h1>
<p>Creating Sankey plots in R using plotly is an effective way to visualise patient pathways.</p>
<p>In our project we embedded Sankey plots within an interactive <a href="https://shiny.posit.co/">Shiny</a> app which allows for selective filters that update the resulting plot. This allowed us to quickly compare the effects of different models of delivering the screening programme, geography, deprivation levels, patient demographic, or any combination of these.</p>
<p>Their use has helped the programme team better understand patient flows through the pathway, where the points of drop-off are and compare / contrast the effects of different models of delivering the screening programme on patient engagement.</p>
<p>Feedback from external stakeholders has been positive too, noting how easy it is to engage with and understand this style of presentation.</p>
<p>In this blog post we have wrangled a dataset to describe how people flow between steps in a process and then produced a Sankey diagram with some stylistic touches to make an effective visualisation.</p>
<p>I hope this post helps you feel better prepared to use Sankeys in your work.</p>


</section>


 ]]></description>
  <category>learning</category>
  <category>tutorial</category>
  <category>visualisation</category>
  <category>R</category>
  <guid>https://the-strategy-unit.github.io/data_science/blogs/posts/2024-02-28_sankey_plot/</guid>
  <pubDate>Wed, 28 Feb 2024 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Nearest neighbour imputation</title>
  <dc:creator>Jacqueline Grout</dc:creator>
  <link>https://the-strategy-unit.github.io/data_science/blogs/posts/2024-01-17_nearest_neighbour/</link>
  <description><![CDATA[ 





<p>Recently I have been gathering data by GP practice, from a variety of different sources. The ultimate purpose of my project is to be able to report at an ICB/sub-ICB level<sup>1</sup>. The various datasets cover different timescales and consequently changes in GP practices over time have left me with mismatching datasets.</p>
<div class="no-row-height column-margin column-container"><div id="fn1"><p><sup>1</sup>&nbsp;An ICB (Integrated Care Board) is a statutory NHS organisation responsible for planning health services for their local populations</p></div></div><p>My approach has been to take as the basis of my project a recent GP List. Later in my project I want to perform calculations at a GP practice level based on an underlying health need and the data for this need is a CHD prevalence value from a dataset that is around 8 years old, and for which there is no update or alternative. From my recent list of 6454 practices, when I match to the need dataset, I am left with 151 practices without a value for need. If I remove these practices from the analysis then this could impact the analysis by sub-ICB since often a group of practices in the same area could be subject to changes, mergers and reorganisation.</p>
<p>Here’s the packages and some demo objects to work with to create an example for two practices:</p>
<div class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Packages</span></span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tidyverse)</span>
<span id="cb1-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(sf)</span>
<span id="cb1-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tidygeocoder)</span>
<span id="cb1-5"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(leaflet)</span>
<span id="cb1-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(viridisLite)</span>
<span id="cb1-7"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(gt)</span>
<span id="cb1-8"></span>
<span id="cb1-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Create some data with two practices with no need data</span></span>
<span id="cb1-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># and a selection of practices locally with need data</span></span>
<span id="cb1-11">practices <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tribble</span>(</span>
<span id="cb1-12">  <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span>practice_code, <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span>postcode, <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span>has_orig_need, <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span>value,</span>
<span id="cb1-13">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"P1"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"CV1 4FS"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span>,</span>
<span id="cb1-14">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"P2"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"CV1 3GB"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">7.3</span>,</span>
<span id="cb1-15">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"P3"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"CV11 5TW"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">6.9</span>,</span>
<span id="cb1-16">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"P4"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"CV6 3HZ"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">7.1</span>,</span>
<span id="cb1-17">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"P5"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"CV6 1HS"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">7.7</span>,</span>
<span id="cb1-18">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"P6"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"CV6 5DF"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">8.2</span>,</span>
<span id="cb1-19">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"P7"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"CV6 3FA"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">7.9</span>,</span>
<span id="cb1-20">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"P8"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"CV1 2DL"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">7.5</span>,</span>
<span id="cb1-21">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"P9"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"CV1 4JH"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">7.7</span>,</span>
<span id="cb1-22">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"P10"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"CV10 0GQ"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">7.5</span>,</span>
<span id="cb1-23">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"P11"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"CV10 0JH"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">7.8</span>,</span>
<span id="cb1-24">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"P12"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"CV11 5QT"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="cn" style="color: #8f5902;
background-color: null;
font-style: inherit;">NA</span>,</span>
<span id="cb1-25">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"P13"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"CV11 6AB"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">7.6</span>,</span>
<span id="cb1-26">  <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"P14"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"CV6 4DD"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">7.9</span></span>
<span id="cb1-27">)</span>
<span id="cb1-28"></span>
<span id="cb1-29"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># get domain of numeric data</span></span>
<span id="cb1-30">(domain <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">range</span>(practices<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>has_orig_need))</span>
<span id="cb1-31"></span>
<span id="cb1-32"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># make a colour palette</span></span>
<span id="cb1-33">pal <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">colorNumeric</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">palette =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">viridis</span>(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">domain =</span> domain)</span></code></pre></div></div>
</details>
</div>
<p>To provide a suitable estimate of need for the newer practices without values, all the practices in the dataset were geocoded<sup>2</sup> using the <code>geocode</code> function from the {tidygeocoder} package.</p>
<div class="no-row-height column-margin column-container"><div id="fn2"><p><sup>2</sup>&nbsp;Geocoding is the process of converting addresses (often the postcode) into geographic coordinates (such as latitude and longitude) that can be plotted on a map.</p></div></div><div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1">practices <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> practices <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb2-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">id =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">row_number</span>()) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb2-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geocode</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">postalcode =</span> postcode) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb2-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">st_as_sf</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">coords =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"long"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lat"</span>), <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">crs =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4326</span>)</span></code></pre></div></div>
</div>
<div class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1">practices <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb3-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">gt</span>()</span></code></pre></div></div>
</details>
<div class="cell-output-display">
<div id="slddrbfmwz" style="padding-left:0px;padding-right:0px;padding-top:10px;padding-bottom:10px;overflow-x:auto;overflow-y:auto;width:auto;height:auto;">
<style>#slddrbfmwz table {
  font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji';
  -webkit-font-smoothing: antialiased;
  -moz-osx-font-smoothing: grayscale;
}

#slddrbfmwz thead, #slddrbfmwz tbody, #slddrbfmwz tfoot, #slddrbfmwz tr, #slddrbfmwz td, #slddrbfmwz th {
  border-style: none;
}

#slddrbfmwz p {
  margin: 0;
  padding: 0;
}

#slddrbfmwz .gt_table {
  display: table;
  border-collapse: collapse;
  line-height: normal;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#slddrbfmwz .gt_caption {
  padding-top: 4px;
  padding-bottom: 4px;
}

#slddrbfmwz .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#slddrbfmwz .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 3px;
  padding-bottom: 5px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#slddrbfmwz .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#slddrbfmwz .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#slddrbfmwz .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#slddrbfmwz .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#slddrbfmwz .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#slddrbfmwz .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#slddrbfmwz .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#slddrbfmwz .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 5px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#slddrbfmwz .gt_spanner_row {
  border-bottom-style: hidden;
}

#slddrbfmwz .gt_group_heading {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  text-align: left;
}

#slddrbfmwz .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#slddrbfmwz .gt_from_md > :first-child {
  margin-top: 0;
}

#slddrbfmwz .gt_from_md > :last-child {
  margin-bottom: 0;
}

#slddrbfmwz .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#slddrbfmwz .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 5px;
  padding-right: 5px;
}

#slddrbfmwz .gt_stub_row_group {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 5px;
  padding-right: 5px;
  vertical-align: top;
}

#slddrbfmwz .gt_row_group_first td {
  border-top-width: 2px;
}

#slddrbfmwz .gt_row_group_first th {
  border-top-width: 2px;
}

#slddrbfmwz .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#slddrbfmwz .gt_first_summary_row {
  border-top-style: solid;
  border-top-color: #D3D3D3;
}

#slddrbfmwz .gt_first_summary_row.thick {
  border-top-width: 2px;
}

#slddrbfmwz .gt_last_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#slddrbfmwz .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#slddrbfmwz .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#slddrbfmwz .gt_last_grand_summary_row_top {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-style: double;
  border-bottom-width: 6px;
  border-bottom-color: #D3D3D3;
}

#slddrbfmwz .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#slddrbfmwz .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#slddrbfmwz .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#slddrbfmwz .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
}

#slddrbfmwz .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#slddrbfmwz .gt_sourcenote {
  font-size: 90%;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
}

#slddrbfmwz .gt_left {
  text-align: left;
}

#slddrbfmwz .gt_center {
  text-align: center;
}

#slddrbfmwz .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#slddrbfmwz .gt_font_normal {
  font-weight: normal;
}

#slddrbfmwz .gt_font_bold {
  font-weight: bold;
}

#slddrbfmwz .gt_font_italic {
  font-style: italic;
}

#slddrbfmwz .gt_super {
  font-size: 65%;
}

#slddrbfmwz .gt_footnote_marks {
  font-size: 75%;
  vertical-align: 0.4em;
  position: initial;
}

#slddrbfmwz .gt_asterisk {
  font-size: 100%;
  vertical-align: 0;
}

#slddrbfmwz .gt_indent_1 {
  text-indent: 5px;
}

#slddrbfmwz .gt_indent_2 {
  text-indent: 10px;
}

#slddrbfmwz .gt_indent_3 {
  text-indent: 15px;
}

#slddrbfmwz .gt_indent_4 {
  text-indent: 20px;
}

#slddrbfmwz .gt_indent_5 {
  text-indent: 25px;
}

#slddrbfmwz .katex-display {
  display: inline-flex !important;
  margin-bottom: 0.75em !important;
}

#slddrbfmwz div.Reactable > div.rt-table > div.rt-thead > div.rt-tr.rt-tr-group-header > div.rt-th-group:after {
  height: 0px !important;
}
</style>

<table class="gt_table caption-top table table-sm table-striped small" data-quarto-bootstrap="false">
<thead>
<tr class="gt_col_headings header">
<th id="practice_code" class="gt_col_heading gt_columns_bottom_border gt_left" data-quarto-table-cell-role="th" scope="col">practice_code</th>
<th id="postcode" class="gt_col_heading gt_columns_bottom_border gt_left" data-quarto-table-cell-role="th" scope="col">postcode</th>
<th id="has_orig_need" class="gt_col_heading gt_columns_bottom_border gt_right" data-quarto-table-cell-role="th" scope="col">has_orig_need</th>
<th id="value" class="gt_col_heading gt_columns_bottom_border gt_right" data-quarto-table-cell-role="th" scope="col">value</th>
<th id="id" class="gt_col_heading gt_columns_bottom_border gt_right" data-quarto-table-cell-role="th" scope="col">id</th>
<th id="geometry" class="gt_col_heading gt_columns_bottom_border gt_center" data-quarto-table-cell-role="th" scope="col">geometry</th>
</tr>
</thead>
<tbody class="gt_table_body">
<tr class="odd">
<td class="gt_row gt_left" headers="practice_code">P1</td>
<td class="gt_row gt_left" headers="postcode">CV1 4FS</td>
<td class="gt_row gt_right" headers="has_orig_need">0</td>
<td class="gt_row gt_right" headers="value">NA</td>
<td class="gt_row gt_right" headers="id">1</td>
<td class="gt_row gt_center" headers="geometry">c(-1.50672203333333, 52.4140662333333)</td>
</tr>
<tr class="even">
<td class="gt_row gt_left" headers="practice_code">P2</td>
<td class="gt_row gt_left" headers="postcode">CV1 3GB</td>
<td class="gt_row gt_right" headers="has_orig_need">1</td>
<td class="gt_row gt_right" headers="value">7.3</td>
<td class="gt_row gt_right" headers="id">2</td>
<td class="gt_row gt_center" headers="geometry">c(-1.51888, 52.4034199)</td>
</tr>
<tr class="odd">
<td class="gt_row gt_left" headers="practice_code">P3</td>
<td class="gt_row gt_left" headers="postcode">CV11 5TW</td>
<td class="gt_row gt_right" headers="has_orig_need">1</td>
<td class="gt_row gt_right" headers="value">6.9</td>
<td class="gt_row gt_right" headers="id">3</td>
<td class="gt_row gt_center" headers="geometry">c(-1.46746, 52.519)</td>
</tr>
<tr class="even">
<td class="gt_row gt_left" headers="practice_code">P4</td>
<td class="gt_row gt_left" headers="postcode">CV6 3HZ</td>
<td class="gt_row gt_right" headers="has_orig_need">1</td>
<td class="gt_row gt_right" headers="value">7.1</td>
<td class="gt_row gt_right" headers="id">4</td>
<td class="gt_row gt_center" headers="geometry">c(-1.52231, 52.42367)</td>
</tr>
<tr class="odd">
<td class="gt_row gt_left" headers="practice_code">P5</td>
<td class="gt_row gt_left" headers="postcode">CV6 1HS</td>
<td class="gt_row gt_right" headers="has_orig_need">1</td>
<td class="gt_row gt_right" headers="value">7.7</td>
<td class="gt_row gt_right" headers="id">5</td>
<td class="gt_row gt_center" headers="geometry">c(-1.52542, 52.41989)</td>
</tr>
<tr class="even">
<td class="gt_row gt_left" headers="practice_code">P6</td>
<td class="gt_row gt_left" headers="postcode">CV6 5DF</td>
<td class="gt_row gt_right" headers="has_orig_need">1</td>
<td class="gt_row gt_right" headers="value">8.2</td>
<td class="gt_row gt_right" headers="id">6</td>
<td class="gt_row gt_center" headers="geometry">c(-1.498344825, 52.4250186)</td>
</tr>
<tr class="odd">
<td class="gt_row gt_left" headers="practice_code">P7</td>
<td class="gt_row gt_left" headers="postcode">CV6 3FA</td>
<td class="gt_row gt_right" headers="has_orig_need">1</td>
<td class="gt_row gt_right" headers="value">7.9</td>
<td class="gt_row gt_right" headers="id">7</td>
<td class="gt_row gt_center" headers="geometry">c(-1.51787, 52.43135)</td>
</tr>
<tr class="even">
<td class="gt_row gt_left" headers="practice_code">P8</td>
<td class="gt_row gt_left" headers="postcode">CV1 2DL</td>
<td class="gt_row gt_right" headers="has_orig_need">1</td>
<td class="gt_row gt_right" headers="value">7.5</td>
<td class="gt_row gt_right" headers="id">8</td>
<td class="gt_row gt_center" headers="geometry">c(-1.49105, 52.40582)</td>
</tr>
<tr class="odd">
<td class="gt_row gt_left" headers="practice_code">P9</td>
<td class="gt_row gt_left" headers="postcode">CV1 4JH</td>
<td class="gt_row gt_right" headers="has_orig_need">1</td>
<td class="gt_row gt_right" headers="value">7.7</td>
<td class="gt_row gt_right" headers="id">9</td>
<td class="gt_row gt_center" headers="geometry">c(-1.5069566, 52.4191685)</td>
</tr>
<tr class="even">
<td class="gt_row gt_left" headers="practice_code">P10</td>
<td class="gt_row gt_left" headers="postcode">CV10 0GQ</td>
<td class="gt_row gt_right" headers="has_orig_need">1</td>
<td class="gt_row gt_right" headers="value">7.5</td>
<td class="gt_row gt_right" headers="id">10</td>
<td class="gt_row gt_center" headers="geometry">c(-1.52197, 52.54074)</td>
</tr>
<tr class="odd">
<td class="gt_row gt_left" headers="practice_code">P11</td>
<td class="gt_row gt_left" headers="postcode">CV10 0JH</td>
<td class="gt_row gt_right" headers="has_orig_need">1</td>
<td class="gt_row gt_right" headers="value">7.8</td>
<td class="gt_row gt_right" headers="id">11</td>
<td class="gt_row gt_center" headers="geometry">c(-1.5163199, 52.53723)</td>
</tr>
<tr class="even">
<td class="gt_row gt_left" headers="practice_code">P12</td>
<td class="gt_row gt_left" headers="postcode">CV11 5QT</td>
<td class="gt_row gt_right" headers="has_orig_need">0</td>
<td class="gt_row gt_right" headers="value">NA</td>
<td class="gt_row gt_right" headers="id">12</td>
<td class="gt_row gt_center" headers="geometry">c(-1.46927, 52.51899)</td>
</tr>
<tr class="odd">
<td class="gt_row gt_left" headers="practice_code">P13</td>
<td class="gt_row gt_left" headers="postcode">CV11 6AB</td>
<td class="gt_row gt_right" headers="has_orig_need">1</td>
<td class="gt_row gt_right" headers="value">7.6</td>
<td class="gt_row gt_right" headers="id">13</td>
<td class="gt_row gt_center" headers="geometry">c(-1.45822, 52.52682)</td>
</tr>
<tr class="even">
<td class="gt_row gt_left" headers="practice_code">P14</td>
<td class="gt_row gt_left" headers="postcode">CV6 4DD</td>
<td class="gt_row gt_right" headers="has_orig_need">1</td>
<td class="gt_row gt_right" headers="value">7.9</td>
<td class="gt_row gt_right" headers="id">14</td>
<td class="gt_row gt_center" headers="geometry">c(-1.50832, 52.44104)</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<p>This map shows the practices, purple are the practices with no need data and yellow are practices with need data available.</p>
<div class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># make map to display practices</span></span>
<span id="cb4-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">leaflet</span>(practices) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">addTiles</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb4-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">addCircleMarkers</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pal</span>(has_orig_need))</span></code></pre></div></div>
</details>
<div class="cell-output-display">
<div class="leaflet html-widget html-fill-item" id="htmlwidget-3f87eed0d4201a3dc2d8" style="width:672px;height:480px;"></div>
<script type="application/json" data-for="htmlwidget-3f87eed0d4201a3dc2d8">{"x":{"options":{"crs":{"crsClass":"L.CRS.EPSG3857","code":null,"proj4def":null,"projectedBounds":null,"options":{}}},"calls":[{"method":"addTiles","args":["https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png",null,null,{"minZoom":0,"maxZoom":18,"tileSize":256,"subdomains":"abc","errorTileUrl":"","tms":false,"noWrap":false,"zoomOffset":0,"zoomReverse":false,"opacity":1,"zIndex":1,"detectRetina":false,"attribution":"&copy; <a href=\"https://openstreetmap.org/copyright/\">OpenStreetMap<\/a>,  <a href=\"https://opendatacommons.org/licenses/odbl/\">ODbL<\/a>"}]},{"method":"addCircleMarkers","args":[[52.41406623333333,52.4034199,52.519,52.42367,52.41989,52.4250186,52.43135,52.40582,52.4191685,52.54074,52.53723,52.51899,52.52682,52.44104],[-1.506722033333333,-1.51888,-1.46746,-1.52231,-1.52542,-1.498344825,-1.51787,-1.49105,-1.5069566,-1.52197,-1.5163199,-1.46927,-1.45822,-1.50832],10,null,null,{"interactive":true,"className":"","stroke":true,"color":["#440154","#FDE725","#FDE725","#FDE725","#FDE725","#FDE725","#FDE725","#FDE725","#FDE725","#FDE725","#FDE725","#440154","#FDE725","#FDE725"],"weight":5,"opacity":0.5,"fill":true,"fillColor":["#440154","#FDE725","#FDE725","#FDE725","#FDE725","#FDE725","#FDE725","#FDE725","#FDE725","#FDE725","#FDE725","#440154","#FDE725","#FDE725"],"fillOpacity":0.2},null,null,null,null,null,{"interactive":false,"permanent":false,"direction":"auto","opacity":1,"offset":[0,0],"textsize":"10px","textOnly":false,"className":"","sticky":true},null]}],"limits":{"lat":[52.4034199,52.54074],"lng":[-1.52542,-1.45822]}},"evals":[],"jsHooks":[]}</script>
</div>
</div>
<p>The data was split into those with, and without, a value for need. Using <code>st_join</code> from the {sf} package to join those without, and those with, a value for need, using the geometry to find all those within 1500m (1.5km).</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1">no_need <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> practices <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb5-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(has_orig_need <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb5-3"></span>
<span id="cb5-4">with_need <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> practices <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb5-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(has_orig_need <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>)</span>
<span id="cb5-6"></span>
<span id="cb5-7"></span>
<span id="cb5-8">neighbours <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> no_need <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb5-9">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">no_need_postcode =</span> postcode, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">no_need_prac_code =</span> practice_code) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb5-10">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">st_join</span>(with_need, st_is_within_distance, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1500</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb5-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">st_drop_geometry</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb5-12">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(id, no_need_postcode, no_need_prac_code) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb5-13">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">inner_join</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> with_need, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">join_by</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"id"</span>))</span></code></pre></div></div>
</div>
<div class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb6-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">leaflet</span>(neighbours) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb6-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">addTiles</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb6-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">addCircleMarkers</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">color =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"purple"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb6-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">addMarkers</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.50686326666667</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">52.4141089666667</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">popup =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Practice with no data"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb6-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">addCircles</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.50686326666667</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">52.4141089666667</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">radius =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1500</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb6-6">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">addMarkers</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.46927</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">52.51899</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">popup =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Practice with no data"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb6-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">addCircles</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1.46927</span>, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">52.51899</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">radius =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1500</span>)</span></code></pre></div></div>
</details>
<div class="cell-output-display">
<div class="leaflet html-widget html-fill-item" id="htmlwidget-f13f8eb74f3d4edabaad" style="width:672px;height:480px;"></div>
<script type="application/json" data-for="htmlwidget-f13f8eb74f3d4edabaad">{"x":{"options":{"crs":{"crsClass":"L.CRS.EPSG3857","code":null,"proj4def":null,"projectedBounds":null,"options":{}}},"calls":[{"method":"addTiles","args":["https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png",null,null,{"minZoom":0,"maxZoom":18,"tileSize":256,"subdomains":"abc","errorTileUrl":"","tms":false,"noWrap":false,"zoomOffset":0,"zoomReverse":false,"opacity":1,"zIndex":1,"detectRetina":false,"attribution":"&copy; <a href=\"https://openstreetmap.org/copyright/\">OpenStreetMap<\/a>,  <a href=\"https://opendatacommons.org/licenses/odbl/\">ODbL<\/a>"}]},{"method":"addCircleMarkers","args":[[52.4034199,52.519,52.41989,52.4250186,52.40582,52.4191685,52.52682],[-1.51888,-1.46746,-1.52542,-1.498344825,-1.49105,-1.5069566,-1.45822],10,null,null,{"interactive":true,"className":"","stroke":true,"color":"purple","weight":5,"opacity":0.5,"fill":true,"fillColor":"purple","fillOpacity":0.2},null,null,null,null,null,{"interactive":false,"permanent":false,"direction":"auto","opacity":1,"offset":[0,0],"textsize":"10px","textOnly":false,"className":"","sticky":true},null]},{"method":"addMarkers","args":[52.4141089666667,-1.50686326666667,null,null,null,{"interactive":true,"draggable":false,"keyboard":true,"title":"","alt":"","zIndexOffset":0,"opacity":1,"riseOnHover":false,"riseOffset":250},"Practice with no data",null,null,null,null,{"interactive":false,"permanent":false,"direction":"auto","opacity":1,"offset":[0,0],"textsize":"10px","textOnly":false,"className":"","sticky":true},null]},{"method":"addCircles","args":[52.4141089666667,-1.50686326666667,1500,null,null,{"interactive":true,"className":"","stroke":true,"color":"#03F","weight":5,"opacity":0.5,"fill":true,"fillColor":"#03F","fillOpacity":0.2},null,null,null,{"interactive":false,"permanent":false,"direction":"auto","opacity":1,"offset":[0,0],"textsize":"10px","textOnly":false,"className":"","sticky":true},null,null]},{"method":"addMarkers","args":[52.51899,-1.46927,null,null,null,{"interactive":true,"draggable":false,"keyboard":true,"title":"","alt":"","zIndexOffset":0,"opacity":1,"riseOnHover":false,"riseOffset":250},"Practice with no data",null,null,null,null,{"interactive":false,"permanent":false,"direction":"auto","opacity":1,"offset":[0,0],"textsize":"10px","textOnly":false,"className":"","sticky":true},null]},{"method":"addCircles","args":[52.51899,-1.46927,1500,null,null,{"interactive":true,"className":"","stroke":true,"color":"#03F","weight":5,"opacity":0.5,"fill":true,"fillColor":"#03F","fillOpacity":0.2},null,null,null,{"interactive":false,"permanent":false,"direction":"auto","opacity":1,"offset":[0,0],"textsize":"10px","textOnly":false,"className":"","sticky":true},null,null]}],"limits":{"lat":[52.4034199,52.52682],"lng":[-1.52542,-1.45822]}},"evals":[],"jsHooks":[]}</script>
</div>
</div>
<p>The data for the “neighbours” was grouped by the practice code of those without need data and a mean value was calculated for each practice to generate an estimated value.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1">neighbours_estimate <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> neighbours <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb7-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">group_by</span>(no_need_prac_code) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb7-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarise</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">need_est =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mean</span>(value)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb7-4">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">st_drop_geometry</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(no_need_prac_code, need_est))</span></code></pre></div></div>
</div>
<p>The original data was joined back to the “neighbours”.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1">practices_with_neighbours_estimate <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> practices <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb8-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">left_join</span>(neighbours_estimate, <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">join_by</span>(practice_code <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> no_need_prac_code)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb8-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">st_drop_geometry</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(practice_code, need_est))</span></code></pre></div></div>
</div>
<div class="cell">
<details class="code-fold">
<summary>Code</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1">practices_with_neighbours_estimate <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb9-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>has_orig_need, <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>id) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb9-3">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">gt</span>()</span></code></pre></div></div>
</details>
<div class="cell-output-display">
<div id="hoquacyrox" style="padding-left:0px;padding-right:0px;padding-top:10px;padding-bottom:10px;overflow-x:auto;overflow-y:auto;width:auto;height:auto;">
<style>#hoquacyrox table {
  font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji';
  -webkit-font-smoothing: antialiased;
  -moz-osx-font-smoothing: grayscale;
}

#hoquacyrox thead, #hoquacyrox tbody, #hoquacyrox tfoot, #hoquacyrox tr, #hoquacyrox td, #hoquacyrox th {
  border-style: none;
}

#hoquacyrox p {
  margin: 0;
  padding: 0;
}

#hoquacyrox .gt_table {
  display: table;
  border-collapse: collapse;
  line-height: normal;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#hoquacyrox .gt_caption {
  padding-top: 4px;
  padding-bottom: 4px;
}

#hoquacyrox .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#hoquacyrox .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 3px;
  padding-bottom: 5px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#hoquacyrox .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#hoquacyrox .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#hoquacyrox .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#hoquacyrox .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#hoquacyrox .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#hoquacyrox .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#hoquacyrox .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#hoquacyrox .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 5px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#hoquacyrox .gt_spanner_row {
  border-bottom-style: hidden;
}

#hoquacyrox .gt_group_heading {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  text-align: left;
}

#hoquacyrox .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#hoquacyrox .gt_from_md > :first-child {
  margin-top: 0;
}

#hoquacyrox .gt_from_md > :last-child {
  margin-bottom: 0;
}

#hoquacyrox .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#hoquacyrox .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 5px;
  padding-right: 5px;
}

#hoquacyrox .gt_stub_row_group {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 5px;
  padding-right: 5px;
  vertical-align: top;
}

#hoquacyrox .gt_row_group_first td {
  border-top-width: 2px;
}

#hoquacyrox .gt_row_group_first th {
  border-top-width: 2px;
}

#hoquacyrox .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#hoquacyrox .gt_first_summary_row {
  border-top-style: solid;
  border-top-color: #D3D3D3;
}

#hoquacyrox .gt_first_summary_row.thick {
  border-top-width: 2px;
}

#hoquacyrox .gt_last_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#hoquacyrox .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#hoquacyrox .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#hoquacyrox .gt_last_grand_summary_row_top {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-style: double;
  border-bottom-width: 6px;
  border-bottom-color: #D3D3D3;
}

#hoquacyrox .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#hoquacyrox .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#hoquacyrox .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#hoquacyrox .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
}

#hoquacyrox .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#hoquacyrox .gt_sourcenote {
  font-size: 90%;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
}

#hoquacyrox .gt_left {
  text-align: left;
}

#hoquacyrox .gt_center {
  text-align: center;
}

#hoquacyrox .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#hoquacyrox .gt_font_normal {
  font-weight: normal;
}

#hoquacyrox .gt_font_bold {
  font-weight: bold;
}

#hoquacyrox .gt_font_italic {
  font-style: italic;
}

#hoquacyrox .gt_super {
  font-size: 65%;
}

#hoquacyrox .gt_footnote_marks {
  font-size: 75%;
  vertical-align: 0.4em;
  position: initial;
}

#hoquacyrox .gt_asterisk {
  font-size: 100%;
  vertical-align: 0;
}

#hoquacyrox .gt_indent_1 {
  text-indent: 5px;
}

#hoquacyrox .gt_indent_2 {
  text-indent: 10px;
}

#hoquacyrox .gt_indent_3 {
  text-indent: 15px;
}

#hoquacyrox .gt_indent_4 {
  text-indent: 20px;
}

#hoquacyrox .gt_indent_5 {
  text-indent: 25px;
}

#hoquacyrox .katex-display {
  display: inline-flex !important;
  margin-bottom: 0.75em !important;
}

#hoquacyrox div.Reactable > div.rt-table > div.rt-thead > div.rt-tr.rt-tr-group-header > div.rt-th-group:after {
  height: 0px !important;
}
</style>

<table class="gt_table caption-top table table-sm table-striped small" data-quarto-bootstrap="false">
<thead>
<tr class="gt_col_headings header">
<th id="practice_code" class="gt_col_heading gt_columns_bottom_border gt_left" data-quarto-table-cell-role="th" scope="col">practice_code</th>
<th id="postcode" class="gt_col_heading gt_columns_bottom_border gt_left" data-quarto-table-cell-role="th" scope="col">postcode</th>
<th id="value" class="gt_col_heading gt_columns_bottom_border gt_right" data-quarto-table-cell-role="th" scope="col">value</th>
<th id="need_est" class="gt_col_heading gt_columns_bottom_border gt_right" data-quarto-table-cell-role="th" scope="col">need_est</th>
</tr>
</thead>
<tbody class="gt_table_body">
<tr class="odd">
<td class="gt_row gt_left" headers="practice_code">P1</td>
<td class="gt_row gt_left" headers="postcode">CV1 4FS</td>
<td class="gt_row gt_right" headers="value">NA</td>
<td class="gt_row gt_right" headers="need_est">7.68</td>
</tr>
<tr class="even">
<td class="gt_row gt_left" headers="practice_code">P2</td>
<td class="gt_row gt_left" headers="postcode">CV1 3GB</td>
<td class="gt_row gt_right" headers="value">7.3</td>
<td class="gt_row gt_right" headers="need_est">NA</td>
</tr>
<tr class="odd">
<td class="gt_row gt_left" headers="practice_code">P3</td>
<td class="gt_row gt_left" headers="postcode">CV11 5TW</td>
<td class="gt_row gt_right" headers="value">6.9</td>
<td class="gt_row gt_right" headers="need_est">NA</td>
</tr>
<tr class="even">
<td class="gt_row gt_left" headers="practice_code">P4</td>
<td class="gt_row gt_left" headers="postcode">CV6 3HZ</td>
<td class="gt_row gt_right" headers="value">7.1</td>
<td class="gt_row gt_right" headers="need_est">NA</td>
</tr>
<tr class="odd">
<td class="gt_row gt_left" headers="practice_code">P5</td>
<td class="gt_row gt_left" headers="postcode">CV6 1HS</td>
<td class="gt_row gt_right" headers="value">7.7</td>
<td class="gt_row gt_right" headers="need_est">NA</td>
</tr>
<tr class="even">
<td class="gt_row gt_left" headers="practice_code">P6</td>
<td class="gt_row gt_left" headers="postcode">CV6 5DF</td>
<td class="gt_row gt_right" headers="value">8.2</td>
<td class="gt_row gt_right" headers="need_est">NA</td>
</tr>
<tr class="odd">
<td class="gt_row gt_left" headers="practice_code">P7</td>
<td class="gt_row gt_left" headers="postcode">CV6 3FA</td>
<td class="gt_row gt_right" headers="value">7.9</td>
<td class="gt_row gt_right" headers="need_est">NA</td>
</tr>
<tr class="even">
<td class="gt_row gt_left" headers="practice_code">P8</td>
<td class="gt_row gt_left" headers="postcode">CV1 2DL</td>
<td class="gt_row gt_right" headers="value">7.5</td>
<td class="gt_row gt_right" headers="need_est">NA</td>
</tr>
<tr class="odd">
<td class="gt_row gt_left" headers="practice_code">P9</td>
<td class="gt_row gt_left" headers="postcode">CV1 4JH</td>
<td class="gt_row gt_right" headers="value">7.7</td>
<td class="gt_row gt_right" headers="need_est">NA</td>
</tr>
<tr class="even">
<td class="gt_row gt_left" headers="practice_code">P10</td>
<td class="gt_row gt_left" headers="postcode">CV10 0GQ</td>
<td class="gt_row gt_right" headers="value">7.5</td>
<td class="gt_row gt_right" headers="need_est">NA</td>
</tr>
<tr class="odd">
<td class="gt_row gt_left" headers="practice_code">P11</td>
<td class="gt_row gt_left" headers="postcode">CV10 0JH</td>
<td class="gt_row gt_right" headers="value">7.8</td>
<td class="gt_row gt_right" headers="need_est">NA</td>
</tr>
<tr class="even">
<td class="gt_row gt_left" headers="practice_code">P12</td>
<td class="gt_row gt_left" headers="postcode">CV11 5QT</td>
<td class="gt_row gt_right" headers="value">NA</td>
<td class="gt_row gt_right" headers="need_est">7.25</td>
</tr>
<tr class="odd">
<td class="gt_row gt_left" headers="practice_code">P13</td>
<td class="gt_row gt_left" headers="postcode">CV11 6AB</td>
<td class="gt_row gt_right" headers="value">7.6</td>
<td class="gt_row gt_right" headers="need_est">NA</td>
</tr>
<tr class="even">
<td class="gt_row gt_left" headers="practice_code">P14</td>
<td class="gt_row gt_left" headers="postcode">CV6 4DD</td>
<td class="gt_row gt_right" headers="value">7.9</td>
<td class="gt_row gt_right" headers="need_est">NA</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<p>Finally, an updated data frame was created of the need data using the actual need for the practice where available, otherwise using estimated need.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb10-1">practices_with_neighbours_estimate <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> practices_with_neighbours_estimate <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb10-2">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">need_to_use =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">case_when</span>(value <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> value,</span>
<span id="cb10-3">    <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">.default =</span> need_est</span>
<span id="cb10-4">  )) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">|&gt;</span></span>
<span id="cb10-5">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(practice_code, need_to_use)</span></code></pre></div></div>
</div>
<div class="cell">
<div class="cell-output-display">
<div id="ponhwjqynj" style="padding-left:0px;padding-right:0px;padding-top:10px;padding-bottom:10px;overflow-x:auto;overflow-y:auto;width:auto;height:auto;">
<style>#ponhwjqynj table {
  font-family: system-ui, 'Segoe UI', Roboto, Helvetica, Arial, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji';
  -webkit-font-smoothing: antialiased;
  -moz-osx-font-smoothing: grayscale;
}

#ponhwjqynj thead, #ponhwjqynj tbody, #ponhwjqynj tfoot, #ponhwjqynj tr, #ponhwjqynj td, #ponhwjqynj th {
  border-style: none;
}

#ponhwjqynj p {
  margin: 0;
  padding: 0;
}

#ponhwjqynj .gt_table {
  display: table;
  border-collapse: collapse;
  line-height: normal;
  margin-left: auto;
  margin-right: auto;
  color: #333333;
  font-size: 16px;
  font-weight: normal;
  font-style: normal;
  background-color: #FFFFFF;
  width: auto;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #A8A8A8;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #A8A8A8;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
}

#ponhwjqynj .gt_caption {
  padding-top: 4px;
  padding-bottom: 4px;
}

#ponhwjqynj .gt_title {
  color: #333333;
  font-size: 125%;
  font-weight: initial;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-color: #FFFFFF;
  border-bottom-width: 0;
}

#ponhwjqynj .gt_subtitle {
  color: #333333;
  font-size: 85%;
  font-weight: initial;
  padding-top: 3px;
  padding-bottom: 5px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-color: #FFFFFF;
  border-top-width: 0;
}

#ponhwjqynj .gt_heading {
  background-color: #FFFFFF;
  text-align: center;
  border-bottom-color: #FFFFFF;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#ponhwjqynj .gt_bottom_border {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#ponhwjqynj .gt_col_headings {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
}

#ponhwjqynj .gt_col_heading {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 6px;
  padding-left: 5px;
  padding-right: 5px;
  overflow-x: hidden;
}

#ponhwjqynj .gt_column_spanner_outer {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: normal;
  text-transform: inherit;
  padding-top: 0;
  padding-bottom: 0;
  padding-left: 4px;
  padding-right: 4px;
}

#ponhwjqynj .gt_column_spanner_outer:first-child {
  padding-left: 0;
}

#ponhwjqynj .gt_column_spanner_outer:last-child {
  padding-right: 0;
}

#ponhwjqynj .gt_column_spanner {
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: bottom;
  padding-top: 5px;
  padding-bottom: 5px;
  overflow-x: hidden;
  display: inline-block;
  width: 100%;
}

#ponhwjqynj .gt_spanner_row {
  border-bottom-style: hidden;
}

#ponhwjqynj .gt_group_heading {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  text-align: left;
}

#ponhwjqynj .gt_empty_group_heading {
  padding: 0.5px;
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  vertical-align: middle;
}

#ponhwjqynj .gt_from_md > :first-child {
  margin-top: 0;
}

#ponhwjqynj .gt_from_md > :last-child {
  margin-bottom: 0;
}

#ponhwjqynj .gt_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  margin: 10px;
  border-top-style: solid;
  border-top-width: 1px;
  border-top-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 1px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 1px;
  border-right-color: #D3D3D3;
  vertical-align: middle;
  overflow-x: hidden;
}

#ponhwjqynj .gt_stub {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 5px;
  padding-right: 5px;
}

#ponhwjqynj .gt_stub_row_group {
  color: #333333;
  background-color: #FFFFFF;
  font-size: 100%;
  font-weight: initial;
  text-transform: inherit;
  border-right-style: solid;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
  padding-left: 5px;
  padding-right: 5px;
  vertical-align: top;
}

#ponhwjqynj .gt_row_group_first td {
  border-top-width: 2px;
}

#ponhwjqynj .gt_row_group_first th {
  border-top-width: 2px;
}

#ponhwjqynj .gt_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#ponhwjqynj .gt_first_summary_row {
  border-top-style: solid;
  border-top-color: #D3D3D3;
}

#ponhwjqynj .gt_first_summary_row.thick {
  border-top-width: 2px;
}

#ponhwjqynj .gt_last_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#ponhwjqynj .gt_grand_summary_row {
  color: #333333;
  background-color: #FFFFFF;
  text-transform: inherit;
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
}

#ponhwjqynj .gt_first_grand_summary_row {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-top-style: double;
  border-top-width: 6px;
  border-top-color: #D3D3D3;
}

#ponhwjqynj .gt_last_grand_summary_row_top {
  padding-top: 8px;
  padding-bottom: 8px;
  padding-left: 5px;
  padding-right: 5px;
  border-bottom-style: double;
  border-bottom-width: 6px;
  border-bottom-color: #D3D3D3;
}

#ponhwjqynj .gt_striped {
  background-color: rgba(128, 128, 128, 0.05);
}

#ponhwjqynj .gt_table_body {
  border-top-style: solid;
  border-top-width: 2px;
  border-top-color: #D3D3D3;
  border-bottom-style: solid;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
}

#ponhwjqynj .gt_footnotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#ponhwjqynj .gt_footnote {
  margin: 0px;
  font-size: 90%;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
}

#ponhwjqynj .gt_sourcenotes {
  color: #333333;
  background-color: #FFFFFF;
  border-bottom-style: none;
  border-bottom-width: 2px;
  border-bottom-color: #D3D3D3;
  border-left-style: none;
  border-left-width: 2px;
  border-left-color: #D3D3D3;
  border-right-style: none;
  border-right-width: 2px;
  border-right-color: #D3D3D3;
}

#ponhwjqynj .gt_sourcenote {
  font-size: 90%;
  padding-top: 4px;
  padding-bottom: 4px;
  padding-left: 5px;
  padding-right: 5px;
}

#ponhwjqynj .gt_left {
  text-align: left;
}

#ponhwjqynj .gt_center {
  text-align: center;
}

#ponhwjqynj .gt_right {
  text-align: right;
  font-variant-numeric: tabular-nums;
}

#ponhwjqynj .gt_font_normal {
  font-weight: normal;
}

#ponhwjqynj .gt_font_bold {
  font-weight: bold;
}

#ponhwjqynj .gt_font_italic {
  font-style: italic;
}

#ponhwjqynj .gt_super {
  font-size: 65%;
}

#ponhwjqynj .gt_footnote_marks {
  font-size: 75%;
  vertical-align: 0.4em;
  position: initial;
}

#ponhwjqynj .gt_asterisk {
  font-size: 100%;
  vertical-align: 0;
}

#ponhwjqynj .gt_indent_1 {
  text-indent: 5px;
}

#ponhwjqynj .gt_indent_2 {
  text-indent: 10px;
}

#ponhwjqynj .gt_indent_3 {
  text-indent: 15px;
}

#ponhwjqynj .gt_indent_4 {
  text-indent: 20px;
}

#ponhwjqynj .gt_indent_5 {
  text-indent: 25px;
}

#ponhwjqynj .katex-display {
  display: inline-flex !important;
  margin-bottom: 0.75em !important;
}

#ponhwjqynj div.Reactable > div.rt-table > div.rt-thead > div.rt-tr.rt-tr-group-header > div.rt-th-group:after {
  height: 0px !important;
}
</style>

<table class="gt_table caption-top table table-sm table-striped small" data-quarto-bootstrap="false">
<thead>
<tr class="gt_col_headings header">
<th id="practice_code" class="gt_col_heading gt_columns_bottom_border gt_left" data-quarto-table-cell-role="th" scope="col">practice_code</th>
<th id="need_to_use" class="gt_col_heading gt_columns_bottom_border gt_right" data-quarto-table-cell-role="th" scope="col">need_to_use</th>
</tr>
</thead>
<tbody class="gt_table_body">
<tr class="odd">
<td class="gt_row gt_left" headers="practice_code">P1</td>
<td class="gt_row gt_right" headers="need_to_use">7.68</td>
</tr>
<tr class="even">
<td class="gt_row gt_left" headers="practice_code">P2</td>
<td class="gt_row gt_right" headers="need_to_use">7.30</td>
</tr>
<tr class="odd">
<td class="gt_row gt_left" headers="practice_code">P3</td>
<td class="gt_row gt_right" headers="need_to_use">6.90</td>
</tr>
<tr class="even">
<td class="gt_row gt_left" headers="practice_code">P4</td>
<td class="gt_row gt_right" headers="need_to_use">7.10</td>
</tr>
<tr class="odd">
<td class="gt_row gt_left" headers="practice_code">P5</td>
<td class="gt_row gt_right" headers="need_to_use">7.70</td>
</tr>
<tr class="even">
<td class="gt_row gt_left" headers="practice_code">P6</td>
<td class="gt_row gt_right" headers="need_to_use">8.20</td>
</tr>
<tr class="odd">
<td class="gt_row gt_left" headers="practice_code">P7</td>
<td class="gt_row gt_right" headers="need_to_use">7.90</td>
</tr>
<tr class="even">
<td class="gt_row gt_left" headers="practice_code">P8</td>
<td class="gt_row gt_right" headers="need_to_use">7.50</td>
</tr>
<tr class="odd">
<td class="gt_row gt_left" headers="practice_code">P9</td>
<td class="gt_row gt_right" headers="need_to_use">7.70</td>
</tr>
<tr class="even">
<td class="gt_row gt_left" headers="practice_code">P10</td>
<td class="gt_row gt_right" headers="need_to_use">7.50</td>
</tr>
<tr class="odd">
<td class="gt_row gt_left" headers="practice_code">P11</td>
<td class="gt_row gt_right" headers="need_to_use">7.80</td>
</tr>
<tr class="even">
<td class="gt_row gt_left" headers="practice_code">P12</td>
<td class="gt_row gt_right" headers="need_to_use">7.25</td>
</tr>
<tr class="odd">
<td class="gt_row gt_left" headers="practice_code">P13</td>
<td class="gt_row gt_right" headers="need_to_use">7.60</td>
</tr>
<tr class="even">
<td class="gt_row gt_left" headers="practice_code">P14</td>
<td class="gt_row gt_right" headers="need_to_use">7.90</td>
</tr>
</tbody>
</table>

</div>
</div>
</div>
<p>For my project, this method has successfully generated a prevalence for 125 of the 151 practices without a need value, leaving just 26 practices without a need. This is using a 1.5 km radius. In each use case there will be a decision to make regarding a more accurate estimate (smaller radius) and therefore fewer matches versus a less accurate estimate (using a larger radius) and therefore more matches.</p>
<p>This approach could be replicated for other similar uses/purposes. A topical example from an SU project is the need to assign population prevalence for hypertension and compare it to current QOF<sup>3</sup> data. Again, the prevalence data is a few years old so we have to move the historical data to fit with current practices and this leaves missing data that can be estimated using this method.</p>


<div class="no-row-height column-margin column-container"><div id="fn3"><p><sup>3</sup>&nbsp;QOF (Quality and Outcomes Framework) is a voluntary annual reward and incentive programme for all GP practices in England, detailing practice achievement results.</p></div></div>

 ]]></description>
  <category>learning</category>
  <guid>https://the-strategy-unit.github.io/data_science/blogs/posts/2024-01-17_nearest_neighbour/</guid>
  <pubDate>Wed, 17 Jan 2024 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Advent of Code and Test Driven Development</title>
  <dc:creator>YiWen Hon</dc:creator>
  <link>https://the-strategy-unit.github.io/data_science/blogs/posts/2024-01-10_advent-of-code-and-test-driven-development/</link>
  <description><![CDATA[ 





<p><a href="https://adventofcode.com/">Advent of Code</a> is an annual event, where daily coding puzzles are released from 1st – 24th December. We ran one of our fortnightly Coffee &amp; Coding sessions introducing Advent of Code to people who code in the Strategy Unit, as well as the concept of test-driven development as a potential way of approaching the puzzles.</p>
<p><a href="https://developer.ibm.com/articles/5-steps-of-test-driven-development/">Test-driven development</a> (TDD) is an approach to coding which involves writing the test for a function BEFORE we write the function. This might seem quite counterintuitive, but it makes it easier to identify bugs 🐛 when they are introduced to our code, and ensures that our functions meet all necessary criteria. From my experience, this takes quite a long time to implement and can be quite tedious, but it is definitely worth it overall, especially as your project develops. Testing is also recommended in the <a href="https://nhsdigital.github.io/rap-community-of-practice/introduction_to_RAP/levels_of_RAP/">NHS Reproducible Analytical Pipeline (RAP)</a> guidelines.</p>
<p>An interesting thing to note about TDD is that we’re always expecting our first test to fail, and indeed failing tests are useful and important! If we wrote tests that just passed all the time, this would not be useful at all for our code.</p>
<p>The way that Advent of Code is structured, with test data for each puzzle and an expected test result, makes it very amenable to a test-driven approach. In order to support this, Matt and I created template repositories for a test-driven approach to Advent of Code, in <a href="https://github.com/yiwen-h/aoc_python_template">Python</a> and in <a href="https://github.com/matt-dray/aoc.rstats.template">R</a>.</p>
<p>Our goal when setting this up was to introduce others in the Strategy Unit to both TDD and Advent of Code. Advent of code can be challenging and I personally struggle to get past the first week, but it encourages creative (and maybe even fun?!) approaches to coding problems. I’m glad that we had the chance to explore some of the puzzles together in Coffee &amp; Coding – it was interesting to see so many different approaches to the same problem, and hopefully it also gave us all the chance to practice writing tests.</p>



 ]]></description>
  <category>learning</category>
  <guid>https://the-strategy-unit.github.io/data_science/blogs/posts/2024-01-10_advent-of-code-and-test-driven-development/</guid>
  <pubDate>Wed, 10 Jan 2024 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Reinstalling R Packages</title>
  <dc:creator>Tom Jemmett</dc:creator>
  <link>https://the-strategy-unit.github.io/data_science/blogs/posts/2023-04-26_reinstalling-r-packages/</link>
  <description><![CDATA[ 





<p><a href="https://stat.ethz.ch/pipermail/r-announce/2023/000691.html">R 4.3.0 was released</a> last week. Anytime you update R you will probably find yourself in the position where no packages are installed. This is by design - the packages that you have installed may need to be updated and recompiled to work under new versions of R.</p>
<p>You may find yourself wanting to have all of the packages that you previously used, so one approach that some people take is to copy the previous library folder to the new versions folder. This isn’t a good idea and could potentially break your R install.</p>
<p>Another approach would be to export the list of packages in R before updating and then using that list after you have updated R. This can cause issues though if you install from places other than CRAN, e.g.&nbsp;bioconductor, or from GitHub.</p>
<p>Some of these approaches are discussed on the <a href="https://community.rstudio.com/t/reinstalling-packages-on-new-version-of-r/7670/4">RStudio Community Forum</a>. But I prefer an approach of having a “spring clean”, instead only installing the packages that I know that I need.</p>
<p>I maintain a <a href="https://gist.github.com/tomjemmett/c105d3e0fbea7558088f68c65e68e1ed/">list of the packages that I used</a> as a gist. Using this, I can then simply run this script on any new R install. In fact, if you click the “raw” button on the gist, and copy that url, you can simply run</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">source</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"https://gist.githubusercontent.com/tomjemmett/c105d3e0fbea7558088f68c65e68e1ed/raw/a1db4b5fa0d24562d16d3f57fe8c25fb0d8aa53e/setup.R"</span>)</span></code></pre></div></div>
<p>Generally, sourcing a url is a bad idea - the reason for this is if it’s not a link that you control, then someone could update the contents and run arbritary code on your machine. In this case, I’m happy to run this as it’s my own gist, but you should be mindful if running it yourself!</p>
<p>If you look at the script I first install a number of packages from CRAN, then I install packages that only exist on GitHub.</p>



 ]]></description>
  <category>git</category>
  <category>tutorial</category>
  <guid>https://the-strategy-unit.github.io/data_science/blogs/posts/2023-04-26_reinstalling-r-packages/</guid>
  <pubDate>Wed, 26 Apr 2023 00:00:00 GMT</pubDate>
</item>
</channel>
</rss>
