Travels with R and Python

the power of data science in healthcare

Aug 2, 2023

What is data science?

  • “A data scientist knows more about computer science than the average statistician, and more about statistics than the average computer scientist”

(Josh Wills, a former head of data engineering at Slack)

Drew Conway’s famous Venn diagram

Data science Venn diagram

Source

What are the skills of data science?

  • Analysis
    • ML
    • Stats
    • Data viz
  • Software engineering
    • Programming
    • SQL/ data
    • DevOps
    • RAP

What are the skills of data science?

  • Domain knowledge
    • Communication
    • Problem formulation
    • Dashboards and reports

Stats and data viz

  • ML leans a bit more towards atheoretical prediction
  • Stats leans a bit more towards inference (but they both do both)
  • Data scientists may use different visualisations
    • Interactive web based tools
    • Dashboard based visualisers e.g. {stminsights}

Software engineering

  • Programming
    • No/ low code data science?
  • SQL/ data
    • Tend to use reproducible automated processes
  • DevOps
    • Plan, code, build, test, release, deploy, operate, monitor
  • RAP
    • I will come back to this

Domain knowledge

  • Do stuff that matters
    • The best minds of my generation are thinking about how to make people click ads. That sucks. Jeffrey Hammerbacher
  • Convince other people that it matters
  • This is the hardest part of data science

RAP

  • Data science isn’t RAP
  • RAP isn’t data science
  • They are firm friends

Reproducibility

What is RAP

  • a process in which code is used to minimise manual, undocumented steps, and a clear, properly documented process is produced in code which can reliably give the same result from the same dataset
  • RAP should be:

the core working practice that must be supported by all platforms and teams; make this a core focus of NHS analyst training

Levels of RAP- Baseline

  • Data produced by code in an open-source language (e.g., Python, R, SQL)
  • Code is version controlled
  • Repository includes a README.md file that clearly details steps a user must follow to reproduce the code
  • Code has been peer reviewed
  • Code is published in the open and linked to & from accompanying publication (if relevant)

Levels of RAP- Silver

  • Code is well-documented…
  • Code is well-organised following standard directory format
  • Reusable functions and/or classes are used where appropriate
  • Pipeline includes a testing framework
  • Repository includes dependency information (e.g. requirements.txt, PipFile, environment.yml)
  • Data is handled and output in a Tidy data format

Levels of RAP- Gold

  • Code is fully packaged
  • Repository automatically runs tests etc. via CI/CD or a different integration/deployment tool e.g. GitHub Actions
  • Process runs based on event-based triggers (e.g., new data in database) or on a schedule
  • Changes to the RAP are clearly signposted. E.g. a changelog in the package, releases etc. (See gov.uk info on Semantic Versioning)

Data science in healthcare

  • Forecasting
    • Stats versus ML
  • Text mining
    • R versus Python
  • Demand modelling
    • DevOps as a way of life

Get involved!

  • NHS-R community
    • Webinars, training, conference, Slack
  • NHS Pycom
    • ditto…
  • MLCSU GitHub?
  • Build links with the other CSUs

Contact