Travels with R and Python
the power of data science in healthcare
Aug 2, 2023
Drew Conway’s famous Venn diagram
Source
What are the skills of data science?
- Analysis
- Software engineering
- Programming
- SQL/ data
- DevOps
- RAP
What are the skills of data science?
- Domain knowledge
- Communication
- Problem formulation
- Dashboards and reports
Stats and data viz
- ML leans a bit more towards atheoretical prediction
- Stats leans a bit more towards inference (but they both do both)
- Data scientists may use different visualisations
- Interactive web based tools
- Dashboard based visualisers e.g. {stminsights}
Software engineering
- Programming
- No/ low code data science?
- SQL/ data
- Tend to use reproducible automated processes
- DevOps
- Plan, code, build, test, release, deploy, operate, monitor
- RAP
Domain knowledge
- Do stuff that matters
- The best minds of my generation are thinking about how to make people click ads. That sucks. Jeffrey Hammerbacher
- Convince other people that it matters
- This is the hardest part of data science
RAP
- Data science isn’t RAP
- RAP isn’t data science
- They are firm friends
What is RAP
- a process in which code is used to minimise manual, undocumented steps, and a clear, properly documented process is produced in code which can reliably give the same result from the same dataset
- RAP should be:
the core working practice that must be supported by all platforms and teams; make this a core focus of NHS analyst training
Levels of RAP- Baseline
- Data produced by code in an open-source language (e.g., Python, R, SQL)
- Code is version controlled
- Repository includes a README.md file that clearly details steps a user must follow to reproduce the code
- Code has been peer reviewed
- Code is published in the open and linked to & from accompanying publication (if relevant)
Levels of RAP- Silver
- Code is well-documented…
- Code is well-organised following standard directory format
- Reusable functions and/or classes are used where appropriate
- Pipeline includes a testing framework
- Repository includes dependency information (e.g. requirements.txt, PipFile, environment.yml)
- Data is handled and output in a Tidy data format
Levels of RAP- Gold
- Code is fully packaged
- Repository automatically runs tests etc. via CI/CD or a different integration/deployment tool e.g. GitHub Actions
- Process runs based on event-based triggers (e.g., new data in database) or on a schedule
- Changes to the RAP are clearly signposted. E.g. a changelog in the package, releases etc. (See gov.uk info on Semantic Versioning)
Data science in healthcare
- Forecasting
- Text mining
- Demand modelling