Data Science @ The Strategy Unit – Identifying patients at risk

AI could help identify high-risk heart patients¹

The University of Leeds has helped train an AI system called Optimise, that looked at health records of more than two million people.

…

Of those two million records that were scanned, more than 400,000 people were identified as being high risk for the likes of heart failure, stroke and diabetes.

How it works

The input: Health records
Health records can be structured or unstructured
- Structured: can be stored in a table
- Unstructured: can’t be stored in a table, different shapes/sizes (e.g. text, audio, images)

Example of structured data in health records

ID	BMI	Age	IMD Decile	Smoker	Blood Pressure
1	17	49	3	1	110/70
2	25	67	1	1	129/70
3	20	39	8	0	140/90
4	28	81	6	0	130/85
5	29	41	4	0	120/80

Data is consistent within each column in the table.

Example of unstructured data in health records

ID	Notes
1	Shortness of breath
2	Patient attended clinic following one week of fever, vomiting, and abdominal pain.

The length of each sentence is different - data not consistent.

A simple approach to classifying data: KNN

Clustering algorithms like K Nearest Neighbours (KNN) are on the more basic end of the scale, requiring very little computational power.

Example of k-nearest neighbour classification, with red triangles representing one class, blue squares representing another, and a new datapoint as a green circle which has two close red triangle neighbours and one blue square neighbour. ¹

A simple approach to classifying data: Decision Tree

Example of a decision tree ¹

There are many different models out there! 🥴

Complex diagram showing a decision tree for choosing the right estimator for different machine learning problems ¹

What makes a model simple or complex?

There are dozens of different algorithms out there
Each algorithm has different strengths and weaknesses
What makes a model simple or complex is the amount of computational power required and how much the model needs to “learn” - how many parameters there are

Is the input or the computation complex?

“We used UK primary care EHR data from 2,081,139 individuals aged ≥ 30 years…

We trained a random forest classifier using age, sex, ethnicity and comorbidities (OPTIMISE).”¹

Pros and cons of simple “A.I.” approaches

Pros:

Simple models are more easily explained
Can sometimes find new patterns in the data

Cons:

The quality of the data determines the quality of the model
Not able to handle very complex tasks

🚩 Issues to look out for 🚩

How complex is the input, or the computational approach?
How is the model’s performance measured?
Does the model get updated?
Where did the data come from?
Have issues of bias or ethics been considered?