Project outcomes: Model performance

Performance of final trained models

Thanks to the 21000 comments painstakingly and expertly labelled as part of this project, we had a very rich dataset with which to train our machine learning models. The best performing models were the Support Vector Classifier, Gradient Boosted Decision Trees (XGBoost), and Distilbert, a transformer-based large language model. For the multilabel categories, we found that these three models were largely comparable to one another but that they produced the best results when combined together. The Distilbert model outperformed the others for predicting text sentiment (i.e., positive/negative/neutral).

Please see below the final performance metrics for the models that we have trained and which are available via the API. These are also available in alternative formats to download from our github repository.

Multilabel ensemble model

Precision Recall F1 Score Support (Label count in test data) Label count in full dataset True Negative False Negative True Positive False Positive Average Precision Score
Activities & access to fresh air 0.76 0.76 0.76 54 284 3940 13 41 13 0.841
Appointment arrangements 0.73 0.69 0.71 261 1282 3679 80 181 67 0.751
Appointment method 0.65 0.65 0.65 31 188 3965 11 20 11 0.638
Being kept informed, clarity & consistency of information 0.56 0.66 0.6 183 840 3730 63 120 94 0.571
Cleanliness, tidiness & infection control 0.91 0.86 0.88 107 475 3891 15 92 9 0.918
Competence & training 0.74 0.57 0.64 164 790 3810 71 93 33 0.607
Contacting services 0.72 0.65 0.68 100 516 3882 35 65 25 0.76
Continuity of care 0.65 0.58 0.61 290 1448 3626 122 168 91 0.644
Discharge 0.8 0.52 0.63 46 140 3955 22 24 6 0.582
Electronic entertainment 0.94 0.65 0.77 23 92 3983 8 15 1 0.84
Environment, facilities & equipment 0.71 0.67 0.69 202 887 3750 67 135 55 0.751
Feeling safe 0.69 0.87 0.77 23 98 3975 3 20 9 0.793
Food & drink provision & facilities 0.87 0.85 0.86 106 594 3888 16 90 13 0.892
Funding & use of financial resources 0.64 0.64 0.64 25 102 3973 9 16 9 0.684
Information directly from staff during care 0.73 0.8 0.76 390 1944 3504 79 311 113 0.855
Information provision & guidance 0.64 0.51 0.57 90 458 3891 44 46 26 0.605
Interaction with family/ carers 0.59 0.45 0.51 123 579 3845 68 55 39 0.556
Labelling not possible 1 1 1 238 1234 3768 0 238 1 1
Mental Health Act 0.75 0.23 0.35 13 56 3993 10 3 1 0.383
Organisation & efficiency 0.49 0.75 0.59 102 560 3826 26 76 79 0.667
Pain management 0.86 0.72 0.78 43 181 3959 12 31 5 0.832
Parking 0.94 0.94 0.94 18 129 3988 1 17 1 0.944
Positive experience & gratitude 0.8 0.91 0.85 938 4722 2856 86 852 213 0.931
Sensory experience 0.8 0.79 0.8 67 281 3927 14 53 13 0.82
Service location 0.85 0.59 0.7 86 475 3912 35 51 9 0.779
Staff listening, understanding & involving patients 0.64 0.75 0.69 361 1743 3492 91 270 154 0.793
Staff manner & personal attributes 0.9 0.9 0.9 1431 7131 2430 142 1289 146 0.965
Staffing levels & responsiveness 0.55 0.56 0.56 194 1027 3726 86 108 87 0.591
Supplying & understanding medication 0.73 0.69 0.71 59 365 3933 18 41 15 0.687
Timeliness of care 0.67 0.82 0.74 529 2773 3266 97 432 212 0.789
Transport to/ from services 0.74 0.64 0.68 78 413 3911 28 50 18 0.686
Unspecified communication 0.61 0.64 0.62 36 172 3956 13 23 15 0.696

Distilbert sentiment model

The model is trained to predict to a five-point sentiment scale. However, we would recommend mapping this to a three-point sentiment scale, as advised by the qualitative analysts at NHS England.

Score Five point sentiment scale Three point sentiment scale
1 Very positive Positive
2 Positive Positive
3 Neutral / mixed Neutral / mixed
4 Negative Negative
5 Very negative Negative

Performance metrics for both are shown below.

Three point sentiment scale

Accuracy: 0.85

Precision Recall F1 Score Support
Positive 0.95 0.9 0.92 2587
Neutral / mixed 0.52 0.71 0.6 551
Negative 0.87 0.81 0.84 805

Confusion matrix using counts

Confusion matrix using percentages

Five point sentiment scale

Accuracy: 0.7

Precision Recall F1 Score Support
Very positive 0.8 0.79 0.8 1746
Positive 0.63 0.52 0.57 841
Neutral / mixed 0.52 0.71 0.6 551
Negative 0.79 0.68 0.73 639
Very negative 0.52 0.64 0.57 166

Confusion matrix using counts

Confusion matrix using percentages