Project outcomes: Key takeaways
What didn’t work, and potential developments
Multinomial Naive Bayes, Random Forest Classifier, and K Nearest Neighbours models showed very poor performance.
For the sklearn models, we only tried spacy and TF-IDF vectorizers. Of the two, TF-IDF produced the best results. Perhaps another vectorizer would have been more effective? Previous work has shown that using pretrained embeddings generally performs poorly on this type of text due to its specialised vocabulary. However, it would have been interesting to also use some other type of vectorizer on the text - perhaps using the custom embeddings generated by the first few layers of the Distilbert model.
We only tried the Distilbert pretrained model, one of the smallest LLMs, due to lack of hardware resources. Perhaps another transformer model would be more effective? There have been big developments in LLMs in recent months, it would be worth trying Llama2 for example. In future iterations of this project, more resources should be committed to the hardware aspects - we need a VM ideally with a GPU, if we want to utilise these exciting new technologies.
Other things that we tried, which did not make it into the final productionized version of the model:
Rules based boosting of probabilities. This idea was suggested to us by Dan Schofield from NHS England. We did implement it, and it did successfully boost performance in a few categories! However, our stakeholders were not very keen on this idea from a qualitative perspective, so we have not implemented it in the final project outcome. The functionality to implement this is still in the package, however. It works by boosting the probability of a category if specific words are found in the text.
Adjusting the classification threshold. We had productive discussions with our stakeholders about the precision-recall tradeoff, and found that on the whole trusts prioritised recall over precision, as they were more concerned about potentially missing labels, rather than incorrectly labelled text. We tried to maintain a balance by using the F1 score as well. We incorporated the ability to manually adjust the threshold used for each category into the code, or to use a precision-recall curve to identify the threshold producing the best F1 score. Ultimately we haven’t implemented this in the final model as we didn’t find it was that useful (and also we ran out of time).
Other things to try in future
Andreas, who worked on phase 1 of the project, shared this paper with us - would be worth dedicating a couple of weeks to, to see if it works! https://aclanthology.org/2023.findings-acl.426/
Reinforcement learning: it would be cool if users of the dashboard could correct the labels given by the model - and this information is fed back into the model to retrain it!
Few shot approaches: We have been using the framework developed by NHS England - but what if trusts want to use their own frameworks? We could create a pipeline to enable trusts to provide their own framework and examples for each category, and train models in this way?
Model interpretability: I think we should use SHAP to highlight which words contributed most to the labels given to a text, to help people understand how the models work.
More nuanced approach to sentiment and labelling: One of the issues that we had was the fact that we treated very long comments in exactly the same way as very short comments. So comments with mixed sentiment (e.g. “the nurse was very nice! Toilets were disgusting”) just came out as “neutral / mixed”, thus belonging to the same category as a truly neutral comment like “everything ok”. We should ideally explore some way of splitting sentences perhaps, or highlighting exactly which elements in a sentence are positive/negative in sentiment, and perhaps linking this with the multilabel categories.