Project outcomes: Key takeaways

What didn’t work, and potential developments

Multinomial Naive Bayes, Random Forest Classifier, and K Nearest Neighbours models showed very poor performance.

For the sklearn models, we only tried spacy and TF-IDF vectorizers. Of the two, TF-IDF produced the best results. Perhaps another vectorizer would have been more effective? Previous work has shown that using pretrained embeddings generally performs poorly on this type of text due to its specialised vocabulary. However, it would have been interesting to also use some other type of vectorizer on the text - perhaps using the custom embeddings generated by the first few layers of the Distilbert model.

We only tried the Distilbert pretrained model, one of the smallest LLMs, due to lack of hardware resources. Perhaps another transformer model would be more effective? There have been big developments in LLMs in recent months, it would be worth trying Llama2 for example. In future iterations of this project, more resources should be committed to the hardware aspects - we need a VM ideally with a GPU, if we want to utilise these exciting new technologies.

Other things that we tried, which did not make it into the final productionized version of the model:

Other things to try in future

  • Andreas, who worked on phase 1 of the project, shared this paper with us - would be worth dedicating a couple of weeks to, to see if it works! https://aclanthology.org/2023.findings-acl.426/

  • Reinforcement learning: it would be cool if users of the dashboard could correct the labels given by the model - and this information is fed back into the model to retrain it!

  • Few shot approaches: We have been using the framework developed by NHS England - but what if trusts want to use their own frameworks? We could create a pipeline to enable trusts to provide their own framework and examples for each category, and train models in this way?

  • Model interpretability: I think we should use SHAP to highlight which words contributed most to the labels given to a text, to help people understand how the models work.

  • More nuanced approach to sentiment and labelling: One of the issues that we had was the fact that we treated very long comments in exactly the same way as very short comments. So comments with mixed sentiment (e.g. “the nurse was very nice! Toilets were disgusting”) just came out as “neutral / mixed”, thus belonging to the same category as a truly neutral comment like “everything ok”. We should ideally explore some way of splitting sentences perhaps, or highlighting exactly which elements in a sentence are positive/negative in sentiment, and perhaps linking this with the multilabel categories.