Text mining
What it can and can’t do, and what it could do
27 November 2024
Intro
- Range of possible applications
- There are a range of caveats depending on the use case
- Some of them are more relevant than others
- Some of them are “free”, and some aren’t
What is text mining?
- A variety of supervised and unsupervised methods
- The uses we’ll discuss today have no “understanding” of text (no “intelligence”)
- They all have the potential, therefore, to be very inaccurate (e.g. negation)
- But they can be a low cost way of gathering insight, properly applied
“I just want a bit of help sorting”
- This is one of the safer things to do
- There are a range of methods
- Some will give you some say of the “theme” of the piles, and some don’t
- Great for: sifting, looking at relative size of piles, novel suggestions for themes for text
- Bad for: accuracy, control over what’s in the piles
“I just want to find useful examples”
- Also quite a safe application, and one we implemented for patient experience
- The algorithm is just helping you to find things with a particular theme or sentiment
- You bring the understanding
- Many ways of achieving this, from easy to difficult
- Unsupervised and supervised
- Searching for strings vs semantic search
Generating summary statistics
- This needs to be done with care
- You can potentially lose a lot of nuance and meaning
- Even the best model is probably only around 80% accurate
- Useful for monitoring and making rough estimates about the size of things
- Not suitable for anything that needs accuracy (e.g. safeguarding)
“Free”
- Unsupervised learning is “free”- no labelling necessary
- Arguably you may as well use it for everything, speculatively
- However there are lots of models and parameters to set
- So “free” is not really “free”
What’s going on?
- Text models are only as sensible as their inputs
- We call the algorithm that turns text to numbers a “vectoriser”
Bag of words vs TF-IDF
- Bag of words just counts the number of times each word appears
- Crude but effective
- TF-IDF works similarly but makes rare words more important, which can help with topic modelling and classification
Word embeddings
From Sutor et al., reproduced under fair use
Word embeddings cont.
- There are some smallish ones (like Glove), and some huge ones (like BERT)
- Vectors are the only way to encode meaning and context
The future
- A number of different things suggest themselves
- Use of topic models to explore and search
- Training of a supervised model for a particular project
The dream
- Zero shot model with human in the loop learning