Text mining

What it can and can’t do, and what it could do

27 November 2024

Intro

  • Range of possible applications
  • There are a range of caveats depending on the use case
  • Some of them are more relevant than others
  • Some of them are “free”, and some aren’t

What is text mining?

  • A variety of supervised and unsupervised methods
  • The uses we’ll discuss today have no “understanding” of text (no “intelligence”)
  • They all have the potential, therefore, to be very inaccurate (e.g. negation)
  • But they can be a low cost way of gathering insight, properly applied

“I just want a bit of help sorting”

  • This is one of the safer things to do
  • There are a range of methods
  • Some will give you some say of the “theme” of the piles, and some don’t
  • Great for: sifting, looking at relative size of piles, novel suggestions for themes for text
  • Bad for: accuracy, control over what’s in the piles

“I just want to find useful examples”

  • Also quite a safe application, and one we implemented for patient experience
  • The algorithm is just helping you to find things with a particular theme or sentiment
  • You bring the understanding
  • Many ways of achieving this, from easy to difficult
    • Unsupervised and supervised
    • Searching for strings vs semantic search

Generating summary statistics

  • This needs to be done with care
  • You can potentially lose a lot of nuance and meaning
  • Even the best model is probably only around 80% accurate
  • Useful for monitoring and making rough estimates about the size of things
  • Not suitable for anything that needs accuracy (e.g. safeguarding)

“Free”

  • Unsupervised learning is “free”- no labelling necessary
  • Arguably you may as well use it for everything, speculatively
  • However there are lots of models and parameters to set
  • So “free” is not really “free”

What’s going on?

  • Text models are only as sensible as their inputs
  • We call the algorithm that turns text to numbers a “vectoriser”

Bag of words vs TF-IDF

  • Bag of words just counts the number of times each word appears
  • Crude but effective
  • TF-IDF works similarly but makes rare words more important, which can help with topic modelling and classification

Word embeddings

From Sutor et al., reproduced under fair use

Word embeddings cont.

  • There are some smallish ones (like Glove), and some huge ones (like BERT)
  • Vectors are the only way to encode meaning and context

The future

  • A number of different things suggest themselves
  • Use of topic models to explore and search
  • Training of a supervised model for a particular project

The dream

  • Zero shot model with human in the loop learning