Text mining

Some practical examples

26 November 2024

Text vectorisation: Turning words into numbers

  • Computers cannot do statistics with words as raw text
  • The basic foundation of all Natural Language Processing!
  • Huge range in complexity

Bag of words: Each word is represented by 1 number

  • ‘I love to run’
  • ‘the cat does not eat fruit’
  • ‘run to the cat’
  • ‘I love to eat fruit. fruit fruit fruit fruit’
A table showing the counts of each of the words

Word embeddings (50-300 numbers per word)

A chart. Each word is represented by a dot on the chart. Words relating to fish are clustered on the right, words relating to food are clustered to the top.1

Attention mechanism: 768 * 3 numbers per word

  • Basis of Large Language Models!
  • Attempts to capture the relationship between words
  • Huge computational power required
A diagram with the words 'Sally loved reading'. Each of the words has three arrows pointing from it, pointing at the letters Q, K and V

Text classification

  • Supervised learning - we need examples that have already been labelled
  • Sentiment analysis - whether a review is positive or negative

Over to the code!

How do we know how good a model is?

We use performance metrics like:

  • Accuracy
  • Precision
  • Recall

2

What’s the accuracy for this model?

The model’s accuracy is 91%!

But is there something wrong with this model?

🐟 Different metrics for different purposes3

Recall

  • A model for cancer screening (positive = potential cancer)
  • Cost of false negative is higher than cost of false positive
  • 🥅 Fishing with a net (more fish, some rocks are ok)

Precision

  • A model for identifying safe seatbelts (positive = safe)
  • Cost of false positive is higher than cost of false negative
  • 🎣 Fishing with a spear (fewer fish, but fewer rocks too)

Topic Modelling

  • Unsupervised learning - the model has no examples to learn from

Over to the code!

Topic Modelling pros and cons

  • How do you evaluate the performance of a topic model?
  • Can work well sometimes
  • Black box

Conclusion

  • Our code examples today were very basic
  • Text mining is not magic
  • Fancier models = Fancier tasks!

Footnotes

  1. Openclassrooms.com

  2. Evidently AI

  3. pxtextmining