Text mining

Some practical examples

26 November 2024

Text vectorisation: Turning words into numbers

Computers cannot do statistics with words as raw text
The basic foundation of all Natural Language Processing!
Huge range in complexity

Bag of words: Each word is represented by 1 number

‘I love to run’
‘the cat does not eat fruit’
‘run to the cat’
‘I love to eat fruit. fruit fruit fruit fruit’

A table showing the counts of each of the words

Word embeddings (50-300 numbers per word)

A chart. Each word is represented by a dot on the chart. Words relating to fish are clustered on the right, words relating to food are clustered to the top. ¹

Attention mechanism: 768 * 3 numbers per word

Basis of Large Language Models!
Attempts to capture the relationship between words
Huge computational power required

A diagram with the words 'Sally loved reading'. Each of the words has three arrows pointing from it, pointing at the letters Q, K and V

Text classification

Supervised learning - we need examples that have already been labelled
Sentiment analysis - whether a review is positive or negative

Over to the code!

How do we know how good a model is?

We use performance metrics like:

Accuracy
Precision
Recall

What’s the accuracy for this model?

The model’s accuracy is 91%!

But is there something wrong with this model?

🐟 Different metrics for different purposes³

Recall

A model for cancer screening (positive = potential cancer)
Cost of false negative is higher than cost of false positive
🥅 Fishing with a net (more fish, some rocks are ok)

Precision

A model for identifying safe seatbelts (positive = safe)
Cost of false positive is higher than cost of false negative
🎣 Fishing with a spear (fewer fish, but fewer rocks too)

Topic Modelling

Unsupervised learning - the model has no examples to learn from

Over to the code!

Topic Modelling pros and cons

How do you evaluate the performance of a topic model?
Can work well sometimes
Black box

Conclusion

Our code examples today were very basic
Text mining is not magic
Fancier models = Fancier tasks!

Footnotes