Project development: Performance metrics
How do we know if the model works?
When we code machine learning models, we need to be able to measure how well they are performing. Performance metrics are the ‘scoring systems’ that we use to measure a model’s predictions. We keep a subset of the data aside that the model has never seen when training, and compare the model’s predicted labels with the real labels for that data. This is called the ‘test set’.
We would choose different performance metrics depending on our aims and objectives for the model. For this project, we are classifying patient comments, trying to label them by topic. Each comment can be referred to as a ‘sample’. A single-label classification model would assign each sample, or patient comment, to only one corresponding topic. In single-label classification, the most commonly used performance metrics are: accuracy, recall, precision, and f1 score.
Common terminology in classification models
The key terminology to understand is the concept of true positives, true negatives, false positives, and false negatives.
True positive is when the sample has the label, and the model correctly predicts this label. An example is given in Table 1.
Comment | Labels given by qualitative analyst | Labels given by hypothetical model |
---|---|---|
“I really enjoyed dinner” | Food & diet | Food & diet |
In this example, the model has also correctly predicted that the comment is NOT any other label, such as Medication
. The comment in question has not been tagged as Medication
, and the model was able to correctly identify this, predicting only Food & diet
as the label instead. Because it predicted a negative for the label Medication
for the comment ‘I really enjoyed dinner’, and this is the same as the real labelling for the comment, this is called a true negative for that label.
In contrast with true positives and true negatives, false positive and false negative are when the model has made a mistake. For the example in Table 2, the model has labelled the comment with Medication
, although this is not a real label for the data. This means that it has made a false positive prediction for the Medication
label.
It has also missed that the real label is Food & diet
, meaning that it has made a false negative prediction for the label Food & diet
.
Comment | Labels given by qualitative analyst | Labels given by hypothetical model |
---|---|---|
“I really enjoyed dinner” | Food & diet | Medication |
The total number of true positives, true negatives, false positives and false negatives are usually put together in a table called a confusion matrix.
Single label performance metrics
Accuracy
Accuracy is usually the concept that is most simple to grasp. It is the total number of correct predictions divided by the sum of the overall number of predictions. Range is between 0 to 1, the closer to 1 the better.
However, it’s not always the best metric to use, particularly with imbalanced datasets. In Table 3 below, the model mostly predicts negatives. Of the 10 real positive values it is only able to predict this correctly 1 time. This means that the accuracy score is high (0.91), but it’s not very good at capturing the positive class when it does occur.
Predicted Negative | Predicted Positive | |
---|---|---|
Actual Negative | True Negatives: 90 | False Positives: 0 |
Actual Positive | False Negatives: 9 | True Positives: 1 |
Recall
Recall measures the ability of the model to detect occurrences of a class. For our example above, the model which had an accuracy of 0.91 only has a recall of 0.1. Recall is best used when it is important to identify as many occurrences of a class as possible, reducing false negatives but potentially increasing false positives. Range is between 0 to 1, the higher the better. The mathematical formula for calculating recall is:
True Positives / (True Positives + False Negatives)
I like to use the analogy of ‘recall’ as trying to optimise fishing by using a net. You are happy to get false positives (e.g. stones, debris) in your catch so that you can capture more fish overall.
Scenarios where a model might be optimised for recall include where you are trying to predict fraudulent bank transactions. It might be safer to flag more transactions as fraudulent (increasing potential false positives) and check them, in order to avoid false negatives.
In the context of pxtextmining, a model that is optimised for recall would perhaps assign more ‘false positives’. So you would be likely to get lots of labels which might not actually be applicable for a particular text. For a comment like ‘I really enjoyed dinner’, you might get the real label Food & diet
, but also incorrect false positive labels such as Medication
and Communication
as well. The model will be more indiscriminate when assigning labels, so as not to accidentally miss a label.
Precision
Precision measures the ability of a model to avoid false alarms for a class, or the confidence of a model when predicting a specific class. It’s best used when it is important to be correct when identifying a class, reducing false positives but potentially increasing false negatives. Range is between 0 to 1, the higher the better. For the example in Table 3 the precision would be 1:
True Positives / (True Positives + False Positives)
= 1 / (1 + 0)
= 1
The model did not identify any false positives at all, although it did have 9 false negatives, missing 9/10 of the target of interest. As we know, the recall was poor (0.1).
The model with high precision but low recall is very good at avoiding false positives, but unfortunately it also made a lot of false negative predictions. The analogy I like to use for understanding optimising for precision is that it is like trying to fish with a spear. You want to be sure that when you catch something, it is definitely a fish. You’re trying to minimise false positives, or anything that is not a fish, but you may miss some fish in doing so (maybe increasing false negatives).
Scenarios where a model might be optimised for precision include where you are trying to predict the safety of seatbelts. It is important to avoid false positives in this scenario. The cost of throwing away a potentially good seatbelt is relatively low, compared to the cost of equipping a car with a faulty seatbelt.
In the context of pxtextmining, a model that is optimised for precision may try to be more certain about a label before predicting it for a comment. So for a comment like ‘I really enjoyed dinner and the ward was comfortable’, which has the real labels Food & diet
and Environment
, the model may only be very certain that it is about Food & diet
and label it as such. This means that it might miss the Environment
label. This model would therefore produce more false negatives.
Precision and recall cannot both be increased at the same time, there is a tradeoff to be made between the two.
Still confused? This video may help.
F1 score
F1 score is sometimes known as the harmonic mean of recall and precision. It’s an attempt to generalise the two. The range for this is between 0 to 1, the higher the better.
In phase 1 of the pxtextmining project, the metric that was optimised was class balance accuracy (not to be confused with balanced accuracy score).[1] This was a type of averaged accuracy score obtained across the different classes.
Multilabel performance metrics
In a multilabel model, one sample can have more than one label. The exact number of labels assigned to the sample can vary. For example, we could label films like this:
Film | Labels |
---|---|
The Mummy | Action Adventure Comedy |
Shrek | Comedy Animation |
Performance measurement for models that perform multilabel classification is tricky. There is little consensus in the literature about which metric to select. This is because, as with the single-label example, it largely depends on what we’re trying to accomplish. The performance of the multi-label learning algorithm should therefore be tested on a broad range of metrics instead of only on the one being optimized.[2], [3] The current popular methods for reporting the performance of a multi-label classifier include the Hamming loss, precision, recall, and F-score for each class with the micro, macro, and weighted average of these scores from all classes.[4]
Film | Real Labels | Predicted Labels |
---|---|---|
The Mummy | Action Adventure Comedy |
Action Adventure |
Shrek | Comedy Animation |
Comedy Animation |
Hamming loss
The Hamming loss is the fraction of labels that are incorrectly predicted. This ranges between 0 and 1, the lower the better. In the example in Table 4 above, there would be a Hamming loss of 0.125 as it only didn’t manage to capture 1 of the labels.
Jaccard score
Jaccard similarity index is the size of the intersection of the predicted labels and the true labels divided by the size of the union of the predicted and true labels. It ranges from 0 to 1, and 1 is the perfect score. However, if there are no true or predicted labels, the sklearn implementation of Jaccard will return an error.
Averaging single label metrics
All of the usual classification metrics (Recall, Precision, F1 score) can be used in multilabel. For example, you could get a recall, precision, and F1 score for every label, on a one-vs-rest approach.
In multilabel, accuracy is sometimes also called ‘exact match accuracy’ or ‘subset accuracy’. This is quite a harsh metric, as all of the labels for the sample must be correct. In the example in Table 4, there would be a subset accuracy of only 0.5 as the model did not accurately predict all of the correct labels for the first sample, although it got 5 out of 6 labels right.
You could also obtain generalised recall, precision, and F1 scores for all the labels. The way that this works is the recall, precision, and F1 scores are obtained for each class, and then they are averaged out. They can be averaged in three ways:
Micro-averaged: all samples equally contribute to the final averaged metric (in imbalanced dataset with many more of Class A than Class B, both would be treated differently according to the total true positives, false negatives and false positives).
Macro-averaged: all classes equally contribute to the final averaged metric (in imbalanced dataset with many more of Class A than Class B, both would be treated equally in calculating the average)
Weighted-averaged: each classes’ contribution to the average is weighted by support (the number of true instances for each label).
Conclusion: which metrics to use for pxtextmining Phase 2?
Ultimately there is not going to be one magic metric that is going to work for the pxtextmining project. As the dataset is quite imbalanced, and we do want to capture the less well-represented classes, the macro-averaged scores are likely to be helpful. I expect to use the macro-averaged F1 score as a very rough indication in the beginning stages of the project.
However, per-class statistics will likely be most important for performance measurement and especially in the model tuning stage will be helpful for indicating labels that are proving hard to capture.[5] The Hamming and Jaccard scores will also be calculated to give an overall indicator of performance although I anticipate that these will not be the primary focus of model tuning. Unfortunately the metric ‘class balance accuracy’ which was utilised in phase 1 of the pxtextmining project cannot be used for the multilabel models in phase 2.
Do note that there are other metrics available. The ones selected here are the ones that have been most commonly mentioned in the literature, as it makes sense to utilise what has been established as best practice elsewhere for generalisability and interpretability.
A note on baseline/dummy models
When prototyping machine learning models, we need to first establish a baseline of performance - what is the most basic level that we should expect from our model? In the case of an imbalanced dataset, a model which just predicts the mode (the most frequently occurring class) regardless of the input would be a suitable dummy. Other approaches to ‘dummy’ models include a model that just outputs random predictions.
It is useful to know what performance a dummy model would be able to achieve, when considering the performance of any machine learning model. This gives us something for us to compare our outputs to.
Bibliography
- L. Mosley, ‘A balanced approach to the multi-class imbalance problem’, Doctor of Philosophy, Iowa State University, Digital Repository, Ames, 2013. doi: 10.31274/etd-180810-3375.
- M.-L. Zhang and Z.-H. Zhou, ‘A Review on Multi-Label Learning Algorithms’, IEEE Trans. Knowl. Data Eng., vol. 26, no. 8, pp. 1819–1837, Aug. 2014, doi: 10.1109/TKDE.2013.39.
- X.-Z. Wu and Z.-H. Zhou, ‘A Unified View of Multi-Label Performance Measures’, in Proceedings of the 34th International Conference on Machine Learning, Jul. 2017, pp. 3780–3788. Accessed: Jan. 20, 2023. [Online]. Available: https://proceedings.mlr.press/v70/wu17a.html
- M. Heydarian, T. E. Doyle, and R. Samavi, ‘MLCM: Multi-Label Confusion Matrix’, IEEE Access, vol. 10, pp. 19083–19095, 2022, doi: 10.1109/ACCESS.2022.3151048.
- S. Henning, W. H. Beluch, A. Fraser, and A. Friedrich, ‘A Survey of Methods for Addressing Class Imbalance in Deep-Learning Based Natural Language Processing’. arXiv, Oct. 10, 2022. Accessed: Jan. 20, 2023. [Online]. Available: http://arxiv.org/abs/2210.04675