Page Nav

HIDE

Breaking News:

latest

Ads Place

Classification in the wild

https://ift.tt/3nISXyi Classification in the Wild Let’s dive into classification metrics and discuss a few tricks, which could boost your ...

https://ift.tt/3nISXyi

Classification in the Wild

Let’s dive into classification metrics and discuss a few tricks, which could boost your classification pipeline performance.

Photo by Patrick Tomasso on Unsplash

Hi, I am Sergey, and I have been working on ML-based projects for the last 5+ years. During my career, I worked on different projects, startups, big companies, won a few competitions, and wrote a few papers. I have also launched the Catalyst — a high-level framework on top of PyTorch to boost my productivity as a deep learning practitioner. With such a path, I have recently decided to write a series of posts about deep learning “general things”. After a few ideas, I have decided to start from ML evaluation techniques and metrics: how to understand them? When to use them? In such a case, today, I would like to dive into classification metrics in deep learning and discuss a few tricks, which could boost your classification pipeline performance.

You can find all the examples below under this colab notebook.
The original text of the blogpost is 
here.

Exp 01: typical classification

The classification task looks well-known for any deep learning practitioner. Long story short, we have some labeled data in(some-data, label) format and want to create a model, which could transfer some-extra-data to label for us. As an example, let’s review simple CIFAR10 classification:

To sum up an example above:

  • we create resnet9 network
  • train it on CIFAR10 for 10 epochs
  • with CE loss, Adam optimizer, MultiStep scheduler
  • and accuracy as a validation metric

This example looks like a very common classification pipeline.
Could we do it better? Let’s check it out!

Exp 02: focal loss

Starting from a simple improvement, let’s introduce FocalLoss instead of CE. For a long review, please read the original paper, for a short review: thanks to per-sample loss reweighing based on the difference between true and predicted probabilities, FocalLoss better handles class imbalance, focusing on poorly distinguishable classes. As a result, it gives better performance for classification tasks with heavy class imbalance (real-world case, not CIFAR one). What is more important, it doesn’t introduce any additional complexity into your pipeline. So let’s check it out:

While we haven’t significantly improved CIFAR10 results, FocalLoss usually helps with more practical cases. Finally, a short trick, I also want to mention - multi-criterion usage:

This approach gives you a way to balance straightforward classification (with CE loss) and imbalance-focused one (with FocalLoss).

Exp 03: classification metrics

Okay, as we have seen, we could “improve” our accuracy performance a bit thanks to FocalLoss. But there are a few additional classification metrics to understand your model better:

  • precision - shows model assurance in label prediction. For example, suppose precision is high, and the model predicts some label L for some input I. In that case, there is a high probability that I is actually L.
  • recall - shows model ability to find all class instances in the data stream. Of course, the high recall does not mean that all model predictions will be accurate enough, but it gives us a high probability of covering all class instances.
  • f-score - is a harmonic mean between precision and recall. Hence, it could be seen as a unified score to understand the model’s ability to find all relevant classes in the data stream (recall) and label them correctly (precision). Moreover, f-score beta parameter allows us to preference for precision or recall during aggregation.

Additionally, there are two more things, which are essential to check during model training and predictions:

  • support - just a simple number of samples per class. It seems oblivious, but the more data points you have - the confident insight you can find. And sometimes, even if you have a large dataset with a wide variety of classes, there could be classes with only a few examples, leading to unpredictable results during training and evaluation. Adding a support metric to your pipeline gives you a simple way to “validate” the dataset during training.
  • confusion matrix - easy to follow resume of your classification model ability to distinguish different classes. While it, obliviously, helps you analyze model correctness (confusion matrix diagonal), it also gives you important insight into classes distribution and labeling. There were several cases in my practice when confusion matrix helped to find incorrect labeling during dataset update - just reviewing the classes interactions anomalies on confusion matrix.

Let’s add them to our pipeline:

There are a few crucial tricks to watch with these metrics. All of them could be computed both “per-class” or “aggregated” over the classes. “Per-class” results are crucial for model performance understanding cause there are many cases when your model could perform “well” in general and “worst than ever before” on the most important classes. For example, we could review some text classification model, which greatly works with greeting intents, but fails at toxic ones prediction, which could be much more valuable from a business perspective. “Aggregated” results are essential if you want to review model performance in only a few numbers quickly. There a three most common aggregation strategies:

  • micro: all samples equally contribute to the final averaged metric,
  • macro: all classes equally contribute to the final averaged metric,
  • weighted: each classes’ contribution is weighted by its size during averaging.

Another important note of the above metrics is that all their results are dataset-based, which means you couldn’t simply average the batch-based micro-metrics to get dataset-based micro statistics.

Tensorboard

As far as we have a large variety of different metrics, it’s much easier to use tensorboard to watch them all:

tensorboard --logdir ./logs
Accuracy metric for the pipelines above. (Image by author)

Inference & report

Additionally, there is also a way to represent all the above metrics in a user-friendly way to review the final model performance:

Classification report results. (Image by author)

With such a classification report, it’s much easier to conclude the model’s final performance.

Thresholds

The last critical step I would like to mention in this tutorial is thresholds. While they are not fancy deep learning models, they give you a way to tune these models for your production cases without any additional learning. So, for example, you could set a threshold of 1.0 for some poor working class to stop the model from predicting it at all. As far as this is an essential practice in production deep learning, it is also included in the Catalyst:

As you can see, we have been able to slightly improve our model performance even in such a simple setup as CIFAR (model learning on CIFAR is quite easy). Of course, there is a data leak during benchmarking because we were tuning and evaluating thresholds on the same test set, so in a real-world example, you have to split your dataset into train, valid, and test parts to prevent any data leaks. Nevertheless, even with such a strict evaluation setup, thresholds usage usually gives a critical 2–8% improvement for your metric on interest, which is huge.

Conclusion

So, to sum up, the main topics for this blog post:

  • classification problem is still an open area for improvement (especially in the close vs. open sets domain, but this is another post),
  • try the FocalLoss in your next experiment, if you have a class imbalance in your data (a pretty common case),
  • use PrecisionRecallF1Support and ConfusionMatrix to analyze your model performance during training,
  • use classification report to understand your final model classification performance,
  • try thresholds during model deployment to tune the model for your special cases and improve the final performance.

If you want to dive deeper into this classification example, you could:

or tune it for whatever classification problem you want ;)

Those were all the important steps in classification for this blog post. If you would like to check more deep learning best practices — subscribe for scitator & catalyst-team. Thanks for reading, and stay tuned for more!


Classification in the wild was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Towards Data Science - Medium https://ift.tt/3kaVz5l
via RiYo Analytics

No comments

Latest Articles