Page Nav

HIDE

Breaking News:

latest

Ads Place

NN or XGBoost —The Guide

https://ift.tt/4owJFUe MaxPixel NN or XGBoost: The Guide A short guide to help you choose the model that suits you best Part 1 - Intro...

https://ift.tt/4owJFUe
MaxPixel

NN or XGBoost: The Guide

A short guide to help you choose the model that suits you best

Part 1 - Intro

Inspired by the success of Deep Learning in Computer Vision tasks, many people debate whether NN can “win the game” for Tabular Data. Some argue in favor of NN’s almost infinite potential while others refer to XGBoost’s achievements in Kaggle competitions. Some even went ahead and performed a head-to-head comparison between the 2 models (e.g. Firefly.ai, MLJar, Ravid, et al.).

Choose not the best model but the model the suits you best

Fortunately, the entire question of “which ML model is best for tabular data” is completely irrelevant for the data scientist's work. This is because the data scientist focuses on his own specific dataset/task and, especially in tabular data, finds other datasets/tasks less relevant. Furthermore, the model’s accuracy plays only a partial role in choosing the right model. Thus, the right question should be “which model is best for my needs”.

Choosing the right model is not always trivial as there are many aspects to consider. To make this process easier, the guide below provides 3 tables of considerations and the respective model recommendations. The logic behind these recommendations is explained in the third section.

The table and analysis below were created for a Multi-Layer-Perceptron architecture. In recent years, however, many other architectures for NN models for tabular data were introduced, such as LSTM, TabNet, NODE. These architectures, while providing superior performance, share many characteristics with the basic MLP architecture. Thus, the analysis below is still applicable, to some extent, for these architectures as well.

Part 2 - The guide

The following tables were created to assist you in recognizing which model has the potential to meet your needs.

Trees → XGBoost, Random Forest, LightGBM, Catboost, etc.
NN → Multi-Layer-Perceptron and, to some extent, other architectures

Performance Question

There is no way to know, in advance, which model would give better performance. It’d be best to try all models, including hyperparameters tuning, and construct an ensemble from the trained model. In other words, it’s best to use an AutoML system with plenty of credits. Unfortunately, using such a system is often not possible and so a choice of the model type must be manually made. In such cases, the characteristics of the data and the task might indicate which model will achieve the best performance.

Aspects of Training and Inference

In Kaggle competitions performance is everything. But in real-life practical considerations, such as ease-of-use and training time/cost, are often just as important.

Other Considerations

Many times the work does not end with a trained model and other follow-up features are required. Such features may dictate choosing a sub-par model to achieve a more useful model.

Part 3 — The Reasoning

Trees vs NN — High-level differences

Let’s examine some key differences between an NN-based solution and a Trees-based solution.

When it comes to ease-of-use, there is no question that it’s much easier to create a Trees-based solution. It’s almost a plug-n’-play solution with 3 lines of code with a negligible learning curve. In contrast, to get good results with NNs, you need to know what you are doing — both the theory and the TensorFlow/PyTorch packages. Not only that, but the time it takes to train a trees-based model is also much shorter. On top of that, the Trees-based models are relatively insensitive to the few hyper-parameters (HP) they have. On the other hand, NN models have many hyper-parameters and are also very sensitive to their values. As a result, NN models require thorough HP tuning to make them work.

All these points add up to Trees-based models having a significantly shorter time for a model.

One of the reasons the Trees-based models are easy-of-use is they do not provide much room to maneuver. This is a double-edged sword. While it enables the novice to get high-end results, it prevents the expert to advance further and building a tailor-made solution. In this sense, the NN models are just the opposite. With sufficient expertise it’s possible to tailor the solution to the problem and the business case, enabling a customized State-of-the-Art solution.

However, this does not come for free. Not only expertise is required, but also increased development time for R&D, dedicated HW, and longer model training with HP tuning. In the industry, all of these translate to a lot of money. Without question, creating a NN solution is far more expensive.

Putting together “customizable”, “State-of-the-Art” and “expensive” yields a significantly more PR and interest in the technology (i.e. Deep Learning) which in turn translates to fundraising, product marketing, and academy research.

Trees vs Neural Networks— Technical Analysis

Each model type, Trees-based or NN-based, has its characteristics which can be translated to strengths and weaknesses. These, in turn, make each model more suitable to specific datasets.

Neural Networks

Let’s examine the Multi-Layer-Perceptron (MLP) architecture. This architecture uses the concept of neurons and layers to conveniently represent the NN model as a function composition of activation functions, where each intermediate function is a weighted sum of several activation functions. Thus, most of the activation function’s characteristics will be passed on to the composed function—which is the NN model. With the common activation functions — e.g. Linear, ReLu, Tanh, or Logistic — this implies the following:

  • Parametric function Space: The NN architecture defines a function space in which lies the NN model F(x). The parameters for F(x) — the weights and the biases — are optimized so F(x) would best represent the underlying function from which the data samples were sampled. Since F(x) is limited to the function space, it’s easier to obtain a good approximation to underlying functions that belong to the same function space and have similar characteristics. This means less data is required to achieve higher accuracy compared to other methods. Another advantage is the easy access to the model’s derivatives if one requires it.
  • Lipschitz continuity: The gradients of the NN model are bounded. This characteristic would make it difficult for a NN to approximate an underlying function with strong gradients such as discontinuous (e.g. categorical values), unbounded functions (e.g. 1/x, log(x)), or unbounded gradients (e.g. sqrt(x)). In such cases, more data & model complexity would be required to achieve the desired accuracy.
  • Not periodic: The common activation functions are not-periodic. This characteristic may make it challenging for a NN to approximate a periodic function.
  • Capability to extrapolate: When X goes to infinity F(X) becomes either constant (for Tanh, Logistic) or linear (for ReLu). This allows some capability to extrapolate. Moreover, in regression tasks, the function may have values lower/higher than the minimum/maximum of the target in the train data.
  • Flexible: It’s possible to play with the architecture to adapt the function space to the task. For instance, multi-label models.
  • Optimization process: The optimization process allows converging to a more optimal solution by incorporating custom metrics, regularization, and constraints. Additionally, it allows advanced features like online learning or transfer learning.
  • Network architecture: The use of several hidden layers causes the network to embed the data in a relatively small N-dimensional space during training. This embedding is often useful for other applications/models. It also opens up many options such as transfer learning (in which the embedding is re-used), using images/text (by concatenation of the embeddings), or even for generative purposes (by querying the embedding’s latent space).

Tree
The underlying function is approximated by a weighted sum of piecewise constant functions, where each piece is an orthotope (hyperrectangle). Thus, while each model (RandomForest, XGBoost, etc.) converges to a different function, they are all contained within the same function space.

  • Non-parametric: The model can approximate any underlying function.
  • piecewise constant: The model’s accuracy is dictated by the number of partitions vs the underlying function’s gradients. Since the partitions are dictated by the data points’ distribution, the model’s accuracy will be lower in areas with high gradients or low-density of samples. This, of course, is relevant mostly for numeric features.
  • Decision making: The basic building block — the decision tree — is easy to explain. As a result, the Trees-based models are considered significantly more explainable. Additionally, Shapley values can be easily calculated from such models with the need for expensive calculations or surrogate models.
  • Interpolation: Due to the binary decision-making structure the Trees-based models cannot provide reliable predictions outside the extrema of the features’ values. Again, this is mostly relevant for numeric features. This makes Trees-based models not suitable for features like “time from epoch” unless an appropriate countermeasure is taken. Similarly, Trees-based models’ predictions are limited to the target range in the training set, so target labels such as “number of customers” should be avoided.
  • Locality: The trees-based model is very local as each leaf is responsible for a specific area. As a result, the generalization quality in one area of the input space is relatively independent of the quality in other subspaces.

Key Takeaways

There is no SOTA ML model for a tabular dataset

Tabular data encompasses all datasets which can be put into tables. But that’s it. Unlike other types of datasets, such as images or text, there is no other common characteristic. As a result, different models are more suitable to different tasks.

Compare the needs of your task with model’s pros and cons

There are plenty of readily available models out there. The right model is not optimized for best performance, but is optimized to provide the best balance between the overall business value and the solution’s cost & effort.

This balance is unique to the specific challenge one is facing. By analyzing the needs of the task versus the models strengths and weaknesses one is able to choose the right path to take.

Choose the model that suits you best

I’ve created this guide to encourage you to ask the right questions and assist you in making the right choices. The rest is up to you.


NN or XGBoost —The Guide was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Towards Data Science - Medium https://ift.tt/t58jsIW
via RiYo Analytics

No comments

Latest Articles