Page Nav

HIDE

Breaking News:

latest

Ads Place

Crafting One Pipeline for Machine Learning Steps

https://ift.tt/2rQmfDh From input transformation to grid search with scikit-learn “One Pipeline to rule them all, One Pipeline to find the...

https://ift.tt/2rQmfDh

From input transformation to grid search with scikit-learn

“One Pipeline to rule them all, One Pipeline to find them, One Pipeline to bring them all and in the brightness fit them.

Photo by Rodion Kutsaiev on Unsplash

When we look at the “Table of Contents” of a machine learning book on the market (i.e. Ǵeron, 2019), we see that after getting the data and visualizing it to gain insights, broadly, there are steps such as data cleaning, transforming and handling data attributes, scaling features, training and then fine-tuning a model. Data scientists’ beloved module, scikit-learn, has an amazing functionality (class) to handle these steps in a streamlined way: Pipeline.

Whilst exploring the best use of Pipelines online, I’ve come across great implementations. Luvsandorj nicely explained what they are (2020) and showed how one can customize a simpler one (2022). Ǵeron (2019, p.71–72) gave an example of writing our “own custom transformer for tasks such as custom clean-up operations or combining specific attributes”. Miles (2021) showed how one can run a grid search with a pipeline with one classifier. On the other hand, Batista (2018) presented how one can include lots of classifiers in a grid search without a pipeline.

Introduction

In this post, I will combine these sources together to come up with the ultimate ML pipeline that can handle the majority of the ML tasks such as (i) feature cleaning, (ii) handling missing values, (iii) scaling and encoding features, (iv) dimensionality reduction and (v) running many classifiers with different combinations of parameters (grid search) as the following diagram presents.

image by author

Table of Contents

I-Libraries and Data Set

II-Setting Classes

III- Setting Grid Search Parameters

IV- Building and Fitting Pipeline

V- Calculating Scores

Conclusion

I-Libraries and Data Set

For simplicity purposes, let’s use the Titanic data set, which can easily be loaded from the seaborn library.

image by author

II-Setting Classes

We need to create the following classes for this process:

  • “FeatureTransformer” to manipulate pandas dataframes, columns. For instance, although it has no impact on the model, I added a “strlowercase” parameter which can be applied on a (list of) column(s) to transform data. (inspired by Ǵeron, 2019, p.71–72)
  • “Imputer” to handle missing values (similar to sklearn’s SimpleImputer class)
  • “Scaler” to handle missing values (similar to sklearn’s StandardScaler class)
  • “Encoder” to encode (categorical or ordinal) features (inspired by Luvsandorj, 2022)
  • “ClassifierSwitcher” to switch between classifiers in the grid search step. (inspired by Miles, 2021)

III- Setting Grid Search Parameters

We create two dictionaries (inspired by Batista, 2018):

1- models_for_gridsearch = Classifier names as keys, classifier objects as values

2- params_for_models = Classifier names as keys, classifier hyper parameters as:

  • Either empty dictionaries (such as LogisticRegression row meaning classifier will be used with default parameters) or
  • Dictionaries with list(s) (KNeighboursClassifier or RandomForestClassifier row)
  • List of dictionaries (such as SVC row)

Note: I commented out other classifier objects for simplicity purposes.

We aim to create a list of dictionaries that have the classifier and parameter choice which then will be used in the grid search.

image by author

IV- Building and Fitting Pipeline

We have created the classes we need for our pipeline. It’s time to use them in an order and fit it.

We have a pipeline that

  • Assigns data types to our columns,
  • Does basic data transformations for some of the columns
  • Imputes missing values for the numerical columns and scales them
  • Encodes categorical columns,
  • Reduces dimensions,
  • Passes classifiers once fed.

After splitting the data into training, and test sets, we feed this pipeline into the GridSearchCV alongside the grid search pipeline parameters to find the best scoring model and fit it to our train set.

Best parameters: {‘clf’: LogisticRegression()}

Best score: 0.8573

grid search model ranking image by author

V- Calculating Scores

We can use the fitted pipeline (pipeline_gridsearch) to calculate scores or find probabilities of each instance belonging to our target status.

Train ROC-AUC: 0.8649

Test ROC-AUC: 0.8281

image by author

One can see that the embark town values are lower case due to our FeatureTransformer step. After all, it was to demonstrate that we can transform our feature within the pipeline.

Conclusion

Customizing sklearn classes in a way that can (i) transform and pre-process our features, (ii) fit multiple ML models for a variety of hyperparameters, is helpful to increase readability of our code and have a better control over the ML steps. Even though this method requires sequential computing (please see image below) as opposed to the parallelization that multiple pipeline methods could offer, despite that marginal time loss, building one pipeline alone would be more beneficial in an environment where we need to loop through many conditions, parameters, pre-processing steps to see their impact on the model.

image by author

References


Crafting One Pipeline for Machine Learning Steps was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Towards Data Science - Medium https://ift.tt/OCh1GSQ
via RiYo Analytics

No comments

Latest Articles