Three Techniques for Scaling Features for Machine Learning

https://ift.tt/4YjwoWv Scaling features on the modern data stack Photo by Julentto Photography on Unsplash While the scaling of numer...

https://ift.tt/4YjwoWv

Scaling features on the modern data stack

Photo by Julentto Photography on Unsplash

While the scaling of numeric features does not always need to be done as explained by Praveen Thenraj in his post on Towards Data Science for tree-based machine learning techniques, it does benefit linear and logistic regressions, support vector machines and neural networks. Many data scientists have created modeling scripts that automate the building and testing of many different types of modeling algorithms. These scripts allow the data scientist to select the best performing model for that data. Scaling of features is commonly performed in these scripts.

There are two primary scaling techniques used. The first is standard scaling (or z-scaling) and is calculated by subtracting the mean and dividing by the standard deviation. The second is min-max scaling and is calculated by subtracting by the minimum value and dividing by the difference between the maximum and minimum values.

Scikit-learn based scaling

The standard scaler can be applied to scale a list of columns scale_columns by importing StandardScaler from the preprocessing module and applying it to the dataframe as

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[scale_columns] = scaler.fit_transform(df[scale_columns])

Similarly, the min-max scaler can be applied to the same list of columns as

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[scale_columns] = scaler.fit_transform(df[scale_columns])

Pandas only based scaling

As the scikit-learn module is rather large, if it is only going to be used to scale the feature, it can be easier to do the scaling using pandas. To scale each feature by the mean and standard deviation, call

df[scale_columns] = (df[scale_columns] - df[scale_columns].mean()) /
                     df[scale_columns].std()

Note that this does not duplicate the results from StandardScaler. This is because scikit-learn uses numpy’s std function. By default, this function uses 0 degrees of freedom. On the other hand, pandas’ std is the unbiased estimator by default. To get the same value, run

df[scale_columns] = (df[scale_columns] - df[scale_columns].mean()) /
                     df[scale_columns].std(ddof=1)

Similarly, to get the min-max scaler, run

df[scale_columns] = (df[scale_columns] - df[scale_columns].min()) /
                 (df[scale_columns].max() - df[scale_columns].min())

RasgoQL scaling in the data warehouse

Instead of extracting the data from the data warehouse, the open-source Python package RasgoQL can create the scaled variables directly in the warehouse. First, this will save time extracting the data and pushing it back to the warehouse after scaling. Second, by leveraging the power of the warehouse, much larger dataset sizes can be transformed at once.

The standard scaler can be applied as

scaledset = dataset.standard_scaler(
                       columns_to_scale=['DS_DAILY_HIGH_TEMP', 
                                         'DS_DAILY_LOW_TEMP',
                                         'DS_DAILY_TEMP_VARIATION',
                                         'DS_DAILY_TOTAL_RAINFALL'])

scaledset.save(table_name='DS_STANDARD_SCALED')

Similarly, the min-max scaler can be applied as

scaledset = dataset.min_max_scaler(
                       columns_to_scale=['DS_DAILY_HIGH_TEMP', 
                                         'DS_DAILY_LOW_TEMP',
                                         'DS_DAILY_TEMP_VARIATION',
                                         'DS_DAILY_TOTAL_RAINFALL'])

scaledset.save(table_name='DS_MIN_MAX_SCALED')

These scaled features are now available for use either by joining to the original data or with any other data. They can also be used in modeling pipelines, data visualizations, and predictions in production pipelines. This scaled data can be downloaded into pandas for use in modeling by calling to_df

df = scaledset.to_df()

If you’d like to check the SQL RasgoQL uses to create this table or view, run sql:

print(scaledset.sql())

Finally, if you use dbt in production and would like to export this work as a dbt model for use by your data engineering team, call to_dbt:

scaledset.to_dbt(project_directory='<path to dbt project>'

When working with small amounts of data on a single machine, the scikit-learn and pandas approaches will work well for scaling features before modeling. However, if the data is large or already stored in a database, performing the scaling in the database (using RasgoQL or SQL) can save significant time during data preparation. More importantly, the code used to scale the data can easily be put into a production workflow potentially saving weeks of effort moving models into production.

If you want to check out RasgoQL, the documentation can be found here and the repository here.

Three Techniques for Scaling Features for Machine Learning was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

from Towards Data Science - Medium https://ift.tt/LjvYyqn
via RiYo Analytics

Page Nav

Pages

Breaking News:

Ads Place

Three Techniques for Scaling Features for Machine Learning

https://ift.tt/4YjwoWv Scaling features on the modern data stack Photo by Julentto Photography on Unsplash While the scaling of numer...

Scaling features on the modern data stack

Scikit-learn based scaling

Pandas only based scaling

RasgoQL scaling in the data warehouse

Related Posts

ليست هناك تعليقات

Top of the month

HP CEO on AI-Enabled Personal Computers

How to Become a Data Scientist in USA?

Top Posts September 26 – October 2: Free Algorithms in Python Course

Meta Launches Human-Like Designer AI for Images

Latest Posts

Cloud Labels

بحث هذه المدونة الإلكترونية

الإبلاغ عن إساءة الاستخدام

المساهمون

Happy To Help You

Popular Tag

Latest Articles

Featured Post

Elon Musk Plans to Launch Alternative Phone if Apple, Google Boot Twitter off Their App Stores

Hot of the Week

What Is Big O Notation and Why You Should Care

Coinbase Receives Approval to Offer Full Suite of Crypto Products in Netherlands

Billionaire Bill Ackman on US Banking Crisis: We Are Running Out of Time to Fix This Problem

Using Clinical Data Science to Improve Clinical Outcomes

التسميات

Footer Menu

Popular Posts

Spider-Man: No Way Home Torrents May Contain Crypto Malware, Cybersecurity Firm Warns

Onecoin Victims Petition Bulgaria for Seizure of Assets and Compensation

3air Leverages Blockchain Technology to Deliver Extensive Broadband Connectivity in Africa

AI Applications for Border Transportation

Page Nav

Ads Place

Three Techniques for Scaling Features for Machine Learning

https://ift.tt/4YjwoWv Scaling features on the modern data stack Photo by Julentto Photography on Unsplash While the scaling of numer...

Scaling features on the modern data stack

Scikit-learn based scaling

Pandas only based scaling

RasgoQL scaling in the data warehouse

Related Posts

ليست هناك تعليقات

Connect WIth Us

Top of the month

HP CEO on AI-Enabled Personal Computers

How to Become a Data Scientist in USA?

Top Posts September 26 – October 2: Free Algorithms in Python Course

Meta Launches Human-Like Designer AI for Images

Latest Posts

Cloud Labels

بحث هذه المدونة الإلكترونية

الإبلاغ عن إساءة الاستخدام

المساهمون

Happy To Help You

Popular Tag

Latest Articles

What Is Big O Notation and Why You Should Care

Coinbase Receives Approval to Offer Full Suite of Crypto Products in Netherlands

Billionaire Bill Ackman on US Banking Crisis: We Are Running Out of Time to Fix This Problem

Using Clinical Data Science to Improve Clinical Outcomes

Popular Posts

Spider-Man: No Way Home Torrents May Contain Crypto Malware, Cybersecurity Firm Warns

Onecoin Victims Petition Bulgaria for Seizure of Assets and Compensation

3air Leverages Blockchain Technology to Deliver Extensive Broadband Connectivity in Africa

AI Applications for Border Transportation