https://ift.tt/4YjwoWv Scaling features on the modern data stack Photo by Julentto Photography on Unsplash While the scaling of numer...
Scaling features on the modern data stack
While the scaling of numeric features does not always need to be done as explained by Praveen Thenraj in his post on Towards Data Science for tree-based machine learning techniques, it does benefit linear and logistic regressions, support vector machines and neural networks. Many data scientists have created modeling scripts that automate the building and testing of many different types of modeling algorithms. These scripts allow the data scientist to select the best performing model for that data. Scaling of features is commonly performed in these scripts.
There are two primary scaling techniques used. The first is standard scaling (or z-scaling) and is calculated by subtracting the mean and dividing by the standard deviation. The second is min-max scaling and is calculated by subtracting by the minimum value and dividing by the difference between the maximum and minimum values.
Scikit-learn based scaling
The standard scaler can be applied to scale a list of columns scale_columns by importing StandardScaler from the preprocessing module and applying it to the dataframe as
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[scale_columns] = scaler.fit_transform(df[scale_columns])
Similarly, the min-max scaler can be applied to the same list of columns as
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[scale_columns] = scaler.fit_transform(df[scale_columns])
Pandas only based scaling
As the scikit-learn module is rather large, if it is only going to be used to scale the feature, it can be easier to do the scaling using pandas. To scale each feature by the mean and standard deviation, call
df[scale_columns] = (df[scale_columns] - df[scale_columns].mean()) /
df[scale_columns].std()
Note that this does not duplicate the results from StandardScaler. This is because scikit-learn uses numpy’s std function. By default, this function uses 0 degrees of freedom. On the other hand, pandas’ std is the unbiased estimator by default. To get the same value, run
df[scale_columns] = (df[scale_columns] - df[scale_columns].mean()) /
df[scale_columns].std(ddof=1)
Similarly, to get the min-max scaler, run
df[scale_columns] = (df[scale_columns] - df[scale_columns].min()) /
(df[scale_columns].max() - df[scale_columns].min())
RasgoQL scaling in the data warehouse
Instead of extracting the data from the data warehouse, the open-source Python package RasgoQL can create the scaled variables directly in the warehouse. First, this will save time extracting the data and pushing it back to the warehouse after scaling. Second, by leveraging the power of the warehouse, much larger dataset sizes can be transformed at once.
The standard scaler can be applied as
scaledset = dataset.standard_scaler(
columns_to_scale=['DS_DAILY_HIGH_TEMP',
'DS_DAILY_LOW_TEMP',
'DS_DAILY_TEMP_VARIATION',
'DS_DAILY_TOTAL_RAINFALL'])
scaledset.save(table_name='DS_STANDARD_SCALED')
Similarly, the min-max scaler can be applied as
scaledset = dataset.min_max_scaler(
columns_to_scale=['DS_DAILY_HIGH_TEMP',
'DS_DAILY_LOW_TEMP',
'DS_DAILY_TEMP_VARIATION',
'DS_DAILY_TOTAL_RAINFALL'])
scaledset.save(table_name='DS_MIN_MAX_SCALED')
These scaled features are now available for use either by joining to the original data or with any other data. They can also be used in modeling pipelines, data visualizations, and predictions in production pipelines. This scaled data can be downloaded into pandas for use in modeling by calling to_df
df = scaledset.to_df()
If you’d like to check the SQL RasgoQL uses to create this table or view, run sql:
print(scaledset.sql())
Finally, if you use dbt in production and would like to export this work as a dbt model for use by your data engineering team, call to_dbt:
scaledset.to_dbt(project_directory='<path to dbt project>'
When working with small amounts of data on a single machine, the scikit-learn and pandas approaches will work well for scaling features before modeling. However, if the data is large or already stored in a database, performing the scaling in the database (using RasgoQL or SQL) can save significant time during data preparation. More importantly, the code used to scale the data can easily be put into a production workflow potentially saving weeks of effort moving models into production.
If you want to check out RasgoQL, the documentation can be found here and the repository here.
Three Techniques for Scaling Features for Machine Learning was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
from Towards Data Science - Medium https://ift.tt/LjvYyqn
via RiYo Analytics
No comments