https://ift.tt/0ZNpLbM One-hot encoding in Python and on the data warehouse Photo by Burst on Unsplash While most machine learning al...
One-hot encoding in Python and on the data warehouse
While most machine learning algorithms only work with numeric values, many important real-world features are not numeric but rather categorical. As categorical features, they take on levels or values. These can be represented as various categories such as age, state, or customer type for example. Alternatively, these can be created by binning underlying numeric features, such as identifying individuals by age ranges (e.g., 0–10, 11–18, 19–30, 30–50, etc.). Finally, these can be numeric identifiers where the relationship between the values is not meaningful. ZIP codes are a common example of this. Two ZIP codes close numerically may be farther apart than another ZIP code that is distant numerically.
Since these categorical features cannot be directly used in most machine learning algorithms, the categorical features need to be transformed into numerical features. While numerous techniques exist to transform these features, the most common technique is one-hot encoding.
In one-hot encoding, a categorical variable is converted into a set of binary indicators (one per category in the entire dataset). So in a category that contains the levels clear, partly cloudy, rain, wind, snow, cloudy, fog, seven new variables will be created that contain either 1 or 0. Then, for each observation, the variable that matches the category will be set to 1 and all other variables set to 0.
Encoding with scikit-learn
One-hot encoding is easy to perform with the scikit-learn preprocessing library. In this code, we will one-hot encode the column stored in the variable column from the dataframe df.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
onehotarray = encoder.fit_transform(df[[column]]).toarray()
items = [f'{column}_{item}' for item in encoder.categories_[0]]
df[items] = onehotarray
This adds the columns created by the one-hot encoder back into the original dataframe and names each of the added column by the original column name along with the category level.
One-hot encoding in the database
Suppose the data already exists in a data warehouse. Instead of extracting the data to pandas on your system and converting it to push the data back to the warehouse, the open-source Python package RasgoQL can perform this encoding directly on the warehouse without needing to move the data at all. This approach saves the time and expense of moving the data, can work with datasets too large to fit into memory on a single machine, and automatically generates the encoded variables as new data arrives in the warehouse. This means the encoding works not just for existing data and modeling pipelines but is automatically available for production pipelines.
To perform the one-hot encoding using RasgoQL, the following code can be used.
onehotset = dataset.one_hot_encode(column=column)
Again, the column to be encoded is stored in the Python variable column.
This data can be downloaded to the python environment as a sample of ten rows in a dataframe by running preview:
onehot_df = onehotset.preview()
Or the full data can be downloaded as:
onehot_df = onehotset.to_df()
To make this data available to everyone as a view on the data warehouse, it can be published using save:
onehotset.save(table_name='<One hot Tablename>',
table_type='view')
Alternatively, to save this as a table, change table_type from ‘view’ to ‘table’. If you’d like to check the SQL RasgoQL uses to create this table or view, run sql:
print(onehotset.sql())
Finally, if you use dbt in production and would like to export this work as a dbt model for use by your data engineering team, call to_dbt:
onehotset.to_dbt(project_directory='<path to dbt project>'
One-hot encoding is one of the most common feature engineering techniques that is applied in many machine learning pipelines. It is easy to apply using scikit-learn, but, unless the production system is running Python, it will require rewriting to migrate the step to the production system. In the modern data stack, this migration often involves rewriting the code into SQL.
In order to perform one-hot encoding in SQL, one most first identify all of the levels in the categorical variable and then generate case statements to populate binary indicator variables for each level. This can be very tedious. Using RasgoQL, the SQL code is generated automatically and saved into the data warehouse. This means that while still working in Python and pandas, the data scientist has already generated code that can be used directly in production.
The process of converting the code developed during the modeling process to production ready code can be among the most time consuming steps of the entire data science lifecycle. Using a tool that makes it easy to transfer the modeling code into production can be a huge time saver. In addition, by moving the data preparation tasks to the data warehouse, the data scientists is allowed to work with the full data stored in the data warehouse and is no longer limited by both the time it takes to transfer the data to their computational environment, nor by the size limitation imposed by the environments memory.
If you want to check out RasgoQL, the documentation can be found here and the repository here.
Categorical Variables for Machine Learning Algorithms was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
from Towards Data Science - Medium https://ift.tt/jdKrRki
via RiYo Analytics
ليست هناك تعليقات