Page Nav

HIDE

Breaking News:

latest

Ads Place

Imputation on the Modern Data Stack

https://ift.tt/nUTR4pk Imputation at scale Photo by Sigmund on  Unsplash Since the majority of machine learning algorithms do not hand...

https://ift.tt/nUTR4pk

Imputation at scale

Photo by Sigmund on Unsplash

Since the majority of machine learning algorithms do not handle null values, missing value imputation is a standard part of feature engineering pipelines. There are multiple techniques for this imputation ranging from the very simple to the creation of complicated models to infer what the value should be.

The simplest method is to impute a single value for the missing value, most commonly the mean or median of the feature. For tree based methods, imputing an extreme value (larger than the maximum or smaller than the minimum for the feature) also works. Often, data scientist feel that these simple techniques do not adequately replace the missing value. In these cases, a model can be created to impute the value based on the other features for that observation. This can range from relatively simple (kNN models) to complicated models similar in scope to the original machine learning effort. While this model-based approach can be appealing, the power of modern machine learning algorithms mean that imputing a single value and creating a missing indicator can perform as well with much lower effort. In this blog, we will impute the mean, but simple changes to the code would allow us to impute other values instead.

Imputation in Python

Within Python, there are two common ways to perform this imputation. First, this imputation can be done with pandas.

df['varname'] = df['varname'].fillna(df['varname'].mean())

In order to create the missing indicator, it needs to be created first

df['varname_mi'] = df['varname'].isna()
df['varname'] = df['varname'].fillna(df['varname'].mean())

Often, this is done in Scikit-learn as part of the feature engineering pipeline. Using train_test_split to create the training and test partitions, a SimpleImputer can be created to replace missing values.

from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1066)
imp = SimpleImputer(missing_values=np.NaN, strategy='mean')
X_train = imp.fit_transform(X_train)
X_test = imp.transform(X_test)

Unlike with pandas, SimpleImputer can simultaneously create a missing value indicator by changing the call to include setting the parameter add_indicator to True.

imp = SimpleImputer(missing_values=np.NaN, 
strategy='mean',
add_indicator=True)

Challenges

In both of these cases, these methods work for datasets that are manageable on individual workstations. In problems where the datasets contain hundreds of gigabytes or terabytes of data, this approach no longer works. While packages such as Dask can allow you to process these massive datasets, you will still need to move the data to your machine or cluster. With these size datasets, the time to move the data can be significant. Even with manageable data sizes, these Python based approaches can still cause problems when the final model goes into production, as these imputations will need to be duplicated in the production code. This can be challenging when the production environment isn’t in Python or doesn’t support loading Python objects.

Imputation using the power of the cloud data warehouse

Using RasgoQL can alleviate both of these issues. RasgoQL performs the imputation within the data warehouse and allows you to save all feature engineering steps so they can be rerun in the warehouse at the time of prediction.

To use RasgoQL’s transforms, first connect to the system with RasgoQL.

import rasgoql
creds = rasgoql.SnowflakeCredentials.from_env()
rql = rasgoql.connect(creds)

And get the dataset (a table or view in the data warehouse) that needs missing value imputation performed on it with rql.dataset(fqtn=<Fully Qualified Table Name>). All supported transforms can be found here or by listing them programmatically:

transforms = rql.list_transforms()
print(transforms)

To transform the data, call the transform name on the dataset.

newset = dataset.impute(imputations={'VARIABLE_1': 'mean',
'VARIABLE_2': 'mean'},
flag_missing_vals=False)

Note that the imputation transformation takes a dictionary of the fields that need imputation and allow different methods for the imputation: mean, median, mode or a single value. In addition, setting flag_missing_vals to True will cause missing indicator features to be created for each field in the imputation dictionary.

Transformation chaining

Multiple transformations can be chained to one another, either by appending multiple calls to transformation functions or by running a transformation function on the output of a prior transformation. This allows you to build up more complicated transformations from relatively simple ones and create your final dataset with a single SQL call. The results of the chain of transformations can be viewed as a dataframe with:

df_transform = newset.preview()
print(df_transform)

By default, preview returns ten rows. To download the entire data as a dataframe, call to_df.

Since these transformations are running on the cloud data warehouse, it is running SQL code. The SQL statements for the transformation chain can be viewed by running

print(newset.sql())

While this code creates the transformation chain, it has not yet been executed with the results saved in the warehouse. To save it, a new table needs to be created by calling save on the chain.

newset.save(table_name='<NEW TABLE NAME>',
table_type='view')

This data is saved as a view. In order to save it as a table instead, change table_type to ‘table’ instead.

At this point, Rasgo has run the missing value imputation against the full data stored in the data warehouse and returned the results as a dataframe in any Python environment and can be used for modeling or visualization using standard tools. In addition, if this was saved as a view, when new data is added to the underlying table in the data warehouse, the data preparation phase doesn’t need to be rerun as the imputed data is automatically run and available for download.

The fact that this preparation is automatically run impacts the ability to place the data preparation pipeline into production. Instead of either packaging up the python code or rewriting it into the language used in the production system, the data is automatically kept up to date in the warehouse and the relevant rows of data can be extracted with the imputation already created.

Now, instead of losing weeks or months of time rewriting (or waiting for software engineering to rewrite) and testing the output of the code, the data is immediately available.

Finally, as this imputation is immediately available in the data warehouse means that other data scientists and analysts no longer need to extract the original data and prepare it for their own use. Instead, the final, transformed data can be extracted directly from the warehouse and used for modeling, visualization or reporting.

To understand the power of these tools, a dataset was created that included missing data in several key fields and uploaded to our data warehouse. The mean was imputed for the missing fields both with and without creating a missing indicator. Finally, a CatBoost classifier was trained for both cases and the log loss and AUC were both calculated.

y = transformed_df[target]
X = transformed_df.drop(columns=[target])
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=.2,
random_state=1066)
categorical_features = X_train.select_dtypes(exclude=[np.number])
train_pool = Pool(X_train, y_train, categorical_features)
test_pool = Pool(X_test, y_test, categorical_features)
model = CatBoostClassifier(iterations=1000, 
max_depth=5,
learning_rate=0.05,
random_seed=1066,
logging_level='Silent',
loss_function='Logloss')
model.fit(X_train, 
y_train,
eval_set=test_pool,
cat_features=categorical_features,
use_best_model=True,
early_stopping_rounds=50)
y_predict = model.predict_proba(X_test)
logloss = log_loss(y_test, y_predict[:, 1])
auc = roc_auc_score(y_test, y_predict[:, 1])
print(logloss, auc)

This gives the final results

  • Without missing indicator: AUC: 0.6881, Log loss: 0.2574
  • With missing indicator: AUC: 0.7045, Log loss 0.2563

In this case, using CatBoost, the model with the missing indicator slightly outperformed the model without the indicator

If you want to check out RasgoQL, the documentation can be found here and the repository here.


Imputation on the Modern Data Stack was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Towards Data Science - Medium https://ift.tt/aYCyuF7
via RiYo Analytics

No comments

Latest Articles