Page Nav

HIDE

Breaking News:

latest

Ads Place

Introduction to Anomaly Detection in Python with PyCaret

https://ift.tt/3s6ughs A step-by-step, beginner-friendly tutorial for unsupervised anomaly detection tasks using PyCaret Photo by Luka Sl...

https://ift.tt/3s6ughs

A step-by-step, beginner-friendly tutorial for unsupervised anomaly detection tasks using PyCaret

Photo by Luka Slapnicar on Unsplash

1. Introduction

PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that speeds up the experiment cycle exponentially and makes you more productive.

In comparison with the other open-source machine learning libraries, PyCaret is an alternate low-code library that can be used to replace hundreds of lines of code with few lines only. This makes experiments exponentially fast and efficient. PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks such as scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, and a few more.

The design and simplicity of PyCaret are inspired by the emerging role of citizen data scientists, a term first used by Gartner. Citizen Data Scientists are power users who can perform both simple and moderately sophisticated analytical tasks that would previously have required more technical expertise.

To learn more about PyCaret, you can check the official website or GitHub.

2. Objective of Tutorial

In this tutorial we will learn:

  • Getting Data: How to import data from the PyCaret repository.
  • Setting up Environment: How to set up an unsupervised anomaly detection experiment in PyCaret.
  • Create Model: How to create a model and assign anomaly labels to the dataset for analysis.
  • Plot Model: How to analyze model performance using various plots.
  • Predict Model: How to assign anomaly labels to new/unseen dataset based on the trained model?
  • Save / Load Model: How to save/load model for future use?

3. Installing PyCaret

Installation is easy and will only take a few minutes. PyCaret’s default installation from pip only installs hard dependencies as listed in the requirements.txt file.

pip install pycaret

To install the full version:

pip install pycaret[full]

4. What is Anomaly Detection?

Anomaly Detection is the task of identifying the rare items, events, or observations that raise suspicions by differing significantly from the majority of the data. Typically the anomalous items will translate to some kind of problems such as bank fraud, a structural defect, medical problems, or errors in a text. There are three broad categories of anomaly detection techniques that exist:

  • Unsupervised anomaly detection: Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the dataset are normal by looking for instances that seem to fit least to the remainder of the data set.
  • Supervised anomaly detection: This technique requires a dataset that has been labeled as “normal” and “abnormal” and involves training a classifier.
  • Semi-supervised anomaly detection: This technique constructs a model representing normal behavior from a given normal training dataset, and then tests the likelihood of a test instance to be generated by the learned model.

5. Overview of Anomaly Detection Module in PyCaret

PyCaret’s anomaly detection module (pycaret.anomaly) is an unsupervised machine learning module that performs the task of identifying rare items, events, or observations that raise suspicions by differing significantly from the majority of the data.

PyCaret anomaly detection module provides several pre-processing features that can be configured when initializing the setup through setup function. It has over 12 algorithms and a few plots to analyze the results of anomaly detection. PyCaret's anomaly detection module also implements a unique function tune_model that allows you to tune the hyperparameters of the anomaly detection model to optimize the supervised learning objective such as AUC for classification or R2 for regression.

6. Dataset for the Tutorial

In this tutorial, we will use a dataset from UCI called Mice Protein Expression. The data set consists of the expression levels of 77 proteins that produced detectable signals in the nuclear fraction of the cortex. The dataset contains a total of 1080 measurements per protein. Each measurement can be considered as an independent sample (mouse).

Dataset Citation:

Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6): e0129126. [Web Link] journal.pone.0129126

You can download the data from the original source found here and load it using pandas (Learn How) or you can use PyCaret’s data repository to load the data using the get_data() function (This will require an internet connection).

License:

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given. (Source)

from pycaret.datasets import get_data
dataset = get_data('mice')
# check the shape of data
dataset.shape
>>> (1080, 82)

In order to demonstrate the use of the predict_model function on unseen data, a sample of 5% (54 records) has been withheld from the original dataset to be used for predictions at the end of the experiment.

data = dataset.sample(frac=0.95, random_state=786)
data_unseen = dataset.drop(data.index)

data.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))
>>> Data for Modeling: (1026, 82)
>>> Unseen Data For Predictions: (54, 82)

7. Setting up Environment in PyCaret

The setup function in PyCaret initializes the environment and creates the transformation pipeline for modeling and deployment. setup must be called before executing any other function in pycaret. It takes only one mandatory parameter: a pandas dataframe. All other parameters are optional can be used to customize the preprocessing pipeline.

When setup is executed, PyCaret's inference algorithm will automatically infer the data types for all features based on certain properties. The data type should be inferred correctly but this is not always the case. To handle this, PyCaret displays a prompt, asking for data types confirmation, once you execute the setup. You can press enter if all data types are correct or type quit to exit the setup.

Ensuring that the data types are correct is really important in PyCaret as it automatically performs multiple type-specific preprocessing tasks which are imperative for machine learning models.

Alternatively, you can also use numeric_features and categorical_features parameters in the setup to pre-define the data types.

from pycaret.anomaly import *

exp_ano101 = setup(data, normalize = True,
ignore_features = ['MouseID'],
session_id = 123)
Output

Once the setup has been successfully executed it displays the information grid which contains some important information about the experiment. Most of the information is related to the pre-processing pipeline which is constructed when setup is executed. The majority of these features are out of scope for this tutorial, however, a few important things to note are:

  • session_id: A pseudo-random number distributed as a seed in all functions for later reproducibility. If no session_id is passed, a random number is automatically generated that is distributed to all functions. In this experiment, session_id is set as 123 for later reproducibility.
  • Missing Values: When there are missing values in original data it will show as True. Notice that Missing Values in the information grid above is Trueas the data contains missing values which are automatically imputed using mean for numeric features and constant for categorical features. The method of imputation can be changed using numeric_imputation and categorical_imputation parameter in the setup.
  • Original Data: Displays the original shape of the dataset. In this experiment (1026, 82) means 1026 samples and 82 features.
  • Transformed Data: Displays the shape of the transformed dataset. Notice that the shape of the original dataset (1026, 82) is transformed into (1026, 91). The number of features has increased due to the encoding of categorical features in the dataset.
  • Numeric Features: Number of features inferred as numeric. In this dataset, 77 out of 82 features are inferred as numeric.
  • Categorical Features: Number of features inferred as categorical. In this dataset, 5 out of 82 features are inferred as categorical. Also, notice we have ignored one categorical feature i.e. MouseID using ignore_feature parameter.

Notice how a few tasks that are imperative to perform modeling are automatically handled such as missing value imputation, categorical encoding, etc. Most of the parameters in the setup function are optional and used for customizing the pre-processing pipeline. These parameters are out of scope for this tutorial but I will write more about them later.

8. Create a Model

Creating an anomaly detection model in PyCaret is simple and similar to how you would have created a model in supervised modules of PyCaret. The anomaly detection model is created using create_model function which takes one mandatory parameter i.e. name of the model as a string. This function returns a trained model object. See the example below:

iforest = create_model('iforest')
print(iforest)
>>> OUTPUT
IForest(behaviour='new', bootstrap=False, contamination=0.05, max_features=1.0, max_samples='auto', n_estimators=100, n_jobs=-1, random_state=123, verbose=0)

We have created an Isolation Forest model using create_model. Notice the contamination parameter is set 0.05 which is the default value when you do not pass the fraction parameter. fraction parameter determines the proportion of outliers in the dataset. In the example below, we will create One Class Support Vector Machine model with 0.025 fraction.

svm = create_model('svm', fraction = 0.025)
print(svm)
>>> OUTPUT
OCSVM(cache_size=200, coef0=0.0, contamination=0.025, degree=3, gamma='auto',kernel='rbf', max_iter=-1, nu=0.5, shrinking=True, tol=0.001, verbose=False)

To see the complete list of models available in the model library, please check the documentation or use the models function.

models()

9. Assign a Model

Now that we have created a model, we would like to assign the anomaly labels to our dataset (1080 samples) to analyze the results. We will achieve this by using assign_model function.

iforest_results = assign_model(iforest)
iforest_results.head()

Notice that two columns Anomaly and Score are added towards the end. 0 stands for inliers and 1 for outliers/anomalies. Anomaly_Score are the values computed by the algorithm. Outliers are assigned with larger anomaly scores. Notice that iforest_results also includes MouseID that we have dropped during setup. It wasn't used for the model and is only appended to the dataset when you use assign_model.

10. Plot a Model

plot_model function can be used to analyze the anomaly detection model over different aspects. This function takes a trained model object and returns a plot.

10.1 T-distributed Stochastic Neighbor Embedding (t-SNE)

plot_model(iforest, plot = 'tsne')

10.2 Uniform Manifold Approximation and Projection

plot_model(iforest, plot = 'umap')

11. Predict on Unseen Data

The predict_model function is used to assign anomaly labels to a new unseen dataset. We will now use our trained iforest model to predict the data stored in data_unseen. This variable was created at the beginning of the tutorial and contains 54 samples from the original dataset that were never exposed to PyCaret.

unseen_predictions = predict_model(iforest, data=data_unseen)
unseen_predictions.head()

Anomaly column indicates the outlier (1 = outlier, 0 = inlier). Anomaly_Score is the values computed by the algorithm. Outliers are assigned with larger anomaly scores. You can also use predict_model function to label the training data.

data_predictions = predict_model(iforest, data = data)
data_predictions.head()

12. Saving the model

We have now finished the experiment by using our iforest model to predict labels on unseen data.

This brings us to the end of our experiment, but one question is still to be asked: What happens when you have more new data to predict? Do you have to go through the entire experiment again? The answer is no, PyCaret’s inbuilt function save_model allows you to save the model along with the entire transformation pipeline for later use.

save_model(iforest,’Final IForest Model 25Nov2020')

To load a saved model at a future date in the same or an alternative environment, we would use PyCaret’s load_model function and then easily apply the saved model on new unseen data for prediction.

saved_iforest = load_model('Final IForest Model 25Nov2020')
new_prediction = predict_model(saved_iforest, data=data_unseen)
new_prediction.head()

12. Wrap-up / Next Steps?

We have only covered the basics of PyCaret’s Anomaly Detection Module. In the following tutorials, we will go deeper into advanced pre-processing techniques that allow you to fully customize your machine learning pipeline and are a must-know for any data scientist.

Thank you for reading 🙏

Important Links

Tutorials New to PyCaret? Check out our official notebooks!
📋 Example Notebooks created by the community.
📙 Blog Tutorials and articles by contributors.
📚 Documentation The detailed API docs of PyCaret
📺 Video Tutorials Our video tutorial from various events.
📢 Discussions Have questions? Engage with community and contributors.
🛠️ Changelog Changes and version history.
🌳 Roadmap PyCaret’s software and community development plan.

Author:

I write about PyCaret and its use-cases in the real world, If you would like to be notified automatically, you can follow me on Medium, LinkedIn, and Twitter.


Introduction to Anomaly Detection in Python with PyCaret was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Towards Data Science - Medium https://ift.tt/3IVaUSn
via RiYo Analytics

No comments

Latest Articles