Page Nav

HIDE

Breaking News:

latest

Ads Place

Unsupervised Anomaly Detection in Python

https://ift.tt/33frH23 A beginner’s guide Photo by Mulyadi on  Unsplash What is anomaly detection? Anomaly detection also known as ou...

https://ift.tt/33frH23

A beginner’s guide

Photo by Mulyadi on Unsplash

What is anomaly detection?

Anomaly detection also known as outlier detection is the process of finding data points within a dataset that differs from the rest. Common applications of anomaly detection includes fraud detection in financial transactions, fault detection and predictive maintenance.

Broadly speaking, anomaly detection can be categorized into supervised and unsupervised realm. Supervised anomaly detection requires labelled dataset that indicates if a record is “normal” or “abnormal”. Unsupervised anomaly detection involves an unlabeled dataset. It assumes that the majority data points in the unlabeled dataset are “normal” and it looks for data points that differs from the “normal” data points.

In this article, we will be using Pycaret for detecting anomalies. Pycaret is an Automated Machine Learning (AutoML) tool that can be used for both supervised and unsupervised learning.

Unsupervised Anomaly Detection

Let’s start by installing PyCaret.

pip install pycaret==2.3.5
pip install scipy==1.4.1

Import the necessary modules

from pycaret.anomaly import *
from sklearn.datasets import load_breast_cancer

Load the dataset

We will be using the Wisconsin Breast Cancer (Diagnostic) dataset[1] from UCI Machine Learning Repository[2] which contains features computed digitized image of a fine needle aspirate of a breast mass and the diagnosis if the mass is benign (B) or malignant (M). This dataset commonly used for demonstrating supervised machine learning where a model is trained to predict the diagnosis. For the purpose of demonstrating unsupervised anomaly detection, we will ignore the diagnosis.

We split the data into the training set reserve a small “unseen” set for scoring.

df = load_breast_cancer(as_frame=True)['data']
df_train = df.iloc[:-10]
df_unseen = df.tail(10)

Set up Pycaret

anom = setup(data = df_train, 
silent = True)

Setting the silent parameter to True automatically confirms the input of data types when setup is executed. If silent is set to False, Pycaret requires the user to do manual confirmation of the input data types as shown in the image below.

Image by Author

Pycaret will infer the data type if they are not explicitly defined. We can also define which columns contain categorical, numeric, ordinal and date features using the categorical_features ,numeric_features ,ordinal_features parameters. Defining the different data type will affect how the column is being pre-processed. For example, the categorical_imputation parameter determines how missing categorical values are being handled with. The available options are constant and mode. Similarly for numeric features, the numeric_imputation parameter determines how missing numeric values are being handled. The available options are mean, medianor zero.

There is a wide range of other pre-processing options such as normalization, dimension reduction and even custom transformation. Check out the documentation for more details.

Check available models

Let’s check what anomaly detection models are available in Pycaret.

models()

The reference column indicates which source package the model was built from. PyOD (Python Outlier Detection) is a package that offers wide variety of outlier detection models.

Image by Author

Train the anomaly detection model

anom_model = create_model(model = 'iforest', fraction = 0.05)

We train an anomaly detection model by defining the model’s ID. The fraction parameter indicates indicates the amount of outliers present in the dataset. The default value is 0.05 which indicates that the dataset has 5% of outliers.

Score the dataset

results = assign_model(anom_model)

assign_model scores the training dataset using the trained model and returns the prediction of the model, concatenated with the training data. The Anomaly column is binary where 1 indicates that the record is anomalous and 0 indicates that it is normal. The Anomaly_Score column gives the raw score for the record, where negative indicates that the record is normal.

Image by Author

Plot model

The plot_model function helps to visualize high dimension data using TSNE or UMAP.

plot_model(anom_model, plot = 'tsne')
Gif by Author
plot_model(anom_model, plot = 'umap')
Image by Author

Save the model

save_model saves the transformation pipeline and the trained model as a pickle file. The pickle file contains a scikit-learn pipeline object.

save_model(model = anom_model, model_name = 'iforest_model')

Load the model

load_model loads the saved transformation pipeline and trained model as a scikit-learn pipeline object.

loaded_model = load_model('iforest_model')
type(loaded_model)
>> sklearn.pipeline.Pipeline

Score on unseen data

The predict method will get us the a binary output of 1 or 0 where 1 represents an abnormal record and 0 represents a normal record.

loaded_model.predict(df_unseen)
>> array([0, 0, 0, 0, 0, 1, 0, 0, 1, 0])

The predict_proba method will get us the probability of a record being classified as 0 or 1 .

loaded_model.predict_proba(df_unseen)
>> array([[0.81527091, 0.18472909],
[0.88219213, 0.11780787],
[0.4789605 , 0.5210395 ],
[0.46277058, 0.53722942],
[0.47638085, 0.52361915],
[0.4421625 , 0.5578375 ],
[0.68153289, 0.31846711],
[0.75207438, 0.24792562],
[0.27120702, 0.72879298],
[0.54821746, 0.45178254]])

The decision_function method will return us the anomaly score

loaded_model.decision_function(df_unseen)
>> array([-0.11826784, -0.14024322, -0.00783128, -0.00251488, -0.00698418,
0.00425233, -0.07435137, -0.09751556, 0.06039016, -0.03057366])

Conclusion

In this article we explored how to quickly train an unsupervised anomaly detection model using Pycaret, visualize the high dimension features using TSNE or UMAP and use the trained model to predict on unseen data.

Reference

[1] Breast Cancer Wisconsin (Diagnostic) Data Set

Creators:

1. Dr. William H. Wolberg, General Surgery Dept.
University of Wisconsin, Clinical Sciences Center
Madison, WI 53792
wolberg ‘@’ eagle.surgery.wisc.edu

2. W. Nick Street, Computer Sciences Dept.
University of Wisconsin, 1210 West Dayton St., Madison, WI 53706
street ‘@’ cs.wisc.edu 608–262–6619

3. Olvi L. Mangasarian, Computer Sciences Dept.
University of Wisconsin, 1210 West Dayton St., Madison, WI 53706
olvi ‘@’ cs.wisc.edu

Donor: Nick Street

License: CC BY 4.0

[2] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [https://ift.tt/SdYWce]. Irvine, CA: University of California, School of Information and Computer Science.


Unsupervised Anomaly Detection in Python was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Towards Data Science - Medium https://ift.tt/31MMgCU
via RiYo Analytics

No comments

Latest Articles