https://ift.tt/3qRfKJA How to fool a 27M-parameter model with a bit of Python Photo by George Lebada from Pexels Do you think it is i...
How to fool a 27M-parameter model with a bit of Python
Do you think it is impossible to fool the vision system of a self-driving Tesla car?
Or that machine learning models used in malware detection software are too good to be evaded by hackers?
Or that face recognition systems in airports are bulletproof?
Like any of us machine learning enthusiasts, you might fall into the trap of thinking that deep models used out there are perfect.
Well, you are WRONG.
There are easy ways to build adversarial examples that can fool any deep learning model and create security issues. In this post, we will cover the following:
- What are adversarial examples?
- How do you generate adversarial examples?
- Hands-on example: let’s break Inception 3
- How to defend your models against adversarial examples
Let’s start!
1. What are adversarial examples? 😈
In the last 10 years, deep learning models have left the academic kindergarten, become big boys, and transformed many industries. This is especially true for computer vision models. When AlexNet hit the charts in 2012, the deep learning era officially started.
Nowadays, computer vision models are as good or better than human vision. You can find them in myriad places, including…
- self-driving cars
- face recognition
- medical diagnosis
- surveillance systems
- malware detection
- …
Until recently, researchers trained and tested machine learning models in a laboratory environment, such as machine learning competitions and academic papers. Nowadays, when deployed in real-world scenarios, security vulnerabilities coming from model errors have become a real concern.
Imagine for a moment that the state-of-the-art-super-fancy-deep learning vision system of your self-driving car was not able to recognize this stop sign as a stop sign.
Well, this is exactly what happened. This stop sign image is an adversarial example. Think of it as an optical illusion for the model.
Let us look at another example. Below you have two images of a dog that are indistinguishable to the human’s eye.
The image on the left is an original picture of a puppy taken by Dominika Roseclay. The one on the right is a slight modification of the first, that I created by adding the noise vector in the central image.
Inception v3 correctly classifies the original image as a dog breed (a redbone). However, this same model thinks, with very high confidence, that the modified image I created is a paper towel.
In other words, I created an adversarial example. And you will too, as this is the example we will work on later in section 3.
An adversarial example for a computer vision model is an input image with small perturbations, imperceptible to the human eye, that causes a wrong model prediction.
Do not think these 2 examples are rare edge-case examples found after spending tons of time and computing resources. There are easy ways to generate adversarial examples, and this opens the door to serious vulnerabilities of machine learning systems in production.
Let’s see how you can generate an adversarial example and fool a state-of-the-art image classification model.
2. How to generate adversarial examples?
Adversarial examples 😈 are generated by taking a clean image 👼 that the model correctly classifies, and finding a small perturbation that causes the new image to be misclassified by the ML model.
White-box scenario
Let’s suppose you have complete information about the model you want to fool. In this case, you can compute the loss function of the model
where
- X is the input image
- y is the output class,
- and θ is the vector of the network parameters.
This loss function is typically the negative log-likelihood for classification methods.
Your goal is to find a new image X’ that is close to the original X and that produces a big change in the value of the loss function.
Imagine you are inside the space of all possible input images, sitting on top of the original image X. This space has dimensions width x height x channels, so I will excuse you if you cannot visualize it well 😜.
To find an adversarial example, you need to walk a little bit in some direction in this space until you find another image X’ with a remarkably different loss. You want to choose the direction that maximizes the change in the loss function J for a fixed small step epsilon.
Now, if you refresh a bit your Maths Calculus course, the direction in the X space where the loss function changes the most is precisely the gradient of J with respect to the X.
The gradient of a function with respect to one of its variables is precisely the direction of maximal change. And by the way, this is the reason people train Neural Networks using Stochastic Gradient Descend and not Stochastic Random-Direction Descend.
Fast gradient sign method
An easy way to formalize this intuition is as follows:
We take only the sign of the gradient and scale it using a small parameter epsilon, to guarantee that the distortion between X and X is small enough to be imperceptible to the human eye. This method is called the fast gradient sign method.
Black-box attack
In most scenarios, it is very likely you will not have complete information about the model. Hence, the previous method is not useful as you cannot compute the gradient.
However, there exists a remarkable property called transferability of adversarial examples that malicious agents can exploit to break a model even if they do not know its internal architecture and parameters.
Researchers have repeatedly observed that adversarial examples transfer quite well between models, meaning that they can be designed for a target model A, but end up being effective against any other model trained on a similar dataset.
Adversarial examples can be generated as follows:
- Query the targeted model with inputs X_i for i = 1 … n and store the outputs y_i.
- Use the training data (X_i, y_i) to build another model, called the substitute model.
- Use a white-box algorithm like the fast gradient sign to generate adversarial examples for the substitute model. Many of them are going to transfer successfully and become adversarial examples for the target model as well.
A successful application of this strategy against a commercial Machine learning model is presented in this Computer Vision Foundation paper.
3. Hands-on example: let’s break Inception 3
Let’s get our hands dirty and implement a few attacks using Python and the great library PyTorch. It always comes in handy to know how the attacker thinks.
You can find the complete code in this Github repo.
Our target model is going to be Inception V3, a powerful image classification model developed by Google, which has around 27M parameters and was trained on around 14M images belonging to 20k categories.
We download the list of classes the model was trained on, and build an auxiliary dictionary that maps classes ids to labels.
Let us take the image of an innocent redbone dog as the starting image we will carefully modify to build adversarial examples:
Inception V3 expects images with dimensions 299 x 299 and normalized pixel ranges between -1 and 1.
Let us preprocess our beautiful dog image:
and check that the model correctly classifies this image.
Good. The model works as expected and the redbone dog is classified as a redbone dog :-).
Let’s move to the fun part and generate adversarial examples using the Fast Gradient Sign method.
I have created an auxiliary function to visualize both the original and the adversarial image. You can see the full implementation in this GitHub repository.
Well. It is interesting how the model prediction changed for the new image, which is almost indistinguishable from the original one. The new prediction is a bloodhound, which is another dog breed with very similar skin color and big ears. As the puppy in question could be a mixed breed, the model mistake seems to be small, so we want to work further to really break this model.
One possibility is to play with different values of epsilon and try to find one that clearly gives a wrong prediction. Let’s try this.
As epsilon increases the change in the image becomes visible. However, the model predictions are still other dog breeds: bloodhound and basset. We need to be smarter than this to break the model.
Remember the intuition behind the Fast gradient sign method, i.e. imagine yourself inside the space of all possible images (with dimension 299 x 299 x 3), right where the original image X is. The gradient of the loss function tells you the direction you need to step to increase its value and make the model less certain about the right prediction. The size of the step is epsilon. You step and you check if you have are now sitting on an adversarial example.
An extension of this would be to take many smaller steps, instead of a single one. After each step, you re-evaluate the gradient and decide the new direction you are going to walk towards
This method is called the Iterative Fast Gradient method. What an original name!
More formally:
Where X0 = X, and Clip X,ϵ denotes clipping of the input in the range of [X−ϵ, X+ϵ].
An implementation in PyTorch is the following:
Now, let us try again to generate a good adversarial example starting from our innocent puppy.
Step 1: bloodhound again
Step 2: beagle again
Step 3: mousetrap? Interesting. However, the model confidence on this prediction is only 16%. Let’s go further.
Step 4: One more dog breed, boring…
Step 5: beagle again..
Step 6:
Step 7: redbone again. Keep calm and continue walking in the image space.
Step 8:
Step 9: BINGO! 🔥🔥🔥
What an exotic paper towel. And the model is pretty confident about its prediction, almost 99%.
If you compare the initial and the final image we found in step 9
you see they are essentially the same, a puppy, but for the Inception V3 model they are two very different things.
I was very surprised the first time I saw such a thing. How small modifications to an image can cause such model misbehavior. This example is funny, but if you think about the implications for a self-driving car vision you might start to worry a bit.
Deep learning models deployed in critical tasks need to properly handle adversarial examples, but how?
4. How to defend your models against adversarial examples?
As of November 2021, there is no universal defense strategy you can follow to protect against adversarial examples. In other words, attacking a model with adversarial examples is easier than protecting it.
The best universal strategy, that can protect against any known attack, is to use adversarial examples when training the model.
If the model sees adversarial examples during training, its performance at prediction time will be better for adversarial examples generated in the same way. This technique is called adversarial training.
You generate them on the fly, as you train the model, and adjust the loss function to look both at clean and adversarial inputs.
For example, we could eliminate the adversarial examples we found in the previous section if we added the 10+ examples to the training set and labeled all of them as redbone.
This defense is very effective against attacks that use the Fast Gradient Sign method. However, there exist more powerful attacks, that we did not cover in this post, that can bypass this defense. If you wanna know more I encourage you to read this amazing paper by Nicholas Carlini and David Wagner.
This is the reason why adversarial defense is an open problem in Machine Learning and cybersecurity.
Conclusion
Adversarial examples are a fascinating topic at the intersection of cybersecurity and machine learning. However, it is still an open problem waiting to be solved.
In this post, I gave a practical introduction to the topic with code examples. If you would like to work further I suggest cleverhans, a Python library for adversarial machine learning developed by the great Ian Goodfellow and Nicolas Papernot, and currently maintained by the University of Toronto.
All the source code I shown in this article is in this Github repo.
Do you want to become an (even) better data scientist, and access top courses on machine learning and data science?
👉🏽 Subscribe to the datamachines newsletter.
👉🏽 Give me lots of claps 👏🏿👏🏽👏 below
👉🏽 Follow me on Medium.
Have a great day 🧡
Pau
Adversarial Examples to Break Deep Learning Models was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
from Towards Data Science - Medium https://ift.tt/3CgF9i8
via RiYo Analytics
No comments