Boo-AI — Master Artificial Intelligence by Building from Scratch

Introduction

Evasion attacks are the most widely studied class of adversarial ML attacks. The core idea is deceptively simple: add carefully calculated perturbations to an input so that a human sees no difference, but the model produces an entirely wrong output. A stop sign remains a stop sign to human eyes, but a self-driving car's neural network reads it as a speed limit sign.

These attacks exploit the fact that deep neural networks, despite their impressive accuracy, learn decision boundaries that can be fragile and counterintuitive. Small shifts in input space—often imperceptible to humans—can push a sample across a decision boundary and into the wrong class.

Fast Gradient Sign Method (FGSM)

The Fast Gradient Sign Method, introduced by Goodfellow et al. in 2014, is the foundational evasion attack. FGSM computes the gradient of the loss function with respect to the input, then perturbs the input in the direction that maximizes the loss. The perturbation is constrained by a small epsilon value to ensure it remains imperceptible.

What makes FGSM so powerful is its efficiency. It requires only a single forward and backward pass through the network, making it fast enough for real-time attacks. Despite its simplicity, FGSM can dramatically reduce model accuracy—a well-tuned epsilon can drop a 99% accurate classifier to below 10%.

🐍python

1import torch
2
3def fgsm_attack(model, images, labels, epsilon):
4    """Generate adversarial examples using FGSM."""
5    images.requires_grad = True
6    outputs = model(images)
7    loss = torch.nn.functional.cross_entropy(outputs, labels)
8    model.zero_grad()
9    loss.backward()
10
11    # Create perturbation
12    perturbation = epsilon * images.grad.data.sign()
13    adversarial_images = images + perturbation
14    adversarial_images = torch.clamp(adversarial_images, 0, 1)
15    return adversarial_images

Key Insight: FGSM demonstrates a fundamental tension in neural networks: the same gradients used for learning can be weaponized to craft inputs that systematically mislead the model. Defense requires breaking this symmetry.

Projected Gradient Descent (PGD)

Projected Gradient Descent extends FGSM by applying it iteratively. Rather than taking a single large step, PGD takes many small steps, projecting the result back onto the allowed perturbation ball after each iteration. This produces stronger adversarial examples that are more likely to fool robust models.

PGD is widely considered the strongest first-order attack and serves as the standard benchmark for evaluating adversarial robustness. If a model can withstand PGD attacks with sufficient iterations, it is considered reasonably robust against gradient-based evasion.

Step size: Each iteration applies a small perturbation, typically much smaller than the FGSM epsilon
Projection: After each step, the perturbation is clipped to remain within the epsilon-ball around the original input
Random restarts: Multiple random initializations help avoid local minima and find stronger adversarial examples

The trade-off is computational cost. While FGSM requires one gradient computation, a typical PGD attack uses 40 to 200 iterations, each requiring a full forward and backward pass. For real-time attacks, this cost can be prohibitive, but for offline attack preparation, PGD is the method of choice.

Physical-World Adversarial Examples

The most alarming development in evasion attacks is their demonstrated effectiveness in the physical world. Researchers have shown that adversarial perturbations can survive the transition from digital images to physical objects, maintaining their ability to fool models even when captured by cameras under varying lighting conditions, angles, and distances.

Autonomous vehicle perception systems have been a primary target. Studies have demonstrated that carefully designed stickers applied to stop signs can cause classification models to misidentify them as yield signs or speed limit signs. These perturbations are designed to be robust across viewing angles and weather conditions, making them practical threats in real deployments.

Face recognition systems face similar vulnerabilities. Adversarial eyeglass frames, makeup patterns, and even infrared LED arrays have been shown to defeat facial recognition at security checkpoints. These attacks are particularly concerning because they target systems that millions of people rely on for physical security.

Why This Matters: Physical-world adversarial examples demonstrate that adversarial ML is not a theoretical curiosity—it is a practical threat to safety-critical systems. Any AI system operating in the physical world must be designed with adversarial robustness as a core requirement.

Adversarial Patches

Adversarial patches take physical attacks a step further by removing the constraint of imperceptibility. Instead of subtle perturbations spread across an entire image, a patch is a small, visible pattern that can be printed and placed in the physical world. When a model's camera captures the patch, it dominates the model's attention and forces a misclassification.

Patches have been demonstrated to make objects invisible to object detection systems. A person holding a printed adversarial patch can become undetectable to surveillance cameras running YOLO or similar detectors. This has obvious implications for both privacy and security.

Universal patches: A single patch that causes misclassification regardless of the underlying image
Targeted patches: Patches designed to cause classification as a specific target class
Detection-suppressing patches: Patches that cause object detectors to fail to detect the patched object entirely

Defending against adversarial patches requires different strategies than defending against imperceptible perturbations. Techniques such as digital watermarking, patch detection networks, and attention-based filtering have shown promise, but no defense is yet considered fully robust against determined adversaries.