:::tip 🎮 Interactive Playground Visualize this concept: Try the Adversarial Prompts demo on the EngineersOfAI Playground - no code required. :::
Adversarial Examples
The Stop Sign That Wasn't
The autonomous vehicle's perception system processed 30 frames per second. At each frame, it ran a neural network classifier across detected objects. At a distance of 40 meters, it correctly identified the stop sign as a stop sign. At 35 meters. At 30 meters.
Then at 25 meters - with the vehicle traveling at 45 mph - the classifier returned "Speed Limit 45."
The stop sign looked normal to every human who passed it that day. But it had small, carefully placed stickers - yellow and black squares arranged in a specific pattern. To the neural network, these imperceptible-to-humans modifications shifted the classifier's decision from "STOP" to "45." The researchers who designed the attack (Eykholt et al., 2017) called it a physical-world adversarial attack. They showed that adversarial perturbations weren't just a theoretical curiosity confined to digital pixel manipulation - they could be printed, physically affixed to real objects, and reliably fool deployed systems.
This is what makes adversarial examples so important to understand: they reveal that neural networks learn different features than humans do. A model that classifies stop signs correctly 99.9% of the time in normal operation can be made to fail 100% of the time with targeted, nearly invisible modifications. The implications for any safety-critical AI system are severe.
What Are Adversarial Examples?
An adversarial example is an input that has been specifically crafted to cause a model to make a wrong prediction. The key property: the modification is constrained - typically imperceptible to humans - while the effect on the model is dramatic.
For images: adversarial examples add small perturbations (often imperceptible) to pixel values. For text: they change a few words, characters, or punctuation marks. For audio: they add inaudible noise that causes speech recognizers to transcribe a different message. For tabular data: they modify feature values within plausible ranges to flip model decisions.
The existence of adversarial examples reveals a fundamental property of neural networks: they make decisions based on features that are statistically correlated with labels in training data - but these features may not align with human-intuitive features. A slight perturbation along a non-robust feature direction can drastically change the prediction while leaving human-relevant features unchanged.
Why Non-Robust Features Exist
The Ilyas et al. (2019) paper "Adversarial Examples Are Not Bugs, They Are Features" provides a compelling theoretical framework: adversarial examples exist because neural networks learn to use all statistically predictive features, including ones that humans don't recognize as meaningful (non-robust features). These non-robust features are genuinely predictive in normal data but change dramatically under small perturbations. Models that use these features are accurate but brittle; models that restrict themselves to robust features are safer but less accurate.
This explains the fundamental robustness-accuracy tradeoff: you cannot have both maximum accuracy and adversarial robustness without paying a cost.
Attack Taxonomy
White-Box Attacks (Full Model Access)
Fast Gradient Sign Method (FGSM) - Goodfellow et al., 2014:
The simplest gradient-based attack. Take one step in the direction that increases loss:
import torch
import torch.nn.functional as F
def fgsm_attack(
model: torch.nn.Module,
images: torch.Tensor,
labels: torch.Tensor,
epsilon: float = 0.03
) -> torch.Tensor:
"""
Fast Gradient Sign Method (FGSM) adversarial attack.
One-shot gradient attack. Fast but weak - doesn't find worst case.
Good for generating training data for adversarial training;
not a reliable evaluation benchmark (use PGD for evaluation).
Args:
model: Target model
images: Input images [B, C, H, W], values in [0, 1]
labels: True labels
epsilon: Perturbation magnitude (L-infinity bound)
Returns:
Adversarial examples with same shape as images
"""
images = images.clone().detach().requires_grad_(True)
# Forward pass
model.eval()
outputs = model(images)
loss = F.cross_entropy(outputs, labels)
# Compute gradients with respect to input
model.zero_grad()
loss.backward()
# FGSM step: move in gradient sign direction
with torch.no_grad():
perturbation = epsilon * images.grad.sign()
adversarial = images + perturbation
adversarial = torch.clamp(adversarial, 0, 1)
return adversarial.detach()
def evaluate_fgsm_robustness(
model: torch.nn.Module,
test_loader,
epsilon_values: list[float] = [0.01, 0.03, 0.05, 0.1]
) -> dict:
"""
Evaluate model robustness across FGSM epsilon values.
Shows the clean accuracy / adversarial accuracy tradeoff curve.
"""
model.eval()
results = {}
for epsilon in epsilon_values:
correct_clean = 0
correct_adversarial = 0
total = 0
for images, labels in test_loader:
total += labels.size(0)
with torch.no_grad():
outputs = model(images)
correct_clean += (outputs.argmax(1) == labels).sum().item()
adversarial = fgsm_attack(model, images, labels, epsilon)
with torch.no_grad():
adv_outputs = model(adversarial)
correct_adversarial += (adv_outputs.argmax(1) == labels).sum().item()
results[epsilon] = {
"clean_accuracy": correct_clean / total,
"adversarial_accuracy": correct_adversarial / total,
"accuracy_drop": (correct_clean - correct_adversarial) / total,
"attack_success_rate": 1 - (correct_adversarial / total)
}
return results
Projected Gradient Descent (PGD) - Madry et al., 2018:
The strongest first-order attack. Multiple FGSM steps with projection back to the epsilon-ball:
import torch
import torch.nn.functional as F
def pgd_attack(
model: torch.nn.Module,
images: torch.Tensor,
labels: torch.Tensor,
epsilon: float = 0.03,
alpha: float = 0.007,
n_steps: int = 40,
random_start: bool = True,
targeted: bool = False,
target_labels: torch.Tensor = None
) -> torch.Tensor:
"""
Projected Gradient Descent (PGD) adversarial attack.
The strongest "first-order" attack. If a model is robust to PGD,
it's considered adversarially robust to gradient-based attacks.
PGD finds the local maximum of loss within the epsilon-ball.
Args:
model: Target model
images: Clean images [B, C, H, W]
labels: True labels
epsilon: L-infinity perturbation bound
alpha: Per-step perturbation magnitude (typically epsilon / 4)
n_steps: Number of gradient steps (40-100 for evaluation)
random_start: Start from random point in epsilon-ball (recommended)
targeted: If True, minimize loss for target_labels
target_labels: Target labels for targeted attack
Returns:
Adversarial examples
"""
model.eval()
batch_size = images.shape[0]
# Initialize perturbation
if random_start:
delta = torch.zeros_like(images).uniform_(-epsilon, epsilon)
else:
delta = torch.zeros_like(images)
delta.requires_grad_(True)
for step in range(n_steps):
adversarial = torch.clamp(images + delta, 0, 1)
# Forward pass and loss
outputs = model(adversarial)
if targeted and target_labels is not None:
# Targeted: minimize loss for target class (move toward target)
loss = -F.cross_entropy(outputs, target_labels)
else:
# Untargeted: maximize loss for true class
loss = F.cross_entropy(outputs, labels)
# Backward pass
loss.backward()
with torch.no_grad():
# Gradient step
delta_grad = delta.grad.sign()
delta.data = delta.data + alpha * delta_grad
# Project back to epsilon-ball (L-infinity)
delta.data = torch.clamp(delta.data, -epsilon, epsilon)
# Ensure valid image range
delta.data = torch.clamp(images + delta.data, 0, 1) - images
if delta.grad is not None:
delta.grad.zero_()
return (images + delta).detach()
def pgd_with_restarts(
model: torch.nn.Module,
images: torch.Tensor,
labels: torch.Tensor,
epsilon: float = 0.03,
alpha: float = 0.007,
n_steps: int = 40,
n_restarts: int = 5
) -> torch.Tensor:
"""
PGD with multiple random restarts for stronger adversarial examples.
AutoAttack uses this strategy for reliable evaluation.
"""
model.eval()
best_loss = torch.zeros(images.shape[0])
best_adversarial = images.clone()
for _ in range(n_restarts):
adversarial = pgd_attack(
model, images, labels,
epsilon=epsilon, alpha=alpha, n_steps=n_steps,
random_start=True
)
with torch.no_grad():
outputs = model(adversarial)
loss = F.cross_entropy(outputs, labels, reduction='none')
# Track best (worst-case) adversarial per example
improved = loss > best_loss
best_adversarial[improved] = adversarial[improved]
best_loss[improved] = loss[improved]
return best_adversarial
Carlini-Wagner (C&W) Attack - Carlini & Wagner, 2017:
The strongest optimization-based attack. Minimizes perturbation while achieving misclassification:
import torch
import torch.optim as optim
import torch.nn.functional as F
def cw_l2_attack(
model: torch.nn.Module,
images: torch.Tensor,
labels: torch.Tensor,
confidence: float = 0.0,
lr: float = 0.01,
n_iterations: int = 1000,
binary_search_steps: int = 9,
c_upper: float = 1e4
) -> torch.Tensor:
"""
Carlini-Wagner L2 Attack (C&W).
Minimizes L2 perturbation while achieving misclassification.
Often finds smaller perturbations than PGD.
Slower than PGD but stronger - breaks many defenses PGD doesn't.
Formulation:
minimize ||x' - x||_2 + c * f(x')
where f(x') < 0 means misclassification with margin 'confidence'
"""
model.eval()
batch_size = images.shape[0]
# Work in tanh space to handle box constraints naturally
# x = (tanh(w) + 1) / 2 maps to [0, 1]
w = torch.atanh(images * 2 - 1) # Inverse tanh mapping
best_adversarial = images.clone()
best_l2 = torch.full((batch_size,), float('inf'))
# Binary search over constant c
c_lower = torch.zeros(batch_size)
c_upper_vals = torch.full((batch_size,), c_upper)
c = torch.ones(batch_size)
for _ in range(binary_search_steps):
# Optimize perturbation for current c
w_opt = w.clone().detach().requires_grad_(True)
optimizer = optim.Adam([w_opt], lr=lr)
for iteration in range(n_iterations):
optimizer.zero_grad()
# Map back to image space
x_adv = (torch.tanh(w_opt) + 1) / 2
# L2 loss
l2_loss = ((x_adv - images) ** 2).sum(dim=(1, 2, 3))
# Classification loss (CW objective)
outputs = model(x_adv)
# Gather correct class logits
correct_logits = outputs.gather(1, labels.unsqueeze(1)).squeeze()
# Max other class logits
mask = torch.ones_like(outputs, dtype=bool)
mask.scatter_(1, labels.unsqueeze(1), False)
other_logits = outputs[mask].view(batch_size, -1).max(dim=1).values
# CW objective: f = max(other - correct + confidence, -kappa)
f_loss = torch.clamp(other_logits - correct_logits + confidence, min=-1.0)
total_loss = (l2_loss + c * f_loss).mean()
total_loss.backward()
optimizer.step()
# Check results and update binary search
with torch.no_grad():
x_final = (torch.tanh(w_opt) + 1) / 2
final_outputs = model(x_final)
final_preds = final_outputs.argmax(1)
final_l2 = ((x_final - images) ** 2).sum(dim=(1, 2, 3)).sqrt()
for i in range(batch_size):
if final_preds[i] != labels[i] and final_l2[i] < best_l2[i]:
best_l2[i] = final_l2[i]
best_adversarial[i] = x_final[i]
c_upper_vals[i] = c[i]
else:
c_lower[i] = c[i]
c = (c_lower + c_upper_vals) / 2
return best_adversarial
Black-Box Attacks (API Access Only)
Transfer Attack: Generate adversarial examples on a locally trained surrogate model. Often transfer to the target model due to shared non-robust features.
Score-Based Attack (NES): Estimate gradients by querying with small perturbations:
import torch
def nes_black_box_attack(
model_query_fn: callable,
image: torch.Tensor,
label: int,
epsilon: float = 0.03,
n_queries: int = 500,
mu: float = 0.01,
step_size: float = 0.01,
n_directions: int = 20
) -> tuple[torch.Tensor, int]:
"""
Natural Evolution Strategy (NES) black-box attack.
Estimates gradients using finite differences from model outputs.
Doesn't need model weights - only API access to output scores.
Args:
model_query_fn: Function(image_tensor) → loss or score float
image: Clean image to perturb [C, H, W]
label: True label
epsilon: L-infinity perturbation bound
n_queries: Maximum API queries allowed
mu: Gradient estimation smoothing factor
step_size: Gradient step size
n_directions: Random directions per gradient estimate
"""
delta = torch.zeros_like(image)
queries_used = 0
for _ in range(n_queries // (2 * n_directions)):
# Sample random directions for gradient estimation
directions = torch.randn(n_directions, *image.shape)
# Estimate gradient via finite differences
grad_estimate = torch.zeros_like(image)
for d in directions:
d_norm = d / (d.norm() + 1e-8)
pos_query = torch.clamp(image + delta + mu * d_norm, 0, 1)
neg_query = torch.clamp(image + delta - mu * d_norm, 0, 1)
pos_score = model_query_fn(pos_query) # Higher = more misclassified
neg_score = model_query_fn(neg_query)
queries_used += 2
grad_estimate += (pos_score - neg_score) / (2 * mu) * d_norm
grad_estimate /= n_directions
# Step in gradient direction
delta = delta + step_size * grad_estimate.sign()
delta = torch.clamp(delta, -epsilon, epsilon)
delta = torch.clamp(image + delta, 0, 1) - image
# Check if attack succeeded
with torch.no_grad():
current_score = model_query_fn(image + delta)
if current_score > 0: # Model now predicts wrong class
print(f"Attack succeeded after {queries_used} queries")
break
return (image + delta).detach(), queries_used
Text Adversarial Attacks
For NLP models, adversarial attacks work at the word, character, or sentence level:
import anthropic
import random
import re
import difflib
client = anthropic.Anthropic()
class TextAdversarialAttacker:
"""
Generate adversarial text examples that fool NLP classifiers.
Text attacks must be semantics-preserving (humans should reach the same
conclusion) while flipping the model's prediction.
Key constraint: humans must not notice the modification.
"""
def __init__(self):
# Homoglyph map: Latin chars replaced with visually identical Unicode
self.homoglyph_map = {
'a': 'а', # Cyrillic а (U+0430)
'e': 'е', # Cyrillic е (U+0435)
'o': 'о', # Cyrillic о (U+043E)
'p': 'р', # Cyrillic р (U+0440)
'c': 'с', # Cyrillic с (U+0441)
'x': 'х', # Cyrillic х (U+0445)
}
# Synonym substitution map
self.synonym_map = {
"good": ["excellent", "great", "wonderful", "superb", "fine"],
"bad": ["poor", "terrible", "awful", "dreadful", "inferior"],
"big": ["large", "huge", "enormous", "massive", "substantial"],
"small": ["tiny", "little", "minor", "petite", "compact"],
"fast": ["quick", "rapid", "swift", "speedy", "brisk"],
"important": ["crucial", "significant", "critical", "vital"],
"show": ["demonstrate", "reveal", "display", "exhibit"],
"use": ["utilize", "employ", "apply", "deploy"],
"make": ["create", "produce", "generate", "construct"],
"need": ["require", "necessitate", "demand"],
}
def character_substitution_attack(
self, text: str, rate: float = 0.03
) -> str:
"""
Attack by substituting characters with visually similar Unicode.
Fools models trained on ASCII but evades human visual detection.
Used to bypass content filters (the original input looks clean).
"""
result = []
for char in text:
if char.lower() in self.homoglyph_map and random.random() < rate:
homoglyph = self.homoglyph_map[char.lower()]
result.append(homoglyph)
else:
result.append(char)
return ''.join(result)
def word_substitution_attack(
self,
text: str,
max_substitutions: int = 3
) -> str:
"""
TextFooler approach (Jin et al., 2020):
Substitute words with synonyms to flip classifier prediction.
Most effective when targeting words with high importance scores.
"""
words = text.split()
n_substitutions = 0
for i, word in enumerate(words):
if n_substitutions >= max_substitutions:
break
lower_word = word.lower().strip('.,!?;:"\'')
if lower_word in self.synonym_map:
synonyms = self.synonym_map[lower_word]
chosen_synonym = random.choice(synonyms)
if word[0].isupper():
chosen_synonym = chosen_synonym.capitalize()
words[i] = chosen_synonym
n_substitutions += 1
return ' '.join(words)
def typo_insertion_attack(
self, text: str, rate: float = 0.03
) -> str:
"""
Insert plausible typos that fool models but are readable to humans.
Swap adjacent characters in words of length > 4.
"""
words = text.split()
for i, word in enumerate(words):
if len(word) > 4 and random.random() < rate:
pos = random.randint(1, len(word) - 2)
word_list = list(word)
word_list[pos], word_list[pos+1] = word_list[pos+1], word_list[pos]
words[i] = ''.join(word_list)
return ' '.join(words)
def llm_paraphrase_attack(
self,
text: str,
target_label: str,
classifier_fn: callable
) -> dict:
"""
Use an LLM to generate adversarial paraphrases.
More powerful than rule-based approaches - can find
semantic changes that fool classifiers while preserving meaning.
"""
original_pred = classifier_fn(text)
prompt = f"""Generate 5 paraphrases of the following text that:
1. Preserve the exact meaning and information
2. Use different words and sentence structure
3. Read naturally as human text
Original text:
{text}
Return as a JSON array of 5 paraphrased strings."""
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=1000,
messages=[{"role": "user", "content": prompt}]
)
import json
candidates = []
try:
json_match = re.search(r'\[.*\]', response.content[0].text, re.DOTALL)
if json_match:
candidates = json.loads(json_match.group())
except Exception:
pass
# Try each candidate
for candidate in candidates:
candidate_pred = classifier_fn(candidate)
if candidate_pred != original_pred:
similarity = difflib.SequenceMatcher(None, text, candidate).ratio()
return {
"attack_succeeded": True,
"adversarial_text": candidate,
"original_prediction": original_pred,
"adversarial_prediction": candidate_pred,
"semantic_similarity": similarity
}
return {
"attack_succeeded": False,
"candidates_tried": len(candidates),
"original_prediction": original_pred
}
def evaluate_attack(
self,
original_text: str,
adversarial_text: str,
classifier_fn: callable
) -> dict:
"""Evaluate if an adversarial text successfully flips prediction."""
original_pred = classifier_fn(original_text)
adversarial_pred = classifier_fn(adversarial_text)
similarity = difflib.SequenceMatcher(None, original_text, adversarial_text).ratio()
return {
"original_prediction": original_pred,
"adversarial_prediction": adversarial_pred,
"attack_succeeded": original_pred != adversarial_pred,
"text_similarity": similarity,
"human_imperceptible": similarity > 0.92,
"modification_count": sum(
1 for a, b in zip(original_text.split(), adversarial_text.split())
if a != b
)
}
Transferability: Why This Matters for Defense
One of the most important (and alarming) properties of adversarial examples: they transfer across models.
An adversarial example created to fool Model A often also fools Model B - even when A and B have different architectures and were trained independently. This means:
- Attackers don't need access to your specific model - they can attack a locally trained surrogate
- Ensemble defenses (using multiple models) provide less protection than expected
- The adversarial vulnerability is partly intrinsic to the learning paradigm
import torch
def measure_transferability(
source_model: torch.nn.Module,
target_models: list[torch.nn.Module],
test_images: torch.Tensor,
test_labels: torch.Tensor,
epsilon: float = 0.03
) -> dict:
"""
Measure how well adversarial examples transfer across models.
High transferability indicates shared non-robust features -
harder to defend with simple model diversity.
"""
# Generate adversarial examples on source model
adversarial_examples = pgd_attack(
source_model, test_images, test_labels,
epsilon=epsilon, n_steps=40
)
results = {}
# Source model performance (baseline: should be high fooling rate)
with torch.no_grad():
source_adv_preds = source_model(adversarial_examples).argmax(1)
source_fooling_rate = (source_adv_preds != test_labels).float().mean().item()
results["source_model"] = {
"fooling_rate": source_fooling_rate,
"is_transfer": False,
"interpretation": "Baseline attack success on source model"
}
# Test on each target model
for i, target_model in enumerate(target_models):
target_model.eval()
with torch.no_grad():
# Adversarial accuracy on target
target_adv_preds = target_model(adversarial_examples).argmax(1)
target_fooling_rate = (target_adv_preds != test_labels).float().mean().item()
# Clean accuracy on target
target_clean_preds = target_model(test_images).argmax(1)
target_clean_accuracy = (target_clean_preds == test_labels).float().mean().item()
# Transfer rate above clean error baseline
natural_error_rate = 1 - target_clean_accuracy
transfer_rate = max(0, target_fooling_rate - natural_error_rate)
results[f"target_model_{i}"] = {
"clean_accuracy": target_clean_accuracy,
"adversarial_fooling_rate": target_fooling_rate,
"transfer_rate": transfer_rate,
"is_transfer": True,
"interpretation": f"{'High' if transfer_rate > 0.3 else 'Low'} transferability"
}
avg_transfer = sum(
v["transfer_rate"] for k, v in results.items() if v["is_transfer"]
) / max(sum(1 for v in results.values() if v["is_transfer"]), 1)
results["summary"] = {
"source_fooling_rate": source_fooling_rate,
"average_transfer_rate": avg_transfer,
"high_transferability": avg_transfer > 0.3,
"defense_implication": (
"Ensemble defense provides limited protection - shared non-robust features"
if avg_transfer > 0.3
else "Models have different failure modes - diversity provides some protection"
)
}
return results
Defenses
1. Adversarial Training (Madry et al., 2018)
The most effective empirically verified defense: train on adversarial examples alongside clean examples.
import torch
import torch.nn.functional as F
def adversarial_training_step(
model: torch.nn.Module,
optimizer: torch.optim.Optimizer,
images: torch.Tensor,
labels: torch.Tensor,
epsilon: float = 0.03,
pgd_steps: int = 7,
alpha: float = 0.01,
clean_data_fraction: float = 0.0 # Mix in clean data (helps with accuracy)
) -> dict:
"""
Single adversarial training step.
For each batch:
1. Generate adversarial examples using PGD (7 steps is standard)
2. Train on adversarial examples (and optionally some clean examples)
Cost: approximately 7x slower than standard training (one PGD step = one forward-backward pass).
This is the dominant cost and explains why adversarially trained models are expensive.
Args:
clean_data_fraction: If > 0, mix clean examples into adversarial training
This trades off robustness for clean accuracy
"""
model.train()
# Generate adversarial examples
adversarial = pgd_attack(
model, images, labels,
epsilon=epsilon,
alpha=alpha,
n_steps=pgd_steps,
random_start=True
)
# Train on adversarial examples
optimizer.zero_grad()
if clean_data_fraction > 0:
# Mix clean and adversarial (TRADES-like approach)
n_clean = int(len(images) * clean_data_fraction)
mixed_inputs = torch.cat([adversarial[n_clean:], images[:n_clean]])
mixed_labels = labels
outputs = model(mixed_inputs)
else:
# Pure adversarial training (Madry et al.)
outputs = model(adversarial)
loss = F.cross_entropy(outputs, labels)
loss.backward()
# Gradient clipping for training stability
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
with torch.no_grad():
clean_outputs = model(images)
clean_accuracy = (clean_outputs.argmax(1) == labels).float().mean().item()
adv_accuracy = (model(adversarial).argmax(1) == labels).float().mean().item()
return {
"loss": loss.item(),
"clean_accuracy": clean_accuracy,
"adversarial_accuracy": adv_accuracy
}
def trades_loss(
model: torch.nn.Module,
images: torch.Tensor,
labels: torch.Tensor,
epsilon: float = 0.03,
alpha: float = 0.007,
n_steps: int = 10,
beta: float = 6.0 # Beta controls robustness-accuracy tradeoff
) -> torch.Tensor:
"""
TRADES (TRadeoff-inspired Adversarial DEfense via Surrogate-loss minimization)
Zhang et al., 2019.
Decomposes adversarial risk into:
1. Natural risk (clean loss)
2. Boundary risk (KL divergence between clean and adversarial predictions)
Loss = CE(f(x), y) + beta * KL(f(x) || f(x_adv))
TRADES often achieves better robustness-accuracy tradeoff than pure Madry AT.
"""
model.train()
criterion = torch.nn.CrossEntropyLoss()
# Generate adversarial examples maximizing KL divergence
adversarial = images.clone().detach()
adversarial += torch.zeros_like(adversarial).uniform_(-epsilon, epsilon)
adversarial.requires_grad_(True)
with torch.no_grad():
natural_outputs = model(images)
natural_probs = F.softmax(natural_outputs, dim=1)
for _ in range(n_steps):
adv_outputs = model(adversarial)
adv_probs = F.softmax(adv_outputs, dim=1)
# KL divergence: KL(natural || adversarial)
kl_loss = F.kl_div(
F.log_softmax(adv_outputs, dim=1),
natural_probs,
reduction='batchmean'
)
kl_loss.backward()
with torch.no_grad():
adversarial = adversarial + alpha * adversarial.grad.sign()
adversarial = torch.clamp(adversarial, images - epsilon, images + epsilon)
adversarial = torch.clamp(adversarial, 0, 1)
adversarial.requires_grad_(True)
adversarial = adversarial.detach()
# TRADES loss
natural_outputs = model(images)
natural_loss = criterion(natural_outputs, labels)
adv_outputs = model(adversarial)
boundary_loss = F.kl_div(
F.log_softmax(adv_outputs, dim=1),
F.softmax(natural_outputs.detach(), dim=1),
reduction='batchmean'
)
total_loss = natural_loss + beta * boundary_loss
return total_loss
2. Randomized Smoothing (Certified Defense)
The only defense with provable robustness guarantees under L2 attacks:
import torch
import numpy as np
from scipy.stats import norm, binom
class RandomizedSmoothing:
"""
Randomized smoothing: certifiably robust predictions (Cohen et al., 2019).
For a base classifier f, the smoothed classifier g(x) predicts the
most likely class when x is perturbed with Gaussian noise:
g(x) = argmax_c P[f(x + N(0, σ²I)) = c]
Key result: if the top class c_A has probability p_A under the noise
distribution, g(x) is certified robust for L2 perturbations of radius:
r = σ * Φ^{-1}(p_A) where Φ^{-1} is the inverse normal CDF
Trade-off: clean accuracy decreases with σ (more noise = less accuracy).
σ = 0.12 → high clean accuracy, low robustness
σ = 0.50 → lower accuracy, higher robustness radius
IMPORTANT: This provides certified robustness against L2 attacks only.
L-infinity certification is a separate (harder) problem.
"""
def __init__(self, base_model: torch.nn.Module, sigma: float = 0.25, n_classes: int = 10):
self.base_model = base_model
self.sigma = sigma
self.n_classes = n_classes
def predict(
self,
x: torch.Tensor,
n_samples: int = 100,
alpha: float = 0.001
) -> tuple[int, float]:
"""
Make a certifiably robust prediction.
Args:
x: Input tensor [C, H, W]
n_samples: Monte Carlo samples for estimating probabilities
alpha: Failure probability for Clopper-Pearson bound
Returns:
(predicted_class, certified_radius)
Returns (-1, 0.0) if prediction confidence insufficient
"""
self.base_model.eval()
# Sample noisy predictions
x_expanded = x.unsqueeze(0).expand(n_samples, -1, -1, -1)
noise = torch.randn_like(x_expanded) * self.sigma
noisy_inputs = x_expanded + noise
with torch.no_grad():
predictions = self.base_model(noisy_inputs).argmax(1)
# Vote for each class
votes = torch.zeros(self.n_classes)
for pred in predictions:
if pred.item() < self.n_classes:
votes[pred.item()] += 1
top_class = int(votes.argmax().item())
top_count = int(votes.max().item())
# Clopper-Pearson lower confidence bound on P[top class]
p_lower = float(binom.ppf(alpha, n_samples, top_count / n_samples))
if p_lower <= 0.5:
return -1, 0.0 # ABSTAIN - prediction not certifiable
# Certified radius: r = σ * Φ^{-1}(p_lower)
radius = self.sigma * norm.ppf(p_lower)
return top_class, radius
def certify_dataset(
self,
test_loader,
n_samples: int = 1000,
radii: list[float] = [0.0, 0.25, 0.5, 0.75, 1.0]
) -> dict:
"""
Compute certified accuracy at multiple L2 radii.
Standard evaluation for randomized smoothing defenses.
"""
certified = {r: 0 for r in radii}
abstained = 0
total = 0
for images, labels in test_loader:
for image, label in zip(images, labels):
pred_class, cert_radius = self.predict(image, n_samples=n_samples)
total += 1
if pred_class == -1:
abstained += 1
continue
if pred_class == label.item():
for r in radii:
if cert_radius >= r:
certified[r] += 1
return {
"total": total,
"abstained": abstained,
"abstain_rate": abstained / total,
"certified_accuracy": {
f"r={r:.2f}": certified[r] / total
for r in radii
},
"sigma": self.sigma,
}
3. Input Preprocessing Defenses
Detect and neutralize adversarial perturbations before model inference:
import torch
import torch.nn.functional as F
import numpy as np
class InputPreprocessingDefense:
"""
Input preprocessing defenses against adversarial perturbations.
These defenses don't provide formal guarantees but serve as useful
additional layers. WARNING: Many are broken by adaptive attacks -
i.e., attacks that optimize against the preprocessor. Never rely
solely on preprocessing.
"""
def jpeg_compression(self, image: torch.Tensor, quality: int = 75) -> torch.Tensor:
"""
JPEG compression removes high-frequency adversarial perturbations.
Why it works: adversarial perturbations often use high-frequency
components that JPEG's DCT quantization discards.
Why it fails (adaptively): attacker can add perturbations that
survive JPEG compression by restricting to low-frequency components.
"""
try:
from PIL import Image
import io
import numpy as np
img_np = (image.permute(1, 2, 0).numpy() * 255).astype(np.uint8)
pil_image = Image.fromarray(img_np)
buffer = io.BytesIO()
pil_image.save(buffer, format='JPEG', quality=quality)
buffer.seek(0)
decompressed = Image.open(buffer)
result = torch.tensor(
np.array(decompressed) / 255.0
).permute(2, 0, 1).float()
return result
except ImportError:
return image
def gaussian_smoothing(
self, image: torch.Tensor, kernel_size: int = 3, sigma: float = 1.0
) -> torch.Tensor:
"""Apply Gaussian blur to remove adversarial noise."""
channels = image.shape[0]
x = torch.arange(kernel_size, dtype=torch.float32) - kernel_size // 2
kernel_1d = torch.exp(-0.5 * (x / sigma) ** 2)
kernel_1d = kernel_1d / kernel_1d.sum()
kernel_2d = kernel_1d.unsqueeze(0) * kernel_1d.unsqueeze(1)
kernel_2d = kernel_2d.unsqueeze(0).unsqueeze(0)
kernel_2d = kernel_2d.expand(channels, 1, -1, -1)
image_batch = image.unsqueeze(0)
padding = kernel_size // 2
smoothed = F.conv2d(image_batch, kernel_2d, padding=padding, groups=channels)
return smoothed.squeeze(0)
def feature_squeezing(self, image: torch.Tensor, bit_depth: int = 4) -> torch.Tensor:
"""
Feature squeezing: reduce bit depth to remove adversarial perturbations.
Adversarial perturbations often rely on precise pixel values.
Reducing bit depth destroys this precision.
"""
max_val = 2 ** bit_depth - 1
squeezed = torch.round(image * max_val) / max_val
return squeezed
def detect_adversarial_by_consistency(
self,
model: torch.nn.Module,
image: torch.Tensor,
threshold: float = 0.15
) -> dict:
"""
Detect adversarial inputs by comparing predictions on
original vs. preprocessed versions (Feature Squeezing approach).
A clean input's prediction should be stable under preprocessing.
An adversarial input's prediction often changes significantly -
because the perturbation pushes the input near a decision boundary.
"""
model.eval()
with torch.no_grad():
# Original prediction
original_output = F.softmax(model(image.unsqueeze(0)), dim=1)
original_pred = original_output.argmax(1).item()
# Gaussian smoothing
smoothed = self.gaussian_smoothing(image)
smoothed_output = F.softmax(model(smoothed.unsqueeze(0)), dim=1)
# Feature squeezing
squeezed = self.feature_squeezing(image)
squeezed_output = F.softmax(model(squeezed.unsqueeze(0)), dim=1)
# Compute prediction disagreement (L1 distance in probability space)
smooth_distance = (original_output - smoothed_output).abs().max().item()
squeeze_distance = (original_output - squeezed_output).abs().max().item()
max_distance = max(smooth_distance, squeeze_distance)
is_adversarial = max_distance > threshold
return {
"original_prediction": original_pred,
"preprocessing_distance": max_distance,
"is_adversarial": is_adversarial,
"confidence": "high" if max_distance > threshold * 2 else "medium" if is_adversarial else "low",
"smooth_distance": smooth_distance,
"squeeze_distance": squeeze_distance
}
Production Considerations
| Defense | Clean Accuracy | Adversarial Accuracy (ε=0.03) | Compute Overhead | Formal Guarantee |
|---|---|---|---|---|
| No defense | ~95% | ~5% | 1x | None |
| JPEG preprocessing | ~93% | ~40% | 1.1x | None (breakable) |
| Feature squeezing | ~92% | ~35% | 1.1x | None (breakable) |
| Adversarial training (FGSM) | ~93% | ~50% | 5x | None |
| Adversarial training (PGD-7) | ~90% | ~55% | 7x | None |
| Adversarial training (PGD-40) | ~87% | ~60% | 40x | None |
| TRADES (β=6) | ~84% | ~56% | 10x | None |
| Randomized smoothing (σ=0.25) | ~76% | ~61%* | 3x | L2 (r≤0.5) |
| Randomized smoothing (σ=0.50) | ~67% | ~54%* | 3x | L2 (r≤1.0) |
*At L2 radius corresponding to the σ value
The fundamental adversarial robustness-accuracy tradeoff: robust models are less accurate on clean inputs. This is not an engineering failure - it reflects the underlying geometry of high-dimensional feature spaces. The proof in Tsipras et al. (2019) shows this tradeoff is intrinsic.
:::danger Mistake 1: Using Only Preprocessing Defenses Input preprocessing (JPEG compression, smoothing) can be broken by adaptive attacks that specifically optimize against the preprocessor. An adversary who knows you're applying JPEG compression will craft perturbations that survive it. Never rely solely on preprocessing - it's a speed bump, not a barrier. Use adversarial training for meaningful robustness. :::
:::warning Mistake 2: Evaluating Only on FGSM FGSM is a weak, single-step attack. A model that appears "FGSM-robust" may be completely vulnerable to PGD or C&W. Always evaluate against PGD-40 (minimum) and preferably AutoAttack (Croce & Hein, 2020) - the standard benchmark for honest robustness evaluation. :::
:::warning Mistake 3: Ignoring Adaptive Attacks If you design a defense, evaluate it against an adversary who knows the defense and optimizes against it. Many defenses that appear strong against standard attacks are immediately broken by adaptive attacks. The correct evaluation is "how does this defense perform when the attacker knows everything about it?" :::
:::tip Best Practice: Defense-in-Depth Combine adversarial training (for core robustness) + input preprocessing (for cheap first-pass filtering) + anomaly detection (for flagging suspicious inputs) + monitoring (for detecting attack campaigns). Budget for the compute overhead of adversarial training if robustness is critical for your use case. For safety-critical applications (medical imaging, autonomous systems), accept the accuracy cost and prioritize robustness. :::
Interview Questions and Answers
Q1: What is an adversarial example and why do they exist?
An adversarial example is an input that has been specifically crafted to cause a model to make a wrong prediction, while appearing nearly identical to a clean input. They exist because neural networks learn non-robust features - statistical correlations with labels that don't align with human perceptions of the input. The Ilyas et al. (2019) paper argues these non-robust features are genuinely predictive in the training distribution but change dramatically under small perturbations. A slight perturbation in the direction of a non-robust feature (found via gradient of loss with respect to input) can dramatically change the model's prediction while leaving human-relevant features unchanged. They're a symptom of the gap between what models learn and what we intend them to learn.
Q2: What is the difference between FGSM and PGD attacks?
FGSM (Fast Gradient Sign Method) takes a single gradient step in the direction that maximizes loss, with step size epsilon. It's fast but weak - it doesn't find the worst-case perturbation within the epsilon-ball. PGD (Projected Gradient Descent) takes many small gradient steps and projects the perturbation back to the epsilon-ball after each step, starting from a random point. PGD finds the local maximum of loss within the epsilon-ball - it's considered the strongest first-order attack. A model that's robust to PGD is considered adversarially robust (to first-order attacks). FGSM is good for fast adversarial training data augmentation; PGD is the evaluation standard. Use FGSM for 7-step inner loop in training; use PGD-40 or AutoAttack for evaluation.
Q3: Why does adversarial training reduce clean accuracy?
This is a fundamental tradeoff rooted in geometry. Adversarial training teaches the model to maintain consistent predictions in a larger region around each input (the epsilon-ball). But in high-dimensional spaces, epsilon-balls of different classes can overlap - the model must learn simpler, more conservative decision boundaries to remain robust. Simpler boundaries classify some clean inputs incorrectly. Tsipras et al. (2019) proved this tradeoff is fundamental for L-infinity robustness: in certain data distributions, a classifier can be optimally clean-accurate or optimally adversarially robust, but not both simultaneously. In practice, models trained with adversarial training (PGD-40) typically see 5-10% clean accuracy drops while achieving 55-60% adversarial accuracy vs. ~5% without defense.
Q4: What is randomized smoothing and what guarantee does it provide?
Randomized smoothing (Cohen et al., 2019) creates a smoothed classifier g(x) = argmax_c P[f(x + N(0, σ²I)) = c] - the class that the base classifier f predicts most often when Gaussian noise is added to x. The key certified result: if the top class has probability p_A under the noise distribution, the smoothed classifier g is provably robust for L2 perturbations of radius r = σ × Φ⁻¹(p_A), where Φ⁻¹ is the inverse normal CDF. This is the only scalable defense with provable L2 robustness guarantees. The tradeoffs: clean accuracy drops significantly for large σ; the certification is only for L2 norm (not L-infinity); and larger σ provides larger certified radius but lower clean accuracy. Typical: σ=0.25 gives radius ≈ 0.5 at clean accuracy ≈ 76% on ImageNet.
Q5: How do adversarial attacks affect NLP systems and what defenses work?
For NLP, adversarial attacks work at character level (typos, homoglyphs, punctuation), word level (synonym substitution, paraphrase), and sentence level (reformulation while preserving meaning). The goal is to flip a classifier's prediction while preserving semantic meaning. What works in defense: (1) data augmentation with adversarial examples during training - prepare the model for perturbation variations; (2) input canonicalization - normalize homoglyphs, fix typos, detect and flag anomalous Unicode; (3) ensemble voting and abstention - abstain when classifiers disagree strongly; (4) certified defenses adapted for discrete text (though certification is harder for discrete inputs than continuous). The key challenge: text is discrete, so gradient-based attacks don't directly apply - attacks must search combinatorially, which is harder but still practical for smart synonym substitution.
Q6: How would you decide whether to invest in adversarial robustness for a production AI system?
Ask three questions: (1) Who would benefit from attacking this system? Content classifiers, fraud detectors, and safety filters have clear adversaries with economic incentives to evade them. Product recommendation systems generally don't. The answer determines how sophisticated the adversary is. (2) What is the impact of adversarial failures? Autonomous vehicles and medical imaging have life-safety implications - accept accuracy cost and prioritize robustness. Customer service chatbots have reputational risk - lightweight defenses plus monitoring may suffice. (3) What is the adversary's access? Physical-world adversarial attacks require physical access to the environment. API-based text classifiers face black-box transfer attacks. Full white-box attacks require model access. Design defenses matched to the realistic threat model, not the worst case in research papers. Then layer: preprocessing for cheap first-pass filtering, adversarial training for core robustness, monitoring for detecting ongoing attack campaigns.
