What is adversarial examples?

Crafting inputs that reliably cause model failures - attack techniques, transferability, and robust defense strategies for production AI systems.

How does adversarial attacks work in practice?

Adversarial Examples covers adversarial examples, adversarial attacks, FGSM from first principles with code examples. Free lesson at https://engineersofai.com/docs/ai-engineering/ai-security/adversarial-examples

What is the difference between adversarial examples and FGSM?

See the full breakdown at https://engineersofai.com/docs/ai-engineering/ai-security/adversarial-examples

:::tip 🎮 Interactive Playground Visualize this concept: Try the Adversarial Prompts demo on the EngineersOfAI Playground - no code required. :::

Adversarial Examples

Reading time: ~28 min | Interview relevance: High | Target roles: AI Engineer, ML Security Engineer, Computer Vision Engineer, NLP Engineer

The Stop Sign That Wasn't

The autonomous vehicle's perception system processed 30 frames per second. At each frame, it ran a neural network classifier across detected objects. At a distance of 40 meters, it correctly identified the stop sign as a stop sign. At 35 meters. At 30 meters.

Then at 25 meters - with the vehicle traveling at 45 mph - the classifier returned "Speed Limit 45."

The stop sign looked normal to every human who passed it that day. But it had small, carefully placed stickers - yellow and black squares arranged in a specific pattern. To the neural network, these imperceptible-to-humans modifications shifted the classifier's decision from "STOP" to "45." The researchers who designed the attack (Eykholt et al., 2017) called it a physical-world adversarial attack. They showed that adversarial perturbations weren't just a theoretical curiosity confined to digital pixel manipulation - they could be printed, physically affixed to real objects, and reliably fool deployed systems.

This is what makes adversarial examples so important to understand: they reveal that neural networks learn different features than humans do. A model that classifies stop signs correctly 99.9% of the time in normal operation can be made to fail 100% of the time with targeted, nearly invisible modifications. The implications for any safety-critical AI system are severe.

What Are Adversarial Examples?

An adversarial example is an input that has been specifically crafted to cause a model to make a wrong prediction. The key property: the modification is constrained - typically imperceptible to humans - while the effect on the model is dramatic.

For images: adversarial examples add small perturbations (often imperceptible) to pixel values. For text: they change a few words, characters, or punctuation marks. For audio: they add inaudible noise that causes speech recognizers to transcribe a different message. For tabular data: they modify feature values within plausible ranges to flip model decisions.

The existence of adversarial examples reveals a fundamental property of neural networks: they make decisions based on features that are statistically correlated with labels in training data - but these features may not align with human-intuitive features. A slight perturbation along a non-robust feature direction can drastically change the prediction while leaving human-relevant features unchanged.

Why Non-Robust Features Exist

The Ilyas et al. (2019) paper "Adversarial Examples Are Not Bugs, They Are Features" provides a compelling theoretical framework: adversarial examples exist because neural networks learn to use all statistically predictive features, including ones that humans don't recognize as meaningful (non-robust features). These non-robust features are genuinely predictive in normal data but change dramatically under small perturbations. Models that use these features are accurate but brittle; models that restrict themselves to robust features are safer but less accurate.

This explains the fundamental robustness-accuracy tradeoff: you cannot have both maximum accuracy and adversarial robustness without paying a cost.

Attack Taxonomy

White-Box Attacks (Full Model Access)

Fast Gradient Sign Method (FGSM) - Goodfellow et al., 2014:

The simplest gradient-based attack. Take one step in the direction that increases loss:

$x_{adv} = x + \epsilon \cdot \text{sign}(\nabla_x \mathcal{L}(f(x), y))$

import torch
import torch.nn.functional as F

def fgsm_attack(
    model: torch.nn.Module,
    images: torch.Tensor,
    labels: torch.Tensor,
    epsilon: float = 0.03
) -> torch.Tensor:
    """
    Fast Gradient Sign Method (FGSM) adversarial attack.

    One-shot gradient attack. Fast but weak - doesn't find worst case.
    Good for generating training data for adversarial training;
    not a reliable evaluation benchmark (use PGD for evaluation).

    Args:
        model: Target model
        images: Input images [B, C, H, W], values in [0, 1]
        labels: True labels
        epsilon: Perturbation magnitude (L-infinity bound)

    Returns:
        Adversarial examples with same shape as images
    """
    images = images.clone().detach().requires_grad_(True)

    # Forward pass
    model.eval()
    outputs = model(images)
    loss = F.cross_entropy(outputs, labels)

    # Compute gradients with respect to input
    model.zero_grad()
    loss.backward()

    # FGSM step: move in gradient sign direction
    with torch.no_grad():
        perturbation = epsilon * images.grad.sign()
        adversarial = images + perturbation
        adversarial = torch.clamp(adversarial, 0, 1)

    return adversarial.detach()


def evaluate_fgsm_robustness(
    model: torch.nn.Module,
    test_loader,
    epsilon_values: list[float] = [0.01, 0.03, 0.05, 0.1]
) -> dict:
    """
    Evaluate model robustness across FGSM epsilon values.
    Shows the clean accuracy / adversarial accuracy tradeoff curve.
    """
    model.eval()
    results = {}

    for epsilon in epsilon_values:
        correct_clean = 0
        correct_adversarial = 0
        total = 0

        for images, labels in test_loader:
            total += labels.size(0)

            with torch.no_grad():
                outputs = model(images)
                correct_clean += (outputs.argmax(1) == labels).sum().item()

            adversarial = fgsm_attack(model, images, labels, epsilon)
            with torch.no_grad():
                adv_outputs = model(adversarial)
                correct_adversarial += (adv_outputs.argmax(1) == labels).sum().item()

        results[epsilon] = {
            "clean_accuracy": correct_clean / total,
            "adversarial_accuracy": correct_adversarial / total,
            "accuracy_drop": (correct_clean - correct_adversarial) / total,
            "attack_success_rate": 1 - (correct_adversarial / total)
        }

    return results

Projected Gradient Descent (PGD) - Madry et al., 2018:

The strongest first-order attack. Multiple FGSM steps with projection back to the epsilon-ball:

import torch
import torch.nn.functional as F

def pgd_attack(
    model: torch.nn.Module,
    images: torch.Tensor,
    labels: torch.Tensor,
    epsilon: float = 0.03,
    alpha: float = 0.007,
    n_steps: int = 40,
    random_start: bool = True,
    targeted: bool = False,
    target_labels: torch.Tensor = None
) -> torch.Tensor:
    """
    Projected Gradient Descent (PGD) adversarial attack.

    The strongest "first-order" attack. If a model is robust to PGD,
    it's considered adversarially robust to gradient-based attacks.
    PGD finds the local maximum of loss within the epsilon-ball.

    Args:
        model: Target model
        images: Clean images [B, C, H, W]
        labels: True labels
        epsilon: L-infinity perturbation bound
        alpha: Per-step perturbation magnitude (typically epsilon / 4)
        n_steps: Number of gradient steps (40-100 for evaluation)
        random_start: Start from random point in epsilon-ball (recommended)
        targeted: If True, minimize loss for target_labels
        target_labels: Target labels for targeted attack

    Returns:
        Adversarial examples
    """
    model.eval()
    batch_size = images.shape[0]

    # Initialize perturbation
    if random_start:
        delta = torch.zeros_like(images).uniform_(-epsilon, epsilon)
    else:
        delta = torch.zeros_like(images)

    delta.requires_grad_(True)

    for step in range(n_steps):
        adversarial = torch.clamp(images + delta, 0, 1)

        # Forward pass and loss
        outputs = model(adversarial)

        if targeted and target_labels is not None:
            # Targeted: minimize loss for target class (move toward target)
            loss = -F.cross_entropy(outputs, target_labels)
        else:
            # Untargeted: maximize loss for true class
            loss = F.cross_entropy(outputs, labels)

        # Backward pass
        loss.backward()

        with torch.no_grad():
            # Gradient step
            delta_grad = delta.grad.sign()
            delta.data = delta.data + alpha * delta_grad

            # Project back to epsilon-ball (L-infinity)
            delta.data = torch.clamp(delta.data, -epsilon, epsilon)

            # Ensure valid image range
            delta.data = torch.clamp(images + delta.data, 0, 1) - images

        if delta.grad is not None:
            delta.grad.zero_()

    return (images + delta).detach()


def pgd_with_restarts(
    model: torch.nn.Module,
    images: torch.Tensor,
    labels: torch.Tensor,
    epsilon: float = 0.03,
    alpha: float = 0.007,
    n_steps: int = 40,
    n_restarts: int = 5
) -> torch.Tensor:
    """
    PGD with multiple random restarts for stronger adversarial examples.
    AutoAttack uses this strategy for reliable evaluation.
    """
    model.eval()
    best_loss = torch.zeros(images.shape[0])
    best_adversarial = images.clone()

    for _ in range(n_restarts):
        adversarial = pgd_attack(
            model, images, labels,
            epsilon=epsilon, alpha=alpha, n_steps=n_steps,
            random_start=True
        )

        with torch.no_grad():
            outputs = model(adversarial)
            loss = F.cross_entropy(outputs, labels, reduction='none')

            # Track best (worst-case) adversarial per example
            improved = loss > best_loss
            best_adversarial[improved] = adversarial[improved]
            best_loss[improved] = loss[improved]

    return best_adversarial

Carlini-Wagner (C&W) Attack - Carlini & Wagner, 2017:

The strongest optimization-based attack. Minimizes perturbation while achieving misclassification:

import torch
import torch.optim as optim
import torch.nn.functional as F

def cw_l2_attack(
    model: torch.nn.Module,
    images: torch.Tensor,
    labels: torch.Tensor,
    confidence: float = 0.0,
    lr: float = 0.01,
    n_iterations: int = 1000,
    binary_search_steps: int = 9,
    c_upper: float = 1e4
) -> torch.Tensor:
    """
    Carlini-Wagner L2 Attack (C&W).

    Minimizes L2 perturbation while achieving misclassification.
    Often finds smaller perturbations than PGD.
    Slower than PGD but stronger - breaks many defenses PGD doesn't.

    Formulation:
    minimize ||x' - x||_2 + c * f(x')
    where f(x') < 0 means misclassification with margin 'confidence'
    """
    model.eval()
    batch_size = images.shape[0]

    # Work in tanh space to handle box constraints naturally
    # x = (tanh(w) + 1) / 2 maps to [0, 1]
    w = torch.atanh(images * 2 - 1)  # Inverse tanh mapping

    best_adversarial = images.clone()
    best_l2 = torch.full((batch_size,), float('inf'))

    # Binary search over constant c
    c_lower = torch.zeros(batch_size)
    c_upper_vals = torch.full((batch_size,), c_upper)
    c = torch.ones(batch_size)

    for _ in range(binary_search_steps):
        # Optimize perturbation for current c
        w_opt = w.clone().detach().requires_grad_(True)
        optimizer = optim.Adam([w_opt], lr=lr)

        for iteration in range(n_iterations):
            optimizer.zero_grad()

            # Map back to image space
            x_adv = (torch.tanh(w_opt) + 1) / 2

            # L2 loss
            l2_loss = ((x_adv - images) ** 2).sum(dim=(1, 2, 3))

            # Classification loss (CW objective)
            outputs = model(x_adv)

            # Gather correct class logits
            correct_logits = outputs.gather(1, labels.unsqueeze(1)).squeeze()

            # Max other class logits
            mask = torch.ones_like(outputs, dtype=bool)
            mask.scatter_(1, labels.unsqueeze(1), False)
            other_logits = outputs[mask].view(batch_size, -1).max(dim=1).values

            # CW objective: f = max(other - correct + confidence, -kappa)
            f_loss = torch.clamp(other_logits - correct_logits + confidence, min=-1.0)

            total_loss = (l2_loss + c * f_loss).mean()
            total_loss.backward()
            optimizer.step()

        # Check results and update binary search
        with torch.no_grad():
            x_final = (torch.tanh(w_opt) + 1) / 2
            final_outputs = model(x_final)
            final_preds = final_outputs.argmax(1)
            final_l2 = ((x_final - images) ** 2).sum(dim=(1, 2, 3)).sqrt()

            for i in range(batch_size):
                if final_preds[i] != labels[i] and final_l2[i] < best_l2[i]:
                    best_l2[i] = final_l2[i]
                    best_adversarial[i] = x_final[i]
                    c_upper_vals[i] = c[i]
                else:
                    c_lower[i] = c[i]

        c = (c_lower + c_upper_vals) / 2

    return best_adversarial

Black-Box Attacks (API Access Only)

Transfer Attack: Generate adversarial examples on a locally trained surrogate model. Often transfer to the target model due to shared non-robust features.

Score-Based Attack (NES): Estimate gradients by querying with small perturbations:

import torch

def nes_black_box_attack(
    model_query_fn: callable,
    image: torch.Tensor,
    label: int,
    epsilon: float = 0.03,
    n_queries: int = 500,
    mu: float = 0.01,
    step_size: float = 0.01,
    n_directions: int = 20
) -> tuple[torch.Tensor, int]:
    """
    Natural Evolution Strategy (NES) black-box attack.

    Estimates gradients using finite differences from model outputs.
    Doesn't need model weights - only API access to output scores.

    Args:
        model_query_fn: Function(image_tensor) → loss or score float
        image: Clean image to perturb [C, H, W]
        label: True label
        epsilon: L-infinity perturbation bound
        n_queries: Maximum API queries allowed
        mu: Gradient estimation smoothing factor
        step_size: Gradient step size
        n_directions: Random directions per gradient estimate
    """
    delta = torch.zeros_like(image)
    queries_used = 0

    for _ in range(n_queries // (2 * n_directions)):
        # Sample random directions for gradient estimation
        directions = torch.randn(n_directions, *image.shape)

        # Estimate gradient via finite differences
        grad_estimate = torch.zeros_like(image)

        for d in directions:
            d_norm = d / (d.norm() + 1e-8)
            pos_query = torch.clamp(image + delta + mu * d_norm, 0, 1)
            neg_query = torch.clamp(image + delta - mu * d_norm, 0, 1)

            pos_score = model_query_fn(pos_query)  # Higher = more misclassified
            neg_score = model_query_fn(neg_query)
            queries_used += 2

            grad_estimate += (pos_score - neg_score) / (2 * mu) * d_norm

        grad_estimate /= n_directions

        # Step in gradient direction
        delta = delta + step_size * grad_estimate.sign()
        delta = torch.clamp(delta, -epsilon, epsilon)
        delta = torch.clamp(image + delta, 0, 1) - image

        # Check if attack succeeded
        with torch.no_grad():
            current_score = model_query_fn(image + delta)
            if current_score > 0:  # Model now predicts wrong class
                print(f"Attack succeeded after {queries_used} queries")
                break

    return (image + delta).detach(), queries_used

Text Adversarial Attacks

For NLP models, adversarial attacks work at the word, character, or sentence level:

import anthropic
import random
import re
import difflib

client = anthropic.Anthropic()

class TextAdversarialAttacker:
    """
    Generate adversarial text examples that fool NLP classifiers.

    Text attacks must be semantics-preserving (humans should reach the same
    conclusion) while flipping the model's prediction.

    Key constraint: humans must not notice the modification.
    """

    def __init__(self):
        # Homoglyph map: Latin chars replaced with visually identical Unicode
        self.homoglyph_map = {
            'a': 'а',  # Cyrillic а (U+0430)
            'e': 'е',  # Cyrillic е (U+0435)
            'o': 'о',  # Cyrillic о (U+043E)
            'p': 'р',  # Cyrillic р (U+0440)
            'c': 'с',  # Cyrillic с (U+0441)
            'x': 'х',  # Cyrillic х (U+0445)
        }

        # Synonym substitution map
        self.synonym_map = {
            "good": ["excellent", "great", "wonderful", "superb", "fine"],
            "bad": ["poor", "terrible", "awful", "dreadful", "inferior"],
            "big": ["large", "huge", "enormous", "massive", "substantial"],
            "small": ["tiny", "little", "minor", "petite", "compact"],
            "fast": ["quick", "rapid", "swift", "speedy", "brisk"],
            "important": ["crucial", "significant", "critical", "vital"],
            "show": ["demonstrate", "reveal", "display", "exhibit"],
            "use": ["utilize", "employ", "apply", "deploy"],
            "make": ["create", "produce", "generate", "construct"],
            "need": ["require", "necessitate", "demand"],
        }

    def character_substitution_attack(
        self, text: str, rate: float = 0.03
    ) -> str:
        """
        Attack by substituting characters with visually similar Unicode.
        Fools models trained on ASCII but evades human visual detection.
        Used to bypass content filters (the original input looks clean).
        """
        result = []
        for char in text:
            if char.lower() in self.homoglyph_map and random.random() < rate:
                homoglyph = self.homoglyph_map[char.lower()]
                result.append(homoglyph)
            else:
                result.append(char)
        return ''.join(result)

    def word_substitution_attack(
        self,
        text: str,
        max_substitutions: int = 3
    ) -> str:
        """
        TextFooler approach (Jin et al., 2020):
        Substitute words with synonyms to flip classifier prediction.
        Most effective when targeting words with high importance scores.
        """
        words = text.split()
        n_substitutions = 0

        for i, word in enumerate(words):
            if n_substitutions >= max_substitutions:
                break

            lower_word = word.lower().strip('.,!?;:"\'')
            if lower_word in self.synonym_map:
                synonyms = self.synonym_map[lower_word]
                chosen_synonym = random.choice(synonyms)
                if word[0].isupper():
                    chosen_synonym = chosen_synonym.capitalize()
                words[i] = chosen_synonym
                n_substitutions += 1

        return ' '.join(words)

    def typo_insertion_attack(
        self, text: str, rate: float = 0.03
    ) -> str:
        """
        Insert plausible typos that fool models but are readable to humans.
        Swap adjacent characters in words of length > 4.
        """
        words = text.split()
        for i, word in enumerate(words):
            if len(word) > 4 and random.random() < rate:
                pos = random.randint(1, len(word) - 2)
                word_list = list(word)
                word_list[pos], word_list[pos+1] = word_list[pos+1], word_list[pos]
                words[i] = ''.join(word_list)
        return ' '.join(words)

    def llm_paraphrase_attack(
        self,
        text: str,
        target_label: str,
        classifier_fn: callable
    ) -> dict:
        """
        Use an LLM to generate adversarial paraphrases.
        More powerful than rule-based approaches - can find
        semantic changes that fool classifiers while preserving meaning.
        """
        original_pred = classifier_fn(text)

        prompt = f"""Generate 5 paraphrases of the following text that:
1. Preserve the exact meaning and information
2. Use different words and sentence structure
3. Read naturally as human text

Original text:
{text}

Return as a JSON array of 5 paraphrased strings."""

        response = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=1000,
            messages=[{"role": "user", "content": prompt}]
        )

        import json
        candidates = []
        try:
            json_match = re.search(r'\[.*\]', response.content[0].text, re.DOTALL)
            if json_match:
                candidates = json.loads(json_match.group())
        except Exception:
            pass

        # Try each candidate
        for candidate in candidates:
            candidate_pred = classifier_fn(candidate)

            if candidate_pred != original_pred:
                similarity = difflib.SequenceMatcher(None, text, candidate).ratio()
                return {
                    "attack_succeeded": True,
                    "adversarial_text": candidate,
                    "original_prediction": original_pred,
                    "adversarial_prediction": candidate_pred,
                    "semantic_similarity": similarity
                }

        return {
            "attack_succeeded": False,
            "candidates_tried": len(candidates),
            "original_prediction": original_pred
        }

    def evaluate_attack(
        self,
        original_text: str,
        adversarial_text: str,
        classifier_fn: callable
    ) -> dict:
        """Evaluate if an adversarial text successfully flips prediction."""
        original_pred = classifier_fn(original_text)
        adversarial_pred = classifier_fn(adversarial_text)
        similarity = difflib.SequenceMatcher(None, original_text, adversarial_text).ratio()

        return {
            "original_prediction": original_pred,
            "adversarial_prediction": adversarial_pred,
            "attack_succeeded": original_pred != adversarial_pred,
            "text_similarity": similarity,
            "human_imperceptible": similarity > 0.92,
            "modification_count": sum(
                1 for a, b in zip(original_text.split(), adversarial_text.split())
                if a != b
            )
        }

Transferability: Why This Matters for Defense

One of the most important (and alarming) properties of adversarial examples: they transfer across models.

An adversarial example created to fool Model A often also fools Model B - even when A and B have different architectures and were trained independently. This means:

Attackers don't need access to your specific model - they can attack a locally trained surrogate
Ensemble defenses (using multiple models) provide less protection than expected
The adversarial vulnerability is partly intrinsic to the learning paradigm

import torch

def measure_transferability(
    source_model: torch.nn.Module,
    target_models: list[torch.nn.Module],
    test_images: torch.Tensor,
    test_labels: torch.Tensor,
    epsilon: float = 0.03
) -> dict:
    """
    Measure how well adversarial examples transfer across models.

    High transferability indicates shared non-robust features -
    harder to defend with simple model diversity.
    """
    # Generate adversarial examples on source model
    adversarial_examples = pgd_attack(
        source_model, test_images, test_labels,
        epsilon=epsilon, n_steps=40
    )

    results = {}

    # Source model performance (baseline: should be high fooling rate)
    with torch.no_grad():
        source_adv_preds = source_model(adversarial_examples).argmax(1)
        source_fooling_rate = (source_adv_preds != test_labels).float().mean().item()

    results["source_model"] = {
        "fooling_rate": source_fooling_rate,
        "is_transfer": False,
        "interpretation": "Baseline attack success on source model"
    }

    # Test on each target model
    for i, target_model in enumerate(target_models):
        target_model.eval()
        with torch.no_grad():
            # Adversarial accuracy on target
            target_adv_preds = target_model(adversarial_examples).argmax(1)
            target_fooling_rate = (target_adv_preds != test_labels).float().mean().item()

            # Clean accuracy on target
            target_clean_preds = target_model(test_images).argmax(1)
            target_clean_accuracy = (target_clean_preds == test_labels).float().mean().item()

        # Transfer rate above clean error baseline
        natural_error_rate = 1 - target_clean_accuracy
        transfer_rate = max(0, target_fooling_rate - natural_error_rate)

        results[f"target_model_{i}"] = {
            "clean_accuracy": target_clean_accuracy,
            "adversarial_fooling_rate": target_fooling_rate,
            "transfer_rate": transfer_rate,
            "is_transfer": True,
            "interpretation": f"{'High' if transfer_rate > 0.3 else 'Low'} transferability"
        }

    avg_transfer = sum(
        v["transfer_rate"] for k, v in results.items() if v["is_transfer"]
    ) / max(sum(1 for v in results.values() if v["is_transfer"]), 1)

    results["summary"] = {
        "source_fooling_rate": source_fooling_rate,
        "average_transfer_rate": avg_transfer,
        "high_transferability": avg_transfer > 0.3,
        "defense_implication": (
            "Ensemble defense provides limited protection - shared non-robust features"
            if avg_transfer > 0.3
            else "Models have different failure modes - diversity provides some protection"
        )
    }

    return results

Defenses

1. Adversarial Training (Madry et al., 2018)

The most effective empirically verified defense: train on adversarial examples alongside clean examples.

import torch
import torch.nn.functional as F

def adversarial_training_step(
    model: torch.nn.Module,
    optimizer: torch.optim.Optimizer,
    images: torch.Tensor,
    labels: torch.Tensor,
    epsilon: float = 0.03,
    pgd_steps: int = 7,
    alpha: float = 0.01,
    clean_data_fraction: float = 0.0  # Mix in clean data (helps with accuracy)
) -> dict:
    """
    Single adversarial training step.

    For each batch:
    1. Generate adversarial examples using PGD (7 steps is standard)
    2. Train on adversarial examples (and optionally some clean examples)

    Cost: approximately 7x slower than standard training (one PGD step = one forward-backward pass).
    This is the dominant cost and explains why adversarially trained models are expensive.

    Args:
        clean_data_fraction: If > 0, mix clean examples into adversarial training
                             This trades off robustness for clean accuracy
    """
    model.train()

    # Generate adversarial examples
    adversarial = pgd_attack(
        model, images, labels,
        epsilon=epsilon,
        alpha=alpha,
        n_steps=pgd_steps,
        random_start=True
    )

    # Train on adversarial examples
    optimizer.zero_grad()

    if clean_data_fraction > 0:
        # Mix clean and adversarial (TRADES-like approach)
        n_clean = int(len(images) * clean_data_fraction)
        mixed_inputs = torch.cat([adversarial[n_clean:], images[:n_clean]])
        mixed_labels = labels
        outputs = model(mixed_inputs)
    else:
        # Pure adversarial training (Madry et al.)
        outputs = model(adversarial)

    loss = F.cross_entropy(outputs, labels)
    loss.backward()

    # Gradient clipping for training stability
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()

    with torch.no_grad():
        clean_outputs = model(images)
        clean_accuracy = (clean_outputs.argmax(1) == labels).float().mean().item()
        adv_accuracy = (model(adversarial).argmax(1) == labels).float().mean().item()

    return {
        "loss": loss.item(),
        "clean_accuracy": clean_accuracy,
        "adversarial_accuracy": adv_accuracy
    }


def trades_loss(
    model: torch.nn.Module,
    images: torch.Tensor,
    labels: torch.Tensor,
    epsilon: float = 0.03,
    alpha: float = 0.007,
    n_steps: int = 10,
    beta: float = 6.0  # Beta controls robustness-accuracy tradeoff
) -> torch.Tensor:
    """
    TRADES (TRadeoff-inspired Adversarial DEfense via Surrogate-loss minimization)
    Zhang et al., 2019.

    Decomposes adversarial risk into:
    1. Natural risk (clean loss)
    2. Boundary risk (KL divergence between clean and adversarial predictions)

    Loss = CE(f(x), y) + beta * KL(f(x) || f(x_adv))

    TRADES often achieves better robustness-accuracy tradeoff than pure Madry AT.
    """
    model.train()
    criterion = torch.nn.CrossEntropyLoss()

    # Generate adversarial examples maximizing KL divergence
    adversarial = images.clone().detach()
    adversarial += torch.zeros_like(adversarial).uniform_(-epsilon, epsilon)
    adversarial.requires_grad_(True)

    with torch.no_grad():
        natural_outputs = model(images)
        natural_probs = F.softmax(natural_outputs, dim=1)

    for _ in range(n_steps):
        adv_outputs = model(adversarial)
        adv_probs = F.softmax(adv_outputs, dim=1)

        # KL divergence: KL(natural || adversarial)
        kl_loss = F.kl_div(
            F.log_softmax(adv_outputs, dim=1),
            natural_probs,
            reduction='batchmean'
        )

        kl_loss.backward()
        with torch.no_grad():
            adversarial = adversarial + alpha * adversarial.grad.sign()
            adversarial = torch.clamp(adversarial, images - epsilon, images + epsilon)
            adversarial = torch.clamp(adversarial, 0, 1)
        adversarial.requires_grad_(True)

    adversarial = adversarial.detach()

    # TRADES loss
    natural_outputs = model(images)
    natural_loss = criterion(natural_outputs, labels)

    adv_outputs = model(adversarial)
    boundary_loss = F.kl_div(
        F.log_softmax(adv_outputs, dim=1),
        F.softmax(natural_outputs.detach(), dim=1),
        reduction='batchmean'
    )

    total_loss = natural_loss + beta * boundary_loss
    return total_loss

2. Randomized Smoothing (Certified Defense)

The only defense with provable robustness guarantees under L2 attacks:

import torch
import numpy as np
from scipy.stats import norm, binom

class RandomizedSmoothing:
    """
    Randomized smoothing: certifiably robust predictions (Cohen et al., 2019).

    For a base classifier f, the smoothed classifier g(x) predicts the
    most likely class when x is perturbed with Gaussian noise:

    g(x) = argmax_c P[f(x + N(0, σ²I)) = c]

    Key result: if the top class c_A has probability p_A under the noise
    distribution, g(x) is certified robust for L2 perturbations of radius:

    r = σ * Φ^{-1}(p_A)   where Φ^{-1} is the inverse normal CDF

    Trade-off: clean accuracy decreases with σ (more noise = less accuracy).
    σ = 0.12 → high clean accuracy, low robustness
    σ = 0.50 → lower accuracy, higher robustness radius

    IMPORTANT: This provides certified robustness against L2 attacks only.
    L-infinity certification is a separate (harder) problem.
    """

    def __init__(self, base_model: torch.nn.Module, sigma: float = 0.25, n_classes: int = 10):
        self.base_model = base_model
        self.sigma = sigma
        self.n_classes = n_classes

    def predict(
        self,
        x: torch.Tensor,
        n_samples: int = 100,
        alpha: float = 0.001
    ) -> tuple[int, float]:
        """
        Make a certifiably robust prediction.

        Args:
            x: Input tensor [C, H, W]
            n_samples: Monte Carlo samples for estimating probabilities
            alpha: Failure probability for Clopper-Pearson bound

        Returns:
            (predicted_class, certified_radius)
            Returns (-1, 0.0) if prediction confidence insufficient
        """
        self.base_model.eval()

        # Sample noisy predictions
        x_expanded = x.unsqueeze(0).expand(n_samples, -1, -1, -1)
        noise = torch.randn_like(x_expanded) * self.sigma
        noisy_inputs = x_expanded + noise

        with torch.no_grad():
            predictions = self.base_model(noisy_inputs).argmax(1)

        # Vote for each class
        votes = torch.zeros(self.n_classes)
        for pred in predictions:
            if pred.item() < self.n_classes:
                votes[pred.item()] += 1

        top_class = int(votes.argmax().item())
        top_count = int(votes.max().item())

        # Clopper-Pearson lower confidence bound on P[top class]
        p_lower = float(binom.ppf(alpha, n_samples, top_count / n_samples))

        if p_lower <= 0.5:
            return -1, 0.0  # ABSTAIN - prediction not certifiable

        # Certified radius: r = σ * Φ^{-1}(p_lower)
        radius = self.sigma * norm.ppf(p_lower)

        return top_class, radius

    def certify_dataset(
        self,
        test_loader,
        n_samples: int = 1000,
        radii: list[float] = [0.0, 0.25, 0.5, 0.75, 1.0]
    ) -> dict:
        """
        Compute certified accuracy at multiple L2 radii.
        Standard evaluation for randomized smoothing defenses.
        """
        certified = {r: 0 for r in radii}
        abstained = 0
        total = 0

        for images, labels in test_loader:
            for image, label in zip(images, labels):
                pred_class, cert_radius = self.predict(image, n_samples=n_samples)
                total += 1

                if pred_class == -1:
                    abstained += 1
                    continue

                if pred_class == label.item():
                    for r in radii:
                        if cert_radius >= r:
                            certified[r] += 1

        return {
            "total": total,
            "abstained": abstained,
            "abstain_rate": abstained / total,
            "certified_accuracy": {
                f"r={r:.2f}": certified[r] / total
                for r in radii
            },
            "sigma": self.sigma,
        }

3. Input Preprocessing Defenses

Detect and neutralize adversarial perturbations before model inference:

import torch
import torch.nn.functional as F
import numpy as np

class InputPreprocessingDefense:
    """
    Input preprocessing defenses against adversarial perturbations.

    These defenses don't provide formal guarantees but serve as useful
    additional layers. WARNING: Many are broken by adaptive attacks -
    i.e., attacks that optimize against the preprocessor. Never rely
    solely on preprocessing.
    """

    def jpeg_compression(self, image: torch.Tensor, quality: int = 75) -> torch.Tensor:
        """
        JPEG compression removes high-frequency adversarial perturbations.

        Why it works: adversarial perturbations often use high-frequency
        components that JPEG's DCT quantization discards.

        Why it fails (adaptively): attacker can add perturbations that
        survive JPEG compression by restricting to low-frequency components.
        """
        try:
            from PIL import Image
            import io
            import numpy as np

            img_np = (image.permute(1, 2, 0).numpy() * 255).astype(np.uint8)
            pil_image = Image.fromarray(img_np)

            buffer = io.BytesIO()
            pil_image.save(buffer, format='JPEG', quality=quality)
            buffer.seek(0)
            decompressed = Image.open(buffer)

            result = torch.tensor(
                np.array(decompressed) / 255.0
            ).permute(2, 0, 1).float()
            return result
        except ImportError:
            return image

    def gaussian_smoothing(
        self, image: torch.Tensor, kernel_size: int = 3, sigma: float = 1.0
    ) -> torch.Tensor:
        """Apply Gaussian blur to remove adversarial noise."""
        channels = image.shape[0]
        x = torch.arange(kernel_size, dtype=torch.float32) - kernel_size // 2
        kernel_1d = torch.exp(-0.5 * (x / sigma) ** 2)
        kernel_1d = kernel_1d / kernel_1d.sum()
        kernel_2d = kernel_1d.unsqueeze(0) * kernel_1d.unsqueeze(1)
        kernel_2d = kernel_2d.unsqueeze(0).unsqueeze(0)
        kernel_2d = kernel_2d.expand(channels, 1, -1, -1)

        image_batch = image.unsqueeze(0)
        padding = kernel_size // 2
        smoothed = F.conv2d(image_batch, kernel_2d, padding=padding, groups=channels)
        return smoothed.squeeze(0)

    def feature_squeezing(self, image: torch.Tensor, bit_depth: int = 4) -> torch.Tensor:
        """
        Feature squeezing: reduce bit depth to remove adversarial perturbations.

        Adversarial perturbations often rely on precise pixel values.
        Reducing bit depth destroys this precision.
        """
        max_val = 2 ** bit_depth - 1
        squeezed = torch.round(image * max_val) / max_val
        return squeezed

    def detect_adversarial_by_consistency(
        self,
        model: torch.nn.Module,
        image: torch.Tensor,
        threshold: float = 0.15
    ) -> dict:
        """
        Detect adversarial inputs by comparing predictions on
        original vs. preprocessed versions (Feature Squeezing approach).

        A clean input's prediction should be stable under preprocessing.
        An adversarial input's prediction often changes significantly -
        because the perturbation pushes the input near a decision boundary.
        """
        model.eval()
        with torch.no_grad():
            # Original prediction
            original_output = F.softmax(model(image.unsqueeze(0)), dim=1)
            original_pred = original_output.argmax(1).item()

            # Gaussian smoothing
            smoothed = self.gaussian_smoothing(image)
            smoothed_output = F.softmax(model(smoothed.unsqueeze(0)), dim=1)

            # Feature squeezing
            squeezed = self.feature_squeezing(image)
            squeezed_output = F.softmax(model(squeezed.unsqueeze(0)), dim=1)

        # Compute prediction disagreement (L1 distance in probability space)
        smooth_distance = (original_output - smoothed_output).abs().max().item()
        squeeze_distance = (original_output - squeezed_output).abs().max().item()
        max_distance = max(smooth_distance, squeeze_distance)

        is_adversarial = max_distance > threshold

        return {
            "original_prediction": original_pred,
            "preprocessing_distance": max_distance,
            "is_adversarial": is_adversarial,
            "confidence": "high" if max_distance > threshold * 2 else "medium" if is_adversarial else "low",
            "smooth_distance": smooth_distance,
            "squeeze_distance": squeeze_distance
        }

Production Considerations

Defense	Clean Accuracy	Adversarial Accuracy (ε=0.03)	Compute Overhead	Formal Guarantee
No defense	~95%	~5%	1x	None
JPEG preprocessing	~93%	~40%	1.1x	None (breakable)
Feature squeezing	~92%	~35%	1.1x	None (breakable)
Adversarial training (FGSM)	~93%	~50%	5x	None
Adversarial training (PGD-7)	~90%	~55%	7x	None
Adversarial training (PGD-40)	~87%	~60%	40x	None
TRADES (β=6)	~84%	~56%	10x	None
Randomized smoothing (σ=0.25)	~76%	~61%*	3x	L2 (r≤0.5)
Randomized smoothing (σ=0.50)	~67%	~54%*	3x	L2 (r≤1.0)

*At L2 radius corresponding to the σ value

The fundamental adversarial robustness-accuracy tradeoff: robust models are less accurate on clean inputs. This is not an engineering failure - it reflects the underlying geometry of high-dimensional feature spaces. The proof in Tsipras et al. (2019) shows this tradeoff is intrinsic.

:::danger Mistake 1: Using Only Preprocessing Defenses Input preprocessing (JPEG compression, smoothing) can be broken by adaptive attacks that specifically optimize against the preprocessor. An adversary who knows you're applying JPEG compression will craft perturbations that survive it. Never rely solely on preprocessing - it's a speed bump, not a barrier. Use adversarial training for meaningful robustness. :::

:::warning Mistake 2: Evaluating Only on FGSM FGSM is a weak, single-step attack. A model that appears "FGSM-robust" may be completely vulnerable to PGD or C&W. Always evaluate against PGD-40 (minimum) and preferably AutoAttack (Croce & Hein, 2020) - the standard benchmark for honest robustness evaluation. :::

:::warning Mistake 3: Ignoring Adaptive Attacks If you design a defense, evaluate it against an adversary who knows the defense and optimizes against it. Many defenses that appear strong against standard attacks are immediately broken by adaptive attacks. The correct evaluation is "how does this defense perform when the attacker knows everything about it?" :::

:::tip Best Practice: Defense-in-Depth Combine adversarial training (for core robustness) + input preprocessing (for cheap first-pass filtering) + anomaly detection (for flagging suspicious inputs) + monitoring (for detecting attack campaigns). Budget for the compute overhead of adversarial training if robustness is critical for your use case. For safety-critical applications (medical imaging, autonomous systems), accept the accuracy cost and prioritize robustness. :::

Interview Questions and Answers

Q1: What is an adversarial example and why do they exist?

An adversarial example is an input that has been specifically crafted to cause a model to make a wrong prediction, while appearing nearly identical to a clean input. They exist because neural networks learn non-robust features - statistical correlations with labels that don't align with human perceptions of the input. The Ilyas et al. (2019) paper argues these non-robust features are genuinely predictive in the training distribution but change dramatically under small perturbations. A slight perturbation in the direction of a non-robust feature (found via gradient of loss with respect to input) can dramatically change the model's prediction while leaving human-relevant features unchanged. They're a symptom of the gap between what models learn and what we intend them to learn.

Q2: What is the difference between FGSM and PGD attacks?

FGSM (Fast Gradient Sign Method) takes a single gradient step in the direction that maximizes loss, with step size epsilon. It's fast but weak - it doesn't find the worst-case perturbation within the epsilon-ball. PGD (Projected Gradient Descent) takes many small gradient steps and projects the perturbation back to the epsilon-ball after each step, starting from a random point. PGD finds the local maximum of loss within the epsilon-ball - it's considered the strongest first-order attack. A model that's robust to PGD is considered adversarially robust (to first-order attacks). FGSM is good for fast adversarial training data augmentation; PGD is the evaluation standard. Use FGSM for 7-step inner loop in training; use PGD-40 or AutoAttack for evaluation.

Q3: Why does adversarial training reduce clean accuracy?

This is a fundamental tradeoff rooted in geometry. Adversarial training teaches the model to maintain consistent predictions in a larger region around each input (the epsilon-ball). But in high-dimensional spaces, epsilon-balls of different classes can overlap - the model must learn simpler, more conservative decision boundaries to remain robust. Simpler boundaries classify some clean inputs incorrectly. Tsipras et al. (2019) proved this tradeoff is fundamental for L-infinity robustness: in certain data distributions, a classifier can be optimally clean-accurate or optimally adversarially robust, but not both simultaneously. In practice, models trained with adversarial training (PGD-40) typically see 5-10% clean accuracy drops while achieving 55-60% adversarial accuracy vs. ~5% without defense.

Q4: What is randomized smoothing and what guarantee does it provide?

Randomized smoothing (Cohen et al., 2019) creates a smoothed classifier g(x) = argmax_c P[f(x + N(0, σ²I)) = c] - the class that the base classifier f predicts most often when Gaussian noise is added to x. The key certified result: if the top class has probability p_A under the noise distribution, the smoothed classifier g is provably robust for L2 perturbations of radius r = σ × Φ⁻¹(p_A), where Φ⁻¹ is the inverse normal CDF. This is the only scalable defense with provable L2 robustness guarantees. The tradeoffs: clean accuracy drops significantly for large σ; the certification is only for L2 norm (not L-infinity); and larger σ provides larger certified radius but lower clean accuracy. Typical: σ=0.25 gives radius ≈ 0.5 at clean accuracy ≈ 76% on ImageNet.

Q5: How do adversarial attacks affect NLP systems and what defenses work?

For NLP, adversarial attacks work at character level (typos, homoglyphs, punctuation), word level (synonym substitution, paraphrase), and sentence level (reformulation while preserving meaning). The goal is to flip a classifier's prediction while preserving semantic meaning. What works in defense: (1) data augmentation with adversarial examples during training - prepare the model for perturbation variations; (2) input canonicalization - normalize homoglyphs, fix typos, detect and flag anomalous Unicode; (3) ensemble voting and abstention - abstain when classifiers disagree strongly; (4) certified defenses adapted for discrete text (though certification is harder for discrete inputs than continuous). The key challenge: text is discrete, so gradient-based attacks don't directly apply - attacks must search combinatorially, which is harder but still practical for smart synonym substitution.

Q6: How would you decide whether to invest in adversarial robustness for a production AI system?

Ask three questions: (1) Who would benefit from attacking this system? Content classifiers, fraud detectors, and safety filters have clear adversaries with economic incentives to evade them. Product recommendation systems generally don't. The answer determines how sophisticated the adversary is. (2) What is the impact of adversarial failures? Autonomous vehicles and medical imaging have life-safety implications - accept accuracy cost and prioritize robustness. Customer service chatbots have reputational risk - lightweight defenses plus monitoring may suffice. (3) What is the adversary's access? Physical-world adversarial attacks require physical access to the environment. API-based text classifiers face black-box transfer attacks. Full white-box attacks require model access. Design defenses matched to the realistic threat model, not the worst case in research papers. Then layer: preprocessing for cheap first-pass filtering, adversarial training for core robustness, monitoring for detecting ongoing attack campaigns.

The Stop Sign That Wasn't​

What Are Adversarial Examples?​

Why Non-Robust Features Exist​

Attack Taxonomy​

White-Box Attacks (Full Model Access)​

Black-Box Attacks (API Access Only)​

Text Adversarial Attacks​

Transferability: Why This Matters for Defense​

Defenses​

1. Adversarial Training (Madry et al., 2018)​

2. Randomized Smoothing (Certified Defense)​

3. Input Preprocessing Defenses​

Production Considerations​

Interview Questions and Answers​