What is saliency maps?

Gradient-based saliency, GradCAM, SmoothGrad, Guided Backpropagation, and Integrated Gradients for explaining computer vision models - with practical code and honest limitations.

How does GradCAM work in practice?

Saliency Maps for Vision - What Your CNN Is Actually Seeing covers saliency maps, GradCAM, integrated gradients from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/explainability-and-interpretability/saliency-maps-for-vision

What is the difference between saliency maps and integrated gradients?

See the full breakdown at https://engineersofai.com/docs/ml/explainability-and-interpretability/saliency-maps-for-vision

Saliency Maps for Vision - What Your CNN Is Actually Seeing

Reading time: 52 min | Interview relevance: Very High - saliency maps appear in every vision ML interview; GradCAM is the must-know algorithm | Target roles: ML Engineer, Computer Vision Engineer, AI Engineer, Research Engineer

The Photo Quality Classifier That Was Looking in the Wrong Place

It is 2019. An Airbnb engineering team has built a listing photo quality classifier. The model's job is to score listing photos on a scale of 0–100, helping hosts understand which photos will attract more bookings and flagging low-quality images for review. After six months of development and tuning, the model achieves 91% accuracy on the held-out test set, validated against human ratings from a trained photo review team.

Then a host in Seattle calls support. Her listing photos are professionally shot - hardwood floors, natural lighting, carefully staged furniture. Her quality scores are consistently in the bottom quartile. She follows all the recommendations. Nothing changes.

A machine learning engineer looks at the case and applies GradCAM to the flagged photos. The resulting heatmaps are unambiguous: the model's activation is concentrated in the corner of each image, not on the room composition at all. That corner happens to contain a small watermark - the logo of the professional photography service the host used. The model has learned: photos from professional services get high ratings, professional services add watermarks, watermark equals high-quality image. But the host's photos were post-processed to remove the watermark. Without it, the model assigns low quality scores regardless of the actual photo composition.

Without GradCAM, this debugging would have been nearly impossible. The model performed well in aggregate. The training data was correct. The accuracy metric showed no problem. Only by visualizing what spatial regions the model was responding to did the failure mode become visible - and actionable.

This pattern recurs constantly. A chest X-ray classifier attends to hospital labels embedded in the corner of images rather than to pathological features. A bird species identifier activates on the photographer's watermark rather than the bird's plumage. A vehicle damage classifier focuses on the claim form visible in some photos rather than the dents. In every case, the model found a shortcut that accuracy metrics could not detect - and only saliency maps revealed it.

This is why saliency maps matter. Not as an academic exercise in interpretability, but as a practical engineering tool for debugging vision models that behave correctly on metrics but incorrectly on the thing that actually matters.

Why Gradient-Based Explanations for Vision?

Before discussing the methods, it is worth understanding why the problem is hard and why gradients are the natural tool.

A CNN classifying a 224×224 image has 150,528 input pixels. Each pixel contributes to the classification decision through a cascade of convolutions, normalizations, pooling operations, and fully connected layers. The question "which pixels caused this classification?" has no simple answer - every pixel technically participates in every computation through the distributed nature of the network.

Gradient-based methods sidestep the combinatorial problem by using calculus: the gradient $\frac{\partial y^c}{\partial x_{hw}}$ measures how much the classification score for class $c$ would change if you made a small change to pixel at position $(h, w)$ . This is an approximation (the gradient is local; the network is highly nonlinear), but it is computationally efficient and mathematically principled. Computing gradients for all pixels simultaneously takes one backward pass - the same cost as one training step.

The field of saliency maps is the story of iteratively improving this basic gradient idea. Vanilla gradients are noisy. Guided backpropagation cleans them up but loses faithfulness. GradCAM sacrifices resolution for class-specificity. SmoothGrad reduces noise by averaging. Integrated Gradients provides formal guarantees. Each method makes different tradeoffs between faithfulness, spatial resolution, and computational cost.

The Family Tree of Saliency Methods

Method 1: Vanilla Gradients

The simplest approach: compute the gradient of the class score with respect to every input pixel.

$S_i = \left|\frac{\partial f_c}{\partial x_i}\right|$

where $f_c$ is the model's unnormalized score for class $c$ and $x_i$ is the pixel value at position $i$ . For RGB images, $S_i$ is a 3-vector (one gradient per color channel). The saliency is typically taken as the maximum or L2 norm across channels.

Interpretation: $S_i$ measures the local sensitivity of the class score to pixel $i$ . High absolute gradient means the model's confidence in class $c$ changes a lot if you perturb that pixel. The intuition: important pixels have high sensitivity.

Problem 1 - Gradient saturation: ReLU neurons receiving strongly positive input have gradients equal to 1, but neurons that have been driven strongly negative are completely off. A pixel that is clearly within the subject of interest might have near-zero gradient because upstream activations have saturated, making further perturbations irrelevant. The pixel matters; the gradient does not reflect that.

Problem 2 - Visual noise: vanilla gradients highlight texture and high-frequency edges, not semantic objects. The network is locally sensitive to high-frequency patterns throughout training - these patterns appear in the gradient even when the global classification is driven by coarser semantic features.

Problem 3 - Limited class-discriminativity: the gradient at a pixel can look similar across classes if the early convolutional layers respond similarly to that pixel regardless of class. Class-specific information only emerges later in the network.

Despite these limitations, vanilla gradients are fast (one backward pass) and useful as a debugging baseline. Every team doing vision explainability should have a working vanilla gradient implementation before moving to more sophisticated methods.

Method 2: SmoothGrad

Smilkov et al. (2017) proposed SmoothGrad to reduce gradient noise without sacrificing faithfulness. The idea: the gradient at a single point $x$ is a noisy local estimate. Average over many slightly perturbed inputs:

$\bar{S}(x) = \frac{1}{n}\sum_{k=1}^n \nabla_{x_k} f_c(x_k + \epsilon_k), \quad \epsilon_k \sim \mathcal{N}(0, \sigma^2 I)$

The noise standard deviation $\sigma$ is typically 10–20% of the input range (for normalized images, around 0.1–0.2). The number of samples $n$ is typically 50–300.

Why this works: the loss surface has high-frequency fluctuations at the pixel level that create noise in individual gradient estimates. By averaging over a neighborhood, SmoothGrad smooths these fluctuations while preserving the lower-frequency structure that corresponds to semantically meaningful regions. This is standard kernel smoothing applied to the gradient field.

Tradeoff: SmoothGrad requires $n$ backward passes. At $n = 50$ , that is 50× the cost of vanilla gradients. On GPU with batching, all $n$ noisy copies can be processed in a single forward/backward call, reducing wall-clock overhead significantly.

Method 3: Guided Backpropagation

Springenberg et al. (2015) proposed guided backpropagation to address the noise problem. During the backward pass through each ReLU layer, apply a second mask based on the gradient sign.

Standard backpropagation through ReLU:

$\frac{\partial L}{\partial h_j^{(l)}} = \mathbb{1}\!\left[h_j^{(l)} > 0\right] \cdot \frac{\partial L}{\partial h_j^{(l+1)}}$

Guided backpropagation adds a second condition: also mask where the upstream gradient is negative:

$\frac{\partial L}{\partial h_j^{(l)}} = \mathbb{1}\!\left[h_j^{(l)} > 0\right] \cdot \mathbb{1}\!\left[\frac{\partial L}{\partial h_j^{(l+1)}} > 0\right] \cdot \frac{\partial L}{\partial h_j^{(l+1)}}$

Only features positively correlated with the class score in both directions contribute to the saliency. The resulting maps are much cleaner, sharper, and more visually appealing.

danger

Guided backpropagation fails the model parameter randomization test (Adebayo et al. 2018). Do not use it to explain model decisions. The maps look clean and interpretable because they are edge detectors - they reflect the image's intensity structure, not the model's learned behavior. Maps barely change when you randomize all model weights.

Method 4: Integrated Gradients - The Theoretically Correct Method

Sundararajan, Taly, and Yan (2017) introduced Integrated Gradients (IG), the only saliency method that satisfies formal axiomatic requirements for faithful neural network attribution.

Choose a baseline input $x'$ - typically the black image (all zeros), a blurred version of the input, or the mean training image. The attribution for pixel $i$ is:

$\text{IG}_i(x) = (x_i - x_i') \int_0^1 \frac{\partial f^c\!\left(x' + \alpha(x - x')\right)}{\partial x_i} d\alpha$

Interpretation: walk in a straight line from the baseline $x'$ to the actual input $x$ , parameterized by $\alpha \in [0, 1]$ . At each point along the path, compute the gradient with respect to pixel $i$ . The integral accumulates these gradients - the total gradient flow through pixel $i$ as the image transitions from baseline to actual. Multiply by $(x_i - x_i')$ to convert accumulated gradient into an attribution in the same units as the prediction difference.

The integral is approximated by a Riemann sum with $m$ steps:

$\text{IG}_i(x) \approx (x_i - x_i') \cdot \frac{1}{m} \sum_{k=1}^{m} \frac{\partial f^c\!\left(x' + \frac{k}{m}(x - x')\right)}{\partial x_i}$

The Four IG Axioms

Sensitivity (Axiom 1): if $f(x) \neq f(x')$ - the model output differs between actual input and baseline - then at least one pixel must have nonzero attribution. No feature that matters can be assigned zero importance.

Formally: if for some feature $i$ , $x_i \neq x_i'$ and $f(x) \neq f(\hat{x})$ where $\hat{x}$ differs from $x$ only in feature $i$ , then $\text{IG}_i(x) \neq 0$ .

Completeness (Axiom 2): the sum of all pixel attributions equals the difference in model output between the actual input and the baseline:

$\sum_i \text{IG}_i(x) = f(x) - f(x')$

This is the accountability guarantee. Every unit of prediction increase from baseline is assigned to specific pixels, and the total adds up exactly. In practice, this serves as an implementation correctness test: compute the sum and verify it matches the prediction difference.

Implementation Invariance (Axiom 3): if two networks are functionally equivalent (same input-output behavior), they receive the same attributions. This means IG attributions depend on the function, not on internal implementation details.

Linearity (Axiom 4): if the network is a linear combination of two sub-networks, IG attributions are the corresponding linear combination of the sub-networks' attributions. This makes IG composable.

Vanilla gradients violate sensitivity (saturated neurons). Guided backpropagation fails both sensitivity and implementation invariance. GradCAM fails completeness (the spatial aggregation discards fine-grained attribution).

Method 5: GradCAM - The Industry Standard

Selvaraju et al. (2017) introduced GradCAM (Gradient-weighted Class Activation Mapping). It is the most widely used saliency method in production computer vision. The key insight: compute gradients not at the pixel level but at the level of the CNN's last convolutional feature maps.

Step 1: compute the importance weight $\alpha_k^c$ for each feature map channel $k$ for class $c$ :

$\alpha_k^c = \frac{1}{Z}\sum_{i,j} \frac{\partial y^c}{\partial A_{ij}^k}$

where $A^k \in \mathbb{R}^{H' \times W'}$ is the $k$ -th feature map of the last convolutional layer and $Z = H' \times W'$ is the number of spatial positions. This is the global average pooling of the gradient - the mean gradient of the class score with respect to each spatial position of feature map $k$ .

$\alpha_k^c$ answers: "How much, on average across all spatial positions, does feature map $k$ contribute to class $c$ ?"

Step 2: compute the class activation map as a ReLU of the weighted sum of feature maps:

$L^c_{Grad-CAM} = \text{ReLU}\!\left(\sum_k \alpha_k^c A^k\right)$

The ReLU retains only features with a positive influence on the class score. The result $L^c \in \mathbb{R}^{H' \times W'}$ is bilinearly upsampled to the input image size.

GradCAM++ (Chattopadhay et al., 2018)

GradCAM++ improves on GradCAM for class-discriminative localization, especially when multiple instances of the same class appear in an image. The improvement is in the weight computation:

$\alpha_k^c = \sum_{i,j} \frac{\partial^2 y^c}{(\partial A_{ij}^k)^2} \cdot \text{ReLU}\!\left(\frac{\partial y^c}{\partial A_{ij}^k}\right)$

This uses second-order gradients to weight each spatial position individually, rather than averaging globally. For an image with two dogs, GradCAM might highlight a region between them; GradCAM++ produces more localized activations for each instance.

Score-CAM (Wang et al., 2020)

Score-CAM removes the gradient entirely, instead using each feature map as a spatial mask:

Upsample each feature map $A^k$ to input size
Normalize to [0, 1] and use as a perturbation mask: $M^k = x \odot \text{norm}(A^k) + x' \odot (1 - \text{norm}(A^k))$
The channel weight is the model's confidence change when using this mask: $\omega_k = f^c(M^k) - f^c(x')$
Final map: $L^c_{Score-CAM} = \text{ReLU}(\sum_k \omega_k A^k)$

Score-CAM is gradient-free (immune to gradient saturation) and produces sharper, more precise maps than GradCAM. The cost: $K$ forward passes (one per feature map channel) rather than one backward pass. For ResNet-50's last layer, $K = 2048$ - expensive but feasible for offline analysis.

Layer-CAM (Jiang et al., 2021)

Layer-CAM computes GradCAM-style maps for intermediate layers, not just the last convolutional layer:

$L^c_{Layer-CAM} = \text{ReLU}\!\left(A^k \odot \text{ReLU}\!\left(\frac{\partial y^c}{\partial A^k}\right)\right)$

The key difference: use element-wise multiplication (Hadamard product) of the feature map with the ReLU'd gradient, rather than global average pooling the gradient. This preserves spatial structure and produces finer-grained localization for intermediate layers.

Earlier layers have higher spatial resolution (e.g., 56×56 vs 7×7 for ResNet-50) but less semantic specificity. Layer-CAM lets you trade between resolution and class-discriminativity by choosing which layer to analyze.

Faithfulness Evaluation: Pixel Flipping and ROAR

How do you know whether a saliency map accurately reflects feature importance? Two principled evaluation protocols:

Pixel Flipping Test

Algorithm:

Rank pixels by saliency score (highest attribution first)
Iteratively replace pixels with baseline values (black or mean), starting with the most important
Track how the model's confidence drops as more "important" pixels are removed

A faithful saliency method should cause rapid confidence drop early in the removal process - the most salient pixels should be the ones that matter most. Plot confidence vs. fraction of pixels removed; faithful methods show earlier, steeper drops.

ROAR (RemOve And Retrain, Hooker et al., 2019)

Pixel flipping has a problem: replacing pixels with baseline values creates out-of-distribution inputs that the model was not trained on, making it behave unpredictably. ROAR addresses this by retraining:

Remove top- $k\%$ most salient pixels from all training and test images
Retrain the model from scratch on the degraded images
Measure test accuracy

A faithful saliency method identifies the truly important pixels - removing them should cause the largest accuracy drop compared to removing the same number of random pixels. ROAR is computationally expensive (requires full model retraining per saliency method per removal fraction) but is the gold standard for faithfulness evaluation.

Adversarial Saliency Attacks

Heo et al. (2019) and Dombrowski et al. (2019) independently showed that saliency maps can be manipulated without changing predictions. An adversary can fine-tune a model to produce arbitrary saliency maps for any input while keeping the model's predictions identical.

The mechanism: add a small perturbation to the model weights that leaves the forward pass (and hence predictions) unchanged, but dramatically changes the backward pass gradients. For GradCAM, you can make the heatmap highlight any region of the image - including regions completely unrelated to the classification.

This has serious implications for regulatory compliance. A company could claim their model is explainable (based on GradCAM) while having specifically tuned the model to show plausible-looking but misleading explanations. For high-stakes applications, saliency maps alone are insufficient; they should be paired with behavioral testing (accuracy on perturbed inputs, counterfactual testing) that does not rely on gradient computation.

Medical Imaging: GradCAM for Chest X-Ray Localization

A concrete production case illustrates both the power and the limits of GradCAM. Rajpurkar et al. (2017) trained a CNN (CheXNet, DenseNet-121) to detect pneumonia from chest X-rays. Using GradCAM, they showed that the model's attention concentrated in regions that radiologists identified as consistent with pneumonia - consolidation, infiltrates, effusions.

The evaluation methodology:

Collect radiologist bounding box annotations for pneumonia regions in 420 X-rays
Threshold GradCAM heatmaps at 50% of maximum value
Measure intersection-over-union (IoU) between GradCAM regions and radiologist annotations
Compare IoU to the agreement between pairs of radiologists

The result: CheXNet's GradCAM achieved IoU comparable to inter-radiologist agreement for pneumonia localization - without ever being trained with localization supervision. The model learned to localize pneumonia from classification labels alone, and GradCAM revealed this.

However, the same study found failure cases: some correct classifications had GradCAM highlights on the heart, bones, or image metadata rather than the lung parenchyma. This is exactly where radiologist review of saliency maps catches problems - not by trusting every heatmap, but by flagging the cases where the highlighted region is clinically implausible.

tip

For medical imaging, do not use the black-image baseline for Integrated Gradients. Medical images (CT scans, X-rays) have specific intensity distributions where black is "no tissue" - an implausible baseline that can create artifacts in the path integral. Use the mean training image or a blurred version of the scan as the baseline instead.

Full Code: GradCAM, SmoothGrad, and Integrated Gradients

import torch
import torch.nn.functional as F
import numpy as np
from torchvision import models, transforms
from PIL import Image
from typing import Optional, Tuple, List

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# ─── SETUP ───────────────────────────────────────────────────────────────────

model = models.resnet50(weights=models.ResNet50_Weights.IMAGENET1K_V2)
model = model.to(device).eval()

preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

def preprocess_image(path: str) -> torch.Tensor:
    img = Image.open(path).convert("RGB")
    return preprocess(img).unsqueeze(0).to(device)

# ─── VANILLA GRADIENTS ────────────────────────────────────────────────────────

def vanilla_gradients(
    model: torch.nn.Module,
    input_tensor: torch.Tensor,
    target_class: Optional[int] = None,
) -> np.ndarray:
    """
    Compute |∂f_c / ∂x| for each pixel.
    Returns: saliency map (H, W), normalized to [0, 1].

    Interpretation: local sensitivity of class score to each pixel.
    Limitation: noisy, sensitive to gradient saturation.
    """
    input_tensor = input_tensor.clone().requires_grad_(True)
    logits = model(input_tensor)
    if target_class is None:
        target_class = logits.argmax(1).item()

    model.zero_grad()
    logits[0, target_class].backward()

    saliency = input_tensor.grad.abs()[0]   # (C, H, W)
    saliency = saliency.max(dim=0)[0]       # (H, W) - max across channels
    saliency = saliency.cpu().numpy()
    saliency = (saliency - saliency.min()) / (saliency.max() - saliency.min() + 1e-8)
    return saliency

# ─── SMOOTHGRAD ───────────────────────────────────────────────────────────────

def smooth_grad(
    model: torch.nn.Module,
    input_tensor: torch.Tensor,
    target_class: Optional[int] = None,
    n_samples: int = 50,
    noise_level: float = 0.15,
) -> np.ndarray:
    """
    SmoothGrad (Smilkov et al. 2017).
    S̄(x) = (1/n) Σ ∇_x f_c(x + ε_k), ε_k ~ N(0, σ²I)

    Average vanilla gradients over Gaussian-noised copies of the input.
    Smooths high-frequency gradient noise while preserving semantic structure.

    Args:
        n_samples: 50 is fast, 300 is research-grade
        noise_level: σ = noise_level × (max - min) of input
    """
    if target_class is None:
        with torch.no_grad():
            target_class = model(input_tensor).argmax(1).item()

    sigma = noise_level * (input_tensor.max() - input_tensor.min()).item()
    accumulated = torch.zeros_like(input_tensor)

    for _ in range(n_samples):
        noise = torch.randn_like(input_tensor) * sigma
        noisy = (input_tensor + noise).requires_grad_(True)
        model.zero_grad()
        model(noisy)[0, target_class].backward()
        accumulated += noisy.grad.abs()

    avg_grad = (accumulated / n_samples)[0]
    saliency = avg_grad.max(dim=0)[0].detach().cpu().numpy()
    saliency = (saliency - saliency.min()) / (saliency.max() - saliency.min() + 1e-8)
    return saliency

# GPU-efficient batched SmoothGrad
def smooth_grad_batched(
    model: torch.nn.Module,
    input_tensor: torch.Tensor,
    target_class: Optional[int] = None,
    n_samples: int = 50,
    noise_level: float = 0.15,
    batch_size: int = 10,
) -> np.ndarray:
    """
    Batched SmoothGrad - process multiple noisy inputs simultaneously.
    5-10x faster than sequential on GPU.
    """
    if target_class is None:
        with torch.no_grad():
            target_class = model(input_tensor).argmax(1).item()

    sigma = noise_level * (input_tensor.max() - input_tensor.min()).item()
    C, H, W = input_tensor.shape[1], input_tensor.shape[2], input_tensor.shape[3]
    accumulated = torch.zeros(C, H, W, device=device)

    for batch_start in range(0, n_samples, batch_size):
        batch_n = min(batch_size, n_samples - batch_start)
        noise = torch.randn(batch_n, C, H, W, device=device) * sigma
        batch = input_tensor.expand(batch_n, -1, -1, -1) + noise
        batch = batch.requires_grad_(True)
        logits = model(batch)
        targets = logits[:, target_class].sum()
        model.zero_grad()
        targets.backward()
        accumulated += batch.grad.abs().sum(dim=0)

    avg_grad = (accumulated / n_samples)
    saliency = avg_grad.max(dim=0)[0].detach().cpu().numpy()
    saliency = (saliency - saliency.min()) / (saliency.max() - saliency.min() + 1e-8)
    return saliency

# ─── GRADCAM ──────────────────────────────────────────────────────────────────

class GradCAM:
    """
    GradCAM (Selvaraju et al. 2017).

    Derivation:
      α_k^c = (1/Z) Σ_{i,j} (∂y^c / ∂A_{ij}^k)   [importance weight]
      L^c = ReLU(Σ_k α_k^c A^k)                   [class activation map]

    Uses forward hook to capture A^k and backward hook to capture ∂y^c/∂A^k.
    """

    def __init__(
        self,
        model: torch.nn.Module,
        target_layer: torch.nn.Module,
    ):
        self.model = model
        self.gradients: Optional[torch.Tensor] = None
        self.activations: Optional[torch.Tensor] = None

        # Forward hook: capture feature map A^k
        target_layer.register_forward_hook(
            lambda m, inp, out: setattr(self, "activations", out.detach())
        )
        # Backward hook: capture gradients ∂y^c/∂A^k
        target_layer.register_full_backward_hook(
            lambda m, gi, go: setattr(self, "gradients", go[0].detach())
        )

    def __call__(
        self,
        input_tensor: torch.Tensor,
        target_class: Optional[int] = None,
    ) -> np.ndarray:
        self.model.zero_grad()
        logits = self.model(input_tensor)

        if target_class is None:
            target_class = logits.argmax(1).item()

        logits[0, target_class].backward()

        # α_k^c = global average pool of gradients
        # gradients shape: (1, K, H', W')
        alpha = self.gradients[0].mean(dim=(1, 2))  # (K,)

        # Weighted sum of feature maps + ReLU
        # activations shape: (1, K, H', W')
        cam = torch.einsum("k,khw->hw", alpha, self.activations[0])
        cam = F.relu(cam)

        # Normalize and upsample to input size
        cam_np = cam.cpu().numpy()
        cam_np = (cam_np - cam_np.min()) / (cam_np.max() - cam_np.min() + 1e-8)

        H, W = input_tensor.shape[2], input_tensor.shape[3]
        cam_tensor = torch.from_numpy(cam_np)[None, None]
        cam_upsampled = F.interpolate(
            cam_tensor, size=(H, W), mode="bilinear", align_corners=False
        )
        return cam_upsampled.squeeze().numpy()

# Usage
# input_tensor = preprocess_image("chest_xray.jpg")
# gradcam = GradCAM(model, model.layer4[-1])  # last conv layer in ResNet-50
# heatmap = gradcam(input_tensor, target_class=233)  # 233 = pneumonia (example)

# ─── INTEGRATED GRADIENTS ─────────────────────────────────────────────────────

def integrated_gradients(
    model: torch.nn.Module,
    input_tensor: torch.Tensor,
    baseline: Optional[torch.Tensor] = None,
    target_class: Optional[int] = None,
    n_steps: int = 50,
) -> Tuple[np.ndarray, float]:
    """
    Integrated Gradients (Sundararajan et al. 2017).

    IG_i(x) = (x_i - x'_i) * ∫₀¹ (∂f^c(x' + α(x-x')) / ∂x_i) dα

    Satisfies:
    - Sensitivity: if f(x)≠f(x'), at least one pixel has nonzero attribution
    - Completeness: Σ IG_i(x) = f(x) - f(x')
    - Implementation invariance: equivalent models → same attributions
    - Linearity: linear combinations preserved

    Args:
        baseline: reference input (default: black image, all zeros)
        n_steps: 50 fast, 300 research-grade

    Returns:
        ig_map: (H, W) attribution map, can be negative
        completeness_error: |Σ IG - (f(x) - f(x'))| - should be small
    """
    if baseline is None:
        baseline = torch.zeros_like(input_tensor)

    if target_class is None:
        with torch.no_grad():
            target_class = model(input_tensor).argmax(1).item()

    with torch.no_grad():
        f_x = model(input_tensor)[0, target_class].item()
        f_baseline = model(baseline)[0, target_class].item()
    prediction_diff = f_x - f_baseline

    # Accumulate gradients along the straight-line path from baseline to input
    accumulated_grad = torch.zeros_like(input_tensor)

    for step in range(1, n_steps + 1):
        alpha = step / n_steps
        interp = (baseline + alpha * (input_tensor - baseline)).requires_grad_(True)
        model.zero_grad()
        model(interp)[0, target_class].backward()
        accumulated_grad += interp.grad.detach()

    # IG_i = (x_i - x'_i) * avg_gradient_i
    avg_grad = accumulated_grad / n_steps
    ig = (input_tensor - baseline) * avg_grad  # (1, C, H, W)
    ig_map = ig[0].sum(dim=0).cpu().numpy()    # (H, W), summed across channels

    # Verify completeness axiom
    total_attribution = ig_map.sum()
    completeness_error = abs(total_attribution - prediction_diff)
    relative_error = completeness_error / (abs(prediction_diff) + 1e-8)

    if relative_error > 0.05:
        print(f"WARNING: Completeness error {relative_error:.2%} > 5%. "
              f"Increase n_steps (currently {n_steps}).")
    else:
        print(f"Completeness check PASSED: error = {relative_error:.4%}")

    return ig_map, completeness_error

# ─── PIXEL FLIPPING FAITHFULNESS TEST ─────────────────────────────────────────

def pixel_flipping_test(
    model: torch.nn.Module,
    input_tensor: torch.Tensor,
    saliency_map: np.ndarray,
    target_class: Optional[int] = None,
    n_steps: int = 20,
    baseline_value: float = 0.0,
) -> Tuple[List[float], List[float]]:
    """
    Pixel flipping faithfulness test.

    Iteratively remove the most salient pixels and track model confidence.
    Faithful methods: confidence drops rapidly (important pixels removed first).

    Returns:
        fractions: fraction of pixels removed at each step
        confidences: model confidence at each step
    """
    if target_class is None:
        with torch.no_grad():
            target_class = model(input_tensor).argmax(1).item()

    H, W = saliency_map.shape
    total_pixels = H * W

    # Sort pixel indices by saliency (highest first)
    flat_saliency = saliency_map.flatten()
    sorted_indices = np.argsort(flat_saliency)[::-1]

    fractions = []
    confidences = []

    modified = input_tensor.clone()

    for step in range(n_steps + 1):
        fraction = step / n_steps
        # Remove (fraction * total_pixels) most salient pixels
        n_remove = int(fraction * total_pixels)

        if step > 0:
            # Remove the next batch of pixels
            batch_start = int((step - 1) / n_steps * total_pixels)
            batch_end = n_remove
            for flat_idx in sorted_indices[batch_start:batch_end]:
                h, w = flat_idx // W, flat_idx % W
                modified[0, :, h, w] = baseline_value

        with torch.no_grad():
            logits = model(modified)
            conf = torch.softmax(logits, dim=-1)[0, target_class].item()

        fractions.append(fraction)
        confidences.append(conf)

    return fractions, confidences

# ─── ADEBAYO SANITY CHECK ────────────────────────────────────────────────────

def model_randomization_check(
    saliency_fn,
    input_tensor: torch.Tensor,
    model: torch.nn.Module,
) -> float:
    """
    Sanity check (Adebayo et al. 2018): compare saliency with trained vs random model.
    Passing method: LOW correlation (maps change with randomization).
    Failing method (guided backprop): HIGH correlation.

    Returns: Spearman rank correlation (≈1.0 = FAILED, ≈0 = PASSED).
    """
    from scipy.stats import spearmanr
    import copy

    target_class = model(input_tensor).argmax(1).item()
    original = saliency_fn(model, input_tensor, target_class)

    # Cascade randomize: from last layer to first
    random_model = copy.deepcopy(model)
    for name, param in reversed(list(random_model.named_parameters())):
        torch.nn.init.normal_(param, mean=0.0, std=0.01)
    random_model.eval()

    randomized = saliency_fn(random_model, input_tensor, target_class)

    corr, _ = spearmanr(original.flatten(), randomized.flatten())
    status = "FAILED (maps not model-specific)" if corr > 0.7 else "PASSED"
    print(f"Randomization check {status}: Spearman r = {corr:.4f}")
    return corr

# ─── CAPTUM PRODUCTION INTEGRATION ───────────────────────────────────────────

def captum_ig_gradcam_example():
    """
    Production-grade attribution using Captum (Meta's XAI library).
    Validated implementation - numerically stable, well-tested.
    Install: pip install captum
    """
    # from captum.attr import IntegratedGradients, GuidedGradCam, Saliency
    # from captum.attr import visualization as viz
    #
    # ─── Integrated Gradients ───────────────────────────────────────────────
    # ig = IntegratedGradients(model)
    # baseline = torch.zeros_like(input_tensor)
    # attrs, delta = ig.attribute(
    #     input_tensor,
    #     baselines=baseline,
    #     target=target_class,
    #     n_steps=300,
    #     return_convergence_delta=True,
    # )
    # print(f"Convergence delta: {delta.item():.6f}")  # should be ≈ 0
    #
    # ─── GuidedGradCAM (GradCAM × Guided Backprop) ───────────────────────────
    # guided_gc = GuidedGradCam(model, model.layer4[-1])
    # gc_attrs = guided_gc.attribute(input_tensor, target=target_class)
    # # Note: GuidedGradCAM inherits guided backprop's unfaithfulness issues
    # # Use only for qualitative visualization, not faithfulness claims
    #
    # ─── Visualize ────────────────────────────────────────────────────────────
    # attr_np = attrs.squeeze().permute(1, 2, 0).detach().numpy()
    # img_np = input_tensor.squeeze().permute(1, 2, 0).detach().cpu().numpy()
    # _ = viz.visualize_image_attr(
    #     attr_np, img_np,
    #     method="heat_map",
    #     sign="absolute_value",
    #     show_colorbar=True,
    #     title="Integrated Gradients Attribution"
    # )
    pass

Method Comparison

Method	Resolution	Class-Discriminative	Completeness Axiom	Sanity Check	Compute
Vanilla Gradients	Full (224×224)	Moderate	No	Pass	1 backward pass
SmoothGrad	Full (224×224)	Moderate	No	Pass	50–300 passes
Guided Backprop	Full (224×224)	Low	No	Fail	1 backward pass
GradCAM	Low (7×7 up)	Yes	No	Pass	1 backward pass
GradCAM++	Low (7×7 up)	Better	No	Pass	1 backward pass
Score-CAM	Medium	Yes	No	Pass	K forward passes
Layer-CAM	Tunable	Tunable	No	Pass	1 backward pass
Integrated Gradients	Full (224×224)	Moderate	Yes	Pass	50–300 passes

The "Fail" for Guided Backpropagation is the Adebayo et al. (2018) definitive result. GradCAM and IG are the two methods with the best overall profiles for production use.

Production Engineering Notes

Make GradCAM part of every model validation checklist. Before any vision model ships, inspect GradCAM maps for 20–50 correctly and incorrectly classified examples from each class. The time cost is minimal; the debugging value is enormous. The watermark problem is caught in hours with GradCAM, or discovered by angry users six months later.

Use Integrated Gradients for compliance documentation. When you need to formally attribute a decision - regulatory review, model audit, user-facing explanation in high-stakes domains - IG's completeness guarantee is essential. Always report: (1) baseline choice and why, (2) number of integration steps, (3) numerical completeness error.

Layer selection for GradCAM matters. Using the last convolutional layer is a good default, but for multi-scale features, try different layers. Earlier layers give higher-resolution but less semantic maps; later layers are coarser but more class-discriminative. For detection models, intermediate layers often produce better localization.

Batch SmoothGrad for throughput. Process multiple noisy copies in a single GPU forward/backward call. 5–10× faster than sequential.

danger

Score-CAM requires $K$ forward passes where $K$ is the number of channels in the target layer. For ResNet-50's last layer, $K = 2048$ - that is 2048 forward passes per image. Use Score-CAM for offline analysis and qualitative comparison, not for production real-time attribution.

Common Mistakes

:::danger Common Mistake 1: Trusting guided backpropagation for model explanation Guided backpropagation produces clean, visually appealing maps that fail the model randomization sanity check. The maps reflect the input image's edge structure, not the model's learned behavior. If you have existing work using guided backpropagation for model explanation, redo it with GradCAM or Integrated Gradients. :::

:::danger Common Mistake 2: Not verifying the completeness axiom for Integrated Gradients IG only satisfies completeness when implemented correctly. Bugs in the interpolation, wrong number of steps, or incorrect gradient accumulation can violate completeness by a large margin. Always compute $|\sum_i \text{IG}_i(x) - (f(x) - f(x'))|$ and verify it is below 5% of $|f(x) - f(x')|$ . :::

:::warning Common Mistake 3: Interpreting GradCAM upsampled resolution as true pixel precision GradCAM heatmaps are upsampled from 7×7 to 224×224 using bilinear interpolation. The smooth heatmap looks precise but the underlying resolution is one activation per 32×32 input region. A heatmap highlighting "the eye of the dog" with apparent pixel precision is actually a coarse blob that happens to align with the eye. :::

:::warning Common Mistake 4: Using a single saliency map to draw definitive conclusions Different saliency methods measure different things. If GradCAM and Integrated Gradients disagree on which regions are important, that disagreement is informative - it means the model's behavior is not fully captured by either method alone. Always use at least two methods and investigate disagreements. :::

YouTube Resources

Resource	Creator	Focus
GradCAM Explained Step by Step	Aladdin Persson	PyTorch GradCAM from scratch
Integrated Gradients - Paper Explained	Yannic Kilcher	IG axioms and completeness proof
Sanity Checks for Saliency Maps	Interpretability Research	Adebayo et al. paper breakdown
Captum Tutorial - PyTorch XAI	PyTorch Official	Production-ready IG and GradCAM
Score-CAM Paper Explained	Papers Explained	Score-CAM vs GradCAM comparison

Interview Q&A

Q1: Walk me through GradCAM step by step, including the math.

GradCAM works in two mathematical steps. Step one: compute the importance weight $\alpha_k^c = \frac{1}{Z}\sum_{i,j} \frac{\partial y^c}{\partial A_{ij}^k}$ for each feature map channel $k$ . This is the global average pooled gradient - how much, on average across all spatial positions in feature map $k$ , does the class score $y^c$ respond to changes in that feature map. You get these by running a backward pass and hooking the gradients at the target convolutional layer. Step two: compute $L^c = \text{ReLU}(\sum_k \alpha_k^c A^k)$ - a weighted sum of the feature maps, ReLU'd to keep only positive contributions. Then bilinearly upsample $L^c$ from the feature map resolution (7×7) to the input image resolution (224×224). Implementation requires two PyTorch hooks: one forward hook to capture the feature maps $A^k$ , one backward hook to capture the gradients $\partial y^c / \partial A^k$ .

Q2: What are the four axioms of Integrated Gradients and why does each matter?

Sensitivity: if the prediction differs between input and baseline, at least one pixel has nonzero attribution - no important feature can be silently ignored. This is violated by vanilla gradients when neurons are saturated. Completeness: the sum of all attributions equals $f(x) - f(x')$ , the prediction difference from baseline. Every unit of prediction change is accounted for by specific pixels, making the explanation accountable. This also serves as a correctness test: if your implementation violates completeness by more than 5%, you have a bug. Implementation invariance: two functionally equivalent networks (same input-output behavior, different internal implementations) receive the same attributions. Vanilla gradients violate this because different architectures have different gradient landscapes even for the same function. Linearity: if the network is a linear combination of two sub-networks, attributions are the corresponding linear combination. This makes IG composable for analyzing ensemble models.

Q3: Why did Adebayo et al. (2018) conclude guided backpropagation is not a reliable explanation method?

They proposed two sanity checks. The model parameter randomization test: reinitialize all model weights randomly, compute guided backpropagation maps, compare to the trained-model maps using Spearman rank correlation. A reliable explanation method should give very different maps after randomization (the model's knowledge is destroyed). For guided backpropagation, correlation stayed high - the maps barely changed. The data randomization test: retrain the model on randomly shuffled labels, recompute maps - same result. The reason is structural: guided backpropagation's double-positive mask implements an edge detector based on the image's intensity structure, largely independent of the weights. The maps look interpretable because they detect edges, not because they reflect learned features.

Q4: You need to explain a CNN's classification decision to a medical regulatory body. What would you use and how would you document it?

I would use Integrated Gradients with the following documentation: Baseline - the mean training image, not black/zero, because medical images have specific intensity distributions where black is "no tissue" and creates artifacts in the path integral. Integration steps - 300 (regulatory applications warrant higher numerical precision than the 50 needed for debugging). Completeness verification - for each explained decision, compute $|\sum_i \text{IG}_i(x) - (f(x) - f(x'))|$ and require it to be less than 1% of $|f(x) - f(x')|$ . Attribution aggregation - sum IG across color channels and report per-pixel contribution to classification confidence change. Clinical validation - have a radiologist verify that highlighted regions align with clinically plausible features for a representative sample. Model randomization sanity check - verify that IG maps change substantially when model weights are randomized, confirming they reflect learned behavior, not image structure.

Q5: GradCAM highlights a coarse region of the image. How would you get higher-resolution, pixel-level explanation?

Three options in increasing precision: (1) Layer-CAM on intermediate layers - earlier layers have higher resolution (56×56 at ResNet-50 layer2 vs 7×7 at layer4) but are less class-discriminative. Layer-CAM uses element-wise multiplication of the feature map with ReLU'd gradients, preserving spatial structure better than GradCAM's global average pooling. (2) Score-CAM uses each channel as a perturbation mask and scores it by the model's confidence change - produces sharper maps than GradCAM at the cost of $K$ forward passes. (3) Integrated Gradients on input pixels - full 224×224 resolution attribution with no spatial aggregation, satisfying the completeness axiom. The tradeoff is 50–300 forward/backward passes versus 1 for GradCAM. For production debugging, GradCAM's coarse resolution is usually sufficient to identify the failure mode. For compliance documentation requiring pixel-level attribution, use IG.

Q6: What is an adversarial saliency attack and why does it matter for compliance?

Heo et al. (2019) showed that you can fine-tune a model to produce arbitrary saliency maps for any input while keeping predictions identical. The mechanism: add a small perturbation to model weights that leaves the forward pass (predictions) unchanged but changes the backward pass gradients (saliency). For GradCAM, the heatmap can be made to highlight any region - including completely unrelated regions. This matters for compliance because a company could claim explainability (based on GradCAM) while having specifically tuned the model to show misleading-but-plausible explanations to regulators. Defenses: (1) pair saliency maps with behavioral testing that does not rely on gradients - pixel flipping test, ROAR, counterfactual testing; (2) use multiple independent attribution methods and require consistency; (3) apply the Adebayo randomization test to verify saliency reflects learned behavior, not gradient manipulation.

Historical Context: From Simple Gradients to the Modern Zoo

The first gradient-based saliency map was proposed by Simonyan, Vedaldi, and Zisserman (2014) - the same year as VGGNet. Their observation was straightforward: the gradient $\partial f_c / \partial x$ is already computed during backpropagation; why not visualize it? The resulting maps were noisy but opened the field.

Springenberg et al. (2015) proposed guided backpropagation and demonstrated that deconvolution (Zeiler & Fergus, 2014) was a special case of the same two-sided gradient masking operation. These methods produced visually compelling maps, and for two years were considered state of the art.

GradCAM (Selvaraju et al., 2017) shifted the focus from pixel-level gradients to feature map gradients, producing coarser but more class-discriminative maps. The key insight - average pool the gradients over the feature map spatial dimensions - is simple and effective. GradCAM became the industry standard within a year of publication.

The sanity checks paper (Adebayo et al., 2018) was the field's reckoning. It demonstrated that several popular methods - guided backpropagation, guided GradCAM, deconvolution - were essentially independent of the model's learned behavior. This forced the field to distinguish between methods that explain the model and methods that explain the image.

Integrated Gradients (Sundararajan et al., 2017) provided the formal foundation. The axiomatic approach - defining what a faithful attribution must satisfy and proving that IG satisfies those axioms - set the standard for rigorous evaluation. Captum (Kokhlikyan et al., 2020) brought production-ready IG to the PyTorch ecosystem.

Score-CAM (Wang et al., 2020) and Layer-CAM (Jiang et al., 2021) extended the CAM family beyond gradient-based weighting, improving precision. RISE (Petsiuk et al., 2018) approached the problem perturbation-based, masking random image regions and measuring model response - slower but gradient-free.

The current state: for production debugging, GradCAM remains the default. For formal attribution, IG is the standard. The field has converged but not stagnated - new methods continue to address specific weaknesses (resolution, faithfulness, efficiency).

Key Takeaways

Saliency maps are the primary debugging tool for vision models. GradCAM is the right default: fast, class-discriminative, passes sanity checks, good enough resolution for most debugging tasks. Use Integrated Gradients when you need formal guarantees (completeness, sensitivity) for compliance or auditing. Never use guided backpropagation for model explanation. Always run the Adebayo sanity check for any new saliency method you adopt. Make saliency visualization part of your standard model validation pipeline - the watermark problem is caught in hours with GradCAM, or discovered by angry users six months later.

The hierarchy by use case:

Debugging and development: GradCAM (one backward pass, immediate feedback)
Multiple object localization: GradCAM++ (better for multi-instance images)
High-resolution debugging: Layer-CAM on intermediate layers or SmoothGrad
Formal compliance documentation: Integrated Gradients with completeness verification
Faithfulness benchmarking: ROAR (pixel removal with retraining)
Adversarial robustness checking: pixel flipping test + model randomization sanity check

Production Deployment Checklist

Saliency explanation systems fail in production for reproducible reasons. This checklist addresses each failure mode.

Step	What to Check	How to Verify
Model sanity	Does the saliency method pass Adebayo randomization?	Randomize model weights; map must change drastically
Class discrimination	Does GradCAM highlight different regions for different classes?	Run on same image with top-3 classes; compare maps
Baseline sensitivity	Does IG score sum equal prediction gap?	Verify $\sum_i \text{IG}_i = f(x) - f(x')$ to 4 decimal places
Resolution adequacy	Is the map fine-grained enough for the use case?	Use Layer-CAM on intermediate layers for higher resolution
Computational budget	Does the method meet latency requirements?	Profile on production hardware; GradCAM ≈ 1 backward pass
Stability	Do similar inputs produce similar maps?	Measure correlation across 5 minor augmentations of same image
Documentation	Are explanations logged with input and model version?	Store input hash + map version + model version together

import hashlib, json
from datetime import datetime

def log_saliency_explanation(
    image: torch.Tensor,
    saliency_map: torch.Tensor,
    predicted_class: int,
    model_version: str,
    method: str = "gradcam"
) -> dict:
    """
    Log a saliency explanation for audit purposes.
    Returns a structured record suitable for database storage.
    """
    image_hash = hashlib.sha256(image.numpy().tobytes()).hexdigest()[:16]
    map_hash = hashlib.sha256(saliency_map.numpy().tobytes()).hexdigest()[:16]

    return {
        "timestamp": datetime.utcnow().isoformat(),
        "image_hash": image_hash,
        "map_hash": map_hash,
        "predicted_class": predicted_class,
        "method": method,
        "model_version": model_version,
        "map_stats": {
            "mean": float(saliency_map.mean()),
            "max": float(saliency_map.max()),
            "top_5pct_fraction": float(
                (saliency_map > saliency_map.quantile(0.95)).float().mean()
            ),
        }
    }

When Saliency Maps Mislead You

Saliency maps are powerful but can produce systematic false impressions. Knowing the failure modes is as important as knowing the methods.

The Texture vs. Shape Debate: ImageNet-trained CNNs are strongly biased toward texture (Geirhos et al., 2019). GradCAM highlights the texture region even on a correctly classified image, which may not correspond to the semantically meaningful feature a human would identify. Shape-biased models (style-transfer-trained or ViTs) produce maps that better align with human intuition.

Adversarial Saliency Manipulation: Dombrowski et al. (2019) showed that adversarial perturbations can produce arbitrary saliency maps while keeping the prediction unchanged and the perturbation imperceptible. A model could show a "safe" saliency map pointing to legitimate features while actually using a hidden shortcut. Mitigation: use multiple independent saliency methods and check agreement.

Gradient Saturation: In very deep networks, gradients can saturate (approaching zero) even for important features. Vanilla saliency maps go blank in these regions. IG avoids this by integrating over the entire path from baseline to input - even if the gradient is zero at the input, it may be non-zero along the path.

Batch Normalization Artifacts: BatchNorm's running statistics make saliency maps sensitive to whether the model is in train vs eval mode. Always set model.eval() before computing saliency maps.

# Anti-pattern: computing saliency in training mode
model.train()  # WRONG: BN uses batch stats, gradients are noisier

# Correct pattern
model.eval()   # RIGHT: BN uses running stats, consistent gradients
# Note: for saliency you need gradients - do NOT use torch.no_grad()
# But DO use model.eval() for consistent BN behavior

def cross_validate_attributions(model, input_tensor, target_class):
    """
    Compare multiple attribution methods to build confidence.
    High agreement between IG and GradCAM is a confidence signal.
    Low agreement suggests conflicting signals worth investigating.
    """
    from captum.attr import IntegratedGradients, LayerGradCam, Saliency

    model.eval()
    saliency = Saliency(model)
    vanilla = saliency.attribute(input_tensor, target=target_class)

    ig = IntegratedGradients(model)
    integrated = ig.attribute(input_tensor, target=target_class, n_steps=50)

    target_layer = model.layer4[-1]
    layer_gc = LayerGradCam(model, target_layer)
    gradcam = layer_gc.attribute(input_tensor, target=target_class)

    # Measure top-10% pixel agreement between IG and GradCAM (upsampled)
    n = vanilla.numel()
    k = int(n * 0.1)
    top_ig = set(integrated.abs().flatten().topk(k).indices.tolist())
    top_gc = set(gradcam.abs().flatten().topk(k).indices.tolist())
    jaccard = len(top_ig & top_gc) / len(top_ig | top_gc)
    print(f"IG vs GradCAM Jaccard@10%: {jaccard:.3f}")
    return vanilla, integrated, gradcam

These failure modes reinforce the core principle: saliency maps are hypotheses about model behavior, not ground truth. Always validate with sanity checks and never rely on a single method alone.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the SHAP Values demo on the EngineersOfAI Playground - no code required.

:::

The Photo Quality Classifier That Was Looking in the Wrong Place​

Why Gradient-Based Explanations for Vision?​

The Family Tree of Saliency Methods​

Method 1: Vanilla Gradients​

Method 2: SmoothGrad​

Method 3: Guided Backpropagation​

Method 4: Integrated Gradients - The Theoretically Correct Method​

The Four IG Axioms​

Method 5: GradCAM - The Industry Standard​

GradCAM++ (Chattopadhay et al., 2018)​

Score-CAM (Wang et al., 2020)​

Layer-CAM (Jiang et al., 2021)​

Faithfulness Evaluation: Pixel Flipping and ROAR​

Pixel Flipping Test​

ROAR (RemOve And Retrain, Hooker et al., 2019)​

Adversarial Saliency Attacks​

Medical Imaging: GradCAM for Chest X-Ray Localization​

Full Code: GradCAM, SmoothGrad, and Integrated Gradients​

Method Comparison​

Production Engineering Notes​

Common Mistakes​

YouTube Resources​

Interview Q&A​

Historical Context: From Simple Gradients to the Modern Zoo​

Key Takeaways​

Production Deployment Checklist​

When Saliency Maps Mislead You​