Skip to main content

Evaluating the Quality of ML Explanations - Faithfulness, Robustness, and Human Studies

Reading time: 45 min | Interview relevance: Very High - distinguishing faithful from plausible explanations is a senior-level interview topic; expected for ML Researcher, Applied Scientist, AI Safety Engineer | Target roles: ML Researcher, Applied Scientist, ML Engineer, AI Engineer


Two Explanations That Both Look Reasonable

A hospital's radiology department is piloting an AI diagnostic tool for chest X-rays. The system flags images for pneumonia, COVID-19, and lung nodules. The clinical AI team builds two explanation systems to help radiologists understand and trust the model's decisions.

The first explanation uses SHAP values on image patches - it highlights regions of the X-ray with the highest positive SHAP contribution, drawing attention to the lower-left lobe where opacity appears. Radiologists look at the highlighted region and nod: "Yes, that is exactly where I would look." The second explanation uses attention weights from the transformer backbone - it shows a heatmap of where the model attended, with bright spots on the lung parenchyma and a few on the costophrenic angle.

Both explanations look reasonable to the radiologists. Both pass the informal eyeball test. The clinical AI team is about to recommend the attention-based explanation because radiologists find it more visually intuitive. Then the lead ML researcher runs a standard evaluation: she randomizes the model weights completely - all parameters reset to random noise, so the model is predicting at chance - and recomputes the attention heatmaps. The heatmaps barely change. The attention maps are generated by patterns in the input image that have nothing to do with the model's decision. They are visually plausible explanations of a random model. The SHAP explanations, when the same test is run, change dramatically when weights are randomized.

The attention maps passed the human eyeball test and failed the machine faithfulness test. The SHAP explanations passed both. This is the core challenge of explanation evaluation: human judgment alone is insufficient and often misleading. We need systematic, quantitative methods to evaluate whether an explanation actually reflects what the model is doing.


The Evaluation Challenge: No Oracle

Evaluating explanations is harder than evaluating predictions. For predictions, you have ground truth labels: the model said 1, the label is 1, it is right. For explanations, there is no ground truth. You cannot look up "the true explanation for why this model assigned this prediction to this input." The model's internal computation is the ground truth, and that is exactly what we are trying to explain.

This creates what Doshi-Velez and Kim (2017) called the "evaluation problem" of interpretable machine learning: we cannot directly verify whether an explanation is correct. We must instead evaluate proxy properties:

  • Faithfulness: Does the explanation accurately reflect the model's actual computation?
  • Robustness: Is the explanation stable under small perturbations of the input?
  • Completeness: Does the explanation account for the model's full output?
  • Human utility: Does the explanation help a human user accomplish a task more effectively?

These properties can sometimes conflict. An explanation that is fully faithful (shows exactly what the model uses) may be incomprehensible to humans. An explanation that is easy for humans to understand may simplify the model's decision to the point of unfaithfulness. The Pareto frontier between faithfulness and human utility is real, and navigating it is an engineering and design challenge.


Faithfulness: Does the Explanation Reflect the Model?

Faithfulness is the most fundamental property. An explanation is faithful if it accurately reflects the model's actual decision process - not what a human thinks the model does, not what the model should do, but what it actually does.

Sufficiency

Definition: The top-kk features identified by the explanation are sufficient to reproduce the model's prediction.

Formal statement: Let ϕ(x)\phi(x) be the explanation, and let SkS_k be the set of the top-kk features by explanation magnitude. Sufficiency at kk is:

Sufficiency(k)=Ex[f(x)f(xSk)]\text{Sufficiency}(k) = \mathbb{E}_x\left[ \left| f(x) - f(x_{S_k}) \right| \right]

where xSkx_{S_k} is xx with all features outside SkS_k replaced by their baseline (e.g., feature mean or zero). A lower value means the top-kk features suffice - the model's prediction barely changes when all other features are masked.

Interpretation: Sufficiency tests whether the explanation correctly identifies which features matter. If the explanation says "features A, B, C are most important" but the model's prediction hardly changes when all other features are removed (leaving only A, B, C), the explanation is identifying the right features.

Comprehensiveness

Definition: Removing the top-kk features from the input causes a significant change in the model's prediction.

Comprehensiveness(k)=Ex[f(x)f(xSˉk)]\text{Comprehensiveness}(k) = \mathbb{E}_x\left[ \left| f(x) - f(x_{\bar{S}_k}) \right| \right]

where xSˉkx_{\bar{S}_k} is xx with the top-kk explanation features masked out (set to baseline). A higher value means removing the identified important features hurts more - they are genuinely important.

Together, sufficiency and comprehensiveness define faithfulness: the explanation is faithful if the top-kk features are both necessary (comprehensiveness is high: removing them hurts) and sufficient (sufficiency is low: keeping only them preserves the prediction).

The Deletion Game

The deletion game evaluates explanation quality by sequentially removing features in order of their explanation rank and observing how the model's prediction changes.

Algorithm:

  1. Start with the full input xx
  2. Order features from most to least important according to the explanation
  3. At each step, remove the next feature (replace with baseline)
  4. Record the model's prediction at each step

A faithful explanation should cause the prediction to drop rapidly at first (since we remove the most important features first) and then plateau. Mathematically, if we rank features j1,j2,,jMj_1, j_2, \ldots, j_M by importance:

DeletionAUC=1Mk=1Mf(xSˉk)\text{DeletionAUC} = \frac{1}{M}\sum_{k=1}^{M} f(x_{\bar{S}_k})

where the sum is over the model's prediction as each feature is sequentially removed. A lower AUC indicates the explanation identified the most damaging features to remove first - a sign of faithfulness.

The Insertion Game

The insertion game is the complement: start with a masked input (all features at baseline) and insert features in order of their explanation rank, from most to least important.

InsertionAUC=1Mk=1Mf(xSk)\text{InsertionAUC} = \frac{1}{M}\sum_{k=1}^{M} f(x_{S_k})

A faithful explanation should cause the prediction to rise rapidly at first (inserting the most important features first recovers the prediction quickly). A higher AUC indicates faithfulness.

Together: a faithful explanation maximizes InsertionAUC and minimizes DeletionAUC relative to a random ordering of features. The gap between the faithful explanation's curve and a random baseline is the explanation's "faithfulness score."


The ROAR Benchmark (Hooker et al. 2019)

The deletion/insertion game has a problem: when you remove features (replace with baseline), the resulting input may be out-of-distribution - the model was never trained on partially masked inputs. A model might behave arbitrarily on these unnatural inputs, making the deletion game unreliable for measuring faithfulness.

Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, and Been Kim (2019) introduced ROAR: Remove and Retrain, which solves this problem by retraining the model on perturbed data.

ROAR Algorithm:

  1. Train the original model ff on full data D\mathcal{D}
  2. Use explanation method EE to rank features by importance for each sample
  3. Create a modified dataset Dk\mathcal{D}_k where, for each sample, the top-k%k\% of features (by importance) are replaced with noise
  4. Retrain a new model fkf_k on Dk\mathcal{D}_k
  5. Measure the accuracy degradation: Δacc(k)=acc(f)acc(fk)\Delta_{\text{acc}}(k) = \text{acc}(f) - \text{acc}(f_k)

Key insight: If the explanation correctly identifies the most important features, removing them should cause a large accuracy drop when the model is retrained. The feature importance ranking is validated against model performance on data from which those features are truly absent.

Comparison: Different explanation methods are compared by plotting Δacc(k)\Delta_{\text{acc}}(k) as a function of kk. The method that causes the largest accuracy drop for the smallest kk (removing the fewest features) has the most faithful feature importance ranking.

KAR (Keep and Retrain): The complement - keep only the top-k%k\% features, replace all others with noise, retrain. A faithful method should achieve high accuracy with only the top features kept.

ROAR score=kKΔacc(k)Δk\text{ROAR score} = \sum_{k \in K} \Delta_{\text{acc}}(k) \cdot \Delta k

The area under the accuracy-degradation curve is the ROAR score. Higher ROAR score = more faithful explanation method.

Limitation of ROAR: It requires retraining the model for each value of kk and each explanation method, which is computationally expensive - potentially hundreds of GPU-hours for a large neural network. It is a benchmark-level evaluation tool, not a production monitoring tool. Use it to select explanation methods during model development, not continuously in production.


Robustness and Stability

Lipschitz Continuity

A robust explainer should satisfy approximate Lipschitz continuity: similar inputs should produce similar explanations. If two inputs xx and xx' are close:

ϕ(x)ϕ(x)2Lxx2\|\phi(x) - \phi(x')\|_2 \leq L \cdot \|x - x'\|_2

A large LL (Lipschitz constant) means the explainer is sensitive to small input changes - explanation can change dramatically for nearly-identical inputs. This is problematic: if a user can shift one feature by a tiny amount and get a completely different explanation, the explanation cannot be trusted.

Practical measurement: Sample pairs of inputs (xi,xj)(x_i, x_j) with xixj2<ϵ\|x_i - x_j\|_2 < \epsilon, compute explanations for both, and estimate:

L^=maxi,j:xixj2<ϵϕ(xi)ϕ(xj)2xixj2\hat{L} = \max_{i,j : \|x_i-x_j\|_2 < \epsilon} \frac{\|\phi(x_i) - \phi(x_j)\|_2}{\|x_i - x_j\|_2}

LIME tends to have high Lipschitz constants (unstable) because it uses random sampling. SHAP tends to be more stable. TreeSHAP is deterministic and has L=0L = 0 (same input always produces same explanation).

Perturbation Test

For each test input xx, add small noise δN(0,σ2I)\delta \sim \mathcal{N}(0, \sigma^2 I) and compare explanations:

StabilityScore(x)=1Eδ[ϕ(x)ϕ(x+δ)2]ϕ(x)2+ϵ\text{StabilityScore}(x) = 1 - \frac{\mathbb{E}_\delta[\|\phi(x) - \phi(x+\delta)\|_2]}{\|\phi(x)\|_2 + \epsilon}

StabilityScore close to 1 means the explanation is robust to noise. Near 0 means the explanation changes as much as the noise itself.


Sanity Checks (Adebayo et al. 2018)

Julius Adebayo and colleagues (2018) published "Sanity Checks for Saliency Maps" - arguably the most important evaluation paper for explanation methods. Their insight: if an explanation method is truly reflecting the model's computation, then:

  1. Model randomization test: Scramble all model weights (reset to random). The model now predicts at chance. A faithful explanation method should produce completely different explanations - because the model is doing something completely different. An explanation method that produces similar-looking explanations for the original and randomized model is not faithful to the model.

  2. Label randomization test: Retrain the model on randomly shuffled labels (no meaningful relationship between inputs and labels). The model has learned a meaningless function. An explanation method should produce explanations that look different from the explanation of the model trained on true labels.

What they found: Gradient-based methods (Gradient * Input, Integrated Gradients) pass both sanity checks - explanations change dramatically when the model is randomized. Many saliency map methods in computer vision (SmoothGrad, Guided Backpropagation, GradCAM with some variants) fail one or both checks - they partially reflect the input structure rather than the model computation. Attention maps typically fail the model randomization test, which is exactly what the radiologist scenario illustrated.

The cascade randomization test: Instead of randomizing all layers at once, randomize one layer at a time from top to bottom. Plot how the explanation changes as each layer is randomized. A faithful method's explanation should degrade progressively as more layers are randomized.


Human Evaluation: Doshi-Velez and Kim (2017) Taxonomy

Finale Doshi-Velez and Been Kim (2017) proposed a three-tier taxonomy for human evaluation of explanations:

Tier 1: Application-Grounded Evaluation

Evaluate the explanation in the real-world application context. Have domain experts use the model with and without explanations to perform real tasks. Measure task performance (accuracy, speed, error rate) as the primary outcome.

Example: Give radiologists access to the chest X-ray AI with explanation A vs explanation B vs no explanation. Measure diagnostic accuracy over 500 cases. The explanation that most improves diagnostic accuracy is best in this tier.

Gold standard - but expensive. Requires domain experts, real tasks, and often IRB approval for medical contexts. Results are specific to the application and may not generalize.

Tier 2: Human-Grounded Evaluation

Evaluate using humans (not domain experts) on simplified tasks that proxy the real task. Faster and cheaper than application-grounded evaluation, but less externally valid.

Simulatability task (most common): Show a human the explanation (but not the model). Ask them to predict what the model would output for a new input. Measure agreement between human predictions and model predictions. An explanation that allows humans to accurately simulate the model's behavior is interpretable.

Forward simulation: "Given this explanation and these new feature values, what would the model predict? Circle: High, Medium, Low."

Trust calibration task: Show the human both the model prediction and the explanation. Show them a new case and ask: "Would you follow the model's recommendation or override it?" Then show them the true outcome. Measure whether explanations help humans correctly identify when to trust and when to override the model.

Tier 3: Functionally-Grounded Evaluation

Evaluate without humans at all, using a proxy metric (formal definition of interpretability) to substitute for human studies. The metrics described in this lesson (faithfulness, robustness, ROAR) are all functionally-grounded. Cheapest to run, but validity depends on how well the proxy metric correlates with actual human utility.


The ERASER Benchmark (DeYoung et al. 2020)

For NLP models, Jay DeYoung and colleagues (2020) introduced ERASER (Evaluating Rationales and Simple English Reasoning), a benchmark for evaluating natural language explanations (rationales - spans of text that justify a prediction).

Task: Given an NLP model that classifies text (sentiment analysis, fact verification, NLI), can the model's selected rationale (highlighted text span) justify the prediction?

Two evaluation axes:

  1. Sufficiency: Does the rationale alone (without the rest of the document) produce the same prediction as the full document? Sufficiency=score(f(rationale),y^)score(f(full doc),y^)\text{Sufficiency} = \text{score}(f(\text{rationale}), \hat{y}) - \text{score}(f(\text{full doc}), \hat{y}) Negative values indicate the rationale is sufficient (removing context does not help).

  2. Comprehensiveness: Does removing the rationale from the full document harm the prediction more than removing a random span? Comprehensiveness=score(f(full doc),y^)score(f(docrationale),y^)\text{Comprehensiveness} = \text{score}(f(\text{full doc}), \hat{y}) - \text{score}(f(\text{doc} \setminus \text{rationale}), \hat{y}) Higher values indicate the rationale was genuinely necessary.

ERASER evaluates eight NLP datasets (MultiRC, FEVER, e-SNLI, CoS-E, etc.) and benchmarks methods including attention, LIME-based rationales, and trained rationale extractors.


Full Python Implementation: Explanation Evaluator

import numpy as np
import pandas as pd
from typing import Callable, Dict, List, Tuple, Optional
from dataclasses import dataclass
import warnings
warnings.filterwarnings("ignore")


@dataclass
class FaithfulnessResult:
"""Results from faithfulness evaluation."""
sufficiency_k1: float
sufficiency_k3: float
sufficiency_k5: float
comprehensiveness_k1: float
comprehensiveness_k3: float
comprehensiveness_k5: float
deletion_auc: float
insertion_auc: float
faithfulness_score: float # composite: insertion_auc - deletion_auc


@dataclass
class RobustnessResult:
"""Results from robustness/stability evaluation."""
mean_stability_score: float
p10_stability_score: float
p90_stability_score: float
lipschitz_estimate: float
sanity_check_model_rand: float # explanation change after model randomization
passed_sanity_check: bool


@dataclass
class EvaluationReport:
"""Complete evaluation report for an explanation method."""
method_name: str
n_samples: int
faithfulness: FaithfulnessResult
robustness: RobustnessResult
summary: str


class ExplanationEvaluator:
"""
Quantitative evaluation pipeline for any explanation method.

Evaluates:
1. Faithfulness: sufficiency, comprehensiveness, deletion/insertion game
2. Robustness: Lipschitz estimate, perturbation stability
3. Sanity checks: model randomization test (Adebayo et al. 2018)

Usage:
evaluator = ExplanationEvaluator(model, X_test, baseline="mean")
report = evaluator.evaluate(
explainer_fn=shap_explainer,
method_name="TreeSHAP",
n_samples=200,
)
"""

def __init__(
self,
model,
X: np.ndarray,
baseline: str = "mean", # "mean", "zero", "median"
task: str = "binary_classification", # or "regression"
):
self.model = model
self.X = X
self.task = task

# Compute baseline (feature marginals)
if baseline == "mean":
self.baseline = X.mean(axis=0)
elif baseline == "zero":
self.baseline = np.zeros(X.shape[1])
elif baseline == "median":
self.baseline = np.median(X, axis=0)
else:
raise ValueError(f"Unknown baseline: {baseline}")

def _predict(self, X: np.ndarray) -> np.ndarray:
"""Return scalar prediction (probability for class 1 or regression output)."""
if self.task == "binary_classification":
return self.model.predict_proba(X)[:, 1]
else:
return self.model.predict(X)

def _mask_features(
self,
x: np.ndarray,
feature_indices: List[int],
mode: str = "remove", # "remove" = replace with baseline, "keep" = keep only these
) -> np.ndarray:
"""Mask features either by removing selected or keeping only selected."""
x_masked = self.baseline.copy()
if mode == "remove":
x_masked = x.copy()
x_masked[feature_indices] = self.baseline[feature_indices]
elif mode == "keep":
x_masked = self.baseline.copy()
x_masked[feature_indices] = x[feature_indices]
return x_masked

def _rank_features(self, shap_values: np.ndarray) -> List[int]:
"""Return feature indices sorted by absolute SHAP value (most to least important)."""
return list(np.argsort(np.abs(shap_values))[::-1])

# ── FAITHFULNESS ────────────────────────────────────────────────────────────

def sufficiency(
self,
x: np.ndarray,
shap_values: np.ndarray,
k: int,
) -> float:
"""
Sufficiency at k: how much does the prediction change when we keep
only the top-k features (remove all others)?
Lower = more sufficient (top-k features capture the prediction well).
"""
ranked = self._rank_features(shap_values)
top_k = ranked[:k]
x_topk = self._mask_features(x, top_k, mode="keep")

pred_full = self._predict(x.reshape(1, -1))[0]
pred_topk = self._predict(x_topk.reshape(1, -1))[0]
return abs(pred_full - pred_topk)

def comprehensiveness(
self,
x: np.ndarray,
shap_values: np.ndarray,
k: int,
) -> float:
"""
Comprehensiveness at k: how much does the prediction change when we
remove the top-k features?
Higher = more comprehensive (top-k features are genuinely important).
"""
ranked = self._rank_features(shap_values)
top_k = ranked[:k]
x_no_topk = self._mask_features(x, top_k, mode="remove")

pred_full = self._predict(x.reshape(1, -1))[0]
pred_no_topk = self._predict(x_no_topk.reshape(1, -1))[0]
return abs(pred_full - pred_no_topk)

def deletion_insertion_auc(
self,
x: np.ndarray,
shap_values: np.ndarray,
) -> Tuple[float, float]:
"""
Compute deletion AUC and insertion AUC.
Deletion: remove features sequentially from most to least important.
Insertion: add features sequentially from most to least important.

Lower deletion AUC = more faithful.
Higher insertion AUC = more faithful.
"""
n_features = len(x)
ranked = self._rank_features(shap_values)

deletion_preds = []
insertion_preds = []

for k in range(1, n_features + 1):
top_k = ranked[:k]

# Deletion: start with full, progressively remove
x_del = self._mask_features(x, top_k, mode="remove")
deletion_preds.append(self._predict(x_del.reshape(1, -1))[0])

# Insertion: start with baseline, progressively add
x_ins = self._mask_features(x, top_k, mode="keep")
insertion_preds.append(self._predict(x_ins.reshape(1, -1))[0])

deletion_auc = float(np.mean(deletion_preds))
insertion_auc = float(np.mean(insertion_preds))
return deletion_auc, insertion_auc

# ── ROBUSTNESS ───────────────────────────────────────────────────────────────

def stability_score(
self,
x: np.ndarray,
explainer_fn: Callable,
n_perturbations: int = 20,
noise_std: float = 0.05,
) -> float:
"""
Stability: compute explanation for x, then for x + small noise.
StabilityScore = 1 - normalized explanation change.
Higher = more stable.
"""
shap_original = explainer_fn(x)
changes = []

for _ in range(n_perturbations):
noise = np.random.normal(0, noise_std, size=x.shape)
x_perturbed = x + noise
shap_perturbed = explainer_fn(x_perturbed)
change = np.linalg.norm(shap_original - shap_perturbed)
scale = np.linalg.norm(shap_original) + 1e-8
changes.append(change / scale)

mean_change = float(np.mean(changes))
return max(0.0, 1.0 - mean_change)

def lipschitz_estimate(
self,
explainer_fn: Callable,
n_pairs: int = 100,
epsilon: float = 0.1,
) -> float:
"""
Estimate Lipschitz constant of the explainer.
Sample pairs with ||x_i - x_j|| < epsilon, compute explanation ratio.
"""
n = len(self.X)
ratios = []

for _ in range(n_pairs):
i, j = np.random.choice(n, size=2, replace=False)
x_i, x_j = self.X[i], self.X[j]
input_dist = np.linalg.norm(x_i - x_j)
if input_dist < epsilon and input_dist > 1e-8:
phi_i = explainer_fn(x_i)
phi_j = explainer_fn(x_j)
expl_dist = np.linalg.norm(phi_i - phi_j)
ratios.append(expl_dist / input_dist)

return float(np.percentile(ratios, 95)) if ratios else 0.0

# ── SANITY CHECK ─────────────────────────────────────────────────────────────

def sanity_check_model_randomization(
self,
x: np.ndarray,
original_shap: np.ndarray,
explainer_fn_factory: Callable, # factory(model) -> explainer_fn
) -> Tuple[float, bool]:
"""
Adebayo et al. (2018) model randomization sanity check.

1. Randomize all model weights (create a random-weights version)
2. Compute explanation using the randomized model
3. Measure how much the explanation changes

A faithful method should show large explanation change.
Returns (relative_change, passed_check) where passed_check = True
if the explanation changed significantly (>50% relative change).

Note: This requires model weight access.
Works natively for sklearn, PyTorch, TensorFlow.
"""
try:
import copy
from sklearn.base import clone

# Clone and fit to scrambled labels
scrambled_model = clone(self.model)
n = len(self.X)
scrambled_labels = np.random.randint(0, 2, size=n)
scrambled_model.fit(self.X, scrambled_labels)

# Compute explanation with scrambled model
scrambled_explainer_fn = explainer_fn_factory(scrambled_model)
scrambled_shap = scrambled_explainer_fn(x)

# Measure relative change
original_norm = np.linalg.norm(original_shap)
change = np.linalg.norm(original_shap - scrambled_shap)
relative_change = change / (original_norm + 1e-8)

# Pass sanity check if explanation changed by > 50%
passed = relative_change > 0.50
return float(relative_change), passed

except Exception as e:
warnings.warn(f"Sanity check failed: {e}")
return 0.0, False

# ── FULL EVALUATION PIPELINE ─────────────────────────────────────────────────

def evaluate(
self,
explainer_fn: Callable,
method_name: str,
n_samples: int = 100,
run_sanity_check: bool = False,
explainer_fn_factory: Optional[Callable] = None,
top_ks: List[int] = [1, 3, 5],
) -> EvaluationReport:
"""
Full evaluation pipeline.

explainer_fn: fn(x: np.ndarray) -> shap_values: np.ndarray
Returns SHAP values for a single sample.
explainer_fn_factory: fn(model) -> explainer_fn
Only needed for sanity check.
"""
sample_indices = np.random.choice(
len(self.X), size=min(n_samples, len(self.X)), replace=False
)

# Faithfulness metrics
suffix_k = {k: [] for k in top_ks}
compre_k = {k: [] for k in top_ks}
del_aucs = []
ins_aucs = []
stability_scores = []

print(f"Evaluating {method_name} on {len(sample_indices)} samples...")
for idx_num, idx in enumerate(sample_indices):
x = self.X[idx]
try:
shap_vals = explainer_fn(x)
except Exception as e:
warnings.warn(f"Explanation failed for sample {idx}: {e}")
continue

# Sufficiency and comprehensiveness
for k in top_ks:
if k <= len(shap_vals):
suffix_k[k].append(self.sufficiency(x, shap_vals, k))
compre_k[k].append(self.comprehensiveness(x, shap_vals, k))

# Deletion/insertion AUC (expensive: skip for >50 features)
if self.X.shape[1] <= 50:
del_auc, ins_auc = self.deletion_insertion_auc(x, shap_vals)
del_aucs.append(del_auc)
ins_aucs.append(ins_auc)

# Stability (only every 5th sample to save time)
if idx_num % 5 == 0:
stab = self.stability_score(x, explainer_fn)
stability_scores.append(stab)

if idx_num % 20 == 0:
print(f" Progress: {idx_num}/{len(sample_indices)}")

# Sanity check (optional - computationally expensive)
sanity_change = 0.0
sanity_passed = True
if run_sanity_check and explainer_fn_factory is not None:
x_test = self.X[sample_indices[0]]
shap_test = explainer_fn(x_test)
sanity_change, sanity_passed = self.sanity_check_model_randomization(
x_test, shap_test, explainer_fn_factory
)

# Lipschitz estimate
lipschitz = self.lipschitz_estimate(explainer_fn, n_pairs=50, epsilon=0.15)

faithfulness = FaithfulnessResult(
sufficiency_k1=float(np.mean(suffix_k.get(1, [0]))),
sufficiency_k3=float(np.mean(suffix_k.get(3, [0]))),
sufficiency_k5=float(np.mean(suffix_k.get(5, [0]))),
comprehensiveness_k1=float(np.mean(compre_k.get(1, [0]))),
comprehensiveness_k3=float(np.mean(compre_k.get(3, [0]))),
comprehensiveness_k5=float(np.mean(compre_k.get(5, [0]))),
deletion_auc=float(np.mean(del_aucs)) if del_aucs else 0.0,
insertion_auc=float(np.mean(ins_aucs)) if ins_aucs else 0.0,
faithfulness_score=(
float(np.mean(ins_aucs)) - float(np.mean(del_aucs))
if ins_aucs and del_aucs else 0.0
),
)

robustness = RobustnessResult(
mean_stability_score=float(np.mean(stability_scores)) if stability_scores else 0.0,
p10_stability_score=float(np.percentile(stability_scores, 10)) if stability_scores else 0.0,
p90_stability_score=float(np.percentile(stability_scores, 90)) if stability_scores else 0.0,
lipschitz_estimate=lipschitz,
sanity_check_model_rand=sanity_change,
passed_sanity_check=sanity_passed,
)

summary = (
f"{method_name}: faithfulness={faithfulness.faithfulness_score:.3f}, "
f"sufficiency@3={faithfulness.sufficiency_k3:.3f}, "
f"comprehensiveness@3={faithfulness.comprehensiveness_k3:.3f}, "
f"stability={robustness.mean_stability_score:.3f}, "
f"Lipschitz={robustness.lipschitz_estimate:.3f}, "
f"sanity_check_passed={robustness.passed_sanity_check}"
)

print(f"\nResult: {summary}")
return EvaluationReport(
method_name=method_name,
n_samples=len(sample_indices),
faithfulness=faithfulness,
robustness=robustness,
summary=summary,
)


# ─── DEMO ─────────────────────────────────────────────────────────────────────

def run_evaluation_demo():
import shap
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

np.random.seed(42)

# Dataset
X, y = make_classification(
n_samples=2000, n_features=12, n_informative=6,
n_redundant=3, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)

# Train two models
gbm = GradientBoostingClassifier(n_estimators=200, max_depth=4, random_state=42)
gbm.fit(X_train, y_train)
print(f"GBM accuracy: {gbm.score(X_test, y_test):.4f}")

rf = RandomForestClassifier(n_estimators=200, max_depth=6, random_state=42)
rf.fit(X_train, y_train)
print(f"RF accuracy: {rf.score(X_test, y_test):.4f}")

# Build TreeSHAP explainer functions
gbm_shap_exp = shap.TreeExplainer(gbm)
rf_shap_exp = shap.TreeExplainer(rf)

def gbm_explainer_fn(x: np.ndarray) -> np.ndarray:
vals = gbm_shap_exp.shap_values(x.reshape(1, -1))
if isinstance(vals, list):
vals = vals[1]
return vals[0]

def rf_explainer_fn(x: np.ndarray) -> np.ndarray:
vals = rf_shap_exp.shap_values(x.reshape(1, -1))
if isinstance(vals, list):
vals = vals[1]
return vals[0]

# Feature importance explainer (baseline comparison - less faithful)
def feature_importance_fn(x: np.ndarray) -> np.ndarray:
"""Use global feature importances as explanation - not faithful per-sample."""
importances = gbm.feature_importances_
return importances * np.sign(x - X_train.mean(axis=0))

# Build evaluator
evaluator = ExplanationEvaluator(
model=gbm, X=X_test, baseline="mean", task="binary_classification"
)

# Evaluate TreeSHAP (GBM)
print("\n" + "="*70)
print("EVALUATING: TreeSHAP on GBM")
print("="*70)
report_shap = evaluator.evaluate(
explainer_fn=gbm_explainer_fn,
method_name="TreeSHAP-GBM",
n_samples=100,
run_sanity_check=True,
explainer_fn_factory=lambda m: (
lambda x: shap.TreeExplainer(m).shap_values(x.reshape(1, -1))[1][0]
if isinstance(shap.TreeExplainer(m).shap_values(x.reshape(1, -1)), list)
else shap.TreeExplainer(m).shap_values(x.reshape(1, -1))[0]
),
)

# Evaluate Feature Importance (baseline - expected to be less faithful)
print("\n" + "="*70)
print("EVALUATING: Global Feature Importance (baseline)")
print("="*70)
report_fi = evaluator.evaluate(
explainer_fn=feature_importance_fn,
method_name="GlobalFeatureImportance",
n_samples=100,
run_sanity_check=False,
)

# Compare
print("\n" + "="*70)
print("COMPARISON TABLE")
print("="*70)
print(f"{'Metric':<30} {'TreeSHAP':>12} {'GlobalFI':>12} {'Better':<10}")
print("-"*65)

metrics = [
("Sufficiency@3 (lower=better)",
report_shap.faithfulness.sufficiency_k3,
report_fi.faithfulness.sufficiency_k3, "lower"),
("Comprehensiveness@3 (higher=better)",
report_shap.faithfulness.comprehensiveness_k3,
report_fi.faithfulness.comprehensiveness_k3, "higher"),
("Faithfulness Score (higher=better)",
report_shap.faithfulness.faithfulness_score,
report_fi.faithfulness.faithfulness_score, "higher"),
("Stability (higher=better)",
report_shap.robustness.mean_stability_score,
report_fi.robustness.mean_stability_score, "higher"),
("Lipschitz (lower=better)",
report_shap.robustness.lipschitz_estimate,
report_fi.robustness.lipschitz_estimate, "lower"),
]

for name, shap_val, fi_val, better_dir in metrics:
if better_dir == "lower":
winner = "TreeSHAP" if shap_val < fi_val else "GlobalFI"
else:
winner = "TreeSHAP" if shap_val > fi_val else "GlobalFI"
print(f"{name:<30} {shap_val:>12.4f} {fi_val:>12.4f} {winner:<10}")

return report_shap, report_fi

if __name__ == "__main__":
run_evaluation_demo()

Faithfulness vs Plausibility: The Core Distinction

The radiologist scenario illustrates the most important conceptual distinction in explanation evaluation:

Faithfulness: Does the explanation reflect what the model actually uses to make the prediction?

Plausibility: Does the explanation look reasonable to a human familiar with the domain?

These can come apart dramatically. An attention-based explanation might highlight clinically relevant regions of an X-ray (plausible) but be generated by a model that actually makes decisions based on scanner artifacts (unfaithful). A SHAP-based explanation might highlight features that the model genuinely uses (faithful) but those features might be spurious correlates (e.g., hospital ID encoded in DICOM metadata) rather than clinically meaningful signals (implausible from a medical perspective).

The danger of optimizing for plausibility alone: you can build explanation systems that humans love but that provide no reliable information about model behavior. A sufficiently sophisticated explanation system can always generate a plausible-looking story for any prediction, even a random one.

This does not mean plausibility is unimportant. An explanation that is faithful but incomprehensible fails the human utility criterion. But faithfulness must be the baseline requirement. Plausibility is then optimized within the constraint of faithfulness, not instead of it.

tip

When comparing explanation methods, always run sanity checks before human studies. A plausible-looking explanation that fails the model randomization test provides no reliable information about model behavior, regardless of how much radiologists like it. Run Adebayo et al. (2018) sanity checks first; human evaluation second.


Practical Evaluation Checklist

Before deploying any explanation system, verify these 10 items:

  1. Sufficiency@5 < 0.1: Top-5 features preserve at least 90% of the prediction when all other features are masked.
  2. Comprehensiveness@3 > 0.2: Removing the top-3 features causes at least a 0.2 change in prediction probability.
  3. Insertion AUC > Deletion AUC + 0.1: The faithfulness gap is positive and meaningful.
  4. Stability score > 0.8: Explanation changes by less than 20% under noise σ=0.05\sigma = 0.05.
  5. Lipschitz estimate < 5.0: Explanation does not change 5x faster than the input.
  6. Sanity check passed: Explanation changes significantly (>50% relative) when model weights are randomized.
  7. Human simulatability > random baseline: Users with the explanation predict model outputs better than without.
  8. ROAR accuracy drop: The explanation method causes larger accuracy drop when its top features are removed and the model is retrained, compared to a random feature ranking.
  9. No protected attribute exposure: Verify that immutable or protected features do not appear in top-k explanations unless they are genuinely model inputs.
  10. Explanation coverage = 100%: Every prediction has an associated explanation stored in the audit trail.

The Accuracy vs Interpretability Pareto Frontier

Complex models (deep networks, large ensembles) are often more accurate but harder to explain faithfully. Simpler models (linear models, shallow trees) are more interpretable but less accurate. The Pareto frontier between accuracy and interpretability is the engineering reality you must navigate.

For any given task, the right position on this frontier depends on the application:

  • High-stakes regulated decisions (credit, employment, medical diagnosis): interpretability may outweigh some accuracy. A linear model that can be fully explained may be preferable to a neural network that cannot.
  • Low-stakes recommendations (product suggestions, playlist curation): accuracy dominates. Users do not need (or want) explanations for every recommendation.
  • Human-in-the-loop systems (radiologist AI assist): need both - high accuracy and sufficient faithfulness for the human to correctly decide when to trust and when to override.

The accuracy gap between a fully interpretable model and a black box is often smaller than assumed. For tabular structured data, a logistic regression with careful feature engineering often achieves within 2–3% of an XGBoost model. That 2–3% accuracy cost may be the right tradeoff for a credit scoring model operating under regulatory scrutiny.


Common Mistakes

:::danger Mistake 1: Using human approval as the sole evaluation criterion "The radiologists liked the attention maps" is not a faithfulness evaluation. Human approval is a plausibility test, not a faithfulness test. Humans often approve of explanations that look domain-appropriate regardless of whether they reflect model behavior. Always run quantitative faithfulness metrics (sufficiency, comprehensiveness, sanity checks) before human evaluation. If an explanation fails sanity checks, do not run a human study - you are wasting time evaluating a meaningless explanation. :::

:::danger Mistake 2: Ignoring the baseline replacement strategy in faithfulness metrics The deletion/insertion game results are highly sensitive to what you replace masked features with. If you replace with zero, you may be creating out-of-distribution inputs that cause model behavior artifacts. If you replace with mean, correlated features may still carry signal. The best baseline depends on the model and data. For tree models, TreeSHAP handles the baseline correctly internally. For faithfulness games, use the training data mean as default and report sensitivity to baseline choice. :::

:::warning Mistake 3: Conflating comprehensiveness with feature importance High comprehensiveness means removing the explanation's top features hurts the model. But a feature can be comprehensive (its removal hurts) for the wrong reason - because the model uses a spurious correlation. Comprehensiveness measures faithfulness to the model, not faithfulness to ground truth. A feature can be genuinely important to a biased model's decision without being causally relevant to the outcome you care about. :::

:::warning Mistake 4: Running ROAR only once instead of across multiple k values ROAR evaluated at a single kk value gives an incomplete picture. The relative ordering of explanation methods can change at different kk. A method might be most faithful for the top-3 features but less faithful for top-10. Always plot the full ROAR curve across k{5%,10%,20%,50%}k \in \{5\%, 10\%, 20\%, 50\%\} of features removed and report the AUC of that curve. :::


YouTube Resources

ResourceCreatorFocus
Evaluating Explanations - ICML 2020 TutorialBeen KimComprehensive evaluation methods, Doshi-Velez framework
Sanity Checks for Saliency MapsJulius AdebayoThe 2018 sanity check paper explained
ERASER: Evaluating Rationales in NLPDeYoung et al.ERASER benchmark, NLP rationale evaluation
Interpretable ML: Human Studies and EvaluationChristoph MolnarPractical human evaluation design
The ROAR Benchmark for Explanation MethodsSara HookerRemove and Retrain explained

Interview Q&A

Q1: What is the distinction between faithfulness and plausibility in ML explanations, and why does it matter?

Faithfulness means the explanation accurately reflects what the model actually uses to make its prediction. Plausibility means the explanation looks reasonable to a domain expert, whether or not it reflects model behavior. The distinction matters because these can come apart completely. An attention map might highlight clinically relevant regions of an X-ray (plausible) while being generated by a process unrelated to the model's actual decision (unfaithful). Adebayo et al. (2018) showed that many popular saliency methods pass the human plausibility test but fail the model randomization sanity check - their explanations look similar whether the model weights are meaningful or random. If you optimize for plausibility without checking faithfulness, you can build an explanation system that users love but that provides no reliable information about model behavior. In practice: always establish faithfulness first (sanity checks, sufficiency, comprehensiveness), then optimize for human comprehensibility within the faithfulness constraint.

Q2: Describe the ROAR benchmark. What problem does it solve that the deletion game does not?

ROAR (Remove and Retrain, Hooker et al. 2019) evaluates explanation faithfulness by: (1) using an explanation method to rank features by importance, (2) creating a modified training dataset where the top-k%k\% of features (per sample) are replaced with noise, (3) retraining a new model on this modified dataset, and (4) measuring the accuracy degradation. The key insight: by retraining, ROAR avoids the out-of-distribution problem of the deletion game. In the deletion game, you replace features with a baseline on a model trained on full data - the model sees input patterns it was never trained on, leading to unpredictable behavior that conflates out-of-distribution sensitivity with feature importance. ROAR measures what happens when those features are genuinely absent from training, producing a more reliable signal. The limitation of ROAR: it requires training a new model for every explanation method and every kk value, which is computationally expensive. It is a research benchmark tool for method development, not a continuous monitoring tool.

Q3: How would you design a human evaluation study to measure whether an explanation method helps radiologists correctly decide when to trust or override an AI diagnosis?

This is a trust calibration study - a Tier 1 (application-grounded) evaluation in the Doshi-Velez taxonomy. Design: (1) Select a balanced set of 200 chest X-ray cases: 100 where the AI is correct, 100 where it is wrong, covering both common cases and edge cases. (2) Randomize radiologists into three conditions: AI prediction only (no explanation), AI prediction plus SHAP saliency map, AI prediction plus attention map. (3) Task: for each case, the radiologist sees the X-ray, the AI's prediction, and (in treatment conditions) the explanation. They must decide: "Accept AI recommendation" or "Override with my own judgment." (4) Primary outcome: calibrated trust - the fraction of AI-wrong cases the radiologist correctly overrides, without over-correcting on AI-correct cases. The ideal explanation increases overrides when the AI is wrong and does not decrease agreement when the AI is right. (5) Secondary outcomes: time to decision, confidence ratings, radiologist satisfaction. (6) Statistical analysis: mixed-effects model controlling for radiologist experience, case difficulty, and AI confidence level. IRB approval is required for a study involving medical data and clinical decision-making.

Q4: What are the Adebayo et al. (2018) sanity checks, and what do they reveal about attention-based explanations?

Adebayo and colleagues proposed two sanity checks. The model randomization test: scramble all model weights to random values. The model now predicts at chance. Recompute the saliency map. A faithful explanation method should produce a completely different explanation - because the model is doing something completely different. An explanation method that produces similar-looking explanations for the original and randomized model is capturing input structure rather than model behavior. The label randomization test: retrain the model on randomly shuffled labels. The model has learned a meaningless function. Explanations for the original (meaningful) and randomized (meaningless) model should differ. What they found for attention: attention maps in many vision and NLP models partially reflect input image structure (edge detectors, texture responses) rather than model decisions. When model weights are randomized, attention maps often barely change - because they are responding to the input, not to the model's learned function. Gradient-based methods (Gradient × Input, Integrated Gradients) generally pass both sanity checks. The practical implication: before using attention as an explanation in a high-stakes system, run these sanity checks. If attention maps look similar for the original and randomized model, they are measuring something about the input, not the model.

Q5: How do you evaluate explanations for a ranking model (e.g., a document ranking or recommendation system)?

Ranking models are harder to evaluate than classification or regression because the prediction is a ranked list, not a scalar. Several adaptations: (1) Sufficiency/comprehensiveness at the item level: does masking the top-kk features of the query + document change the document's rank? Measure rank change (Kendall's tau) rather than prediction change. (2) For listwise explanations (why is item A ranked above item B?), the explanation must account for relative differences between items. Contrastive explanations ("item A ranks higher than item B because it has a higher review score and lower price") are more natural for ranking than individual SHAP values. (3) NDCG degradation: compute NDCG on the original ranking, then rerank using only the top-kk explanation features. If the explanation is faithful, the NDCG should degrade minimally (the top-kk features capture the ranking signal). (4) For human evaluation, use the forward simulation task: "Given these feature values and this explanation, which item do you think the model would rank higher?" Measure accuracy of human simulation. A faithful explanation should enable humans to accurately predict ranking decisions.


Key Takeaways

Evaluating explanations requires moving beyond the eyeball test. Faithfulness metrics (sufficiency, comprehensiveness, deletion/insertion AUC) measure whether the explanation identifies what the model actually uses. ROAR validates explanation faithfulness by measuring accuracy degradation when important features are removed and the model is retrained. Robustness metrics (Lipschitz estimate, perturbation stability) measure whether explanations are consistent across similar inputs. The Adebayo sanity checks reveal whether an explanation reflects model computation or input structure - always run these before a human study. Human evaluation (simulatability, trust calibration, application-grounded studies) measures whether explanations help users accomplish real tasks, but human approval is not sufficient proof of faithfulness. The most important conceptual distinction: faithfulness is not the same as plausibility. An explanation can pass the human eyeball test and fail the sanity check. Always establish faithfulness first.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the SHAP Values demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.