Counterfactual Explanations - What Would Have to Change for a Different Decision?
Reading time: 48 min | Interview relevance: High - increasingly required in finance, HR, and healthcare; GDPR makes this legally significant | Target roles: ML Engineer, AI Engineer, Data Scientist, Applied Scientist
The Loan Applicant Who Just Wanted to Know What to Do
Maria has been turned down for a mortgage. She makes $72,000 a year, has been at her job for three years, and has a credit score of 648. She is not a high-risk borrower by any common-sense standard. But the bank's automated system rejected her application, and the letter she received says only: "We were unable to approve your application at this time."
She calls the bank. The customer service representative opens the case in the system, which shows a SHAP waterfall chart - feature contributions from the underlying gradient-boosted model. The rep reads the values out loud: "Your income contributed negative 0.3 to the score. Your credit utilization contributed negative 0.6. Your length of credit history contributed negative 0.2." Maria listens politely and then asks: "So what should I do to be approved?"
The representative has no answer. SHAP tells you what contributed to the past decision, not what to change for a different future decision. The attribution values are accurate. They are completely useless for Maria's actual problem.
Counterfactual explanations are the answer to the question SHAP cannot answer. Instead of saying "your credit utilization contributed -0.6 to the score," a counterfactual says: "If your credit utilization were below 28% and your credit score were 670, you would have been approved." That is actionable. That is specific. That gives Maria something she can actually do.
Under EU GDPR Article 22, individuals have the right to "meaningful information about the logic involved" in automated decisions that significantly affect them. Many legal scholars and the ICO (Information Commissioner's Office) interpret this as requiring exactly this kind of actionable explanation - not attribution scores, but a specification of what the person could do differently. The financial sector is moving in this direction: some UK lenders now generate counterfactual explanations automatically for every declined application.
This lesson covers the complete technical landscape: formal optimization objectives, actionability constraints, the four main algorithmic families, causal versus statistical counterfactuals, evaluation metrics, and production implementation including the DiCE library.
Why SHAP Explanations Are Informative but Not Actionable
SHAP values answer: "How much did feature contribute to this prediction, relative to the average prediction?" This is the attribution question. It decomposes the prediction into additive contributions:
where is the base value (average prediction) and is feature 's contribution. The problem with attribution for decision support: attribution is retrospective. It explains what happened in the past decision. It does not tell you what to change to get a different decision in the future.
A person with a 648 credit score who knows "credit score contributed -0.4" now knows credit score was important, but not:
- What credit score threshold would flip the decision?
- Is it necessary to change credit score, or is there another combination that works?
- Would changing credit utilization alone be sufficient?
- Are these changes realistic given the person's financial situation?
- What is the most efficient path (minimum effort) to approval?
Counterfactual explanations are designed to answer exactly these questions.
The Formal Optimization Objective
A counterfactual explanation finds an alternative input that minimizes a combined objective:
subject to constraints:
- for all immutable features (actionability)
- for all features (plausibility bounds)
- for categorical features (valid values only)
The two terms in the objective:
- Proximity term : minimize distance from original to counterfactual
- Prediction loss : minimize the loss between counterfactual prediction and target class
The hyperparameter controls the tradeoff: large prioritizes proximity (minimal changes) over prediction validity; small prioritizes getting the right prediction over staying close to the original.
The distance function has several standard choices:
L1 (Manhattan) distance: - encourages sparsity (few features change), handles heterogeneous scales via weights .
Gower distance: for mixed-type features (continuous + categorical), normalize continuous features by range and use Hamming distance for categorical. Each feature contributes equally regardless of scale:
where for continuous features and for categorical features.
Actionability and Immutability Constraints
Not all features can be changed. A credit scoring model might use age, national origin, or credit history length - none of which can be changed quickly or at all. A good counterfactual explanation system must encode this.
Immutable features: race, sex, date of birth, national origin, disability status. These must never appear as suggested changes, both for practical reasons (you cannot change them) and legal ones (ECOA, GDPR, EU AI Act prohibit using protected characteristics as bases for decisions, and suggesting changes to them in explanations implicates the same concerns).
Mutable but directional features: age can only increase (you cannot get younger). A counterfactual suggesting "if your age were 25 instead of 45" is invalid. These are monotonicity constraints: for features that can only increase.
Mutable with plausibility constraints: income can change, but "if your income were 72,000 is implausible. Plausibility is enforced through range constraints derived from the training data distribution: of the training distribution for that feature.
Actionable within time horizon: some changes are achievable within a specified time period. Reducing credit utilization from 45% to 28% can be done in one month by paying down a credit card. Increasing credit score from 648 to 720 takes 12–18 months. A good counterfactual system can incorporate time-horizon constraints to generate "short-term" and "long-term" paths to a different decision.
Wachter et al. (2017): The Foundational Formulation
Wachter, Mittelstadt, and Russell (2017) published the first systematic formulation of counterfactual explanations for ML in "Counterfactual Explanations without Opening the Black Box" (Harvard Journal of Law and Technology). Their formulation connects to EU GDPR Article 22: they argued that automated decision subjects have a right to counterfactual explanations under GDPR's requirement for "meaningful information about the logic involved."
Their loss function:
Optimization via gradient descent (for differentiable models) or Nelder-Mead simplex (for black-box models). The parameter is tuned by starting small and increasing until satisfies the prediction constraint.
Known limitations of Wachter et al.: (1) generates a single counterfactual - does not offer alternative paths; (2) can produce implausible counterfactuals that lie far from the data manifold; (3) the two-term objective requires careful tuning for each instance; (4) no diversity - multiple runs with different initializations may converge to the same solution.
DiCE: Diverse Counterfactual Explanations
Mothilal, Sharma, and Tan (2020) introduced DiCE at ACM FAccT 2020, addressing the diversity limitation of Wachter et al. DiCE generates counterfactuals simultaneously, optimized to be both individually valid and mutually diverse.
The DiCE loss function:
where:
- First term: each counterfactual must flip the decision (prediction validity)
- Second term: each counterfactual stays close to the original (proximity)
- Third term: the Determinantal Point Process (DPP) diversity penalty
The DPP diversity term: is the kernel matrix where measures similarity between counterfactuals. The determinant is maximized when its rows are orthogonal - geometrically, when the counterfactuals are spread out in feature space. Maximizing (equivalently, minimizing ) pushes the counterfactuals to explore different directions from the original point.
DiCE supports:
- Feature constraints: immutable features (frozen at original value), range constraints (per-feature bounds), permitted feature changes (allowed categorical values)
- Multiple backends: sklearn, TensorFlow, PyTorch
- Multiple generation methods: random (fast), genetic algorithm (better quality), KD-tree (structured problems)
Growing Spheres Algorithm
Laugel et al. (2017) proposed Growing Spheres - a model-agnostic method that works by expanding a sphere around the query point until a decision boundary is found, then finding the nearest point on that boundary.
Algorithm:
- Start with a small sphere of radius centered at
- Sample points uniformly inside the sphere
- If any sample has the target class : return the nearest such sample as the counterfactual
- Otherwise: double and repeat
This approach is simple, truly model-agnostic (requires only label queries), and naturally finds near-boundary counterfactuals. Its limitation: the resulting counterfactual may not satisfy actionability or plausibility constraints, requiring post-processing.
MACE: Model-Agnostic Counterfactual Explanations
Karimi et al. (2020) introduced MACE (Model-Agnostic Counterfactual Explanations) - the first method with a formal optimality guarantee. MACE formulates counterfactual generation as a Satisfiability Modulo Theories (SMT) problem.
The SMT formulation encodes:
- The model's decision boundary as a logical formula (for tree-based models, this is exact; for neural networks, it is approximated)
- Actionability constraints as logical clauses
- Minimality as the optimization objective
The SMT solver (e.g., Z3) then finds a provably minimal counterfactual - one that cannot be improved by removing any change. This is different from gradient-descent methods that find locally minimal solutions.
Advantage: provable minimality - MACE guarantees the found counterfactual has the smallest possible distance from the original among all valid counterfactuals satisfying the constraints.
Disadvantage: SMT solving is computationally expensive and does not scale to high-dimensional inputs or complex neural networks. Practical for tree-based models with moderate depth; impractical for deep networks.
Algorithmic Recourse: Causal Counterfactuals
Standard counterfactual explanations are statistical: find a nearby point in feature space where the model predicts differently. But features are not independent - they have causal structure.
The problem with statistical counterfactuals: a statistical counterfactual might say "if your income were $120,000 and your education level were 'no degree,' you would be approved." In the causal world, income and education are causally related - high income with no degree is implausible and ignores the structural relationships.
More critically: a counterfactual might suggest "if your credit score were 720." But credit score is determined by upstream factors - payment history, credit utilization, age of accounts, credit mix, new inquiries. Telling someone to "have a 720 credit score" without telling them which upstream factors to change is like telling someone to "have a lower body temperature" without mentioning that they should see a doctor.
Algorithmic recourse (Ustun et al. 2019, Karimi et al. 2021) frames the problem causally. Given a Structural Causal Model (SCM):
- Nodes: features
- Edges: causal dependencies means causally influences
- Structural equations: where are the causal parents
A causal recourse intervention:
- Must target root-cause nodes (features with no causal parents, or direct causes)
- Must propagate through the SCM (changing also changes all descendants of )
- Must respect the causal ordering (you cannot set a downstream variable directly)
CARLA Framework (Pawelczyk et al. 2021) provides a benchmark library for causal recourse. It implements multiple SCM-aware counterfactual methods and allows comparison across different causal assumptions.
Practical middle ground: building a full SCM is usually infeasible because the causal structure is unknown and contested. Instead: (1) use domain knowledge to add plausibility constraints that approximate causal plausibility (income and employment tenure should be positively correlated); (2) present the counterfactual in terms of actionable root causes ("reduce credit utilization by paying down Card X") rather than in terms of model features ("increase credit score"); (3) acknowledge that the counterfactual is approximate and provides a direction rather than a precise path.
Evaluation Metrics for Counterfactual Explanations
A counterfactual explanation system should be evaluated on four dimensions, not just whether the prediction flips:
Proximity : average Gower distance from original to counterfactual. Smaller is better. Report separately for each feature type (continuous vs categorical).
Sparsity (feature-level): number of features changed. A counterfactual changing 2 features is far more useful than one changing 20. Also report the fraction of instances where a 1-feature or 2-feature counterfactual exists.
Feasibility (data manifold): the fraction of generated counterfactuals that lie within the data manifold. Operationalized as: the -nearest-neighbor distance from the counterfactual to the training set (if this is much larger than the mean pairwise distance in the training set, the counterfactual is implausible).
Diversity (set-level): for methods that generate counterfactuals, the average pairwise distance between counterfactuals. Higher diversity means the set offers more distinct paths to the target decision.
Additional metric: validity rate - what fraction of generated counterfactuals actually achieve the target prediction. A method is only useful if validity is high (ideally 100%).
| Method | Proximity | Sparsity | Feasibility | Diversity | Compute | Optimality |
|---|---|---|---|---|---|---|
| Wachter (2017) | Good | Moderate | Poor | Single | Fast | Local |
| Growing Spheres | Good | Low | Moderate | Single | Moderate | Local |
| DiCE (2020) | Good | Moderate | Moderate | Excellent | Moderate | Local |
| MACE (2020) | Excellent | Excellent | N/A | Single | Slow | Global |
| CARLA (causal) | Moderate | Moderate | Excellent | Moderate | Slow | Local |
Full Code: Wachter Counterfactuals and DiCE
import numpy as np
import pandas as pd
import torch
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from typing import List, Dict, Optional, Tuple
import warnings
warnings.filterwarnings("ignore")
# ─── DATASET: SYNTHETIC LOAN APPLICATION ─────────────────────────────────────
np.random.seed(42)
n = 2000
income = np.random.normal(65000, 20000, n).clip(20000, 200000)
credit_score = np.random.normal(650, 80, n).clip(300, 850)
utilization = np.random.beta(2, 5, n)
debt_to_income = np.random.beta(2, 4, n)
years_employed = np.random.exponential(3, n).clip(0, 30)
X = pd.DataFrame({
"income": income,
"credit_score": credit_score,
"credit_utilization": utilization,
"debt_to_income": debt_to_income,
"years_employed": years_employed,
})
z = (
0.8 * (credit_score - 650) / 80
+ 0.6 * (income - 65000) / 20000
- 1.2 * (utilization - 0.3) / 0.2
- 0.5 * (debt_to_income - 0.3) / 0.15
+ 0.4 * (years_employed - 3) / 3
)
prob_approve = 1 / (1 + np.exp(-z))
y = (prob_approve > 0.5).astype(int)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
clf = GradientBoostingClassifier(
n_estimators=200, max_depth=4, learning_rate=0.1, random_state=42
)
clf.fit(X_train, y_train)
print(f"Model accuracy: {clf.score(X_test, y_test):.4f}")
feature_names = list(X.columns)
feature_ranges = X.max().values - X.min().values
feature_bounds = {
"income": (20000, 200000),
"credit_score": (300, 850),
"credit_utilization": (0.0, 1.0),
"debt_to_income": (0.0, 1.0),
"years_employed": (0.0, 30.0),
}
# ─── GOWER DISTANCE ──────────────────────────────────────────────────────────
def gower_distance(
x: np.ndarray,
x_prime: np.ndarray,
feature_ranges: np.ndarray,
) -> float:
"""
Gower distance for mixed-type features.
Normalizes each feature by its range before computing L1.
"""
normalized_diffs = np.abs(x - x_prime) / (feature_ranges + 1e-8)
return normalized_diffs.mean()
# ─── WACHTER ET AL. (2017) COUNTERFACTUAL ────────────────────────────────────
def wachter_counterfactual(
model,
x: np.ndarray,
feature_names: List[str],
feature_ranges: np.ndarray,
target_class: int = 1,
target_prob: float = 0.5,
immutable_features: Optional[List[str]] = None,
directional_features: Optional[Dict[str, str]] = None,
feature_bounds: Optional[Dict] = None,
max_iter: int = 1000,
learning_rate: float = 0.01,
lambda_val: float = 0.1,
) -> Tuple[np.ndarray, float, int]:
"""
Wachter et al. (2017) counterfactual via gradient descent.
Minimizes: λ·(f(x') - y_target)² + gower_distance(x, x')
Args:
immutable_features: feature names that cannot be changed
directional_features: dict of {feature_name: 'increase' or 'decrease'}
e.g., {'age': 'increase'} - age can only increase
Returns:
x_cf: counterfactual input
final_pred: model prediction for counterfactual
n_iter: iterations used
"""
if immutable_features is None:
immutable_features = []
if directional_features is None:
directional_features = {}
x_cf = x.copy().astype(float)
# Mutability mask: 1 = can change, 0 = immutable
mutable_mask = np.array([
0.0 if name in immutable_features else 1.0
for name in feature_names
])
best_cf = x_cf.copy()
best_dist = float("inf")
best_pred = 0.0
for iteration in range(max_iter):
# Numerical gradient of prediction w.r.t. x_cf
eps = 1e-4
grad_pred = np.zeros_like(x_cf)
for j in range(len(x_cf)):
if mutable_mask[j] == 0:
continue
x_plus = x_cf.copy(); x_plus[j] += eps
x_minus = x_cf.copy(); x_minus[j] -= eps
pred_plus = model.predict_proba([x_plus])[0, target_class]
pred_minus = model.predict_proba([x_minus])[0, target_class]
grad_pred[j] = (pred_plus - pred_minus) / (2 * eps)
current_pred = model.predict_proba([x_cf])[0, target_class]
# Gradient of Gower distance w.r.t. x_cf
grad_dist = np.sign(x_cf - x) / (feature_ranges * len(x_cf) + 1e-8)
# Total gradient: prediction term + distance term
pred_term = 2 * lambda_val * (current_pred - target_prob) * grad_pred
total_grad = pred_term + grad_dist
total_grad *= mutable_mask
# Gradient descent step
x_cf = x_cf - learning_rate * total_grad
# Apply feature bounds
if feature_bounds:
for j, name in enumerate(feature_names):
if name in feature_bounds:
lb, ub = feature_bounds[name]
x_cf[j] = np.clip(x_cf[j], lb, ub)
# Apply directional constraints
for j, name in enumerate(feature_names):
if name in directional_features:
direction = directional_features[name]
if direction == 'increase':
x_cf[j] = max(x_cf[j], x[j]) # can only increase
elif direction == 'decrease':
x_cf[j] = min(x_cf[j], x[j]) # can only decrease
# Track best feasible counterfactual
current_pred = model.predict_proba([x_cf])[0, target_class]
current_dist = gower_distance(x, x_cf, feature_ranges)
if current_pred >= target_prob and current_dist < best_dist:
best_cf = x_cf.copy()
best_dist = current_dist
best_pred = current_pred
if current_pred >= target_prob and current_dist < 0.05:
return best_cf, best_pred, iteration
return best_cf, best_pred, max_iter
# ─── APPLY TO MARIA'S CASE ───────────────────────────────────────────────────
maria = np.array([72000, 648, 0.45, 0.35, 3.0])
maria_pred = clf.predict_proba([maria])[0]
print(f"\nMaria's prediction: P(approve)={maria_pred[1]:.4f} -> REJECTED")
cf, cf_pred, n_iter = wachter_counterfactual(
model=clf,
x=maria,
feature_names=feature_names,
feature_ranges=feature_ranges,
target_class=1,
target_prob=0.52,
immutable_features=[],
directional_features={"years_employed": "increase"},
feature_bounds=feature_bounds,
max_iter=2000,
learning_rate=0.005,
lambda_val=0.5,
)
print(f"\nCounterfactual found (iter={n_iter}):")
print(f"{'Feature':<25} {'Original':>12} {'Counterfactual':>15} {'Change':>10}")
print("-" * 65)
for name, orig_val, cf_val in zip(feature_names, maria, cf):
change = cf_val - orig_val
change_str = f"{change:+.4f}" if abs(change) > 0.001 else "no change"
print(f"{name:<25} {orig_val:>12.4f} {cf_val:>15.4f} {change_str:>10}")
print(f"\nCounterfactual P(approve): {cf_pred:.4f}")
# ─── GRADIENT-BASED COUNTERFACTUAL SEARCH (NN models) ────────────────────────
def nn_counterfactual(
model_fn, # callable: x -> probability (differentiable)
x: torch.Tensor,
target_prob: float = 0.5,
lambda_val: float = 0.5,
immutable_mask: Optional[torch.Tensor] = None,
max_iter: int = 500,
lr: float = 0.01,
) -> torch.Tensor:
"""
Gradient-based counterfactual for differentiable models (PyTorch).
Uses autograd to compute exact gradients - much faster than numerical diff.
Loss = λ·(f(x') - y_target)² + ||x - x'||₁
"""
x_cf = x.clone().detach().requires_grad_(True)
optimizer = torch.optim.Adam([x_cf], lr=lr)
if immutable_mask is None:
immutable_mask = torch.ones_like(x)
for step in range(max_iter):
optimizer.zero_grad()
pred = model_fn(x_cf)
proximity = (x_cf - x).abs().mean()
pred_loss = lambda_val * (pred - target_prob) ** 2
loss = pred_loss + proximity
loss.backward()
# Zero gradient for immutable features
with torch.no_grad():
x_cf.grad *= immutable_mask
optimizer.step()
if pred.item() >= target_prob:
break
return x_cf.detach()
# ─── EVALUATION METRICS ───────────────────────────────────────────────────────
def evaluate_counterfactual(
x_orig: np.ndarray,
x_cf: np.ndarray,
model,
training_data: np.ndarray,
feature_names: List[str],
feature_ranges: np.ndarray,
target_class: int = 1,
k_neighbors: int = 5,
) -> Dict:
"""
Evaluate a single counterfactual on four dimensions:
1. Validity: does it achieve the target prediction?
2. Proximity: Gower distance from original
3. Sparsity: number of features changed
4. Feasibility: k-NN distance to training set (data manifold)
"""
pred_cf = model.predict_proba([x_cf])[0, target_class]
validity = pred_cf >= 0.5
proximity = gower_distance(x_orig, x_cf, feature_ranges)
changes = [abs(x_cf[j] - x_orig[j]) > 1e-4 for j in range(len(x_orig))]
sparsity = sum(changes)
changed_features = [feature_names[j] for j in range(len(x_orig)) if changes[j]]
# Feasibility: average distance to k nearest training examples
train_dists = [gower_distance(x_cf, x_train, feature_ranges)
for x_train in training_data[:200]] # subsample for speed
feasibility_score = np.sort(train_dists)[:k_neighbors].mean()
return {
"valid": validity,
"p_target": pred_cf,
"proximity": proximity,
"sparsity": sparsity,
"changed_features": changed_features,
"feasibility_score": feasibility_score,
}
metrics = evaluate_counterfactual(
maria, cf, clf,
training_data=X_train.values,
feature_names=feature_names,
feature_ranges=feature_ranges,
)
print(f"\nCounterfactual evaluation:")
for key, val in metrics.items():
print(f" {key}: {val}")
# ─── DICE-ML PRODUCTION USAGE ─────────────────────────────────────────────────
def dice_ml_example():
"""
Production DiCE usage with the dice-ml library.
pip install dice-ml
"""
# import dice_ml
# from dice_ml import Dice
#
# # Step 1: Define data interface
# d = dice_ml.Data(
# dataframe=X_train.assign(approved=y_train),
# continuous_features=[
# "income", "credit_score", "credit_utilization",
# "debt_to_income", "years_employed"
# ],
# outcome_name="approved",
# )
#
# # Step 2: Define model interface (sklearn)
# m = dice_ml.Model(model=clf, backend="sklearn")
#
# # Step 3: Initialize DiCE with desired method
# # "random": fast, good for exploration
# # "genetic": better quality, slower
# # "kdtree": fastest for structured problems
# exp = Dice(d, m, method="random")
#
# # Step 4: Generate diverse counterfactuals
# query = pd.DataFrame([maria], columns=feature_names)
# dice_exp = exp.generate_counterfactuals(
# query_instances=query,
# total_CFs=3,
# desired_class="opposite", # flip from rejected to approved
# features_to_vary=[ # only actionable features
# "credit_score", "credit_utilization", "income", "years_employed"
# ],
# permitted_range={ # plausibility bounds
# "credit_score": [600, 850],
# "credit_utilization": [0.05, 0.5],
# },
# )
#
# # Step 5: Display results
# dice_exp.visualize_as_dataframe(show_only_changes=True)
# # Returns 3 diverse counterfactual paths, each showing which features
# # changed and by how much
pass
# ─── GDPR COMPLIANCE AUDIT LOG ───────────────────────────────────────────────
import json
from datetime import datetime
def log_decision_with_explanation(
applicant_id: str,
input_features: Dict,
prediction: float,
decision: str,
counterfactuals: List[Dict],
model_version: str,
explanation_method: str = "dice_v1",
) -> Dict:
"""
Create GDPR-compliant audit log entry.
Store (input, prediction, decision, explanation) triplet for audit trail.
Requirements:
- Immutable features must not appear in counterfactual changes
- Multiple counterfactuals (at least 3) covering different paths
- GDPR Article 22 rights notice included
- Counterfactual validity timestamp (explanations may expire as model updates)
"""
log_entry = {
"timestamp": datetime.utcnow().isoformat() + "Z",
"applicant_id": applicant_id,
"model_version": model_version,
"input_features": input_features,
"prediction_score": round(prediction, 6),
"decision": decision,
"explanation": {
"method": explanation_method,
"counterfactuals": counterfactuals,
"validity_window_days": 90, # explanation valid for 90 days
"actionable_next_steps": [
f"Option {i+1}: " + ", ".join(
f"change {k} to {v}" for k, v in cf.items()
)
for i, cf in enumerate(counterfactuals)
],
},
"gdpr_rights_notice": (
"Under GDPR Article 22, you have the right to request human "
),
}
print(json.dumps(log_entry, indent=2))
return log_entry
sample_cfs = [
{"credit_utilization": "reduce from 45% to 27%",
"credit_score": "increase from 648 to 670"},
{"income": "increase from 72k to 82k",
"credit_utilization": "reduce from 45% to 32%"},
{"years_employed": "increase to 5 years",
"credit_score": "increase from 648 to 665"},
]
log_entry = log_decision_with_explanation(
applicant_id="MARIA-2024-001",
input_features=dict(zip(feature_names, maria.tolist())),
prediction=maria_pred[1],
decision="REJECTED",
counterfactuals=sample_cfs,
model_version="loan-gbt-v2.3.1",
)
Production Engineering Notes
Counterfactual generation has latency implications. Wachter's gradient descent approach takes 100–2000 iterations. For tree models (XGBoost, LightGBM), each call is fast (sub-millisecond), making generation feasible in under 1 second. For neural networks, use analytical gradients rather than numerical differences to reduce call count by a factor of 100.
Pre-compute counterfactuals for common profiles. In loan underwriting, applications cluster around a few hundred common feature profiles. Pre-compute counterfactuals for representative cluster centers and cache them. At request time, find the nearest cached counterfactual and fine-tune if needed.
Monitor for counterfactual drift. As the model is retrained, the counterfactuals it generates will change. If a user was told "increase your credit score to 670" six months ago and now applies with a 670 score and is still rejected, the explanation has become misleading. Track counterfactual consistency across model versions.
Always enforce actionability constraints. A counterfactual suggesting changes to immutable features is legally dangerous. Encode the feature mutability mask in system configuration, make it explicit in code reviews, and test that immutable features have zero change across all generated counterfactuals.
DiCE is the production-ready library for diverse counterfactuals. It handles all major backends (sklearn, TensorFlow, PyTorch), supports categorical and continuous features, and provides three generation methods: random (fast), genetic algorithm (better quality), and KD-tree (fastest for structured problems). Start with the random method and upgrade to genetic if quality is insufficient.
Common Mistakes
:::danger Common Mistake 1: Generating counterfactuals that violate feature correlations A counterfactual saying "if your income were $150,000 and your years of employment were 0" is implausible. Statistical methods that treat features independently generate these implausible suggestions. Enforce plausibility through prototype constraints or training-data-range constraints at minimum. :::
:::danger Common Mistake 2: Not enforcing actionability on immutable features Suggesting that a person change their age, sex, or national origin is illegal in many jurisdictions. Always define and enforce feature actionability masks. Test that immutable features are never modified. Include this in your model card and compliance documentation. :::
:::warning Common Mistake 3: Providing only one counterfactual A single counterfactual implies there is one path to a different decision. Different people have different capacities to change different features. Always provide at least 3 diverse counterfactuals covering different feature combinations. DiCE makes this straightforward. :::
:::warning Common Mistake 4: Treating causal and statistical counterfactuals as equivalent Statistical counterfactuals find nearby points in feature space. Causal counterfactuals find interventions that would, following the causal graph, produce a different outcome. "Increase credit score" is statistical - it does not tell you which upstream actions produce the score increase. For truly actionable recourse, trace the counterfactual to actionable root causes. :::
YouTube Resources
| Resource | Creator | Focus |
|---|---|---|
| Counterfactual Explanations - Wachter Paper Explained | Yannic Kilcher | Wachter et al. paper review and GDPR connection |
| DiCE: Diverse Counterfactual Explanations | Microsoft Research | DiCE library walkthrough and DPP diversity |
| Algorithmic Recourse and Causal Counterfactuals | Bernhard Scholkopf | SCM-based causal framing of recourse |
| GDPR and Machine Learning Explanations | AI Ethics Institute | Legal perspective on explanation rights |
| CARLA: Counterfactual and Recourse Library | CARLA Authors | Framework for causal recourse evaluation |
Interview Q&A
Q1: What is the difference between a SHAP explanation and a counterfactual explanation? When would you use each?
SHAP answers the attribution question: "How much did each feature contribute to this prediction, relative to the average prediction?" It decomposes the model output into additive feature contributions - retrospective, explaining the past decision. Counterfactual explanations answer the recourse question: "What would need to change for a different decision?" They find a nearby point in feature space where the model predicts differently - prospective, providing actionable guidance. Use SHAP when: you need to understand model behavior, debug training data issues, or explain a decision to a technical audience. Use counterfactuals when: you need to give a user actionable guidance, when complying with GDPR Article 22 recourse requirements, or when the user's question is "what can I do about this?" rather than "why did this happen?" In practice, a complete explanation system uses both: SHAP for the "why" and counterfactuals for the "what now."
Q2: Write out the formal optimization objective for counterfactual explanations and explain each term.
The objective is: subject to immutability, range, and validity constraints. The first term is the proximity penalty - it keeps the counterfactual close to the original. We want minimal changes so the explanation is actionable. The Gower distance is preferred for mixed feature types: it normalizes continuous features by range and uses Hamming distance for categorical, so all features contribute equally. The second term is the prediction loss - it penalizes counterfactuals that do not achieve the target class. Typically this is the squared difference or cross-entropy. The hyperparameter controls the tradeoff: large prioritizes proximity, small prioritizes achieving the target prediction. In practice, is tuned per-instance by starting small and increasing until the prediction constraint is satisfied.
Q3: What is DiCE and how does its DPP diversity term work?
DiCE (Diverse Counterfactual Explanations, Mothilal et al. 2020) generates counterfactuals simultaneously using a combined objective that includes proximity, prediction validity, and a Determinantal Point Process (DPP) diversity term. The DPP term is where is a kernel matrix measuring similarity between the generated counterfactuals. The determinant of a matrix is maximized when its rows are orthogonal - geometrically, when the counterfactuals are spread out in feature space. By minimizing , DiCE penalizes sets where counterfactuals are similar to each other, pushing them to explore different directions from the original point. The result is a set of counterfactuals that collectively cover different paths to the target decision - one might change credit utilization only, another might change income and employment tenure, a third might change credit score and debt-to-income ratio.
Q4: A user was told "increase your credit score to 670 and you will be approved." They do it. They apply again. They are still rejected. What went wrong?
This is counterfactual validity decay - the decision boundary shifted between when the counterfactual was generated and when the user acted on it. Causes: model was retrained on new data, model was updated to fix a bug, regulatory recalibration. Prevention: (1) version your models and attach each explanation to the specific model version that generated it; (2) implement explanation monitoring - validate that counterfactuals from version N are still valid under version N+1; (3) set explicit validity windows on counterfactuals ("this path is valid for 90 days"); (4) when a model update changes the decision boundary, proactively notify users who received counterfactual guidance under the old model; (5) consider a "counterfactual guarantee" - if a user achieves the stated conditions within 90 days, the model decision is honored regardless of model updates.
Q5: What is algorithmic recourse and how does it differ from standard counterfactual explanations?
Standard counterfactual explanations are statistical: they find a nearby point in feature space where the model predicts differently, treating features as independent and ignoring causal relationships. Algorithmic recourse (Ustun et al. 2019, Karimi et al. 2021) is causal: it uses a Structural Causal Model (SCM) to find interventions that would, through the causal graph, produce a different outcome. The difference is concrete: a statistical counterfactual might say "if your credit score were 720." An algorithmic recourse counterfactual traces through the SCM and says "if you reduced your credit utilization to 20% and made all payments on time for 12 months, your credit score would increase to 720 and your application would be approved." The recourse specifies actionable root causes, not downstream model features. The practical challenge: building a full SCM requires knowing the causal structure, which is often unknown. The CARLA framework benchmarks causal recourse methods under different assumptions about the SCM. For most production systems, the middle ground is to add plausibility constraints that approximate causal plausibility without requiring a full SCM.
Q6: How do you evaluate the quality of a set of counterfactuals generated by different methods?
Evaluate on four dimensions. Validity: fraction of generated counterfactuals that actually achieve the target prediction - a method is useless if this is not close to 100%. Proximity: average Gower distance from original to counterfactual - smaller is better; a counterfactual requiring minimal changes is more actionable. Sparsity: average number of features changed - a 2-feature counterfactual is far more useful than a 15-feature one. Feasibility: k-NN distance from counterfactual to training data - counterfactuals far from the training distribution are implausible (age -5, income $10M). Diversity: for sets of k counterfactuals, average pairwise distance between them - diverse sets give users more options and respect different user capacities. Run pixel-flipping-style ablation: actually remove the features specified in the counterfactual from inputs and measure prediction change - this validates that the counterfactual identified genuinely important features, not noise.
GDPR Compliance Checklist
Under GDPR Article 22, individuals have the right not to be subject to solely automated decisions that "significantly affect" them. The ICO guidance suggests this implies a right to an actionable explanation.
| Requirement | Implementation |
|---|---|
| What decision was made | State the decision and the model's output score |
| Why the decision was made | SHAP or LIME for attribution (retrospective) |
| What to do for a different decision | Counterfactual explanation (prospective, actionable) |
| Only changeable features suggested | Feature actionability masks |
| Suggestions are realistic | Data manifold / range constraints |
| Multiple paths offered | DiCE diverse counterfactuals (k ≥ 3) |
| Protected characteristics excluded | Immutability constraints on protected features |
| Explanation is auditable | Log (input, decision, explanation) triplets |
| Validity window stated | Attach model version and expiry date to each explanation |
GDPR compliance is a legal question that requires legal counsel. This checklist represents ML engineering best practice for explanation systems, not legal advice. Always consult a qualified lawyer before claiming GDPR compliance.
Key Takeaways
Counterfactual explanations are the most actionable form of ML explanation. SHAP tells you what contributed to a past decision; counterfactuals tell you what to change for a different future decision. The Wachter (2017) framework provides the foundational formulation; DiCE extends it to generate diverse options covering different paths; MACE adds provable optimality for tree-based models; CARLA and algorithmic recourse add causal soundness.
The five evaluation dimensions - validity, proximity, sparsity, feasibility, diversity - are your quality criteria. No single method dominates on all five; the right choice depends on your latency budget, model type, and regulatory requirements.
Always: enforce actionability constraints (immutable features frozen), set validity windows on explanations, monitor for counterfactual drift as models are updated, and provide at least three diverse counterfactuals rather than a single path. Do this, and your explanation system will actually help users rather than just satisfying a compliance checkbox.
Historical Context: From Post-Hoc Attribution to Actionable Recourse
The first wave of ML explainability methods - saliency maps, LIME (Ribeiro et al., 2016), SHAP (Lundberg & Lee, 2017) - focused on attribution: which features drove this decision? These methods are retrospective. They tell you what happened, not what to do.
Wachter, Mittelstadt, and Russell (2017) identified the gap. Their paper, published in the same year as SHAP, argued that attribution explanations fail the test of actionability - they cannot tell a loan applicant what to change to get approved. The counterfactual framing shifted the question from "why?" to "what if?", making explanations prospective and actionable.
The timing was not accidental. The EU's General Data Protection Regulation was finalized in 2016 and came into force in May 2018. Article 22's "right to explanation" for automated decision-making created legal pressure to produce explanations that actually helped individuals understand their options - not just documentation for data scientists. Counterfactual explanations were positioned as the technical answer to a legal requirement.
DiCE (Mothilal et al., 2020) from Microsoft Research extended the Wachter framework by introducing diversity as a first-class objective. Single counterfactuals are fragile - a slight change in model parameters makes them invalid. Providing three to five diverse paths hedges against this instability and gives users meaningful choice.
The algorithmic recourse direction (Ustun et al., 2019; Karimi et al., 2021) added causal structure. Statistical counterfactuals may suggest changes that are statistically plausible but causally impossible - you cannot "decrease age" and expect the same effect as someone who is actually younger. CARLA and causal recourse methods enforce do-calculus constraints, ensuring the suggested changes produce the predicted outcome through a realistic causal chain.
Choosing the Right Method: Decision Guide
Decision rules in plain language:
- Tree-based model + need provably minimal changes: MACE (SMT-solver, runs in milliseconds on sklearn trees, guarantees the fewest feature changes needed).
- Neural network + need diverse options for users: DiCE with DPP diversity (generates 3–10 counterfactuals covering different change paths, gradient-based optimization).
- Any model + causal structure available: CARLA with a structural causal model (prevents causally impossible suggestions like "lower your age").
- Rapid prototyping or latency-critical path: Wachter optimization (straightforward gradient descent, single counterfactual, reasonable quality).
The default for production credit/insurance/hiring systems: DiCE with at least counterfactuals, actionability constraints enforced, and a GDPR audit log. The default for tree-based internal tools: MACE for its provable minimality guarantee.
Monitoring Counterfactual Validity Over Time
Counterfactual explanations have a shelf life. When a model is retrained or its feature distribution shifts, explanations generated under the old model may no longer be valid - the counterfactual now crosses the new decision boundary in the wrong direction, or the changed features are no longer sufficient for approval.
Counterfactual drift is a real production problem. A user receives a counterfactual explanation ("increase your savings by $5,000"), acts on it over three months, and then discovers the model has been updated and the action is no longer sufficient. This creates legal and reputational risk - the user has a reasonable expectation that the explanation was actionable.
from datetime import datetime, timedelta
import json
class CounterfactualExplanationStore:
"""
Stores counterfactual explanations with metadata for drift monitoring.
"""
def __init__(self, validity_days: int = 90):
self.validity_days = validity_days
self.records = []
def store(self, user_id: str, original_input: dict,
counterfactual: dict, model_version: str,
decision: str) -> str:
"""Store a counterfactual with expiry metadata."""
record = {
"explanation_id": f"cf_{user_id}_{datetime.utcnow().strftime('%Y%m%d%H%M%S')}",
"user_id": user_id,
"timestamp": datetime.utcnow().isoformat(),
"valid_until": (datetime.utcnow() + timedelta(days=self.validity_days)).isoformat(),
"model_version": model_version,
"original_input": original_input,
"counterfactual": counterfactual,
"changes_required": {
k: {"from": original_input.get(k), "to": v}
for k, v in counterfactual.items()
if original_input.get(k) != v
},
"decision": decision,
"status": "active"
}
self.records.append(record)
return record["explanation_id"]
def check_validity(self, explanation_id: str,
current_model_version: str,
current_model) -> dict:
"""
Check if a stored counterfactual is still valid under the current model.
Returns validity status and recommended action.
"""
record = next((r for r in self.records if r["explanation_id"] == explanation_id), None)
if record is None:
return {"valid": False, "reason": "explanation_not_found"}
# Check expiry
valid_until = datetime.fromisoformat(record["valid_until"])
if datetime.utcnow() > valid_until:
return {"valid": False, "reason": "explanation_expired",
"expired_at": record["valid_until"]}
# Check model version
if record["model_version"] != current_model_version:
# Rerun the counterfactual through the current model
cf_input = record["counterfactual"]
try:
import numpy as np
cf_array = np.array(list(cf_input.values())).reshape(1, -1)
pred = current_model.predict(cf_array)[0]
still_valid = (pred == record["decision"])
except Exception:
still_valid = False
return {
"valid": still_valid,
"reason": "model_updated",
"original_model_version": record["model_version"],
"current_model_version": current_model_version,
"recommendation": (
"Counterfactual still valid under new model - user action remains actionable."
if still_valid else
"Counterfactual invalid under new model - regenerate explanation and notify user."
)
}
return {"valid": True, "reason": "explanation_current"}
# Usage in a production audit workflow
store = CounterfactualExplanationStore(validity_days=90)
explanation_id = store.store(
user_id="user_12345",
original_input={"income": 45000, "savings": 2000, "credit_score": 680},
counterfactual={"income": 45000, "savings": 7000, "credit_score": 680},
model_version="credit_model_v2.3",
decision="approved"
)
print(f"Explanation stored: {explanation_id}")
Drift monitoring protocol: when a model is updated, run all active counterfactual explanations through the new model. Flag any that are no longer valid (i.e., applying the suggested changes to the original input no longer produces the target outcome). Send proactive notifications to affected users offering updated explanations. This is not just good practice - in regulated industries, it may be required to avoid claims of misleading explanation.
Interview Q&A: Advanced Questions
Q7: How do you handle non-differentiable models (decision trees, rule systems) in counterfactual generation?
Gradient-based methods (Wachter, DiCE with gradients) require a differentiable loss function - they compute . Decision trees and rule systems are not differentiable. Options: (1) MACE: SMT-solver approach that directly reasons about the tree's decision rules as logical constraints. Provably finds the minimal change with no differentiability requirement. Works natively with sklearn trees. (2) Model distillation: train a locally differentiable surrogate (logistic regression or neural net) in the region of , then apply gradient-based CF search to the surrogate. Less precise but widely applicable. (3) DiCE's genetic algorithm mode: uses evolutionary search (mutation + selection) rather than gradients. Slower but model-agnostic. (4) Growing Spheres: perturbation-based, no gradients needed. Expands spheres around until a valid CF is found. For tree ensembles (XGBoost, random forests), MACE + feature actionability constraints is the gold standard.
Q8: What is the difference between a counterfactual explanation and algorithmic recourse?
Counterfactual explanation (Wachter et al., 2017): finds a nearby input such that . Focuses on the model's decision boundary - purely statistical. It tells you "if your input were , the model would predict ." It does not guarantee that actions to change to are (a) causal - does increasing your savings actually cause approval, or is savings correlated with another factor? - or (b) achievable in practice. Algorithmic recourse (Ustun et al., 2019): extends counterfactuals with causal structure. Uses a structural causal model (SCM) to ensure the suggested interventions are causally valid. Also adds cost functions based on the effort required to make each change (changing income is hard; changing address is medium; changing marital status is easy). Recourse focuses on finding the minimum-effort causally valid path to a different outcome - a stronger and more useful guarantee for the individual. In practice: most production systems use statistical counterfactuals (faster, no SCM required) but add actionability constraints as a proxy for causal validity.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Counterfactual Explanations demo on the EngineersOfAI Playground - no code required.
:::
