Skip to main content

Interpretability vs Explainability - Clearing Up the Confusion

Reading time: 35 min | Interview relevance: High - regulatory and system design questions | Target roles: ML Engineer, AI Engineer, Data Scientist, Research Engineer


The Loan Denial That Failed the Audit

It is a Tuesday morning in Frankfurt when the trouble begins. A major bank's automated loan-decision system has been flagged by EU regulators under GDPR Article 22. A customer was denied a mortgage. The customer filed a complaint. The regulator sent an auditor.

The ML team is prepared. They have SHAP values. They pull up a waterfall plot showing exactly which features contributed to the denial: debt-to-income ratio (+0.34), recent credit inquiries (+0.21), employment duration (-0.18). The colors are clean. The numbers add up to the model's output. It looks airtight.

The auditor is not satisfied. "You've shown me why this specific customer was denied," she says, "but I need to understand your model. What does it do with borderline cases? What is the decision boundary? Can you show me that the model does not systematically disadvantage applicants from certain postal codes?" She picks up a pen. "I'm not asking about this decision. I'm asking about your system."

The ML team goes quiet. SHAP explains individual predictions. It does not explain the model. They have a tool that answers the wrong question.

This gap - between explaining a specific output and understanding a system - is the subject of this lesson. It is not just semantic. Regulators, doctors, and judges need different things. Building the wrong type of explanation for the wrong stakeholder is a failure mode that costs organizations real money and real trust.


Why the Terminology Matters

The field uses "interpretability" and "explainability" interchangeably. Most blog posts do. Many papers do. This is a mistake with practical consequences. When a clinical team asks for "model interpretability" and you deliver SHAP plots, you may be providing the wrong artifact entirely. When a regulator asks for "explainability" and you respond with model weights, you have misunderstood the request.

The distinction has been formalized by several researchers. Doshi-Velez and Kim (2017) define interpretability as "the degree to which a human can understand the cause of a decision." Lipton (2016) distinguishes transparency (intrinsic model properties - simulatability, decomposability, algorithmic transparency) from post-hoc interpretability (explanations generated after the fact). The cleaner framing, now widely used in practice:

Interpretability is a property of the model itself - how much a human can understand the model's internal mechanisms just by examining it.

Explainability is a property of an explanation artifact - how well a generated explanation (SHAP values, LIME coefficients, saliency maps) communicates why the model made a specific prediction.

A logistic regression is interpretable. You read its coefficients and immediately understand the decision function. An XGBoost model is not interpretable - there are hundreds of trees with complex interactions. But an XGBoost model can be made explainable: SHAP values can tell you, for this prediction, which features pushed the output up and by how much.

The bank's mistake was having explainability without interpretability and not knowing the difference.


Historical Context - How the Field Got Here

The modern field of explainable AI (XAI) has roots in two separate traditions that eventually collided.

The first tradition is rule-based AI. Expert systems in the 1980s (MYCIN, R1/XCON) were interpretable by construction. They encoded domain knowledge as explicit if-then rules. A clinician could read MYCIN's reasoning chain and agree or disagree with each step. Interpretability was free - it came from the model architecture itself.

The second tradition is statistical machine learning. Linear models, developed over decades, remained interpretable. But as the 1990s progressed and SVMs, then boosted trees, then neural networks achieved better accuracy, the mechanisms became opaque. The community made an implicit trade: give up interpretability to gain performance.

The crisis became visible around 2016. ProPublica published an analysis of COMPAS - a recidivism prediction tool used in US courts to determine bail and sentencing. The tool was a black box. No one could explain why it rated a defendant as high-risk. An investigation found systematic racial bias baked into the predictions. The tool had been in use for years. It could not be audited because it could not be interpreted.

The same year, Ribeiro et al. published LIME. A year later, Lundberg and Lee published SHAP. The field of post-hoc explainability was born - methods that could generate explanations for black-box models after the fact, without requiring the model to be interpretable.

DARPA launched its XAI program in 2017, funding $70M in research. The EU passed GDPR in 2018, with Article 22 implying (though not explicitly granting) a "right to explanation." The field accelerated.

In 2019, Cynthia Rudin published "Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead" - arguably the most important paper in the field. Her argument: for tabular data (which covers most high-stakes domains), there is often no accuracy-interpretability tradeoff. The choice of black box is a choice, not a necessity. Use interpretable models and you do not need post-hoc explanations at all.


Core Concepts

Definitions

Interpretability - the degree to which a human can understand the internal mechanisms of a model. An interpretable model is sometimes called a "glass box" or a "transparent model." You understand how it works by looking at it.

  • A linear regression with 10 features: interpretable. The coefficients are the model.
  • A decision tree with depth 4: interpretable. You can draw the tree and trace any prediction by hand.
  • A logistic regression: interpretable. Coefficients give log-odds contributions of each feature.
  • An XGBoost model with 500 trees: not interpretable. You cannot hold 500 trees in your head.
  • A transformer with 110M parameters: not interpretable.

Explainability - the ability to provide a post-hoc explanation for a specific model output. An explainability method takes a trained model, an input, and produces a human-readable artifact describing why the model produced that output.

  • SHAP values: explain what each feature contributed to this prediction
  • LIME: fit a local linear model around this prediction
  • Attention weights: show which input tokens the model attended to
  • Saliency maps: highlight image regions that most influenced the output

A model can be:

  • Interpretable with no need for post-hoc explanation (linear regression)
  • Not interpretable but explainable via post-hoc methods (XGBoost + SHAP)
  • Not interpretable and poorly explained (a neural network where you show attention weights that don't actually correspond to reasoning)

Taxonomy

Intrinsic vs post-hoc: Intrinsic interpretability means the model itself is understandable - no extra step required. Post-hoc explanations are generated after the model has made a prediction.

Global vs local: Global explanations describe overall model behavior. "Which features does this model rely on most, across all predictions?" Local explanations describe a single prediction. "Why was this specific loan application denied?"

Model-agnostic vs model-specific: Model-agnostic methods work with any model (LIME, KernelSHAP - they treat the model as a black box). Model-specific methods exploit the model's internal structure for efficiency or accuracy (TreeSHAP for tree ensembles, Grad-CAM for CNNs).

Ante-hoc vs post-hoc: Ante-hoc methods design interpretability in from the start - monotone constraints, attention bottlenecks, prototype networks. Post-hoc methods are applied to an already-trained model.

The Interpretability-Accuracy Tradeoff - Is It Real?

The conventional wisdom: complex models (neural nets, boosted trees) are more accurate; simple models (linear regression, shallow decision trees) are more interpretable; you trade one for the other.

Cynthia Rudin's 2019 paper challenges this on tabular data. Her argument:

  1. For most structured/tabular prediction problems, carefully engineered interpretable models (optimal decision trees, logistic regression with good feature engineering, rule sets) match or approach the accuracy of black-box models.
  2. The accuracy gap is usually due to bad feature engineering, not the model class.
  3. Post-hoc explanations of black-box models are themselves approximations - they may not faithfully represent what the black box actually does.
  4. Therefore: for high-stakes tabular decisions, use interpretable models. The tradeoff is a choice, not a necessity.

Where the tradeoff is real: unstructured data (images, text, audio). State-of-the-art image classifiers are deep CNNs or ViTs - there is no interpretable model that matches their accuracy on ImageNet. For NLP tasks, transformer-based models dominate. In these domains, post-hoc explanation is necessary because the interpretable-model alternative does not exist at competitive accuracy.

The practical implication: before defaulting to XGBoost + SHAP, ask whether a logistic regression or shallow decision tree would achieve acceptable accuracy. If yes, use it. You get explanation for free, and the explanation is guaranteed to be faithful.

Global vs Local Explanations in Depth

Global explanations answer: what does this model do, in general?

  • Feature importance (which features does the model use most?)
  • Partial dependence plots (how does the model's output change as feature xjx_j varies?)
  • Model distillation (approximate the model with a simpler, interpretable surrogate)
  • Rule extraction (extract a set of if-then rules that approximate the model)

Local explanations answer: why did the model make this specific prediction?

  • SHAP values for a single instance
  • LIME explanation for a single instance
  • Counterfactual: what is the smallest change to this input that would change the output?
  • Influential training examples: which training instances most affected this prediction?
tip

In practice, you usually need both. Use global explanations for model auditing and debugging ("is the model relying on spurious features?"). Use local explanations for individual decision accountability ("why was this applicant denied?").


The Interpretability Spectrum

Left side = intrinsically interpretable. Right side = black box, requires post-hoc methods. Each step right increases model capacity and typically accuracy on complex tasks, but reduces transparency.


Code: Logistic Regression vs XGBoost - Two Kinds of "Explanation"

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
import shap

# Generate a synthetic dataset: 6 features, binary classification
X, y = make_classification(
n_samples=5000,
n_features=6,
n_informative=4,
n_redundant=1,
n_repeated=1,
random_state=42
)

feature_names = ["credit_score", "income", "debt_ratio",
"employment_years", "recent_inquiries", "age"]
X_df = pd.DataFrame(X, columns=feature_names)

X_train, X_test, y_train, y_test = train_test_split(
X_df, y, test_size=0.2, random_state=42
)

# ─── LOGISTIC REGRESSION - intrinsic interpretability ────────────────────────

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

lr = LogisticRegression(random_state=42)
lr.fit(X_train_scaled, y_train)

print("=== LOGISTIC REGRESSION - MODEL IS THE EXPLANATION ===")
print(f"Accuracy: {lr.score(X_test_scaled, y_test):.3f}")
print()
print("Feature coefficients (log-odds contributions):")
for name, coef in sorted(zip(feature_names, lr.coef_[0]),
key=lambda x: abs(x[1]), reverse=True):
direction = "increases" if coef > 0 else "decreases"
print(f" {name:20s} {coef:+.4f} ({direction} log-odds of approval)")

# For a single instance, the explanation is exact and derivable from the model
instance = X_test_scaled[0:1]
instance_df = pd.DataFrame(instance, columns=feature_names)
contributions = instance_df.values[0] * lr.coef_[0]
print()
print("Local explanation for instance 0 (feature × coefficient):")
for name, contrib in sorted(zip(feature_names, contributions),
key=lambda x: abs(x[1]), reverse=True):
print(f" {name:20s} contribution: {contrib:+.4f}")

# ─── XGBOOST - post-hoc explainability required ──────────────────────────────

xgb_model = xgb.XGBClassifier(
n_estimators=200,
max_depth=5,
learning_rate=0.1,
random_state=42,
eval_metric="logloss",
verbosity=0
)
xgb_model.fit(X_train, y_train)

print()
print("=== XGBOOST - POST-HOC SHAP REQUIRED ===")
print(f"Accuracy: {xgb_model.score(X_test, y_test):.3f}")

# SHAP values for XGBoost - TreeSHAP is exact and fast
explainer = shap.TreeExplainer(xgb_model)
shap_values = explainer.shap_values(X_test)

print()
print("Global feature importance (mean |SHAP value|):")
mean_abs_shap = np.abs(shap_values).mean(axis=0)
for name, importance in sorted(zip(feature_names, mean_abs_shap),
key=lambda x: x[1], reverse=True):
print(f" {name:20s} mean |SHAP|: {importance:.4f}")

print()
print("Local explanation for instance 0 (SHAP values):")
instance_shap = shap_values[0]
for name, sv in sorted(zip(feature_names, instance_shap),
key=lambda x: abs(x[1]), reverse=True):
direction = "pushes toward approval" if sv > 0 else "pushes toward denial"
print(f" {name:20s} SHAP: {sv:+.4f} ({direction})")
note

For logistic regression, the "explanation" is the model coefficients - identical for every prediction (scaled by the feature value). For XGBoost, you need an external tool (SHAP) to generate an explanation, and that explanation is an approximation of what the 200 trees collectively decided. The logistic regression explanation is guaranteed faithful. The SHAP explanation is an approximation, though TreeSHAP is exact.


The Regulatory Landscape

GDPR Article 22 - Right to Explanation

The General Data Protection Regulation (EU, 2018) contains a provision with major ML implications. Article 22 states that data subjects have the right not to be subject to a decision based solely on automated processing that produces legal or similarly significant effects.

When automated decisions are permitted (with consent or contract necessity), Article 13(2)(f) and 22(3) require that organizations provide "meaningful information about the logic involved." The exact scope of the "right to explanation" has been debated in legal scholarship. The practical interpretation adopted by most EU regulators: if an automated system denies a loan, refuses insurance, or makes a hiring decision, the organization must be able to explain to the affected individual why.

This is a local explanation requirement - why was this specific decision made for this specific person. A global "our model uses these features" statement does not satisfy Article 22.

EU AI Act (2024)

The EU AI Act, which entered into force in 2024, takes a risk-based approach. Systems in high-risk categories (credit scoring, employment, education admissions, law enforcement, medical devices, critical infrastructure) are subject to strict requirements:

  • Technical documentation including model architecture, training data, and accuracy metrics
  • Logging of system operations
  • Transparency to users - users must be informed they are interacting with an AI system
  • Human oversight - high-risk systems must allow human review and override
  • Accuracy, robustness, and cybersecurity standards

The AI Act does not mandate SHAP values or any specific explanation method. It mandates that organizations can explain their systems and allow human oversight. The practical implication: black-box models in high-risk domains require robust post-hoc explanation infrastructure, audit trails, and human review workflows.

FDA Guidance - AI in Medical Devices

The FDA's 2021 action plan for AI/ML-based Software as a Medical Device (SaMD) requires that developers maintain transparency about model changes, monitor performance post-deployment, and provide algorithmic transparency to clinicians. The FDA expects that clinicians who act on AI recommendations can understand, at minimum, what signals drive the AI's output.

FINRA - Financial AI

FINRA Regulatory Notice 21-06 covers the use of AI in broker-dealer operations. It requires that firms using AI for recommendations maintain records that allow review of the AI's decision logic, monitor for model drift, and ensure the AI's outputs are auditable. Permutation importance plots don't satisfy a FINRA examiner. Documented SHAP-based explanation pipelines do.


When Interpretability Is Non-Negotiable

Healthcare

A radiologist using an AI diagnostic tool needs to know not just the prediction ("80% probability of malignancy") but what drove it. If the AI is correct, the radiologist needs to understand why to corroborate the finding. If the AI is wrong, the radiologist needs to catch the error. Showing a saliency map that highlights the relevant lesion region allows the radiologist to agree or disagree. Showing a number with no explanation forces the radiologist to either blindly trust or blindly ignore - neither is medically acceptable.

Finance

Credit models determine access to housing, small business loans, and education. Regulatory and ethical requirements demand that applicants who are denied can receive a meaningful explanation and challenge incorrect information. "Your SHAP value for debt-to-income ratio is +0.34" is not a meaningful explanation to a loan applicant. A well-designed system translates SHAP values into plain-language summaries: "Your debt-to-income ratio of 45% exceeds our typical threshold of 36%."

Hiring

Automated resume screening is a high-stakes domain. A 2019 Reuters investigation revealed Amazon's AI resume screener systematically downweighted resumes containing the word "women's" (as in "women's chess club"). The model was trained on historical hiring data that reflected existing gender bias. Without interpretation, the bias was invisible until a journalist found it. With SHAP analysis of systematic patterns, such biases can be detected in testing.

Criminal Justice - The COMPAS Controversy

COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is a black-box risk assessment tool used in US courts to inform bail, sentencing, and parole decisions. In 2016, ProPublica found that the tool was twice as likely to incorrectly flag Black defendants as future criminals compared to white defendants. The tool's developers refused to release the model or its logic, citing trade secrets.

The COMPAS case illustrates the intersection of interpretability, accountability, and justice. When a model affects liberty - whether someone is detained or released - the inability to audit or explain that model is not just a technical problem. It is a justice problem.


Common Mistakes

danger

Presenting SHAP plots to regulators and calling it "model interpretability"

SHAP values are post-hoc local explanations. They explain a specific prediction. They do not explain the model. A regulator asking for model interpretability may require you to demonstrate that the model's decision boundary is free of proxy discrimination, that it is monotone in certain features, or that it generalizes correctly to edge cases. SHAP values alone cannot show these things.

danger

Using post-hoc explanations as a substitute for model validation

A common pattern: train an XGBoost model, generate SHAP plots, show them to stakeholders, declare the model "explainable," and deploy. The SHAP plots may look reasonable while the model has subtle failures (extrapolating badly to out-of-distribution inputs, relying on spurious correlations that appear in the training data but not in deployment). Explanation is not validation.

warning

Confusing feature importance with causality

SHAP values measure each feature's contribution to the model's prediction. This is a statistical attribution, not a causal statement. If debt-to-income ratio has the highest SHAP value for loan denials, it means the model relies heavily on that feature - not that reducing your debt ratio will cause the model to approve you (though it likely will, if the model is well-calibrated). Always distinguish correlation-based attribution from causal inference.

warning

Selecting the model class based on habit rather than stakes

Many teams default to gradient boosted trees because they usually win on tabular data. But "usually win" means "on benchmark accuracy metrics." In high-stakes regulated domains, an interpretable model with 1% lower accuracy may be the correct engineering choice when you factor in regulatory compliance cost, audit preparation time, and the ability to certify that the model is free of prohibited proxies.


Production Engineering Notes

Build explanation infrastructure before launch. For any model in a regulated domain, explanation must be part of the system design, not retrofitted. This means: explanation APIs alongside prediction APIs, explanation caching (SHAP values for TreeSHAP are cheap; for KernelSHAP they are expensive), explanation storage for audit trails, and a UI layer that translates technical explanations into plain-language outputs.

Version your explanations alongside your models. When a model is retrained, the explanations change - even for the same input. If an auditor asks why a loan was denied 18 months ago, you need the explanation from the model version active at that time. Store model checkpoints with their explanation configurations.

Monitor explanation drift. If the SHAP value for a given feature suddenly increases in magnitude across predictions, that is a signal that model behavior has shifted - possibly due to input distribution shift. Explanation drift monitoring is a complement to performance monitoring.

Choose the right explanation type for the audience. A data scientist wants SHAP waterfall plots. A loan officer wants plain-language summaries ("The primary reason for denial was your debt-to-income ratio"). A regulator wants a documented explanation methodology with validation evidence. Build all three.


YouTube Resources

  • "Interpretable Machine Learning" - Christoph Molnar (PyData): The author of the textbook walks through the full taxonomy. Excellent for building the complete mental model.
  • "Stop Explaining Black Box ML" - Cynthia Rudin (ML street talk): Rudin makes her case directly. Best technical argument against reflexive use of black-box models.
  • "The Mythos of Model Interpretability" - Zachary Lipton: The 2016 paper that formalized the interpretability taxonomy. Lipton's talks are widely referenced.
  • "GDPR and AI" - European Data Protection Board: Official guidance on Article 22 implementation.

Interview Q&A

Q: What is GDPR Article 22 and what does it require of ML systems?

Article 22 gives EU data subjects the right not to be subject to solely automated decisions that produce legal or similarly significant effects. When such decisions are permitted, organizations must provide "meaningful information about the logic involved." In practice, this requires that organizations can generate per-prediction explanations explaining why a specific decision was made for a specific individual. Global feature importance is not sufficient. A logistic regression whose coefficients are documented, or an XGBoost model with a SHAP explanation pipeline, satisfies this requirement. A black-box model with no explanation capability does not.

Q: Cynthia Rudin argues we should stop using black-box models for high-stakes decisions. Summarize her argument and evaluate it.

Rudin's argument (2019): for tabular/structured data, the accuracy gap between interpretable models and black-box models is small and often zero. The gap is typically due to inferior feature engineering for the interpretable model, not an inherent limitation of interpretable model classes. Post-hoc explanations (SHAP, LIME) are themselves approximations - they can be unfaithful, unstable, or misleading. For high-stakes decisions where errors affect people's lives, using an interpretable model gives you a guarantee: the explanation is the model. Post-hoc explanations give you an approximation of an approximation.

The argument is strong for tabular data. It does not apply to unstructured data (images, text) where interpretable models cannot approach black-box accuracy. In practice: for tabular credit, hiring, or medical record tasks, evaluate whether a logistic regression or optimal decision tree achieves acceptable accuracy before defaulting to XGBoost.

Q: What is the difference between global and local explanations, and when do you need each?

A global explanation describes overall model behavior - which features the model relies on, how the model's output changes as a feature varies, what the model's decision boundary looks like. Tools: feature importance, partial dependence plots, model distillation, rule extraction. Use global explanations for model debugging ("is the model relying on a proxy for protected attributes?"), model auditing, and model documentation.

A local explanation describes a single prediction - which features drove this specific output. Tools: SHAP values for an instance, LIME for an instance, counterfactuals. Use local explanations for per-decision accountability ("why was this applicant denied?"), debugging unusual predictions, and regulatory compliance for individual cases.

Q: What are the four categories in the interpretability taxonomy?

  1. Intrinsic (transparent) vs post-hoc: Intrinsic means the model itself is understandable. Post-hoc means an explanation is generated after the fact.
  2. Global vs local: global explains overall behavior; local explains a single prediction.
  3. Model-agnostic vs model-specific: agnostic methods (LIME, KernelSHAP) treat the model as a black box; specific methods (TreeSHAP, Grad-CAM) exploit model structure.
  4. Ante-hoc vs post-hoc (sometimes used): ante-hoc means interpretability is designed in (monotone constraints, sparse models); post-hoc means explanation is generated after training.

Q: In what domains is interpretability non-negotiable, and why?

Healthcare: Clinicians must be able to agree or disagree with an AI recommendation based on clinical reasoning. A black-box diagnosis without explanation cannot be properly overridden or validated.

Finance and credit: Regulatory requirements (GDPR, Equal Credit Opportunity Act, FINRA) mandate explainability for automated decisions. Denied applicants have the right to know why.

Hiring: Automated screening tools must not discriminate based on protected characteristics. Without interpretability, proxy discrimination is undetectable until harm has occurred at scale.

Criminal justice: Decisions affecting liberty (bail, sentencing, parole) carry due process implications. An unauditable algorithm making such decisions cannot be challenged, even when wrong.

Scientific ML: When ML is used to generate scientific hypotheses or inform policy, the mechanism must be understandable to domain experts who evaluate and trust the findings.


Designing for Interpretability from the Start

Most teams reach for interpretability tools after a model is already deployed and a problem has emerged. The better approach is ante-hoc design: making deliberate architectural choices during model development that preserve interpretability.

Monotone Constraints

For many real-world problems, you know the direction of certain feature effects from domain knowledge. A credit model should predict higher risk for higher debt-to-income ratios. A medical risk model should predict higher mortality for lower blood oxygen saturation. These domain constraints can be encoded directly into gradient boosted tree models:

import xgboost as xgb
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Synthetic credit dataset
# Features: [credit_score, debt_ratio, income, employment_years, recent_inquiries]
# Domain knowledge: credit_score negatively associated with default (higher score = lower risk)
# debt_ratio positively associated with default (higher = more risk)
# income negatively associated (higher = lower risk)

np.random.seed(42)
n = 3000
credit_score = np.random.uniform(300, 850, n)
debt_ratio = np.random.uniform(0.1, 0.9, n)
income = np.random.uniform(20000, 200000, n)
employment_years = np.random.uniform(0, 30, n)
recent_inquiries = np.random.randint(0, 10, n)

# True underlying risk (logistic)
logit = (-0.005 * credit_score + 2.0 * debt_ratio
- 0.000008 * income - 0.03 * employment_years
+ 0.15 * recent_inquiries)
prob_default = 1 / (1 + np.exp(-logit))
y = (np.random.rand(n) < prob_default).astype(int)

X = np.column_stack([credit_score, debt_ratio, income,
employment_years, recent_inquiries])
feature_names = ["credit_score", "debt_ratio", "income",
"employment_years", "recent_inquiries"]

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

# Model WITHOUT monotone constraints - may learn perverse relationships
model_unconstrained = xgb.XGBClassifier(
n_estimators=200, max_depth=4, random_state=42,
eval_metric="logloss", verbosity=0
)
model_unconstrained.fit(X_train, y_train)

# Model WITH monotone constraints
# monotone_constraints: +1 = positive monotone, -1 = negative monotone, 0 = unconstrained
# Order matches feature order: credit_score, debt_ratio, income, employment_years, recent_inquiries
model_constrained = xgb.XGBClassifier(
n_estimators=200, max_depth=4, random_state=42,
eval_metric="logloss", verbosity=0,
monotone_constraints=(-1, +1, -1, -1, +1)
# credit_score: higher score → lower risk (-1)
# debt_ratio: higher ratio → higher risk (+1)
# income: higher income → lower risk (-1)
# employment_years: more years → lower risk (-1)
# recent_inquiries: more inquiries → higher risk (+1)
)
model_constrained.fit(X_train, y_train)

print("Unconstrained accuracy:", model_unconstrained.score(X_test, y_test))
print("Constrained accuracy:", model_constrained.score(X_test, y_test))

# Verify monotonicity: check that predictions are monotone in credit_score
import pandas as pd
credit_range = np.linspace(300, 850, 100)
test_instance_base = np.array([600, 0.3, 60000, 5, 2])

preds_unconstrained = []
preds_constrained = []
for cs in credit_range:
instance = test_instance_base.copy()
instance[0] = cs # vary credit_score
preds_unconstrained.append(
model_unconstrained.predict_proba([instance])[0, 1]
)
preds_constrained.append(
model_constrained.predict_proba([instance])[0, 1]
)

preds_u = np.array(preds_unconstrained)
preds_c = np.array(preds_constrained)

print(f"\nMonotonicity check (credit_score, should be decreasing = lower risk):")
print(f" Unconstrained: monotone decreasing? {bool(np.all(np.diff(preds_u) <= 0.01))}")
print(f" Constrained: monotone decreasing? {bool(np.all(np.diff(preds_c) <= 1e-6))}")

Monotone constraints improve interpretability without sacrificing accuracy on well-understood features. They also prevent the model from learning spurious reversals (e.g., "having a very high credit score slightly increases default risk") that may be artifacts of training data noise.

Sparsity and Feature Selection

Sparse models - models that use few features - are more interpretable because there are fewer moving parts to understand. Lasso regression enforces sparsity through L1 regularization:

from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

# Lasso logistic regression: C controls sparsity (smaller C = more regularization)
lasso_lr = LogisticRegressionCV(
penalty="l1",
solver="saga",
Cs=[0.001, 0.01, 0.1, 1.0, 10.0],
cv=5,
random_state=42
)
lasso_lr.fit(X_train_s, y_train)

print(f"\nLasso Logistic Regression:")
print(f"Best C: {lasso_lr.C_[0]:.4f}")
print(f"Accuracy: {lasso_lr.score(X_test_s, y_test):.4f}")
print(f"\nFeature coefficients (zero = removed by lasso):")
for name, coef in zip(feature_names, lasso_lr.coef_[0]):
status = "ACTIVE" if abs(coef) > 1e-6 else "zeroed out"
print(f" {name:20s} {coef:+.4f} ({status})")

Generalized Additive Models (GAMs)

GAMs are a class of models that are more expressive than linear models but remain interpretable. A GAM models the target as:

g(E[y])=β0+j=1pfj(xj)g(E[y]) = \beta_0 + \sum_{j=1}^{p} f_j(x_j)

where each fjf_j is a smooth (but nonlinear) function of a single feature. Because each feature has its own shape function fjf_j, you can visualize and understand the model's behavior one feature at a time. There are no interaction terms (unless explicitly added as fij(xi,xj)f_{ij}(x_i, x_j) pairwise interactions).

The interpret library implements Explainable Boosting Machine (EBM), which is a GA2M (GAM with pairwise interactions):

# pip install interpret
try:
from interpret.glassbox import ExplainableBoostingClassifier
from interpret import show

ebm = ExplainableBoostingClassifier(random_state=42)
ebm.fit(X_train, y_train)
print(f"\nEBM accuracy: {ebm.score(X_test, y_test):.4f}")

# Global explanation: shape functions for each feature
ebm_global = ebm.explain_global()
# show(ebm_global) # launches interactive visualization

# Local explanation: contribution of each feature for a specific instance
ebm_local = ebm.explain_local(X_test[:5], y_test[:5])
# show(ebm_local) # shows waterfall-like plot for each instance

print("EBM shape functions available for each feature - interpretable by design")
print("EBM is a top alternative to SHAP+XGBoost in regulated domains")

except ImportError:
print("interpret library not installed. Install with: pip install interpret")
print("EBM (Explainable Boosting Machine) = GAM with boosted trees for shape functions")
print("Key property: each feature has an independent shape function, visualizable as a curve")

GAMs and EBMs represent the best of both worlds for tabular data: they can model complex nonlinear feature effects while remaining fully interpretable - each feature's effect is a single curve you can inspect.


Translating Technical Explanations for Non-Technical Audiences

Technical explanations (SHAP waterfall plots, LIME coefficients) satisfy data scientists and regulators with technical backgrounds. They do not satisfy loan applicants, patients, or hiring managers. Building effective explainability systems requires translating technical outputs into plain-language summaries.

Plain-Language Explanation Templates

def generate_plain_language_explanation(shap_values, feature_names,
feature_values, prediction,
domain="credit"):
"""
Convert SHAP values into plain-language explanations for different audiences.
"""
# Sort features by absolute SHAP value
sorted_features = sorted(
zip(feature_names, shap_values, feature_values),
key=lambda t: abs(t[1]), reverse=True
)

# Templates by domain
templates = {
"credit": {
"denial_primary": "The primary reason your application was not approved is {reason}.",
"denial_secondary": "Additional factors include {reasons}.",
"approval_primary": "Your application was approved primarily because of {reason}.",
},
"medical": {
"high_risk": "The model flagged elevated risk based primarily on {reason}.",
"low_risk": "Risk assessment was low, primarily due to {reason}.",
}
}

top_feature, top_shap, top_value = sorted_features[0]
second_features = sorted_features[1:3]

# Generate natural language for each feature
feature_descriptions = {
"debt_ratio": f"a high debt-to-income ratio ({top_value:.1%})",
"credit_score": f"a {'strong' if top_shap < 0 else 'weak'} credit score ({top_value:.0f})",
"income": f"{'sufficient' if top_shap < 0 else 'insufficient'} income (${top_value:,.0f}/year)",
"employment_years": f"{'stable' if top_shap < 0 else 'limited'} employment history ({top_value:.1f} years)",
"recent_inquiries": f"{top_value:.0f} recent credit inquiries",
}

top_desc = feature_descriptions.get(top_feature, top_feature)
secondary_descs = [
feature_descriptions.get(f, f)
for f, _, _ in second_features
if abs(_) > 0.05 # only include material contributors
]

if domain == "credit" and prediction == 1: # denied
explanation = templates["credit"]["denial_primary"].format(reason=top_desc)
if secondary_descs:
explanation += " " + templates["credit"]["denial_secondary"].format(
reasons=", ".join(secondary_descs)
)
else:
explanation = templates["credit"]["approval_primary"].format(reason=top_desc)

return {
"plain_language": explanation,
"technical_shap": {f: float(sv) for f, sv, _ in sorted_features[:5]},
"appeal_guidance": (
"To discuss this decision or provide additional information, "
"contact our lending team within 60 days."
) if prediction == 1 else None
}

# Example usage
example_shap = [0.45, -0.12, 0.08, -0.03, 0.22] # debt_ratio, credit_score, income, employment, inquiries
example_values = [0.52, 720, 58000, 4.5, 6]

explanation = generate_plain_language_explanation(
example_shap, feature_names, example_values,
prediction=1, domain="credit"
)
print("\nPlain-language explanation:")
print(f" {explanation['plain_language']}")
if explanation['appeal_guidance']:
print(f" {explanation['appeal_guidance']}")

The plain-language layer is essential for regulatory compliance and for treating affected individuals with dignity. A SHAP waterfall plot is not a meaningful explanation for a loan applicant. A clear, accurate sentence is.


Faithfulness - Can You Trust the Explanation?

The most important question about any explanation method is often left unasked: is the explanation actually faithful to the model?

A faithful explanation is one that accurately describes what the model does. An unfaithful explanation looks reasonable but describes something different from the model's actual computation.

LIME explanations may be unfaithful: LIME fits a linear model locally, but the local R2R^2 of the linear approximation may be low - the linear model does not accurately capture the black box's behavior in the neighborhood. Low R2R^2 means the LIME explanation is not a reliable description of the model.

SHAP explanations for TreeSHAP are exactly faithful: TreeSHAP computes exact Shapley values - the attribution exactly sums to the prediction. But for KernelSHAP and DeepSHAP, approximation errors exist.

Attention weights in transformers may be unfaithful: Jain and Wallace (2019) showed that attention weights do not always correlate with gradient-based importance measures. High attention weight on a token does not necessarily mean the model's prediction depends on that token.

Always validate your explanation method:

def check_explanation_faithfulness(model, explainer, X_sample, n_instances=100):
"""
Check faithfulness: do the SHAP values actually sum to the model's output?
For KernelSHAP/LIME: also check local approximation quality.
"""
import shap

predictions = model.predict_proba(X_sample[:n_instances])[:, 1]
shap_vals = explainer.shap_values(X_sample[:n_instances])

if isinstance(shap_vals, list):
shap_vals = shap_vals[1]

expected_val = explainer.expected_value
if isinstance(expected_val, np.ndarray):
expected_val = expected_val[1]

# Efficiency axiom: E[f] + sum(SHAP) should equal model output
reconstructed = expected_val + shap_vals.sum(axis=1)

# For tree models, this should be essentially machine precision
# For KernelSHAP, some approximation error is expected
max_error = np.abs(reconstructed - predictions).max()
mean_error = np.abs(reconstructed - predictions).mean()

print(f"Faithfulness check ({n_instances} instances):")
print(f" Max reconstruction error: {max_error:.6f}")
print(f" Mean reconstruction error: {mean_error:.6f}")
print(f" {'FAITHFUL (TreeSHAP exact)' if max_error < 1e-4 else 'APPROXIMATE (check nsamples)'}")

return {
"max_error": float(max_error),
"mean_error": float(mean_error),
"is_faithful": bool(max_error < 1e-4)
}

Faithfulness checks should be part of your model evaluation pipeline, not an afterthought. If your explanation method is not faithful, you are providing a false sense of understanding.


The Human Factors of Interpretability

Building technically correct explanations is necessary but not sufficient. Explanations must also be useful to the humans who receive them. The field of human-computer interaction has studied this carefully, and the findings are important for ML engineers.

Cognitive Load and Explanation Complexity

Miller (2019) reviewed the psychological literature on explanations and identified several key findings:

Humans prefer contrastive explanations. When asked "why did the model deny my loan?", the most useful answer is not "here are all the factors" but "what is the minimum difference from an approved case?" This is why counterfactual explanations (lesson 07) are so powerful: they naturally answer the contrastive question.

Humans have limited working memory. An explanation with 15 features is less useful than one with 3-5. Research suggests that more than 5-7 features in an explanation increases cognitive load without improving understanding. This motivates sparse explanations.

Social explanations beat statistical ones. "Your income is below our typical threshold for this loan size" is processed more easily than "SHAP value for income = -0.23 log-odds." Statistical literacy is not uniformly distributed. Design explanation systems for the actual audience, not the data science team.

The Right Level of Explanation Detail

Different stakeholders need different explanation depth:

StakeholderWhat they needFormat
Loan applicantWhy was I denied? What can I change?1-3 plain-language sentences
Loan officerWhat are the key risk factors?Top-5 SHAP drivers with context
Compliance teamIs the model using prohibited factors?Feature importance analysis, bias metrics
RegulatorCan the model be audited?Full technical documentation, methodology
Data scientistWhere is the model failing?Full SHAP plots, ICE plots, residual analysis

Build explanation pipelines that produce all these views from a single underlying SHAP computation.


Model Documentation - The "Model Card" Pattern

Google's Model Cards for Model Reporting (Mitchell et al. 2019) and Hugging Face's model card standard provide a structured way to document model behavior, limitations, and intended use. Model cards are the highest-level form of global interpretability: they communicate what a model does (and does not do) in plain language.

A model card for a credit scoring system might include:

## Model Card: Credit Default Risk Classifier v2.1

### Intended Use
- Primary use: automated pre-screening for mortgage applications at $200K-$800K
- In-scope: US applicants with at least 12 months of credit history
- Out-of-scope: business loans, international applicants, thin-file applicants

### Performance
- AUC: 0.89 (holdout), 0.87 (90-day validation)
- False positive rate: 12% at 0.5 threshold (12% of safe applicants flagged as risky)
- False negative rate: 8% at 0.5 threshold (8% of risky applicants approved)

### Key Feature Contributions (SHAP global)
1. Debt-to-income ratio (mean |SHAP| = 0.42)
2. Credit score (mean |SHAP| = 0.38)
3. Employment duration (mean |SHAP| = 0.21)
4. Recent credit inquiries (mean |SHAP| = 0.18)
5. Income (mean |SHAP| = 0.15)

### Fairness Analysis
- Model does not use race, gender, religion, or national origin
- Zip code is excluded (proxy for race)
- Approval rate by demographic group: [table]
- SHAP analysis: no protected-group proxy variables in top-10 features

### Known Limitations
- Training data: 2018-2023 (does not reflect post-2023 economic conditions)
- Thin-file applicants (< 12 months credit history) may be systematically underserved
- Model performance degrades for loan amounts outside $200K-$800K range

### Human Oversight
- All denials reviewed by loan officer before communication to applicant
- Applicants may request human review of automated decisions
- Appeals process: [link]

Model cards are increasingly required by regulators and institutional users. They operationalize the distinction between interpretability (what the model does overall) and explainability (what it did for this specific decision).


Practice Problems

  1. A healthcare company trains an XGBoost model on patient data to predict 30-day readmission. The model uses 47 features including patient age, diagnosis codes, lab values, and hospital ID. A clinician asks you: "Why did the model flag this specific patient as high-risk?" What tool do you use, what information do you present, and how do you translate it for a non-technical clinician?

  2. Your team's loan denial model is being audited under GDPR Article 22. The auditor requests a demonstration that the model does not use postal code as a proxy for race. Describe the analysis you would conduct using SHAP and what results would satisfy the auditor.

  3. You have a dataset with 100 features, including several groups of highly correlated financial metrics. You need to select the 10 most important features for a new, simpler model. Compare the results you would expect from MDI, permutation importance, and SHAP global importance. Which would you trust, and why?

  4. A random forest model achieves 95% accuracy on a medical imaging classification task. A researcher claims this high accuracy demonstrates the model understands the disease pathology. What experiments would you run with explainability tools to test this claim? What would a "wrong for the right reasons" vs "right for the wrong reasons" result look like?

  5. Your company wants to deploy a model for automated resume screening. Legal counsel has reviewed the system and is satisfied it does not explicitly use protected attributes. How would you use interpretability methods to identify potential proxy discrimination that the legal review missed?


Summary - Interpretability vs Explainability

The bank's Frankfurt audit failure is the right mental model to keep. Explainability (SHAP for a specific decision) and interpretability (understanding the model's overall logic) are different things that satisfy different stakeholders. Conflating them leads to delivering the wrong artifact to the wrong audience.

The key distinctions to carry:

Interpretability = a property of the model. Linear regression, shallow decision trees, and GAMs are interpretable. Neural networks and boosted tree ensembles are not.

Explainability = a property of an explanation artifact. SHAP, LIME, and saliency maps are post-hoc explanations. They describe what the model did for a specific input - they do not make the model interpretable.

The tradeoff is real for unstructured data (images, text) but overstated for tabular data. For tabular high-stakes decisions, always evaluate whether an interpretable model achieves acceptable accuracy before choosing a black box that will require explanation infrastructure.

The regulatory landscape is moving toward requiring both. The EU AI Act requires interpretable, auditable AI in high-risk domains. This is not just a compliance checkbox - it reflects a genuine societal need. Models that affect people's access to credit, healthcare, employment, and liberty should be understandable to the humans they affect.

The rest of this module builds the technical toolkit: SHAP for feature attribution (lesson 02), LIME for local approximation (lesson 03), feature importance methods (lesson 04), attention and saliency for deep learning (lessons 05-06), counterfactuals for actionable recourse (lesson 07), production infrastructure (lesson 08), and evaluation of explanation quality (lesson 09).


Key Papers and Resources

Foundational papers:

  • Lipton, Z. (2016). "The Mythos of Model Interpretability." Queue. The first systematic taxonomy of interpretability concepts. Defines transparency, decomposability, simulatability, and post-hoc interpretation.
  • Doshi-Velez, F. & Kim, B. (2017). "Towards a Rigorous Science of Interpretable Machine Learning." Defines interpretability rigorously and proposes evaluation frameworks.
  • Rudin, C. (2019). "Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead." Nature Machine Intelligence. The most cited argument for interpretable models in tabular domains.

Regulatory references:

  • GDPR Article 22 and Recitals 71, 86, 91: official text of the automated decision-making provisions.
  • EU AI Act (Regulation EU 2024/1689): high-risk AI system requirements in Annex III and Articles 9-15.
  • FDA (2021). "Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) Action Plan."
  • FINRA (2021). "Regulatory Notice 21-06: Artificial Intelligence."

Books:

  • Molnar, C. (2022). Interpretable Machine Learning (2nd ed.). Free at christophm.github.io/interpretable-ml-book. The definitive reference covering every method in this module.
  • Biecek, P. & Burzykowski, T. (2021). Explanatory Model Analysis. Free at ema.drwhy.ai. Strong on the evaluation and comparison of explanation methods.

Libraries:

  • shap: pip install shap - the primary library for SHAP values, all algorithms
  • lime: pip install lime - Ribeiro's original LIME implementation
  • alibi: pip install alibi - Anchors, ALE, integrated gradients, counterfactuals
  • interpret: pip install interpret - EBM/GAM models, interactive visualizations
  • alepython: pip install alepython - standalone ALE plots

:::tip 🎮 Interactive Playground

Visualize this concept: Try the SHAP Values demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.