What is feature importance?

Permutation importance, impurity-based importance, partial dependence plots, ALE, H-statistics, Sobol indices, and production monitoring - the complete toolkit for understanding which features drive your model's decisions, and when each method lies to you.

How does permutation importance work in practice?

Feature Importance Methods - Beyond SHAP covers feature importance, permutation importance, MDI from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/explainability-and-interpretability/feature-importance-methods

What is the difference between feature importance and MDI?

See the full breakdown at https://engineersofai.com/docs/ml/explainability-and-interpretability/feature-importance-methods

Feature Importance Methods - Beyond SHAP

Reading time: 55 min | Interview relevance: High - feature selection and model debugging questions | Target roles: ML Engineer, Data Scientist, AI Engineer, MLOps Engineer

The ICU Model That Got the Wrong Answer for the Right Reasons

It is 2015. A healthcare data scientist at a major hospital presents a random forest model for predicting ICU mortality risk. The model achieves an AUC of 0.91 on the validation set - state of the art for the dataset. The clinical team is called in to review feature importance.

The top feature: patient age. The coefficient is large, the bar is long, the display is convincing. The clinical team is uncomfortable. "Age should not matter this much," the intensivist says. "Once you control for severity of illness - APACHE score, lactate levels - age adds very little independent prognostic value. Are you sure this is right?"

The data scientist goes back to the analysis. The feature importance was computed using Gini importance, also called Mean Decrease in Impurity (MDI). Age was stored as a continuous float with many unique values. Gini importance is systematically biased toward high-cardinality features - features with many unique values tend to generate more node splits and accumulate more total impurity reduction, even when their actual predictive contribution is modest.

A re-analysis using permutation importance tells a different story. APACHE score and lactate level are the dominant features. Age drops to fifth place. The clinical team nods: that matches their domain knowledge. The original result was not wrong because the model was wrong - the model was fine. It was wrong because the importance measure was biased.

Six months later, the same team discovers another problem. They are monitoring feature importances across weekly model retraining runs and notice that sodium_level - a feature that had always ranked 4th - suddenly ranks 9th. No code change. The model's AUC is unchanged. The ETL pipeline had introduced a subtle normalization change that altered the numeric scale of sodium values. The importance drift detected a data quality issue before any downstream metrics flagged it.

This lesson is about understanding exactly when each feature importance method is correct, when it lies, what to use instead, and how to use importance signals in production monitoring pipelines.

Why "Feature Importance" Is Not One Thing

The phrase "feature importance" sounds like it describes a single concept. It does not. Different methods answer different questions:

MDI (Gini importance): how much did this feature reduce impurity in the training trees?
Permutation importance (MDA): how much does model performance drop when this feature's values are randomly shuffled?
SHAP global importance: what is the average magnitude of this feature's contribution to individual predictions?
PDP: how does the model's average prediction change as this feature varies, holding others at their marginal distributions?
ICE: how does the model's prediction change for each individual instance as this feature varies?
ALE: how does the model's prediction change as this feature varies, conditioning on the feature's actual data distribution?
H-statistic: how much of the model's variance is explained by feature interactions?
Sobol indices: how much of the output variance is attributable to each input feature or feature interaction?

These can and do give different rankings. Before choosing a method, ask: what question am I actually trying to answer?

Historical Context

Feature importance methods have been evolving since the early 2000s alongside ensemble methods. Leo Breiman introduced the permutation importance idea in his 2001 random forests paper - the original "out-of-bag" importance. The Gini/MDI importance was embedded in scikit-learn from its early versions as a fast built-in metric.

The systematic bias in MDI was documented conclusively by Strobl et al. (2007) in "Bias in Random Forest Variable Importance Measures." The academic community accepted this result quickly; the practitioner community took over a decade to act on it. As late as 2019, most Kaggle solutions and production ML systems were using MDI as the default without question.

ALE (Accumulated Local Effects) was introduced by Apley and Zhu (2020) in a JRSS-B paper that directly compared ALE to PDP, proving the extrapolation bias in PDP and providing the theoretical foundation for ALE's correctness under feature correlation. The H-statistic for measuring interactions was developed by Friedman and Popescu (2008). Sobol sensitivity indices come from the uncertainty quantification literature, adapted for ML by several groups in the 2015–2020 period.

MDI - Mean Decrease in Impurity

How It Works

For decision tree ensembles, the Gini importance (MDI) of feature $j$ is the total decrease in node impurity attributed to splits on feature $j$ , weighted by the proportion of samples reaching each node, averaged across all trees:

$\text{MDI}(f_j) = \frac{1}{T} \sum_{t=1}^{T} \sum_{\substack{n \in \text{tree}_t \\ \text{split on } f_j}} p(n) \cdot \Delta\text{impurity}(n)$

where $p(n) = n_{\text{samples at node}} / n_{\text{total samples}}$ is the proportion of samples reaching node $n$ , and $\Delta\text{impurity}(n)$ is the decrease in Gini impurity (or MSE for regression) from the split.

For Gini impurity with class probabilities $p_k$ at a node:

$\text{Gini}(n) = 1 - \sum_k p_k^2$

After a split into left child $n_L$ and right child $n_R$ :

$\Delta\text{Gini}(n) = \text{Gini}(n) - \frac{|n_L|}{|n|}\text{Gini}(n_L) - \frac{|n_R|}{|n|}\text{Gini}(n_R)$

MDI sums these decrements across all nodes where feature $j$ was the split variable.

The Bias Problem

MDI is systematically biased toward high-cardinality features. A continuous feature with 1,000 unique values can create 999 possible binary splits. A binary feature can create only 1 split. High-cardinality features have more opportunities to be selected for splitting and accumulate more total impurity decrease - even if their actual predictive value is equal to or less than low-cardinality features.

Strobl et al. (2007) demonstrated this conclusively: in a dataset with informative binary features and uninformative continuous noise features, MDI ranks the noise features above the informative binary features. The bias can produce completely misleading rankings in realistic datasets.

Additional problem: MDI is computed on the training data. A feature that is overfitted to the training set will have high MDI even if it has zero generalization value.

MDI vs Permutation Importance - Key Differences:

Property	MDI	Permutation Importance
Bias toward high cardinality	Yes - severe	No
Uses training or test data	Training	Test (recommended)
Handles overfitting	No	Yes (if using test set)
Speed	Instant (computed during training)	O(n_features × n_repeats) extra passes
Model type	Tree ensembles only	Any model
Accounts for correlation	No	Partial (undervalues correlated pairs)

When to Use MDI

MDI is useful when: you want a fast, approximate ranking and understand the bias. It requires no additional computation (it is computed during tree training). It is a reasonable starting point for exploratory analysis, not for rigorous feature selection or regulatory reporting.

Permutation Importance - Mean Decrease in Accuracy

The Algorithm

Permutation importance (Breiman 2001, the "out-of-bag" version) measures the drop in model performance when feature $j$ 's values are randomly shuffled:

$\text{MDA}(f_j) = \text{score}(f) - \mathbb{E}_\pi[\text{score}(f_{\pi_j})]$

where $f_{\pi_j}$ is the model evaluated with feature $j$ 's values replaced by a random permutation of those values.

Step-by-step algorithm:

Evaluate the model on the test set, record baseline score $S_0$ (e.g., AUC, F1, R²)
For each feature $j$ $j$ from 1 to $p$ $p$ :
- For $b = 1$ $b = 1$ to $B$ $B$ (typically 5–10 repeats):
  - Create $\tilde{X}$ = copy of $X_{\text{test}}$ with column $j$ randomly shuffled
  - Evaluate model on $\tilde{X}$ , record $S_{j,b}$
- $\text{MDA}(f_j) = S_0 - \frac{1}{B}\sum_b S_{j,b}$
Features with high MDA are important; features with MDA near zero (or negative) are unimportant

The key insight: shuffling breaks the relationship between feature $j$ and the target, while preserving its marginal distribution and all other feature relationships.

$\text{MDA}(f_j) = \frac{1}{B} \sum_{b=1}^{B} \left[\text{score}(f) - \text{score}(f_{\pi_j^{(b)}})\right]$

Confidence Intervals for Permutation Importance

Because we run $B$ repeats, we can compute standard errors:

$\text{SE}(f_j) = \frac{\text{std}(\{S_0 - S_{j,b}\}_{b=1}^B)}{\sqrt{B}}$

A feature is statistically not important if its 95% CI includes zero: $\text{MDA}(f_j) \pm 1.96 \cdot \text{SE}(f_j)$ . This is a rigorous criterion for feature elimination - not just "low importance" but "statistically indistinguishable from zero."

The Correlation Problem

Permutation importance has its own failure mode: correlated features are undervalued. If features $f_i$ and $f_j$ are highly correlated, shuffling $f_i$ does not actually remove the model's access to the information in $f_i$ - the model can partially recover it from $f_j$ . The measured performance drop understates the true importance of both features.

Example: in a credit model, income and net_worth are highly correlated. Shuffling income barely affects performance because the model uses net_worth as a backup. MDI might rank both highly. Permutation importance ranks both low. Neither result is "correct" - they answer different questions under different assumptions.

For a definitive correct answer on correlated features: use SHAP, which distributes importance symmetrically among correlated features by the Shapley axiom of symmetry.

Train vs Test Permutation Importance

Compute permutation importance on the test set, not the training set. Training set importance is inflated by overfitting. A feature that was memorized from training will show high training-set importance but low test-set importance. Test-set permutation importance measures generalization value, which is what you care about.

Partial Dependence Plots (PDP)

The Marginal Effect Derivation

A Partial Dependence Plot shows the marginal effect of feature $f_j$ on the model's prediction, averaging over the distribution of all other features:

$\hat{f}_j(x_j) = \mathbb{E}_{X_{-j}}\left[f(x_j, X_{-j})\right] = \int f(x_j, x_{-j}) \, p(x_{-j}) \, dx_{-j}$

This integral is approximated empirically:

$\hat{f}_j(x_j) = \frac{1}{n} \sum_{i=1}^{n} f(x_j, x_{-j}^{(i)})$

For each value of $x_j$ , we fix that value, use all other features from the training data, and average the model's predictions. The result shows the average relationship between $x_j$ and the prediction.

For a feature with range $[a, b]$ , evaluate at $G$ grid points $x_j^{(1)}, \ldots, x_j^{(G)}$ . For each grid point:

Fix $x_j = x_j^{(g)}$
Replace the $j$ -th column of the entire training set with this value
Run all $n$ samples through the model
The PDP value at $x_j^{(g)}$ is the mean prediction

The total computational cost: $G \times n$ forward passes through the model. For $G=50$ grid points and $n=50{,}000$ samples, this is 2.5 million inference calls - significant for deep learning models.

The Extrapolation Problem

PDPs have a critical failure mode: they average over the marginal distribution of other features, not the conditional distribution. When features are correlated, this means the PDP evaluates the model at feature combinations that never occur in the training data.

Example: feature $x_1$ = education level, feature $x_2$ = income. These are positively correlated - high education predicts high income. A PDP for $x_1$ would average over all values of $x_2$ , including low-income values combined with high-education levels. This combination is rare in reality. The model has never seen it during training and may extrapolate unpredictably. The PDP reports this extrapolation as if it were meaningful.

This is not a minor technical detail. In healthcare, financial, or social science models, correlated features are the norm. PDPs can be deeply misleading when features are correlated.

ICE Plots - Individual Conditional Expectation

ICE plots solve one of PDP's problems: they show the PDP for each individual instance, rather than averaging.

For each training instance $i$ , the ICE curve is:

$\hat{f}_j^{(i)}(x_j) = f(x_j, x_{-j}^{(i)})$

The PDP is the average of all ICE curves: $\hat{f}_j(x_j) = \frac{1}{n} \sum_i \hat{f}_j^{(i)}(x_j)$ .

Why ICE is more informative: when the effect of $x_j$ is heterogeneous - different for different subgroups - the PDP average hides this heterogeneity. ICE plots reveal it. If most ICE curves slope upward but a cluster slopes downward, there is a subgroup interaction that the PDP would wash out.

c-ICE (centered ICE): subtract each curve's value at a fixed reference point $x_j^{(0)}$ to make all curves start at zero. This removes the effect of different baseline prediction levels and makes the shape of each curve's response visible:

$\hat{f}_j^{(i), c}(x_j) = \hat{f}_j^{(i)}(x_j) - \hat{f}_j^{(i)}(x_j^{(0)})$

ICE does not solve the extrapolation problem - it still evaluates the model at potentially unrealistic feature combinations. For that, you need ALE.

ALE - Accumulated Local Effects

ALE (Apley and Zhu, 2020) solves the extrapolation problem by conditioning on the actual data distribution:

$\hat{f}_j^{\text{ALE}}(x_j) = \int_{z_{0,j}}^{x_j} \mathbb{E}\left[\frac{\partial f(x_j', X_{-j})}{\partial x_j'} \,\Big|\, x_j' = z\right] dz - \text{constant}$

Instead of averaging over the full distribution of $x_{-j}$ , ALE computes the expected partial derivative of the model's output with respect to $x_j$ , conditioned on $x_j' = z$ . This uses only the actual data points near each value of $x_j$ , avoiding extrapolation. The constant centers the ALE at zero.

Discrete ALE computation algorithm:

Divide feature $j$ 's range into $K$ quantile-based intervals (bins): $[q_0, q_1], [q_1, q_2], \ldots, [q_{K-1}, q_K]$
For each bin $k$ $k$ :
- Find all instances $\mathcal{N}_k = \{i : x_j^{(i)} \in [q_{k-1}, q_k]\}$
- For each instance $i \in \mathcal{N}_k$ $i \in N_{k}$ :
  - Create $x^{(i,\text{high})}$ : copy of $x^{(i)}$ with feature $j$ set to $q_k$ (upper bound)
  - Create $x^{(i,\text{low})}$ : copy of $x^{(i)}$ with feature $j$ set to $q_{k-1}$ (lower bound)
- Local effect in bin $k$ : $\Delta_k = \frac{1}{|\mathcal{N}_k|} \sum_{i \in \mathcal{N}_k} \left[f(x^{(i,\text{high})}) - f(x^{(i,\text{low})})\right]$
Accumulate: $\hat{f}_j^{\text{ALE}}(x_j) = \sum_{k=1}^{K_j} \Delta_k - \frac{1}{n}\sum_{i=1}^n \hat{f}_j^{\text{ALE}}(x_j^{(i)})$

The final subtraction centers the ALE plot at zero.

ALE is the correct replacement for PDP when features are correlated. ALE is unbiased; PDP is biased under correlation.

H-Statistic - Measuring Interaction Effects

Friedman and Popescu (2008) defined the H-statistic to measure how much of a model's variance is explained by interactions between features.

For two features $j$ and $k$ , the two-way H-statistic is:

$H_{jk}^2 = \frac{\sum_{i=1}^n \left[\text{PD}_{jk}(x_j^{(i)}, x_k^{(i)}) - \text{PD}_j(x_j^{(i)}) - \text{PD}_k(x_k^{(i)})\right]^2}{\sum_{i=1}^n \left[\text{PD}_{jk}(x_j^{(i)}, x_k^{(i)})\right]^2}$

where:

$\text{PD}_{jk}(x_j, x_k)$ : 2D partial dependence (average over all other features)
$\text{PD}_j(x_j)$ : 1D partial dependence for feature $j$
$\text{PD}_k(x_k)$ : 1D partial dependence for feature $k$

Interpretation: $H_{jk}^2 \in [0, 1]$ .

$H_{jk}^2 = 0$ : no interaction - the joint effect is the sum of marginal effects
$H_{jk}^2 = 1$ : pure interaction - the joint effect cannot be decomposed into individual effects

A total H-statistic for feature $j$ measures its interaction with all other features:

$H_j^2 = \frac{\sum_{i=1}^n \left[f(x^{(i)}) - \text{PD}_j(x_j^{(i)}) - \text{PD}_{-j}(x_{-j}^{(i)})\right]^2}{\sum_{i=1}^n \left[f(x^{(i)})\right]^2}$

Computational cost: the H-statistic requires computing multiple PD functions and is $O(n^2)$ for pairwise interactions. For large datasets, subsample to 500–1000 instances for estimation.

Sobol Sensitivity Indices - Variance-Based Attribution

Sobol indices (Sobol 1993, widely adopted for ML circa 2015–2020) decompose the variance of the model's output:

$\text{Var}(Y) = \sum_j V_j + \sum_{j < k} V_{jk} + \ldots + V_{12\ldots p}$

where $V_j = \text{Var}_{x_j}\left[\mathbb{E}_{X_{-j}}[f(X) \mid x_j]\right]$ is the variance of the conditional expectation - how much the expected output changes as feature $j$ varies.

First-order Sobol index (main effect):

$S_j = \frac{V_j}{\text{Var}(Y)} = \frac{\text{Var}_{x_j}\left[\mathbb{E}_{X_{-j}}[f(X) \mid x_j]\right]}{\text{Var}(Y)}$

$S_j$ ranges from 0 to 1 and represents the fraction of output variance explained by feature $j$ alone.

Total Sobol index (includes all interactions involving feature $j$ ):

$S_j^T = 1 - \frac{\text{Var}_{X_{-j}}\left[\mathbb{E}_{x_j}[f(X) \mid X_{-j}]\right]}{\text{Var}(Y)}$

If $S_j^T - S_j$ is large, feature $j$ participates in significant interactions. Sobol indices require Monte Carlo estimation (usually 1,000–10,000 model evaluations per feature) and are most commonly used in uncertainty quantification and safety-critical systems where variance decomposition is required by regulation.

Code: MDI vs Permutation vs SHAP

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
import shap

# ─── DATASET WITH HIGH-CARDINALITY BIAS ──────────────────────────────────────

np.random.seed(42)
n = 2000
X_informative, y = make_classification(
    n_samples=n, n_features=4, n_informative=4,
    n_redundant=0, n_classes=2, random_state=42
)

# Add 2 high-cardinality noise features (random continuous)
noise_hc = np.random.randn(n, 2) * 10  # high variance, completely random

X_full = np.hstack([X_informative, noise_hc])
feature_names = ["informative_1", "informative_2",
                 "informative_3", "informative_4",
                 "noise_continuous_1", "noise_continuous_2"]

X_train, X_test, y_train, y_test = train_test_split(
    X_full, y, test_size=0.2, random_state=42
)

# ─── TRAIN RANDOM FOREST ──────────────────────────────────────────────────────

rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
print(f"Random Forest accuracy: {rf.score(X_test, y_test):.4f}")

# ─── MDI IMPORTANCE (built-in, biased) ───────────────────────────────────────

mdi_importance = pd.Series(rf.feature_importances_, index=feature_names)
print("\nMDI (Gini) Importance:")
print(mdi_importance.sort_values(ascending=False).to_string())
print("NOTE: noise features may appear inflated due to high-cardinality bias")

# ─── PERMUTATION IMPORTANCE (test set, unbiased) ─────────────────────────────

perm_imp = permutation_importance(
    rf, X_test, y_test,
    n_repeats=10,          # average over 10 random permutations
    random_state=42,
    n_jobs=-1
)

perm_importance = pd.Series(
    perm_imp.importances_mean,
    index=feature_names
)
perm_std = pd.Series(perm_imp.importances_std, index=feature_names)

print("\nPermutation Importance (mean ± std on test set):")
for feat in perm_importance.sort_values(ascending=False).index:
    imp = perm_importance[feat]
    std = perm_std[feat]
    sig = "SIGNIFICANT" if imp > 1.96 * std else "not significant"
    print(f"  {feat:25s}  {imp:.4f} ± {std:.4f}  [{sig}]")

# ─── SHAP GLOBAL IMPORTANCE (theoretically correct) ──────────────────────────

explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)

if isinstance(shap_values, list):
    sv = shap_values[1]  # class 1
else:
    sv = shap_values

mean_abs_shap = pd.Series(
    np.abs(sv).mean(axis=0),
    index=feature_names
)

print("\nSHAP Global Importance (mean |SHAP|):")
print(mean_abs_shap.sort_values(ascending=False).to_string())

# ─── COMPARISON TABLE ─────────────────────────────────────────────────────────

comparison = pd.DataFrame({
    "MDI_rank": mdi_importance.rank(ascending=False).astype(int),
    "Permutation_rank": perm_importance.rank(ascending=False).astype(int),
    "SHAP_rank": mean_abs_shap.rank(ascending=False).astype(int),
    "MDI": mdi_importance.round(4),
    "Permutation": perm_importance.round(4),
    "SHAP": mean_abs_shap.round(4)
})

print("\nComparison (ranks) - noise features should rank low in Permutation and SHAP:")
print(comparison.sort_values("SHAP_rank").to_string())

Code: PDP, ICE, and ALE

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.inspection import PartialDependenceDisplay

# ─── DATASET WITH CORRELATED FEATURES ────────────────────────────────────────

np.random.seed(42)
n = 2000

x1 = np.random.randn(n)
x2 = x1 + 0.5 * np.random.randn(n)  # correlated with x1
x3 = np.random.randn(n)              # independent noise

X = np.column_stack([x1, x2, x3])
y = 2 * x1 + 0.5 * x3 + np.random.randn(n) * 0.5  # x2 has no real effect

feature_names = ["x1 (informative)", "x2 (correlated with x1)", "x3 (weak signal)"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

gbm = GradientBoostingRegressor(n_estimators=200, max_depth=3, random_state=42)
gbm.fit(X_train, y_train)
print(f"GBM R² on test set: {gbm.score(X_test, y_test):.4f}")

# ─── PDP - WARNING: EXTRAPOLATES UNDER CORRELATION ───────────────────────────

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
PartialDependenceDisplay.from_estimator(
    gbm, X_test, [0, 1, 2],
    feature_names=feature_names,
    ax=axes, kind="average"  # kind="average" = PDP
)
fig.suptitle("PDP - x2 will show false positive slope due to correlation", y=1.02)
plt.tight_layout()

# ─── ICE - SHOWS HETEROGENEITY ───────────────────────────────────────────────

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
PartialDependenceDisplay.from_estimator(
    gbm, X_test[:100], [0, 1, 2],
    feature_names=feature_names,
    ax=axes, kind="both"  # PDP bold + ICE thin
)
fig.suptitle("PDP + ICE - check for heterogeneity (diverging ICE lines)", y=1.02)
plt.tight_layout()

# ─── ALE - CORRECT UNDER CORRELATION ─────────────────────────────────────────

def manual_ale_1d(model, X, feature_idx, K=20):
    """
    Manual ALE computation for 1D feature (Apley & Zhu 2020).
    Uses quantile bins and local finite differences.
    x2's ALE should be near zero even though PDP shows a positive slope.
    """
    feat_vals = X[:, feature_idx]
    quantiles = np.quantile(feat_vals, np.linspace(0, 1, K + 1))
    quantiles = np.unique(quantiles)

    ale_vals = []
    bin_centers = []
    bin_counts = []

    for k in range(len(quantiles) - 1):
        low, high = quantiles[k], quantiles[k + 1]
        mask = (feat_vals >= low) & (feat_vals < high)
        if mask.sum() < 2:
            continue

        X_low = X[mask].copy()
        X_high = X[mask].copy()
        X_low[:, feature_idx] = low
        X_high[:, feature_idx] = high

        # Local finite difference within bin - only uses nearby points
        local_effect = (model(X_high) - model(X_low)).mean()
        ale_vals.append(local_effect)
        bin_centers.append((low + high) / 2)
        bin_counts.append(mask.sum())

    # Accumulate local effects
    ale_cumsum = np.cumsum(ale_vals)
    # Center: subtract weighted mean
    weights = np.array(bin_counts, dtype=float)
    weights /= weights.sum()
    ale_cumsum -= (ale_cumsum * weights).sum()

    return np.array(bin_centers), ale_cumsum

print("\nALE vs PDP comparison for correlated features:")
for i, name in enumerate(feature_names):
    centers, ale = manual_ale_1d(gbm.predict, X_test, feature_idx=i)
    ale_range = ale.max() - ale.min()
    print(f"\n  '{name}':")
    print(f"    ALE range: {ale_range:.3f}")
    if i == 1:
        print("    Expected: NEAR ZERO - x2 has no real effect; ALE conditions correctly")
        print("    PDP would show: POSITIVE slope (due to correlation with x1)")

LightGBM + SHAP Integration

import lightgbm as lgb
import shap
import numpy as np
import pandas as pd

# ─── LIGHTGBM TRAINING ────────────────────────────────────────────────────────

# LightGBM has native SHAP support via TreeSHAP (exact, O(TLD²) complexity)
np.random.seed(42)
n, p = 5000, 20
X = np.random.randn(n, p)
# Non-linear target with interactions
y = X[:, 0]**2 + X[:, 1] * X[:, 2] + np.sin(X[:, 3]) + np.random.randn(n) * 0.5

feature_names = [f"feature_{i}" for i in range(p)]
dtrain = lgb.Dataset(X, label=y, feature_name=feature_names)

params = {
    "objective": "regression",
    "num_leaves": 31,
    "n_estimators": 200,
    "learning_rate": 0.05,
    "verbosity": -1,
}
model = lgb.train(params, dtrain, num_boost_round=200)

# ─── LightGBM BUILT-IN IMPORTANCE ─────────────────────────────────────────────

# LightGBM provides two built-in importance types
gain_importance = pd.Series(
    model.feature_importance(importance_type="gain"),
    index=feature_names
)
split_importance = pd.Series(
    model.feature_importance(importance_type="split"),
    index=feature_names
)

print("LightGBM Gain Importance (similar to MDI):")
print(gain_importance.sort_values(ascending=False).head(5))

print("\nLightGBM Split Importance (number of times feature used):")
print(split_importance.sort_values(ascending=False).head(5))

# ─── SHAP TREESHAP ────────────────────────────────────────────────────────────

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

mean_abs_shap = pd.Series(
    np.abs(shap_values).mean(axis=0),
    index=feature_names
)

print("\nSHAP Global Importance (mean |SHAP value|):")
print(mean_abs_shap.sort_values(ascending=False).head(5))

# ─── SHAP INTERACTION VALUES ─────────────────────────────────────────────────

# Detect interactions: feature_1 * feature_2 should show as top interaction
shap_interaction = explainer.shap_interaction_values(X[:500])  # subsample for speed
# Shape: (n_samples, n_features, n_features)
# Diagonal: main effects; off-diagonal: interaction effects

# Sum absolute interaction values per pair
interaction_matrix = np.abs(shap_interaction).mean(axis=0)
np.fill_diagonal(interaction_matrix, 0)  # remove main effects from diagonal

# Top interactions
interaction_df = pd.DataFrame(
    interaction_matrix,
    index=feature_names,
    columns=feature_names
)

# Flatten and find top pairs
pairs = []
for i in range(p):
    for j in range(i+1, p):
        pairs.append((feature_names[i], feature_names[j], interaction_matrix[i, j]))

top_interactions = sorted(pairs, key=lambda x: x[2], reverse=True)[:5]
print("\nTop 5 Feature Interactions (SHAP interaction values):")
for f1, f2, strength in top_interactions:
    print(f"  {f1} × {f2}: {strength:.4f}")

# Expected: feature_1 × feature_2 should be near the top (y contains x1*x2 term)

Feature Selection Pipeline Using Importance

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel
import warnings
warnings.filterwarnings("ignore")

def importance_based_selection(
    X: np.ndarray,
    y: np.ndarray,
    feature_names: list,
    n_select: int = None,
    threshold_pct: float = 0.01,
    n_repeats: int = 10,
    random_state: int = 42
) -> tuple:
    """
    Feature selection pipeline using permutation importance.

    Strategy:
    1. Train baseline model on all features
    2. Compute permutation importance on validation set
    3. Eliminate features with importance <= threshold (or statistically zero)
    4. Retrain on selected features, compare CV score

    Returns: selected_features, importance_df, cv_scores
    """
    X_train, X_val, y_train, y_val = train_test_split(
        X, y, test_size=0.2, random_state=random_state
    )

    # Step 1: Train on all features
    model = RandomForestClassifier(n_estimators=200, random_state=random_state)
    model.fit(X_train, y_train)
    baseline_score = model.score(X_val, y_val)
    print(f"Baseline (all {len(feature_names)} features): val accuracy = {baseline_score:.4f}")

    # Step 2: Permutation importance on validation set
    perm_result = permutation_importance(
        model, X_val, y_val,
        n_repeats=n_repeats,
        random_state=random_state,
        n_jobs=-1
    )

    importance_df = pd.DataFrame({
        "feature": feature_names,
        "importance_mean": perm_result.importances_mean,
        "importance_std": perm_result.importances_std,
    })
    importance_df["ci_lower"] = (
        importance_df["importance_mean"] - 1.96 * importance_df["importance_std"]
    )
    importance_df["is_significant"] = importance_df["ci_lower"] > 0
    importance_df = importance_df.sort_values("importance_mean", ascending=False)

    # Step 3: Select features
    if n_select:
        selected = importance_df.head(n_select)["feature"].tolist()
    else:
        # Keep features that are statistically significant AND above threshold
        max_imp = importance_df["importance_mean"].max()
        selected = importance_df[
            importance_df["is_significant"] &
            (importance_df["importance_mean"] >= threshold_pct * max_imp)
        ]["feature"].tolist()

    print(f"\nSelected {len(selected)} / {len(feature_names)} features:")
    print(importance_df[["feature", "importance_mean", "is_significant"]].to_string())

    # Step 4: Retrain on selected features
    feat_indices = [feature_names.index(f) for f in selected]
    X_selected = X[:, feat_indices]

    cv_scores_full = cross_val_score(
        RandomForestClassifier(n_estimators=200, random_state=random_state),
        X, y, cv=5, scoring="accuracy"
    )
    cv_scores_selected = cross_val_score(
        RandomForestClassifier(n_estimators=200, random_state=random_state),
        X_selected, y, cv=5, scoring="accuracy"
    )

    print(f"\nCV accuracy (all features):      {cv_scores_full.mean():.4f} ± {cv_scores_full.std():.4f}")
    print(f"CV accuracy (selected features): {cv_scores_selected.mean():.4f} ± {cv_scores_selected.std():.4f}")

    return selected, importance_df, {"full": cv_scores_full, "selected": cv_scores_selected}

Production Monitoring Using Importance Drift

Feature importance drift is one of the most underused signals in production ML monitoring. When a feature's importance changes significantly across time windows, it often indicates:

Data pipeline change: a normalization, encoding, or aggregation change altered the feature's scale or distribution
Distribution shift: the real-world relationship between the feature and the target has changed
Feature interaction change: a correlated feature changed, causing the model to rely more/less on this feature as a compensatory mechanism
Data quality issue: missing value handling changed, or a new source of missingness appeared

import numpy as np
import pandas as pd
from datetime import datetime, timedelta
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance

def monitor_importance_drift(
    model,
    X_windows: dict,  # {"2024-01": X_jan, "2024-02": X_feb, ...}
    y_windows: dict,
    feature_names: list,
    n_repeats: int = 10,
    drift_threshold: float = 0.3,  # 30% relative change flags a drift
) -> pd.DataFrame:
    """
    Compute permutation importance across time windows.
    Flag features where importance has drifted significantly.

    drift_threshold: flag if |pct_change| > drift_threshold
    """
    results = {}

    for window_key, X_w in X_windows.items():
        y_w = y_windows[window_key]
        perm = permutation_importance(
            model, X_w, y_w,
            n_repeats=n_repeats,
            n_jobs=-1
        )
        results[window_key] = perm.importances_mean

    # Build importance time series
    imp_df = pd.DataFrame(results, index=feature_names).T
    imp_df.index.name = "window"

    # Compute drift: compare each window to the first window (baseline)
    baseline = imp_df.iloc[0]
    drift_df = pd.DataFrame()

    for window in imp_df.index[1:]:
        current = imp_df.loc[window]
        # Relative change from baseline (avoid div by zero)
        pct_change = (current - baseline) / (baseline.abs() + 1e-8)
        drifted = pct_change.abs() > drift_threshold

        drift_row = pd.DataFrame({
            "window": window,
            "feature": feature_names,
            "baseline_importance": baseline.values,
            "current_importance": current.values,
            "pct_change": pct_change.values,
            "is_drifted": drifted.values
        })
        drift_df = pd.concat([drift_df, drift_row], ignore_index=True)

    # Summarize
    drifted_features = drift_df[drift_df["is_drifted"]]["feature"].value_counts()

    print(f"\nFeature Importance Drift Report:")
    print(f"Windows analyzed: {list(X_windows.keys())}")
    print(f"Drift threshold: {drift_threshold:.0%} relative change")
    print(f"\nFeatures flagged as drifted (count of windows with drift):")
    print(drifted_features.to_string())

    if len(drifted_features) > 0:
        print("\nAction items:")
        for feat in drifted_features.index[:3]:
            print(f"  - Investigate '{feat}': check data pipeline, distribution, missingness rate")

    return imp_df, drift_df


# ─── EXAMPLE USAGE ────────────────────────────────────────────────────────────

np.random.seed(42)
n, p = 1000, 10
feature_names = [f"f{i}" for i in range(p)]

# Simulate distribution shift in feature f3 starting in month 3
X_base = np.random.randn(n, p)
y_base = (X_base[:, 0] + X_base[:, 2] > 0).astype(int)

X_shifted = X_base.copy()
X_shifted[:, 3] *= 5  # scale change in f3 simulating pipeline bug

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_base, y_base)

X_windows = {"month_1": X_base, "month_2": X_base, "month_3": X_shifted}
y_windows = {"month_1": y_base, "month_2": y_base, "month_3": y_base}

imp_df, drift_df = monitor_importance_drift(rf, X_windows, y_windows, feature_names)

Correlated Feature Problems in Depth

When two correlated features $f_i$ and $f_j$ with correlation $\rho$ are both predictive:

MDI: inflates both if they are high-cardinality; may favor whichever was split first more often
Permutation importance: the model compensates for the shuffled feature using the other, so both appear less important than they truly are - the amount of undervaluation increases with $|\rho|$
SHAP: the Shapley symmetry axiom distributes importance equally between perfectly symmetric features. With correlation $\rho \approx 1$ , SHAP gives roughly equal importance to both; the total combined importance is correct
PDP: shows the marginal effect of each feature, but the marginal distribution includes impossible $(f_i, f_j)$ combinations, inflating one or both
ALE: correct - conditions on the joint distribution, avoids impossible combinations

Choosing the Right Method

Production Engineering Notes

LightGBM in production: LightGBM's gain importance is equivalent to MDI and has the same high-cardinality bias. Always pair LightGBM importance with SHAP (which LightGBM supports natively via lgb.Booster.predict(X, pred_contrib=True) or the shap library's TreeExplainer).

TreeSHAP complexity: TreeSHAP runs in $O(TLD^2)$ where $T$ = number of trees, $L$ = maximum number of leaves per tree, $D$ = maximum depth. For LightGBM/XGBoost with 200 trees, 31 leaves, depth 6 - TreeSHAP is fast enough for real-time scoring (typically 1–10 ms). KernelSHAP runs in $O(n^2 \cdot p)$ and is not suitable for real-time use.

Importance drift monitoring: set up automated checks comparing feature importance week-over-week. A 20–30% relative change in a top-5 feature's importance should trigger a data quality investigation. This is cheaper than waiting for model performance to degrade.

ALE in production: compute ALE on a rolling window of recent data to monitor whether feature relationships have changed. If ALE for a key feature changes shape (e.g., the slope reverses), the model's learned relationship may no longer be valid.

warning

Permutation importance computed on the training set measures memorization, not generalization. Always use the test set or a held-out validation set. This is the most common implementation mistake in practice.

danger

Never use MDI as the sole feature importance method for regulatory compliance, model auditing, or feature selection in mixed-cardinality datasets. Always cross-validate with permutation importance or SHAP. MDI is a fast heuristic, not a principled measure.

YouTube Resources

Resource	Creator	Focus
Feature Importance for Machine Learning	StatQuest (Josh Starmer)	MDI vs permutation, step-by-step
Partial Dependence Plots and ICE Explained	Christoph Molnar	PDP and ICE with visual examples
ALE Plots - Better PDPs	PyData Global	Apley & Zhu ALE derivation and demo
SHAP Interactions and Feature Effects	Scott Lundberg	SHAP interaction values and beeswarm plots
Feature Selection in Practice	Abhishek Thakur	Practical Kaggle-style feature selection

Common Mistakes

:::danger Common Mistake 1: Using MDI when features have different cardinalities If your dataset contains a mix of categorical features (low cardinality) and continuous features (high cardinality), MDI will systematically overrank the continuous features. A continuous age feature will typically dominate a binary gender feature even if gender is more predictive. Always validate MDI rankings with permutation importance or SHAP when you have mixed feature types. :::

:::danger Common Mistake 2: Using PDP when features are correlated PDP averages over the marginal distribution of all other features. When two features are correlated, this creates impossible combinations: the PDP for income will evaluate the model at (low_income, high_education) pairs that essentially don't exist in the real world. The model may have learned arbitrary behavior in these unrealistic regions. Use ALE instead - it conditions on the actual data distribution. :::

:::warning Common Mistake 3: Computing permutation importance on the training set Training set permutation importance is inflated by overfitting. A memorized feature will show high importance on training data but near-zero importance on held-out data. Always compute permutation importance on the test set (or a held-out validation set). :::

:::warning Common Mistake 4: Dropping correlated features based on importance rank without checking performance If features $A$ and $B$ are highly correlated and both predictive, they will share importance under SHAP (symmetry axiom). If you drop the lower-SHAP one, you may degrade performance because the remaining feature is less measured or noisier. Always check the correlation matrix alongside importance scores before dropping features. :::

:::warning Common Mistake 5: Interpreting ICE lines as causal effects ICE plots show how the model's prediction would change if you moved a single feature while holding all others fixed. This is not the same as asking "if I changed this feature for this person, what would happen?" The model was trained on observational data with correlations. Moving one feature while holding others fixed may move you to a region where the model extrapolates poorly. :::

:::danger Common Mistake 6: Ignoring importance drift in production Feature importance is often computed once during model development and then forgotten. In production, importance can shift dramatically due to data pipeline changes, seasonality, or concept drift. Set up automated drift monitoring that alerts when any top-5 feature's importance changes by more than 25% week-over-week. :::

Interview Q&A

Q1: What is MDI and why is it biased? When would you still use it?

MDI (Mean Decrease in Impurity, or Gini importance) measures the total reduction in node impurity attributable to each feature across all trees in the ensemble. The formula sums the impurity decrease at every node where feature $j$ was selected as the split variable, weighted by the proportion of samples at that node. The bias arises because features with more unique values have more potential split points - a continuous feature with 1,000 unique values has 999 candidate binary splits; a binary feature has 1. The tree-building algorithm searches over all candidate splits, so high-cardinality features are naturally selected more often, accumulating more impurity decrease. Strobl et al. (2007) demonstrated that uninformative continuous noise features can outrank genuinely predictive binary features purely due to cardinality. MDI is also computed on training data, so it conflates memorization with importance. When would I still use it? For a quick first-pass exploration when I understand the bias and all features are continuous with similar cardinality. Never for regulatory reporting, mixed-cardinality datasets, or any situation where the result matters.

Q2: How does permutation importance work and what are its failure modes?

Permutation importance: evaluate the model on the test set, record the baseline score. Shuffle the values of feature $j$ (breaking its relationship with the target while preserving its marginal distribution), re-evaluate, measure the performance drop. Repeat for 5-10 permutations and average. A feature with high average drop was important; near zero means the model doesn't need that feature. Failure mode 1 - correlated features: when features $i$ and $j$ are highly correlated, shuffling $i$ doesn't fully remove its information because the model can partially compensate using $j$ . Both features appear less important than they truly are. The severity increases with the correlation coefficient. Failure mode 2 - extrapolation: permuting feature $j$ while keeping all other features at their original values creates unnatural feature combinations that didn't appear in training data, potentially putting the model in an out-of-distribution regime. The fix for correlated features is SHAP (symmetry axiom). Always use the test set, not training set.

Q3: What is the difference between PDP and ALE, and when does it matter?

PDP shows $\hat{f}_j(x_j) = \frac{1}{n}\sum_i f(x_j, x_{-j}^{(i)})$ - the model's average prediction as $x_j$ varies, averaged over the full marginal distribution of all other features. ALE accumulates local derivatives, conditioning on the actual data distribution near each value of $x_j$ . The difference matters precisely when features are correlated, which is the norm in real datasets. For a correlated feature pair (education, income), the PDP for education evaluates the model at combinations like (high education, low income) that barely exist in the data. The model may produce arbitrary output in these regions. ALE avoids this by only using the data points that are actually near each value of education when computing the local effect. For the case where x2 is correlated with x1 but has no true independent effect, PDP will show a spurious positive slope for x2, while ALE correctly shows near-zero effect. Always use ALE when any two features have |correlation| > 0.3.

Q4: What is the H-statistic and how would you use it in practice?

The H-statistic (Friedman and Popescu, 2008) measures how much of the model's predicted variance is explained by interactions between features rather than by their individual main effects. For two features $j$ and $k$ , $H_{jk}^2$ compares the joint partial dependence $\text{PD}_{jk}(x_j, x_k)$ to the sum of the individual partial dependences $\text{PD}_j(x_j) + \text{PD}_k(x_k)$ . If the joint effect equals the sum of individual effects, $H_{jk}^2 = 0$ - no interaction. If the joint effect cannot be decomposed into individual effects at all, $H_{jk}^2 = 1$ . In practice: compute the total H-statistic for each feature first (cheap) to identify which features interact with anything. Then compute pairwise H-statistics only for the top 3-5 features (expensive, since each requires computing 2D PDs). This tells you which interactions are real and important - critical for deciding whether a linear model can approximate the black-box or whether interaction terms are needed.

Q5: How would you design a feature importance monitoring system for a production credit scoring model?

I would set up a weekly pipeline with three components. First, recompute permutation importance on a rolling 4-week window of labeled data (where labels are available with a delay). Store the importance vector for each week in a time-series database. Second, compute the relative change in importance for each feature week-over-week and trigger alerts for features whose importance changes by more than 25% in one week or 40% cumulatively over 4 weeks. Third, when a drift alert fires, automatically generate a drift diagnosis report: check the feature's raw distribution (PSI), check for new missingness rates, check for scale changes (mean, std, percentiles). The alert distinguishes between "the feature became less important because the relationship changed" (genuine concept drift - may require retraining) versus "the feature's data quality changed" (data pipeline issue - fix the pipeline). I'd also track the consistency of importance rankings: if the top-3 features change order, that warrants a manual investigation even if individual importance values haven't changed much.

Q6: You have 150 features in a regression model. Walk me through a feature selection strategy using importance methods.

Start with a quick MDI ranking to identify obviously useless features (the bottom 30% by MDI can often be eliminated safely when feature types are homogeneous). Then compute permutation importance on the test set for the remaining ~100 features, with 10 repeats, and compute 95% confidence intervals. Eliminate features whose CI includes zero - they are statistically indistinguishable from noise. Now you likely have 30–60 features. Check the correlation matrix of the remaining features. For any correlated cluster (|r| > 0.8), use SHAP interaction values to determine which feature in the cluster is the primary carrier of the information, and drop the others. Rerun cross-validation to ensure no performance degradation. Finally, for the top 15-20 remaining features, compute ALE plots to ensure each feature has a meaningful, interpretable effect curve. Features with ALE plots that are essentially flat despite having non-zero importance are often picking up noise or interaction effects that are better represented by other features. This process typically gets you to a well-understood 20–40 feature set with documented effect curves - which is what regulators and business stakeholders actually want.

Key Takeaways

Method	Handles Correlation	Requires Tree Model	Computes On	Best For
MDI	No (cardinality bias)	Yes	Training	Quick exploration only
Permutation	Partial (undervalues)	No	Test set	General global ranking
SHAP global	Yes (symmetry axiom)	No (TreeSHAP for trees)	Any	Rigorous global ranking
PDP	No (extrapolates)	No	Any (many passes)	Independent features only
ICE	No (extrapolates)	No	Any (many passes)	Heterogeneity detection
ALE	Yes	No	Any (efficient)	Correlated features - best default
H-statistic	N/A	No	Any (expensive)	Interaction discovery
Sobol	Yes	No	Any (MC intensive)	Variance decomposition, safety-critical

The practitioner's rule: use ALE for effect curves, SHAP for rankings, and permutation importance for model-agnostic validation. Treat MDI as a fast heuristic, not ground truth.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the SHAP Values demo on the EngineersOfAI Playground - no code required.

:::

The ICU Model That Got the Wrong Answer for the Right Reasons​

Why "Feature Importance" Is Not One Thing​

Historical Context​

MDI - Mean Decrease in Impurity​

How It Works​

The Bias Problem​

When to Use MDI​

Permutation Importance - Mean Decrease in Accuracy​

The Algorithm​

Confidence Intervals for Permutation Importance​

The Correlation Problem​

Train vs Test Permutation Importance​

Partial Dependence Plots (PDP)​

The Marginal Effect Derivation​

The Extrapolation Problem​

ICE Plots - Individual Conditional Expectation​

ALE - Accumulated Local Effects​

H-Statistic - Measuring Interaction Effects​

Sobol Sensitivity Indices - Variance-Based Attribution​

Code: MDI vs Permutation vs SHAP​

Code: PDP, ICE, and ALE​

LightGBM + SHAP Integration​

Feature Selection Pipeline Using Importance​

Production Monitoring Using Importance Drift​

Correlated Feature Problems in Depth​

Choosing the Right Method​

Production Engineering Notes​

YouTube Resources​

Common Mistakes​

Interview Q&A​

Key Takeaways​