Skip to main content

Pruning and Depth Control

The Production Scenario

The ML engineering team at a large telecom company has been tasked with reducing churn. A colleague trained a decision tree on 500,000 customer records - six months of usage data, billing history, support tickets, device age, and contract status. The model looks spectacular on paper: 99.1% training accuracy. The team deploys it to a holdout validation set and watches the number drop to 71.3%.

You open the trained tree to investigate. It has 847 leaves. There is a leaf for customers who called support exactly three times in the last 90 days, have a device older than 26 months, are on a legacy unlimited plan, and made their last payment between 12 and 17 days ago. That leaf contains exactly two training customers. The model has memorized the training set - every quirk, every noise pattern, every coincidental correlation. It has learned the data rather than the phenomenon.

The fix is not a different algorithm. It is pruning - a family of techniques that constrain a tree's growth so it generalizes rather than memorizes. There are two broad approaches: pre-pruning stops splits before they become too specific during training, and post-pruning grows the full tree first, then works backward to remove branches that do not justify their complexity. Both address the same root cause: decision trees have unlimited capacity and will exploit every degree of freedom if you let them.

This lesson covers both approaches rigorously. You will understand the bias-variance tradeoff through the lens of tree depth, derive the cost-complexity pruning objective used by sklearn, implement the complete alpha selection path in Python, visualize decision boundaries at different depths, and leave with practical guidance for tuning depth on very large datasets where full cross-validation is too slow.

Why Unpruned Trees Overfit: Bias-Variance for Trees

A decision tree grown to full depth will partition the training set until every leaf contains a single example, or until all examples in a leaf share the same label. At that point it achieves zero training error. But this perfect fit to training data comes at a cost: every decision boundary is drawn to accommodate noise in the training set, not the underlying signal.

The bias-variance decomposition clarifies this. For any prediction model:

Expected Test Error=Bias2underfitting+Varianceoverfitting+σ2irreducible\text{Expected Test Error} = \underbrace{\text{Bias}^2}_{\text{underfitting}} + \underbrace{\text{Variance}}_{\text{overfitting}} + \underbrace{\sigma^2}_{\text{irreducible}}

For decision trees:

  • Shallow tree (high max_depth restriction): High bias - the tree lacks the capacity to fit complex boundaries, systematically underfitting. Training and validation accuracy are both low and similar.
  • Deep tree (no restriction): Near-zero bias on training data but extreme variance - small changes in the training set produce radically different trees with completely different boundary placements. Training accuracy is near perfect, validation accuracy collapses.

The goal of pruning is to find the depth at which validation accuracy is maximized - the "sweet spot" where the tree captures real signal without memorizing noise.

Training vs Validation Accuracy as Tree Depth Increases

Accuracy
99% | Training ──────────────────────────── (memorizes)
| /
90% | Validation <- SWEET SPOT near depth 5-7
| / \
80% | / \
| / \________ (overfitting gap grows)
70% | /
| / <- underfitting region
60% | /
+--+----+----+----+----+----+----+-> max_depth
1 2 3 4 5 6 7 8

Pre-Pruning: Stopping Growth Early

Pre-pruning parameters stop the tree from growing further when certain conditions are met. These are applied during the tree-building phase, as each potential split is evaluated.

max_depth

The most direct control. A tree of depth dd can represent at most 2d2^d distinct decision regions (leaves). Depth 1 is a "decision stump" - a single split. Depth 5 allows up to 32 leaves. Depth 10 allows up to 1024.

The key insight: depth grows exponentially but training data grows linearly. At depth 10, you would need over 1024 leaves to fill the tree, but each leaf receives only n/1024n/1024 training examples on average. With n=10,000n = 10{,}000 samples, that is fewer than 10 examples per leaf - deeply unreliable estimates.

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(
n_samples=5000, n_features=20, n_informative=8,
n_redundant=4, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.25, random_state=42
)

for depth in [2, 3, 5, 7, 10, None]:
clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
clf.fit(X_train, y_train)
train_acc = clf.score(X_train, y_train)
val_acc = clf.score(X_val, y_val)
n_leaves = clf.get_n_leaves()
print(f"depth={str(depth):4s} leaves={n_leaves:4d} "
f"train={train_acc:.3f} val={val_acc:.3f} "
f"gap={train_acc - val_acc:.3f}")

:::tip Practical starting point For tabular data with 10-50 features, start with max_depth between 3 and 8. Depths beyond 10 almost always overfit on datasets under 1 million rows. For standalone decision trees (not in an ensemble), max_depth=5 is a reasonable first guess. :::

min_samples_split

The minimum number of training samples required at a node for a split to be attempted. If a node contains fewer samples than min_samples_split, it becomes a leaf regardless of its impurity.

  • Default: 2 (split as long as there are at least 2 samples - effectively no constraint).
  • Setting min_samples_split=20 prevents splits based on tiny customer groups.
  • Can be a fraction of total samples: min_samples_split=0.01 means at least 1% of training data must reach the node before splitting.

The formula sklearn uses internally: min_samples_split = max(min_samples_split, 1) (integer) or ceil(min_samples_split * n_samples) (float).

min_samples_leaf

The minimum number of samples that must remain in each child after a split. A split is only made if both resulting children contain at least min_samples_leaf samples.

This is often more useful than min_samples_split because it directly controls leaf size - leaf predictions are averages (regression) or majority votes (classification), and very small leaves produce unreliable estimates. Two customers is not a reliable basis for a "high-churn" prediction.

Effect of min_samples_leaf on the telecom tree (n=500,000):

min_samples_leaf=1 → 847 leaves (original overfit)
min_samples_leaf=50 → ~320 leaves
min_samples_leaf=200 → ~130 leaves
min_samples_leaf=500 → ~55 leaves (each leaf: 500+ customers)

The stat-motivated default: set min_samples_leaf to at least ntrain\sqrt{n_{\text{train}}} or 0.5% of the training set - enough examples in each leaf to produce statistically meaningful estimates.

min_impurity_decrease

A split is only made if the impurity reduction it achieves meets a minimum threshold. Sklearn's weighted formula:

ΔGtmin_impurity_decreasentn\Delta G_t \geq \text{min\_impurity\_decrease} \cdot \frac{n_t}{n}

where ntn_t is the number of samples at node tt and nn is total training size. The nt/nn_t / n weighting has a subtle and important effect:

  • Splits near the root (where ntnn_t \approx n): must achieve nearly the full min_impurity_decrease in absolute terms - the bar is high.
  • Splits near the leaves (where ntnn_t \ll n): face a near-zero effective threshold - almost any reduction qualifies.

This design ensures that high-level splits (affecting most of the data) are held to a higher standard than low-level splits (affecting small subsets). In practice, min_impurity_decrease is most useful for preventing low-gain splits near the root that would spawn many unnecessary descendants.

max_leaf_nodes

A hard cap on the total number of leaves the tree can have. If specified, the tree is built greedily using best-first search rather than depth-first - at each step, it expands the leaf with the highest impurity reduction.

# max_leaf_nodes=20 gives at most 20 leaves, selected greedily by impurity gain
clf_limited = DecisionTreeClassifier(max_leaf_nodes=20, random_state=42)
clf_limited.fit(X_train, y_train)
print(f"Leaves: {clf_limited.get_n_leaves()}") # Will be <= 20

This is useful when you have a specific business constraint: "the model must be understandable as a flowchart with at most 30 decision regions."

Pre-Pruning Parameters Summary

ParameterWhat it controlsDefaultPractical range
max_depthMaximum tree depthNone (unlimited)3-10 for standalone trees
min_samples_splitMin samples to attempt any split25-50 or 0.005-0.02
min_samples_leafMin samples in each child after split15-50 or 0.005-0.01
min_impurity_decreaseMin weighted impurity reduction required0.00.0001-0.01
max_leaf_nodesCap on total number of leavesNone10-200 for interpretable trees
max_featuresFeatures considered per splitNone'sqrt', 'log2' (for ensembles)

Post-Pruning: Cost-Complexity Pruning

Post-pruning grows the full tree first, then selectively removes subtrees that do not contribute enough predictive power to justify their added complexity. The key insight is that measuring complexity after the tree is built is more principled than guessing stopping thresholds before.

Derivation of the Cost-Complexity Objective

Let TT be a tree and T|T| the number of its leaves. Define the training error rate of the tree as R(T)R(T) - the fraction of training examples misclassified (for classification) or total MSE (for regression).

A more complex tree always achieves lower (or equal) R(T)R(T) on training data - but we want to penalize complexity. The cost-complexity criterion introduces a regularization penalty:

Rα(T)=R(T)+αTR_\alpha(T) = R(T) + \alpha |T|

Here α0\alpha \geq 0 is the complexity parameter (called ccp_alpha in sklearn). For α=0\alpha = 0, we get the full unpruned tree with the lowest training error. As α\alpha increases, the penalty for extra leaves grows, and the optimal tree under this objective gets progressively smaller.

The key theorem (Breiman et al., 1984): as α\alpha increases from 0, the optimal subtrees form a finite nested sequence T0T1T2TrootT_0 \supset T_1 \supset T_2 \supset \cdots \supset T_{\text{root}}, where each TkT_k is obtained from Tk1T_{k-1} by pruning the weakest link - the internal node whose removal causes the smallest increase in training error per leaf removed.

For an internal node tt with subtree TtT_t, define its effective alpha - the value of α\alpha at which it becomes worth pruning this node (turning it into a leaf):

αt=R(t)R(Tt)Tt1\alpha_t^* = \frac{R(t) - R(T_t)}{|T_t| - 1}

where:

  • R(t)R(t) is the error of turning node tt into a leaf (predicting majority class)
  • R(Tt)R(T_t) is the error of the full subtree rooted at tt
  • Tt1|T_t| - 1 is the number of leaves removed by collapsing this subtree to a leaf

The numerator R(t)R(Tt)R(t) - R(T_t) is the error cost of pruning - how much worse the predictions become. The denominator Tt1|T_t| - 1 is the complexity benefit - how many leaves we eliminate. αt\alpha_t^* is the "price per leaf" of keeping this subtree.

Weakest link pruning algorithm:

1. Start with full tree T_0
2. Compute α_t* for every internal node t in T_0
3. Prune the node with the smallest α_t* → gives T_1, records α_1 = min(α_t*)
4. Repeat on T_1: compute new α_t* values, prune weakest link → T_2, records α_2
5. Continue until only the root remains
6. This gives a sequence: T_0 ⊃ T_1 ⊃ ... ⊃ T_root
with corresponding α values: 0 = α_0 < α_1 < α_2 < ...
7. Use cross-validation to find which α gives best validation performance

The Alpha Path

As alpha increases, the optimal tree shrinks (telecom churn example, n=500,000):

alpha=0.000 → 847 leaves (full tree, 99.1% train accuracy, 71.3% val)
alpha=0.001 → 420 leaves
alpha=0.003 → 180 leaves
alpha=0.008 → 62 leaves
alpha=0.010 → 45 leaves <- validation accuracy peaks ~87% here
alpha=0.020 → 18 leaves
alpha=0.050 → 6 leaves (major churn drivers only)
alpha=0.100 → 1 leaf (predict majority class always, 64% accuracy)

The job of cross-validation is to find the α\alpha value on this path that maximizes held-out performance.

Python: The Full Pruning Path

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import make_classification

# ── Generate synthetic churn-like dataset ──────────────────────────────────
X, y = make_classification(
n_samples=5000, n_features=20, n_informative=8,
n_redundant=4, n_clusters_per_class=2, random_state=42,
)
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.25, random_state=42
)

# ── Step 1: Compute the pruning path ───────────────────────────────────────
full_tree = DecisionTreeClassifier(random_state=42)
path = full_tree.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas # monotonically increasing alpha values
impurities = path.impurities # corresponding total leaf impurity

print(f"Number of alpha values on path: {len(ccp_alphas)}")
print(f"Alpha range: {ccp_alphas[0]:.6f} to {ccp_alphas[-1]:.4f}")
print(f"Full tree: {full_tree.get_n_leaves()} leaves (before pruning)")

# ── Step 2: Train one tree per alpha value ─────────────────────────────────
trees = []
for alpha in ccp_alphas:
clf = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42)
clf.fit(X_train, y_train)
trees.append(clf)

# Extract metrics along the path
n_leaves = [clf.get_n_leaves() for clf in trees]
train_accs = [clf.score(X_train, y_train) for clf in trees]
val_accs = [clf.score(X_val, y_val) for clf in trees]
gaps = [tr - va for tr, va in zip(train_accs, val_accs)]

# ── Step 3: Plot the pruning path ──────────────────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

axes[0].plot(ccp_alphas, n_leaves, marker="o", markersize=3, color="#2563eb")
axes[0].set_xlabel("ccp_alpha")
axes[0].set_ylabel("Number of leaves")
axes[0].set_title("Tree complexity vs alpha")
axes[0].set_xscale("log")
axes[0].grid(alpha=0.3)

axes[1].plot(ccp_alphas, train_accs, label="Train", marker="o", markersize=3, color="#16a34a")
axes[1].plot(ccp_alphas, val_accs, label="Validation", marker="o", markersize=3, color="#dc2626")
axes[1].set_xlabel("ccp_alpha")
axes[1].set_ylabel("Accuracy")
axes[1].set_title("Accuracy vs alpha")
axes[1].legend()
axes[1].set_xscale("log")
axes[1].grid(alpha=0.3)

axes[2].plot(ccp_alphas, gaps, marker="o", markersize=3, color="#ea580c")
axes[2].axhline(0.05, linestyle="--", color="gray", alpha=0.5, label="5% gap threshold")
axes[2].set_xlabel("ccp_alpha")
axes[2].set_ylabel("Train - Val accuracy gap")
axes[2].set_title("Overfitting gap vs alpha")
axes[2].legend()
axes[2].set_xscale("log")
axes[2].grid(alpha=0.3)

plt.tight_layout()
plt.savefig("pruning_path.png", dpi=150)
plt.show()

# ── Step 4: Find optimal alpha via cross-validation ────────────────────────
# Exclude the last alpha (collapses to just the root node)
valid_alphas = ccp_alphas[:-1]

cv_scores = []
for alpha in valid_alphas:
clf = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42)
scores = cross_val_score(clf, X_train, y_train, cv=5, scoring="accuracy")
cv_scores.append(scores.mean())

best_alpha_idx = np.argmax(cv_scores)
best_alpha = valid_alphas[best_alpha_idx]
print(f"\nBest ccp_alpha (5-fold CV): {best_alpha:.6f}")
print(f"CV accuracy at best alpha: {max(cv_scores):.4f}")

# ── Step 5: Train final model with best alpha ──────────────────────────────
final_clf = DecisionTreeClassifier(ccp_alpha=best_alpha, random_state=42)
final_clf.fit(X_train, y_train)
print(f"Final tree leaves: {final_clf.get_n_leaves()}")
print(f"Final train accuracy: {final_clf.score(X_train, y_train):.4f}")
print(f"Final validation accuracy: {final_clf.score(X_val, y_val):.4f}")
print(f"Train-val gap: {final_clf.score(X_train, y_train) - final_clf.score(X_val, y_val):.4f}")

Grid Search Over Alpha Path

from sklearn.model_selection import GridSearchCV

# Grid search over the pruning path - prefer roc_auc for imbalanced problems
param_grid = {"ccp_alpha": ccp_alphas[:-1]}
grid_search = GridSearchCV(
DecisionTreeClassifier(random_state=42),
param_grid,
cv=5,
scoring="roc_auc", # use AUC for imbalanced churn data
n_jobs=-1,
verbose=0,
)
grid_search.fit(X_train, y_train)
print(f"Best alpha (AUC): {grid_search.best_params_['ccp_alpha']:.6f}")
print(f"Best CV AUC: {grid_search.best_score_:.4f}")

Decision Boundary Visualization at Different Depths

One of the most instructive ways to see overfitting in action is to visualize decision boundaries at different depth levels. Each additional depth level can add sharp jagged boundaries that fit training noise.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_moons

# 2D dataset for clean visualization
X_2d, y_2d = make_moons(n_samples=300, noise=0.25, random_state=42)
X_tr2, X_val2, y_tr2, y_val2 = train_test_split(
X_2d, y_2d, test_size=0.3, random_state=42
)

depths_to_show = [1, 3, 5, 7]
fig, axes = plt.subplots(1, 4, figsize=(20, 4))

x_min, x_max = X_2d[:, 0].min() - 0.3, X_2d[:, 0].max() + 0.3
y_min, y_max = X_2d[:, 1].min() - 0.3, X_2d[:, 1].max() + 0.3
xx, yy = np.meshgrid(
np.linspace(x_min, x_max, 200),
np.linspace(y_min, y_max, 200),
)

for ax, depth in zip(axes, depths_to_show):
clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
clf.fit(X_tr2, y_tr2)

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

ax.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.RdBu)
ax.scatter(X_tr2[:, 0], X_tr2[:, 1], c=y_tr2, cmap=plt.cm.RdBu,
edgecolors='k', s=20, label="Train")
ax.scatter(X_val2[:, 0], X_val2[:, 1], c=y_val2, cmap=plt.cm.RdBu,
marker='*', s=60, alpha=0.6, label="Val")

train_acc = clf.score(X_tr2, y_tr2)
val_acc = clf.score(X_val2, y_val2)
ax.set_title(
f"depth={depth} leaves={clf.get_n_leaves()}\n"
f"train={train_acc:.2f} val={val_acc:.2f}",
fontsize=10,
)
ax.axis("off")

axes[0].legend(loc="lower right", fontsize=8)
plt.suptitle("Decision Boundaries at Increasing Depth", fontsize=13, y=1.02)
plt.tight_layout()
plt.savefig("decision_boundaries_by_depth.png", dpi=150)
plt.show()

The depth=1 boundary is a single straight line - high bias, unable to separate the moons correctly. The depth=3 boundary captures the general crescent shapes with moderate complexity. The depth=5 boundary is nearly perfect on training data. The depth=7 boundary has jagged edges that perfectly surround individual training points - those jagged edges are pure noise memorization.

Minimum Description Length (MDL) Principle

Cost-complexity pruning has a deep connection to information theory through the Minimum Description Length (MDL) principle. MDL says: the best model is the one that provides the shortest total description of the data.

The description has two parts:

  1. Model description length: How many bits to describe the tree structure itself? More leaves = more bits to specify.
  2. Data description length given the model: How many bits to describe the training data's labels given the tree's predictions? A tree that makes more errors requires more bits to describe the residuals.

MDL(T)=L(model)+L(datamodel)MDL(T) = L(\text{model}) + L(\text{data} | \text{model})

This is equivalent to minimizing Rα(T)=R(T)+αTR_\alpha(T) = R(T) + \alpha |T| where α\alpha is determined by the encoding scheme. MDL provides the theoretical justification for why penalizing tree size is the principled approach - larger trees describe themselves, and if that self-description cost outweighs the data compression benefit, pruning wins.

Learning Curves: Diagnosing Fit Quality

The learning curve plots performance metrics as a function of a regularization parameter. Reading it correctly is essential for diagnosing whether a model is underfitting or overfitting.

from sklearn.model_selection import learning_curve
import numpy as np
import matplotlib.pyplot as plt

# ── Learning curve vs training set size (for a fixed depth) ────────────────
clf_fixed = DecisionTreeClassifier(max_depth=5, random_state=42)

train_sizes, train_scores, val_scores = learning_curve(
clf_fixed, X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5,
scoring="accuracy",
n_jobs=-1,
)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Learning curve (train size vs accuracy)
axes[0].plot(train_sizes, train_scores.mean(1), label="Train", color="#16a34a", marker="o")
axes[0].fill_between(train_sizes,
train_scores.mean(1) - train_scores.std(1),
train_scores.mean(1) + train_scores.std(1),
alpha=0.2, color="#16a34a")
axes[0].plot(train_sizes, val_scores.mean(1), label="Validation", color="#dc2626", marker="o")
axes[0].fill_between(train_sizes,
val_scores.mean(1) - val_scores.std(1),
val_scores.mean(1) + val_scores.std(1),
alpha=0.2, color="#dc2626")
axes[0].set_xlabel("Training set size")
axes[0].set_ylabel("Accuracy")
axes[0].set_title("Learning Curve (max_depth=5)")
axes[0].legend()
axes[0].grid(alpha=0.3)

# Plot 2: Depth vs accuracy gap
depths = range(1, 15)
train_scores_d, val_scores_d = [], []
for depth in depths:
clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
clf.fit(X_train, y_train)
train_scores_d.append(clf.score(X_train, y_train))
val_scores_d.append(clf.score(X_val, y_val))

axes[1].plot(depths, train_scores_d, label="Train", color="#16a34a", marker="o")
axes[1].plot(depths, val_scores_d, label="Validation", color="#dc2626", marker="o")
best_depth = depths[np.argmax(val_scores_d)]
axes[1].axvline(best_depth, linestyle="--", color="gray", alpha=0.7,
label=f"Best depth={best_depth}")
axes[1].set_xlabel("max_depth")
axes[1].set_ylabel("Accuracy")
axes[1].set_title("Validation Accuracy vs Tree Depth")
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig("learning_curves.png", dpi=150)
plt.show()

# Print detailed table
print(f"\n{'Depth':>6} {'Train':>8} {'Val':>8} {'Gap':>8} {'Diagnosis':>20}")
print("-" * 54)
for d, tr, va in zip(depths, train_scores_d, val_scores_d):
gap = tr - va
diagnosis = "underfitting" if va < 0.75 else ("overfitting" if gap > 0.10 else "good fit")
print(f"{d:6d} {tr:8.3f} {va:8.3f} {gap:8.3f} {diagnosis:>20}")

:::tip Reading the learning curve

  • Both training and validation low: Underfitting - increase depth or relax min_samples_leaf.
  • Training high, validation low: Overfitting - reduce depth or increase ccp_alpha.
  • Both high and similar: Good generalization - this is the target state.
  • Validation peaks then declines: Clear overfitting signal - pick the depth at the peak.
  • Large gap even at high training size: The model form is wrong, not just regularization. :::

Pre-Pruning vs Post-Pruning: Trade-offs

DimensionPre-PruningPost-Pruning (CCP)
When appliedDuring tree buildingAfter full tree is built
Computation costLower - tree never fully builtHigher - full tree built first
Hyperparameter interpretationDirect and intuitiveAbstract (alpha has no natural unit)
Risk of stopping too earlyYes - may miss useful deep splitsNo - starts with full tree
Interaction between parametersMultiple can conflictSingle alpha controls all complexity
Best forVery large datasets where full tree is expensiveStandard to large datasets with CV budget
sklearn parametersmax_depth, min_samples_split, min_samples_leafccp_alpha
ReproducibilityMay vary with feature orderDeterministic given full tree

:::warning Pre-pruning parameters interact If you set both max_depth=5 and min_samples_leaf=50, the effective constraint is whichever fires first at each node. On datasets where most leaves naturally contain fewer than 50 samples at depth 5, min_samples_leaf will dominate. Test parameters individually to understand which is actually binding before combining them. :::

Combining Pre-Pruning and Post-Pruning

In practice, using both is sensible. Set a conservative max_depth to prevent the full tree from becoming enormous (which would slow CCP path computation), then use ccp_alpha for fine-tuned regularization.

from sklearn.model_selection import GridSearchCV

# First limit depth to keep the pruning path tractable
param_grid = {
"max_depth": [8, 10, 12, None],
"ccp_alpha": [0.0, 0.0005, 0.001, 0.002, 0.005],
"min_samples_leaf": [1, 5, 10],
}

grid = GridSearchCV(
DecisionTreeClassifier(random_state=42),
param_grid,
cv=5,
scoring="roc_auc",
n_jobs=-1,
verbose=1,
)
grid.fit(X_train, y_train)

print(f"Best params: {grid.best_params_}")
print(f"Best CV AUC: {grid.best_score_:.4f}")
print(f"Val AUC: {grid.score(X_val, y_val):.4f}")

# Analyze what each parameter combination is actually doing
import pandas as pd
results = pd.DataFrame(grid.cv_results_)
top_results = (results
.sort_values("mean_test_score", ascending=False)
.head(10)[["param_max_depth", "param_ccp_alpha",
"param_min_samples_leaf", "mean_test_score"]]
)
print(f"\nTop 10 parameter combinations:\n{top_results.to_string(index=False)}")

Production: Depth Tuning Without Full Cross-Validation

On very large datasets (5M+ rows), 5-fold cross-validation over a grid of alpha values is prohibitively slow. Practical alternatives:

1. Holdout Validation with a Single Split

Use an 80/10/10 train/validation/test split. Tune on the validation set. Reserve the test set for final reporting only. This is 5x faster than cross-validation with only slightly higher variance in the estimate.

X_train_full, X_temp, y_train_full, y_temp = train_test_split(
X, y, test_size=0.2, random_state=42
)
X_val_hold, X_test_hold, y_val_hold, y_test_hold = train_test_split(
X_temp, y_temp, test_size=0.5, random_state=42
)

# Tune on val_hold, never touch test_hold until final evaluation
best_depth, best_val_score = None, 0.0
for depth in [3, 4, 5, 6, 7, 8, 10]:
clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
clf.fit(X_train_full, y_train_full)
score = clf.score(X_val_hold, y_val_hold)
if score > best_val_score:
best_val_score = score
best_depth = depth
print(f" depth={depth} val_acc={score:.4f}")

print(f"\nBest depth: {best_depth}, val_acc: {best_val_score:.4f}")

2. Subsample the Alpha Path

Compute the full pruning path on a 20% subsample, identify the region where validation accuracy peaks, then run a finer grid in that region on the full dataset.

from sklearn.utils import resample

# Subsample for path discovery (fast)
X_sub, y_sub = resample(X_train, y_train, n_samples=10_000, random_state=42)
path_sub = DecisionTreeClassifier().cost_complexity_pruning_path(X_sub, y_sub)
alphas_sub = path_sub.ccp_alphas

# Evaluate every 10th alpha to find the rough best region
sampled_alphas = alphas_sub[::10]
val_scores_sub = []
for alpha in sampled_alphas:
clf = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42)
clf.fit(X_train, y_train)
val_scores_sub.append(clf.score(X_val, y_val))

best_rough_alpha = sampled_alphas[np.argmax(val_scores_sub)]
print(f"Rough best alpha: {best_rough_alpha:.6f}")

# Fine search around the rough best (10x finer grid)
fine_alphas = np.linspace(best_rough_alpha * 0.4, best_rough_alpha * 2.5, 25)
fine_scores = []
for alpha in fine_alphas:
clf = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42)
clf.fit(X_train, y_train)
fine_scores.append(clf.score(X_val, y_val))

final_alpha = fine_alphas[np.argmax(fine_scores)]
print(f"Final alpha: {final_alpha:.6f}, val_acc: {max(fine_scores):.4f}")

3. Rule of Thumb Caps

When speed matters most and you want a defensible default without tuning:

n_train = len(X_train)

# Logarithmic depth relative to dataset size
depth_cap = int(np.log2(n_train))

# At least 0.1% of data per leaf
leaf_cap = max(5, int(0.001 * n_train))

clf_heuristic = DecisionTreeClassifier(
max_depth=depth_cap,
min_samples_leaf=leaf_cap,
random_state=42,
)
clf_heuristic.fit(X_train, y_train)
print(f"Heuristic: max_depth={depth_cap}, min_samples_leaf={leaf_cap}")
print(f"Leaves: {clf_heuristic.get_n_leaves()}")
print(f"Val accuracy: {clf_heuristic.score(X_val, y_val):.4f}")

These heuristics will not find the optimal tree but reliably avoid the worst overfitting on most tabular datasets.

:::note When pruning is not enough If a fully pruned decision tree still underperforms a linear model on your validation set, the problem is likely that your target function is smooth - continuous and roughly linear - which decision trees approximate poorly with axis-aligned step functions. In that case, switch to gradient boosted trees (which stack many shallow trees to approximate smooth functions) or a linear model. More pruning will not fix a fundamental model mismatch. :::

Production Engineering Notes

The Churn Model Fix

Returning to the telecom scenario: the 847-leaf tree with 99.1% train / 71.3% validation accuracy is a textbook overfit. The 27.8-point train-validation gap is the fingerprint.

The remediation path:

  1. Compute the CCP alpha path from the training data.
  2. Use 5-fold CV (or holdout validation if 5-fold is too slow) to select optimal alpha.
  3. Retrain the final model with that alpha on the full training set.
  4. Verify that the validation accuracy gap (training - validation) is below 5%. If not, the tree is still overfit - increase alpha further.
  5. For a 500,000-row dataset, expect the optimal tree to have 50-200 leaves. 847 leaves means roughly one leaf per 590 customers - far too granular for generalizable churn patterns.

Monitoring Trees in Production

A pruned tree that performed well at deployment can drift as customer behavior evolves:

from sklearn.tree import DecisionTreeClassifier
import numpy as np

def monitor_tree_health(model, X_score, X_train, y_train, alert_threshold=0.05):
"""
Monitor key indicators of production tree health.
Run this on each new scoring batch.
"""
# 1. Leaf sample distribution (has the traffic per leaf changed?)
leaf_ids_current = model.apply(X_score)
leaf_ids_train = model.apply(X_train)

train_leaf_counts = np.bincount(leaf_ids_train, minlength=model.tree_.node_count)
current_leaf_counts = np.bincount(leaf_ids_current, minlength=model.tree_.node_count)

# Leaves used in production scoring
active_leaves_train = set(leaf_ids_train.tolist())
active_leaves_current = set(leaf_ids_current.tolist())

new_leaves_pct = len(active_leaves_current - active_leaves_train) / len(active_leaves_current)
print(f"New leaves (never seen in training): {new_leaves_pct:.1%}")
if new_leaves_pct > alert_threshold:
print(f"ALERT: {new_leaves_pct:.1%} of traffic goes to leaves "
f"not seen during training - possible distribution shift")

return {"new_leaves_pct": new_leaves_pct}

Track the fraction of scoring traffic that reaches leaves with very few training examples. If this fraction grows, the model is encountering customers it was not trained on.

YouTube Resources

VideoChannelWhy Watch It
Decision Tree Pruning - Cost ComplexityStatQuest with Josh StarmerBest intuitive walkthrough of CCP alpha path
Overfitting in Machine LearningStatQuest with Josh StarmerBias-variance tradeoff explained visually
sklearn Decision Trees - Full TutorialPatrick LoeberHands-on pruning with sklearn from scratch
Decision Tree Post-PruningNormalized NerdWeakest-link pruning algorithm explained step-by-step
Regularization for TreesTowards Data ScienceWhen and how to tune all pre-pruning parameters

Interview Questions and Answers

Q1: What is the cost-complexity criterion and how does it control tree size?

The cost-complexity criterion is Rα(T)=R(T)+αTR_\alpha(T) = R(T) + \alpha |T|, where R(T)R(T) is training error, T|T| is the number of leaves, and α0\alpha \geq 0 is the complexity parameter. For α=0\alpha = 0, the full tree minimizes this objective. As α\alpha increases, each additional leaf costs more in the objective function. The optimal tree under RαR_\alpha gets progressively smaller as α\alpha grows, producing a nested sequence of trees. Cross-validation finds the α\alpha that maximizes held-out performance. In sklearn, this is the ccp_alpha parameter. The key insight is that measuring complexity after building the full tree is more principled than guessing stopping thresholds before building it.

Q2: What is weakest-link pruning and why does it produce a nested sequence of optimal trees?

Weakest-link pruning iteratively removes the internal node whose removal causes the smallest increase in training error per leaf eliminated. For each internal node tt with subtree TtT_t, the effective alpha is αt=(R(t)R(Tt))/(Tt1)\alpha_t^* = (R(t) - R(T_t)) / (|T_t| - 1) - the "price per leaf" of keeping the subtree. The node with the minimum αt\alpha_t^* is pruned first, producing tree T1T_1. This process repeats, recording the alpha at each pruning step. The result is a nested sequence T0T1T_0 \supset T_1 \supset \cdots with corresponding α0<α1<\alpha_0 < \alpha_1 < \cdots. For any value of α\alpha, the optimal tree is the one in this sequence whose recorded alpha is closest to α\alpha from below. This theorem guarantees we only need to evaluate a polynomial number of trees, not the exponential space of all possible subtrees.

Q3: Explain the difference between min_samples_split and min_samples_leaf. Which is more useful in practice?

min_samples_split requires a minimum number of samples at a node before a split is even attempted. min_samples_leaf requires that both children must have at least this many samples after any split - it is a post-split constraint on the children, not a pre-split constraint on the parent. min_samples_leaf is usually more useful because it directly controls leaf size, which determines prediction reliability. A leaf with 2 samples produces an unreliable majority-vote or average. min_samples_leaf=50 ensures every leaf has at least 50 training examples backing its prediction. min_samples_split is weaker - a node with 4 samples might still split into a leaf of 3 and a leaf of 1 if min_samples_split=4.

Q4: A tree with max_depth=5 has worse validation accuracy than one with max_depth=4. What could explain this?

This is normal and expected near the optimal depth. The depth-5 tree has enough extra capacity to fit some noise that the depth-4 tree ignores. The splits added at depth 5 may be capturing real noise rather than real signal. This is exactly what the bias-variance tradeoff predicts: as depth increases from 4 to 5, you reduce bias (fit the training data better) but increase variance (become more sensitive to specific training examples). If depth-4 validates better, that is the tree's optimal depth for this dataset - use it. If you want more fine-grained control, switch to ccp_alpha which can produce trees between depth-4 and depth-5 complexity.

Q5: How would you prune a decision tree on a dataset with 50 million rows where cross-validation over the full pruning path is too slow?

Three practical approaches: (1) Subsample 5-10% of the data to compute the CCP alpha path, use it to identify the rough best-alpha region (where validation accuracy peaks), then run a finer grid search in that region on the full dataset. (2) Use an 80/10/10 train/validation/test split instead of 5-fold CV - this is 5x faster and for 50M rows has very low variance even without cross-validation. (3) Apply rule-of-thumb caps as a starting point: max_depth=int(log2(n_samples)) and min_samples_leaf=max(5, int(0.001*n_samples)), then refine only if the validation-training gap remains above 5%. For 50M rows, log2(50M) ≈ 25.5, so start at depth 20-25 with min_samples_leaf around 50,000.

Q6: Why does the MDL principle support cost-complexity pruning?

The Minimum Description Length (MDL) principle states the best model minimizes the total description length: model description length plus data description length given the model. For decision trees: more leaves require more bits to describe the tree structure (model description cost). Each training error requires extra bits to encode the residual (data description cost). Minimizing total description length is equivalent to minimizing R(T)+αTR(T) + \alpha|T| where α\alpha reflects the bit cost per leaf relative to the bit cost per training error. MDL provides the information-theoretic justification for why penalizing tree complexity is the principled approach - it is not an arbitrary regularizer but a direct application of Occam's Razor in information-theoretic terms.

Production Pruning Workflow: Complete Reference

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler

def production_pruning_workflow(X_train, y_train, X_test, y_test,
cv_folds: int = 5):
"""
Complete production pruning workflow:
1. Compute CCP alpha path
2. Cross-validate each alpha
3. Select best alpha with 1-SE rule
4. Retrain on full training set
5. Report final metrics
"""
# ── Step 1: Full tree → CCP alpha path ────────────────────────────────────
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)

path = full_tree.cost_complexity_pruning_path(X_train, y_train)
alphas = path.ccp_alphas[:-1] # exclude final (0-leaf degenerate tree)
impurities = path.impurities[:-1]

print(f"Full tree: {full_tree.get_n_leaves()} leaves, depth {full_tree.get_depth()}")
print(f"CCP alpha path: {len(alphas)} candidate values, "
f"range [{alphas.min():.5f}, {alphas.max():.5f}]")

# ── Step 2: Cross-validate each alpha ─────────────────────────────────────
cv = StratifiedKFold(n_splits=cv_folds, shuffle=True, random_state=42)
cv_mean_scores = []
cv_std_scores = []

for alpha in alphas:
tree_a = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42)
scores = cross_val_score(tree_a, X_train, y_train, cv=cv, scoring='roc_auc')
cv_mean_scores.append(scores.mean())
cv_std_scores.append(scores.std())

cv_mean_scores = np.array(cv_mean_scores)
cv_std_scores = np.array(cv_std_scores)

# ── Step 3: 1-SE rule - choose simplest tree within 1 std of max ──────────
best_idx_raw = np.argmax(cv_mean_scores)
best_score = cv_mean_scores[best_idx_raw]
threshold = best_score - cv_std_scores[best_idx_raw]

# Among all alphas that score above threshold, pick the largest alpha
# (largest alpha = simplest / most pruned tree)
eligible = cv_mean_scores >= threshold
best_idx = np.where(eligible)[0][-1] # rightmost (largest alpha) eligible

best_alpha = alphas[best_idx]

print(f"\nBest alpha (max CV AUC): {alphas[best_idx_raw]:.6f} "
f"(AUC={cv_mean_scores[best_idx_raw]:.4f})")
print(f"1-SE rule alpha: {best_alpha:.6f} "
f"(AUC={cv_mean_scores[best_idx]:.4f})")

# ── Step 4: Retrain on full training set with best_alpha ──────────────────
final_tree = DecisionTreeClassifier(ccp_alpha=best_alpha, random_state=42)
final_tree.fit(X_train, y_train)

test_score = final_tree.score(X_test, y_test)
print(f"\nFinal tree: {final_tree.get_n_leaves()} leaves, "
f"depth {final_tree.get_depth()}, test accuracy={test_score:.4f}")

# ── Step 5: Visualization ─────────────────────────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(20, 5))

# Alpha path vs n_leaves
n_leaves = []
for alpha in alphas:
t = DecisionTreeClassifier(ccp_alpha=alpha).fit(X_train, y_train)
n_leaves.append(t.get_n_leaves())

axes[0].semilogx(alphas, n_leaves, 'o-', ms=3, color='#3b82f6')
axes[0].axvline(best_alpha, color='#dc2626', linestyle='--',
label=f'Best alpha={best_alpha:.4f}')
axes[0].set_xlabel("ccp_alpha (log scale)")
axes[0].set_ylabel("Number of leaves")
axes[0].set_title("Pruning Path: Leaves vs Alpha")
axes[0].legend()

# CV scores vs alpha
axes[1].semilogx(alphas, cv_mean_scores, 'o-', ms=3, color='#7c3aed')
axes[1].fill_between(alphas, cv_mean_scores - cv_std_scores,
cv_mean_scores + cv_std_scores, alpha=0.2, color='#7c3aed')
axes[1].axvline(alphas[best_idx_raw], color='#16a34a', linestyle='--',
label=f'Max CV (alpha={alphas[best_idx_raw]:.4f})')
axes[1].axvline(best_alpha, color='#dc2626', linestyle='--',
label=f'1-SE rule (alpha={best_alpha:.4f})')
axes[1].axhline(threshold, color='gray', linestyle=':', alpha=0.7, label='1-SE threshold')
axes[1].set_xlabel("ccp_alpha (log scale)")
axes[1].set_ylabel(f"{cv_folds}-fold CV AUC")
axes[1].set_title("CV Score vs Alpha (1-SE Rule)")
axes[1].legend(fontsize=8)

# Impurity vs alpha (the actual CCP objective)
axes[2].semilogx(alphas, impurities, 'o-', ms=3, color='#ea580c')
axes[2].axvline(best_alpha, color='#dc2626', linestyle='--',
label=f'Best alpha={best_alpha:.4f}')
axes[2].set_xlabel("ccp_alpha (log scale)")
axes[2].set_ylabel("Total Impurity (R(T))")
axes[2].set_title("CCP: Impurity vs Alpha")
axes[2].legend()

plt.suptitle("Cost-Complexity Pruning: Production Workflow", fontsize=13)
plt.tight_layout()
plt.show()

return final_tree, best_alpha


# Run on a classification dataset
X, y = make_classification(n_samples=5000, n_features=20, n_informative=12,
n_redundant=4, random_state=42)
scaler = StandardScaler()
X_train, X_test = X[:4000], X[4000:]
y_train, y_test = y[:4000], y[4000:]
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

final_tree, best_alpha = production_pruning_workflow(
X_train_sc, y_train, X_test_sc, y_test, cv_folds=5
)

Decision Boundary Visualization at Different Depths

Understanding how depth controls the complexity of learned boundaries is essential for explaining bias-variance tradeoff in interviews:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import cross_val_score

X_vis, y_vis = make_moons(n_samples=500, noise=0.25, random_state=42)
depths = [1, 2, 3, 5, 7, None]

fig, axes = plt.subplots(2, 3, figsize=(18, 11))
xx, yy = np.meshgrid(
np.linspace(X_vis[:,0].min()-0.5, X_vis[:,0].max()+0.5, 200),
np.linspace(X_vis[:,1].min()-0.5, X_vis[:,1].max()+0.5, 200),
)
grid = np.c_[xx.ravel(), yy.ravel()]

for ax, depth in zip(axes.flat, depths):
dt = DecisionTreeClassifier(max_depth=depth, random_state=42)
dt.fit(X_vis, y_vis)

cv_auc = cross_val_score(dt, X_vis, y_vis, cv=5, scoring='roc_auc').mean()
train_acc = dt.score(X_vis, y_vis)

Z = dt.predict_proba(grid)[:, 1].reshape(xx.shape)

ax.contourf(xx, yy, Z, levels=20, cmap='RdBu', alpha=0.6)
ax.scatter(X_vis[y_vis==0, 0], X_vis[y_vis==0, 1], c='#3b82f6', s=10, alpha=0.7)
ax.scatter(X_vis[y_vis==1, 0], X_vis[y_vis==1, 1], c='#ef4444', s=10, alpha=0.7)

depth_str = str(depth) if depth is not None else "∞"
leaves = dt.get_n_leaves()
ax.set_title(f"depth={depth_str} leaves={leaves}\n"
f"train={train_acc:.3f} cv_auc={cv_auc:.3f}", fontsize=10)
ax.axis('off')

plt.suptitle("Decision Boundary Complexity vs Depth\n"
"(Left: high bias, low variance | Right: low bias, high variance)",
fontsize=12)
plt.tight_layout()
plt.show()

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Decision Tree Splits demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.