Skip to main content

Random Forests

The Production Scenario

It is Tuesday morning at a mid-sized fintech lender. The credit risk team has been running a decision tree classifier in production for eighteen months. It has a training accuracy of 99% and was considered a triumph when it launched. But the model review committee pulled the quarterly numbers and the test-set accuracy has been sitting at 71% for three quarters in a row. Loan default predictions are wrong nearly one time in three.

The lead data scientist swaps in a Random Forest overnight. The next morning she calls the team together. The training accuracy has dropped to 94%. Several engineers are uncomfortable - the model performs worse on data it has already seen. But the test accuracy is now 89%. The model has gone from wrong 29% of the time to wrong 11% of the time on unseen loans.

A senior engineer raises his hand: "I don't get it. We built the forest out of trees that individually perform worse than the original tree - some of them are deliberately limited in what features they can see. How does averaging a bunch of worse models produce a better result?"

This question cuts to the heart of ensemble learning. The answer is not intuitive until you see the math. By the end of this lesson you will be able to explain it precisely - and you will understand why the single decision tree's 99% training accuracy was a warning sign, not a feature.

Why Single Trees Have High Variance

Recall from Lessons 01-03 that a fully grown decision tree memorizes training data. It partitions the feature space into smaller and smaller regions until nearly every leaf contains a single training point. This is high variance: small changes in the training data cause large changes in the learned tree structure.

Formally, the expected test error decomposes as:

Expected Test Error=Bias2systematic error+Variancesensitivity to training data+σ2irreducible\text{Expected Test Error} = \underbrace{\text{Bias}^2}_{\text{systematic error}} + \underbrace{\text{Variance}}_{\text{sensitivity to training data}} + \underbrace{\sigma^2}_{\text{irreducible}}

A fully grown decision tree has near-zero bias on training data but enormous variance. The credit risk tree's 99% train / 71% test accuracy shows a 28-point gap - that gap is the variance term made visible.

Pruning reduces variance but increases bias. Neither option alone yields consistently strong generalization. We need a different mechanism.

Bagging: Bootstrap Aggregating

Leo Breiman introduced bagging in 1994 as a general variance-reduction technique. The idea is elegant and powerful:

  1. Draw BB bootstrap samples from training set D\mathcal{D} - sample nn points with replacement.
  2. Fit a full decision tree f^b\hat{f}^b on each bootstrap sample Db\mathcal{D}^b.
  3. Aggregate predictions by majority vote (classification) or averaging (regression):

f^B(x)=1Bb=1Bf^b(x)\hat{f}^B(x) = \frac{1}{B} \sum_{b=1}^{B} \hat{f}^b(x)

The Variance Reduction Formula: Where the Math Lives

Let X1,X2,,XBX_1, X_2, \ldots, X_B be identically distributed random variables (one per tree), each with variance σ2\sigma^2 and pairwise correlation ρ\rho. Their average Xˉ=1BXb\bar{X} = \frac{1}{B}\sum X_b has variance:

Var(Xˉ)=ρσ2+1ρBσ2\text{Var}(\bar{X}) = \rho \sigma^2 + \frac{1 - \rho}{B} \sigma^2

This is the most important equation in ensemble learning. Examine each term carefully:

  • First term ρσ2\rho \sigma^2: as BB \to \infty, this term remains fixed. No matter how many trees you add, correlated errors do not average away.
  • Second term (1ρ)σ2B\frac{(1-\rho)\sigma^2}{B}: as BB \to \infty, this term 0\to 0.

The correlation ρ\rho between trees is the ceiling on improvement. If every tree makes the same mistakes (high ρ\rho), averaging them changes nothing. Pure bagging on decision trees already reduces variance because bootstrap samples differ - but trees trained on bootstrap samples of the same dataset are still correlated, because they all tend to split on the same strong predictors.

Variance decomposition for different correlation levels:

ρ = 0.0 (independent trees): Var(avg) = 0 + σ²/B → collapses to 0
ρ = 0.5 (bagging): Var(avg) = 0.5σ² + σ²/(2B) → 0.5σ²
ρ = 0.9 (highly correlated): Var(avg) = 0.9σ² + 0.1σ²/B → 0.9σ²

Lesson: adding trees helps only for the uncorrelated component.
The correlated floor ρσ² is unreachable by adding more trees.

:::note The bootstrap sample statistics Each bootstrap sample draws nn points with replacement from nn training points. By the coupon collector's argument, on average each sample contains about 1(11/n)n1e163.2%1 - (1-1/n)^n \approx 1 - e^{-1} \approx 63.2\% of unique training examples. The remaining ~36.8% are out-of-bag (OOB) - never selected for that tree. :::

Feature Randomization: The Random Forest Innovation

Breiman's key insight, published in 2001, was that bagging alone still produces highly correlated trees. If one feature is a very strong predictor (say, credit_score in a loan default model), nearly every tree will split on it near the root. Those root-level splits make all trees structurally similar and highly correlated.

The fix: at each split, randomly sample mm features from the full dd features and only consider those mm features as split candidates.

The canonical defaults:

TaskDefault mtryIntuition
Classificationm=dm = \lfloor\sqrt{d}\rfloor~22% of features for d=20
Regressionm=d/3m = \lfloor d/3 \rfloor~33% of features for d=20

This forces different trees to use different features for their splits, decorrelating them. In the variance formula:

Var(Xˉ)=ρσ2+1ρBσ2\text{Var}(\bar{X}) = \rho \sigma^2 + \frac{1 - \rho}{B} \sigma^2

Feature randomization drives ρ\rho toward zero. Each individual tree is slightly worse (higher bias, because it cannot always use the best feature at each split), but the ensemble is dramatically better because averaging near-independent predictions collapses the variance.

Out-of-Bag Error: The Free Validation Set

Because each tree is trained on a bootstrap sample, roughly 37% of training points are out-of-bag for each tree. This means we can evaluate every training point using only the trees that never saw it during training - a rigorous held-out evaluation without requiring a separate validation set.

Algorithm for OOB error estimation:

For each training point (x_i, y_i):
OOB_trees_i = all trees b where i is NOT in D_b
Collect predictions from OOB_trees_i only
Aggregate: ŷ_i_OOB = majority_vote(OOB_trees_i predictions)

OOB error = mean(ŷ_i_OOB ≠ y_i) over all i

This is a nearly unbiased estimate of test error:
- Each prediction uses only trees that never saw that point
- No separate validation split needed
- Every training point is used in both fitting and validation
Bootstrap sample illustration (n=10, B=3 trees):

Training set: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
(index, not value)

Bootstrap D1: [2, 2, 5, 7, 1, 9, 3, 5, 8, 2] OOB for tree 1: {4, 6, 10}
Bootstrap D2: [4, 1, 6, 6, 3, 9, 2, 8, 4, 7] OOB for tree 2: {5, 10}
Bootstrap D3: [9, 3, 5, 2, 8, 1, 7, 5, 3, 6] OOB for tree 3: {4, 10}

Point 4: OOB for trees 1 and 3 → predicted by trees 1, 3 only
Point 10: OOB for all 3 trees → predicted by all three trees
Point 5: OOB for tree 2 only → predicted by tree 2 only

:::tip When to use OOB error vs cross-validation OOB error is faster to compute and uses all training data for both fitting and validation. It is a good default for hyperparameter searches on Random Forests specifically. Use full cross-validation when comparing across different model families (e.g., RF vs. GBM vs. logistic regression), because OOB error is only naturally available for bagging-based models. :::

Full scikit-learn RF with OOB Evaluation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score

# ── Generate credit risk dataset ──────────────────────────────────────────────
X, y = make_classification(
n_samples=5000, n_features=20, n_informative=10,
n_redundant=5, n_clusters_per_class=2, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

# ── Train Random Forest with OOB score ────────────────────────────────────────
rf = RandomForestClassifier(
n_estimators=200,
max_features="sqrt", # = int(sqrt(20)) = 4 features per split
max_depth=None, # fully grown trees
min_samples_leaf=1,
bootstrap=True,
oob_score=True, # compute OOB accuracy
n_jobs=-1, # all CPU cores
random_state=42,
)
rf.fit(X_train, y_train)

test_acc = accuracy_score(y_test, rf.predict(X_test))
test_auc = roc_auc_score(y_test, rf.predict_proba(X_test)[:, 1])

print(f"Test accuracy : {test_acc:.4f}")
print(f"Test AUC : {test_auc:.4f}")
print(f"OOB accuracy : {rf.oob_score_:.4f} (free validation estimate)")
print(f"OOB error : {1 - rf.oob_score_:.4f}")
print(f"Train-test gap : {rf.score(X_train, y_train) - test_acc:.4f}")

# ── OOB error vs n_estimators (convergence curve) ─────────────────────────────
oob_errors = []
for n_trees in range(1, 201):
rf_partial = RandomForestClassifier(
n_estimators=n_trees, oob_score=True,
max_features="sqrt", random_state=42, n_jobs=-1
)
rf_partial.fit(X_train, y_train)
oob_errors.append(1 - rf_partial.oob_score_)

plt.figure(figsize=(10, 4))
plt.plot(range(1, 201), oob_errors, color="#2563eb", linewidth=1.5)
plt.xlabel("Number of trees")
plt.ylabel("OOB error rate")
plt.title("OOB Error Convergence - Random Forest")
plt.axvline(50, linestyle="--", color="gray", alpha=0.5, label="50 trees")
plt.axvline(100, linestyle="--", color="orange", alpha=0.5, label="100 trees")
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig("oob_convergence.png", dpi=150)
plt.show()

Extra-Trees: Even More Decorrelation

ExtraTreesClassifier (Extremely Randomized Trees) takes decorrelation one step further - it randomizes the split threshold itself, not just the feature subset.

Comparison:

  • Random Forest: sample mm features per split, find the optimal threshold among those mm features via exhaustive search.
  • Extra Trees: sample mm features per split, pick the split threshold uniformly at random within each feature's observed range.
Decision boundary comparison (conceptual):

Random Forest Extra Trees
┌──────────────────────┐ ┌──────────────────────┐
│ ○ ○ ○ │ ● ● ● │ │ ○ ○│○ ● ● ● │
│ ○ ○ │ ● ● │ │ ○ │ ○ ● ● │
│ ○ ○ ○ │ ● ● │ │ ○ │○ ● ● │
└──────────────────────┘ └──────────────────────┘
Optimal threshold Random threshold
(slower: exhaustive search) (faster: O(1) per split)

The result:

  • Individual Extra Trees are weaker (higher bias from random thresholds)
  • But the ensemble is more decorrelated (lower ρ\rho in the variance formula)
  • Training is significantly faster (no exhaustive threshold search)

Extra Trees wins on high-dimensional noisy datasets where random thresholds act as additional regularization. On clean low-noise datasets with strong signal, Random Forests with optimal thresholds usually win. Always benchmark both.

from sklearn.ensemble import ExtraTreesClassifier

et = ExtraTreesClassifier(
n_estimators=200,
max_features="sqrt",
n_jobs=-1,
random_state=42,
)
et.fit(X_train, y_train)
et_auc = roc_auc_score(y_test, et.predict_proba(X_test)[:, 1])
print(f"Extra Trees AUC: {et_auc:.4f} (vs RF AUC: {test_auc:.4f})")

Bias-Variance Decomposition for Ensembles

The key theoretical insight about Random Forests:

Bagging (Random Forest) reduces variance, not bias:

Bias(fˉ)Bias(fb)(averaging does not reduce bias)\text{Bias}(\bar{f}) \approx \text{Bias}(f^b) \quad \text{(averaging does not reduce bias)}

Var(fˉ)=ρσ2+(1ρ)σ2BVar(fb)\text{Var}(\bar{f}) = \rho \sigma^2 + \frac{(1-\rho)\sigma^2}{B} \ll \text{Var}(f^b)

This means:

  • If your single tree is systematically wrong in certain regions (high bias), the forest will be similarly wrong. Averaging cannot correct systematic errors.
  • If your single tree is inconsistent (high variance), the forest dramatically reduces this inconsistency.

This is why the credit risk scenario works: the single tree's 99% training accuracy reflects extreme variance (it memorized training quirks), not genuine signal. The forest reduces that variance while approximately preserving the bias.

Bias-variance view for RF vs GBM:

Method | Bias effect | Variance effect | Training
───────────────|─────────────────|──────────────────────|──────────
Random Forest | Unchanged ≈ 0 | Reduced by 1/(B(1-ρ))| Parallel
Gradient Boost | Reduced each M | Can grow with M | Sequential

RF wins when: variance is the problem (single tree overfit)
GBM wins when: bias is the problem (need to correct systematic errors)

Feature Importance: MDI vs Permutation

Random Forests provide two importance measures that often disagree substantially.

Mean Decrease in Impurity (MDI)

MDI accumulates the weighted impurity reduction contributed by each feature across all trees:

MDI(fj)=1Bb=1BnTbn splits on fjntnΔG(n)\text{MDI}(f_j) = \frac{1}{B} \sum_{b=1}^{B} \sum_{\substack{n \in T_b \\ n \text{ splits on } f_j}} \frac{n_t}{n} \cdot \Delta G(n)

where p(n)=nt/np(n) = n_t/n is the fraction of training samples reaching node nn and ΔG(n)\Delta G(n) is the Gini reduction at that node.

MDI bias with high-cardinality features: A continuous feature with 1000 unique values has 999 threshold candidates; a binary feature has 1. The continuous feature gets more opportunities to contribute to impurity reduction just by having more splits. MDI systematically inflates the importance of high-cardinality features.

Permutation Importance (MDA - Mean Decrease Accuracy)

For each feature, randomly shuffle its values in the validation set and measure the drop in accuracy:

PI(fj)=AccbaselineEshuffles[Accshuffled(fj)]\text{PI}(f_j) = \text{Acc}_{\text{baseline}} - \mathbb{E}_{\text{shuffles}}[\text{Acc}_{\text{shuffled}(f_j)}]

Permutation importance is unbiased with respect to cardinality because it measures actual predictive contribution, not internal tree structure. The cost: it requires many model evaluations and can be misleading for correlated features (permuting one of two correlated features leaves the other to carry the signal, making both look unimportant).

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

# Create dataset with features of very different cardinalities
rng = np.random.default_rng(42)
n = 3000

X_feat = np.column_stack([
rng.normal(0, 1, n), # f0: continuous (many unique values)
rng.normal(0, 1, n), # f1: continuous (many unique values)
rng.integers(0, 2, n).astype(float), # f2: binary (2 unique values)
rng.integers(0, 5, n).astype(float), # f3: low-cardinality (5 unique)
rng.normal(0, 0.01, n), # f4: nearly constant (many unique, no signal)
])
# f0 and f1 are the real predictors; others are noise
y_feat = (rng.normal(0, 1, n) + 2*X_feat[:, 0] + 1.5*X_feat[:, 1] > 0).astype(int)

X_f_tr, X_f_val, y_f_tr, y_f_val = train_test_split(
X_feat, y_feat, test_size=0.3, random_state=42
)

rf_fi = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf_fi.fit(X_f_tr, y_f_tr)

# ── MDI importance ─────────────────────────────────────────────────────────────
mdi = rf_fi.feature_importances_

# ── Permutation importance ─────────────────────────────────────────────────────
perm = permutation_importance(
rf_fi, X_f_val, y_f_val,
n_repeats=20, random_state=42, scoring="roc_auc"
)
perm_mean = perm.importances_mean
perm_std = perm.importances_std

# ── Plot comparison ────────────────────────────────────────────────────────────
feature_names = ["f0 (cont)", "f1 (cont)", "f2 (binary)", "f3 (low-card)", "f4 (const)"]
x_pos = np.arange(len(feature_names))

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.bar(x_pos, mdi, color="#6366f1", alpha=0.8)
ax1.set_xticks(x_pos)
ax1.set_xticklabels(feature_names, rotation=30, ha="right")
ax1.set_title("MDI Feature Importance\n(biased toward high-cardinality)")
ax1.set_ylabel("MDI Score")

ax2.bar(x_pos, perm_mean, color="#16a34a", alpha=0.8, yerr=perm_std, capsize=4)
ax2.set_xticks(x_pos)
ax2.set_xticklabels(feature_names, rotation=30, ha="right")
ax2.set_title("Permutation Importance\n(unbiased, measures actual predictive contribution)")
ax2.set_ylabel("AUC decrease when permuted")

plt.suptitle("MDI vs Permutation Importance - Cardinality Bias Demo", fontsize=12)
plt.tight_layout()
plt.savefig("mdi_vs_permutation.png", dpi=150)
plt.show()

print("MDI ranking: ", [feature_names[i] for i in np.argsort(mdi)[::-1]])
print("Permutation ranking: ", [feature_names[i] for i in np.argsort(perm_mean)[::-1]])

Proximity Matrix for Visualization

The Random Forest proximity matrix captures how often any pair of training examples ends up in the same leaf, averaged across all trees:

Proximity(i,j)=1Bb=1B1[same leaf in tree b]\text{Proximity}(i, j) = \frac{1}{B} \sum_{b=1}^B \mathbf{1}[\text{same leaf in tree } b]

Examples with high proximity are "similar" according to the forest's learned representation. This enables cluster discovery, outlier detection, and visualization - without requiring distance metric specification.

import numpy as np
from sklearn.ensemble import RandomForestClassifier

def compute_proximity_matrix(forest, X: np.ndarray) -> np.ndarray:
"""
Compute the RF proximity matrix.
prox[i, j] = fraction of trees where samples i and j fall in the same leaf.

Note: O(n^2 * B) memory and time - use only for small-medium datasets.
"""
n = len(X)
# Get leaf node IDs for all samples in all trees: shape (n_estimators, n)
leaf_ids = forest.apply(X) # shape: (n, n_estimators) in sklearn
leaf_ids = leaf_ids.T # shape: (n_estimators, n) for convenience

prox = np.zeros((n, n))
for tree_leaves in leaf_ids:
# For each tree, add 1 to (i, j) if they share a leaf
for leaf_id in np.unique(tree_leaves):
idx = np.where(tree_leaves == leaf_id)[0]
# All pairs in this leaf
prox[np.ix_(idx, idx)] += 1

prox /= forest.n_estimators
np.fill_diagonal(prox, 1.0) # self-proximity is 1.0
return prox


# Usage: visualize proximity structure
rf_small = RandomForestClassifier(n_estimators=50, random_state=42)
rf_small.fit(X_train[:500], y_train[:500]) # subset for memory

prox = compute_proximity_matrix(rf_small, X_train[:500])

# Use proximity as a distance for further analysis
from scipy.spatial.distance import squareform
distance_matrix = 1.0 - prox # high proximity = low distance
np.fill_diagonal(distance_matrix, 0.0)

print(f"Proximity matrix shape: {prox.shape}")
print(f"Mean proximity (same class): "
f"{prox[y_train[:500]==1][:, y_train[:500]==1].mean():.4f}")
print(f"Mean proximity (diff class): "
f"{prox[y_train[:500]==0][:, y_train[:500]==1].mean():.4f}")

Key Hyperparameters Reference

ParameterWhat it controlsPractical advice
n_estimatorsNumber of treesStart at 100-200. More trees never hurt accuracy, only compute.
max_featuresFeatures sampled per split"sqrt" for classification, d/3 for regression. Increase if trees are too weak.
max_depthMaximum tree depthNone (fully grown) is the standard default. Set for memory/latency constraints.
min_samples_leafMinimum samples in a leafIncrease to 5-20 for noisy data; acts as regularization.
bootstrapWhether to bootstrapAlways True for standard RF.
oob_scoreCompute OOB accuracySet True during hyperparameter search to avoid a separate validation fold.
n_jobsParallelism-1 uses all available cores. Trees are fully independent - parallelism is perfect.
max_samplesSamples per bootstrapDefault is n (full dataset). Reduce for very large datasets.

:::warning The n_estimators plateau Adding more trees always reduces variance - or keeps it flat. It never hurts accuracy. But marginal benefit decreases rapidly: going from 10 to 100 trees yields dramatic improvement; from 500 to 1000 is barely perceptible. In production, balance accuracy against inference latency and memory. 200 trees is a reasonable ceiling for most problems before hitting serious diminishing returns. :::

Random Forest vs Gradient Boosting: When to Choose RF

DimensionRandom ForestGradient Boosting
BiasHigher (parallel averaging doesn't reduce bias)Lower (sequential correction reduces bias)
VarianceLower (decorrelated tree averaging)Higher (can overfit with too many rounds)
Training speedFast (fully parallel)Slower (sequential by design)
Hyperparameter sensitivityLow - robust defaults work wellHigher - learning rate, depth, rounds interact
Noisy labelsRobustSensitive (up-weights hard examples)
Many irrelevant featuresExcellent - feature sampling helpsGood, but needs careful regularization
Raw tabular benchmarksStrong baselineUsually wins final leaderboard
InterpretabilityMDI + permutation + proximitySame + smooth partial dependence

Rule of thumb: Start with Random Forest. It is forgiving, robust, and trains fast. Move to gradient boosting (Lesson 05) when you need the last few percentage points of performance and are willing to tune carefully.

Production Engineering Notes

Memory Footprint

A forest of BB fully grown trees on a wide dataset consumes substantial RAM. Each tree stores split features, thresholds, and leaf values for every node. For B=200B = 200 fully grown trees on a 30-feature dataset with 50,000 samples, RAM usage can reach 2-4 GB.

import joblib
import sys

# Measure model size in memory
rf_size_bytes = sys.getsizeof(rf)
print(f"Approximate in-memory size: {rf_size_bytes / 1024**2:.1f} MB")

# Save with compression - cuts file size ~50% with minimal overhead
joblib.dump(rf, "random_forest_credit_v2.joblib", compress=3)

# Load in serving process
rf_loaded = joblib.load("random_forest_credit_v2.joblib")

# Check that loaded model matches original
assert np.allclose(rf.predict_proba(X_test[:10]),
rf_loaded.predict_proba(X_test[:10]))
print("Model load verified successfully")

Mitigations for memory:

  • Set max_depth (e.g., 15-20) to cap tree size.
  • Set min_samples_leaf (e.g., 5-10) to merge small leaves.
  • Use max_leaf_nodes to directly cap per-tree size.
  • Use n_estimators=100 instead of 500 - the last 400 trees add very little.

Prediction Latency

Each prediction traverses BB trees. For a forest with 200 trees of depth 20, that is up to 4,000 node comparisons per prediction. In Python this is fast enough for batch scoring (thousands of rows per second). For real-time serving at sub-10ms SLAs:

# Option 1: Reduce model complexity
rf_fast = RandomForestClassifier(
n_estimators=50, # 50 instead of 200
max_depth=10, # cap depth
n_jobs=-1,
random_state=42,
)
rf_fast.fit(X_train, y_train)

# Check speed
import time
X_batch = X_test[:1000]
start = time.perf_counter()
_ = rf_fast.predict(X_batch)
elapsed = (time.perf_counter() - start) * 1000
print(f"1000 predictions in {elapsed:.1f}ms ({elapsed/1000*1000:.3f}ms per row)")

# Option 2: Export to treelite for C-compiled inference (~10-100x speedup)
# pip install treelite treelite-runtime
# import treelite
# model_treelite = treelite.sklearn.import_model(rf)
# model_treelite.export_lib(toolchain='gcc', libpath='rf_model.so', verbose=True)

Monitoring OOB Error in Production

from sklearn.ensemble import RandomForestClassifier

ALERT_THRESHOLD = 0.15 # alert if OOB error exceeds 15%

def retrain_and_monitor(X_new, y_new, n_estimators=200):
"""Retrain RF and check for data drift via OOB error."""
rf_new = RandomForestClassifier(
n_estimators=n_estimators,
oob_score=True,
n_jobs=-1,
random_state=42,
)
rf_new.fit(X_new, y_new)
oob_error = 1.0 - rf_new.oob_score_

print(f"OOB error: {oob_error:.4f}")
if oob_error > ALERT_THRESHOLD:
raise RuntimeError(
f"OOB error {oob_error:.3f} exceeds threshold {ALERT_THRESHOLD} "
f"- check for data drift or label quality issues"
)
return rf_new

# Track OOB error across retraining runs as a drift signal
# A sudden jump indicates distribution shift, not a code regression

:::tip The production lesson from the credit risk scenario The original single tree's 99% training accuracy was a red flag, not a success metric. Overfitting to training data is invisible until production. Random Forest's lower training accuracy (94%) paired with much higher test accuracy (89%) is the healthy pattern. In production ML, a small train-test gap is worth far more than a high training score. The job of a model is to generalize, not to memorize. :::

YouTube Resources

VideoChannelWhy Watch It
Random Forests Part 1 - Building and UsingStatQuest with Josh StarmerBest intuitive explanation of bagging and feature randomization
Random Forests Part 2 - Missing Data and ClusteringStatQuest with Josh StarmerProximity matrix, missing data, and OOB error explained visually
Ensemble Methods - Bagging, Boosting, StackingKrish NaikHigh-level overview of all ensemble methods with sklearn
Feature Importance in Random ForestTowards Data ScienceMDI bias and permutation importance comparison
Bias-Variance Tradeoff in EnsemblesCoding LaneMathematical derivation of variance reduction formula

Interview Questions and Answers

Q1: Explain why Random Forest's training accuracy is lower than a single decision tree's, yet its test accuracy is higher.

A single decision tree grown to full depth memorizes the training set - it creates so many leaves that nearly every training example gets its own leaf with zero error. This is extreme variance: change any training example and the tree changes dramatically. The training accuracy of 99% is not a measure of learning, it is a measure of memorization. Random Forest introduces two mechanisms that deliberately prevent memorization: (1) each tree is trained on a bootstrap sample (different data) so different trees specialize on different parts of the space, and (2) at each split, only d\sqrt{d} randomly selected features are considered, forcing different trees to find different signal paths. Both mechanisms reduce correlation between trees. By the variance formula Var(Xˉ)=ρσ2+(1ρ)σ2/B\text{Var}(\bar{X}) = \rho\sigma^2 + (1-\rho)\sigma^2/B, lower correlation ρ\rho means averaging delivers large variance reduction. The lower training accuracy reflects that each individual tree is now slightly weaker - but the ensemble of 200 weaker, decorrelated trees dramatically outperforms one memorizer.

Q2: Derive the variance of the average of B correlated predictions. What does this tell you about the limits of ensemble learning?

Let X1,,XBX_1, \ldots, X_B be identically distributed with variance σ2\sigma^2 and pairwise correlation ρ\rho. Then Var(Xˉ)=Cov(Xˉ,Xˉ)=1B2i,jCov(Xi,Xj)=1B2[Bσ2+B(B1)ρσ2]=ρσ2+(1ρ)σ2B\text{Var}(\bar{X}) = \text{Cov}(\bar{X}, \bar{X}) = \frac{1}{B^2}\sum_{i,j}\text{Cov}(X_i, X_j) = \frac{1}{B^2}[B\sigma^2 + B(B-1)\rho\sigma^2] = \rho\sigma^2 + \frac{(1-\rho)\sigma^2}{B}. As BB \to \infty, the second term vanishes but the first term ρσ2\rho\sigma^2 remains fixed. This is the fundamental limit: correlation is the ceiling of ensemble improvement. No matter how many trees you add, if they are all making the same correlated errors (high ρ\rho), the variance floor remains at ρσ2\rho\sigma^2. Feature randomization directly attacks ρ\rho - it prevents all trees from using the same features at the root, breaking correlation. Without feature randomization (plain bagging), ρ\rho stays relatively high because strong features dominate all trees. With m=dm = \sqrt{d}, correlation drops dramatically.

Q3: What is Out-of-Bag error and why is it a nearly unbiased estimate of test error?

Each Random Forest tree is trained on a bootstrap sample that includes approximately 63.2% of unique training examples. The remaining ~36.8% are out-of-bag for that tree - never seen during its training. For each training example ii, OOB error uses only the predictions from trees where ii was OOB. Because those trees never saw ii, their predictions for ii are genuinely held-out predictions. Aggregating these across all examples gives an estimate equivalent to a leave-one-out cross-validation done on approximately 0.632B0.632 \cdot B trees per example. It is "nearly unbiased" (not exactly unbiased) because the trees used for OOB estimation are trained on slightly less data (63.2% instead of 100%), making them slightly weaker than the final forest - so OOB error slightly overestimates test error. In practice this bias is negligible.

Q4: How does MDI feature importance differ from permutation importance, and when does each mislead?

MDI (Mean Decrease Impurity) accumulates the weighted Gini reduction contributed by each feature across all nodes and trees. It is computed from the tree structure and requires no additional model evaluations. Its key bias: continuous or high-cardinality features get more split opportunities, inflating their MDI scores even if they are only marginally more predictive. Example: a patient ID column with unique values per row gets an enormous MDI score despite being useless. Permutation importance shuffles each feature's values in a held-out validation set and measures the resulting drop in model score. It is unbiased with respect to cardinality but has a different problem: when features are correlated, permuting one feature leaves another to carry the signal, making both look unimportant. The right approach: use permutation importance as the primary importance measure, group correlated features and permute them together, and use MDI only to understand tree structure.

Q5: What are Extra-Trees and when do they outperform Random Forest?

Extra-Trees (Extremely Randomized Trees) randomize both the feature subset AND the split threshold at each node. Random Forest picks the optimal threshold among mm randomly selected features. Extra-Trees picks a random threshold within the observed feature range. This additional randomization further reduces correlation between trees (lower ρ\rho), at the cost of higher individual tree bias (suboptimal thresholds). Extra-Trees trains faster because there is no exhaustive threshold search. Extra-Trees tends to outperform RF on: (1) very high-dimensional noisy datasets where random thresholds act as strong regularization, (2) datasets with many nearly-equivalent features where optimal threshold selection overfits to noise, (3) when training speed is critical. RF tends to outperform Extra-Trees on: (1) clean datasets with few strong features where finding optimal thresholds matters, (2) datasets with strong signal and low noise. Always benchmark both - the performance difference is often less than 0.5% AUC.

Q6: A colleague argues that adding a 501st tree to a 500-tree forest can only hurt performance because the model becomes more complex. Is this correct?

No, this is incorrect. Adding trees to a Random Forest can never decrease accuracy - it can only maintain or slightly improve it. The variance formula Var(Xˉ)=ρσ2+(1ρ)σ2/B\text{Var}(\bar{X}) = \rho\sigma^2 + (1-\rho)\sigma^2/B is monotonically decreasing in BB. More trees means lower variance, which means lower expected test error. However, the marginal benefit decreases rapidly - going from 500 to 501 trees changes performance by less than 0.0001%. The practical concern is not accuracy but compute: memory usage and prediction latency grow linearly with BB. The right tradeoff is to find the number of trees beyond which OOB error stops improving meaningfully (typically around 100-200 for most datasets), and use that as your n_estimators. The statement "more complex means worse" applies to model capacity (depth, features) but not to ensemble size.

Hyperparameter Tuning Reference

The most impactful Random Forest hyperparameters, in order of importance:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.datasets import make_classification
import numpy as np

X, y = make_classification(n_samples=10000, n_features=30,
n_informative=15, n_redundant=5, random_state=42)

# Parameter grid in priority order (most → least impactful)
param_dist = {
# Tier 1: MOST important
'n_estimators': [100, 200, 300, 500], # more is better, diminishing returns after 200
'max_features': ['sqrt', 'log2', 0.3, 0.5], # sqrt for classification, 1/3 for regression
'max_depth': [None, 10, 20, 30, 50], # None = full tree; tune to reduce overfitting

# Tier 2: important regularizers
'min_samples_leaf': [1, 3, 5, 10, 20], # larger = more regularization
'min_samples_split': [2, 5, 10, 20],
'max_leaf_nodes': [None, 100, 500, 1000], # alternative to max_depth

# Tier 3: less critical
'bootstrap': [True, False], # False = paste, True = bagging
'class_weight': [None, 'balanced'], # important for imbalanced classes
}

rf_tuned = RandomizedSearchCV(
RandomForestClassifier(n_jobs=-1, oob_score=True, random_state=42),
param_distributions=param_dist,
n_iter=60, # number of random configurations to try
cv=StratifiedKFold(5, shuffle=True, random_state=42),
scoring='roc_auc',
n_jobs=-1,
random_state=42,
verbose=1,
)
rf_tuned.fit(X, y)

print(f"Best AUC: {rf_tuned.best_score_:.4f}")
print("Best params:")
for k, v in rf_tuned.best_params_.items():
print(f" {k}: {v}")

# Inspect OOB score of best model
best_rf = rf_tuned.best_estimator_
print(f"\nOOB AUC: {best_rf.oob_score_:.4f}")

Quick-start defaults that work 80% of the time:

  • n_estimators=200 - covers most datasets
  • max_features='sqrt' for classification, max_features=0.33 for regression
  • min_samples_leaf=5 - mild regularization, safe default
  • n_jobs=-1 - use all cores

Production Monitoring: Forest Health Checks

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

def check_forest_health(rf: RandomForestClassifier,
X_train: np.ndarray, y_train: np.ndarray,
X_val: np.ndarray, y_val: np.ndarray,
feature_names: list = None) -> dict:
"""
Comprehensive Random Forest health check for production models.

Checks:
1. OOB vs validation score - large gap suggests distribution shift
2. Tree depth distribution - pathologically deep trees signal poor tuning
3. MDI vs permutation importance agreement - large disagreement suggests cardinality bias
4. Feature importance stability - high variance across trees signals brittle features
"""
n_features = X_train.shape[1]
if feature_names is None:
feature_names = [f"f{i}" for i in range(n_features)]

diagnostics = {}

# 1. OOB vs validation score
oob_score = rf.oob_score_ if hasattr(rf, 'oob_score_') and rf.oob_score_ else None
val_score = rf.score(X_val, y_val)
diagnostics['oob_score'] = oob_score
diagnostics['val_score'] = val_score
if oob_score:
gap = abs(oob_score - val_score)
diagnostics['oob_val_gap'] = gap
if gap > 0.03:
print(f"WARNING: OOB-val gap={gap:.4f} (>0.03) - possible distribution shift")

# 2. Tree depth distribution
depths = [t.get_depth() for t in rf.estimators_]
diagnostics['mean_tree_depth'] = np.mean(depths)
diagnostics['max_tree_depth'] = np.max(depths)
diagnostics['std_tree_depth'] = np.std(depths)
if np.mean(depths) > 25:
print(f"WARNING: Mean tree depth {np.mean(depths):.1f} - consider max_depth or min_samples_leaf")

# 3. MDI vs permutation importance
mdi_importance = rf.feature_importances_
perm_result = permutation_importance(rf, X_val, y_val, n_repeats=10,
random_state=42, n_jobs=-1)
perm_mean = perm_result.importances_mean
perm_std = perm_result.importances_std

# Rank correlation between MDI and permutation importance
from scipy.stats import spearmanr
rank_corr, _ = spearmanr(mdi_importance, perm_mean)
diagnostics['mdi_perm_rank_corr'] = rank_corr
if rank_corr < 0.7:
print(f"WARNING: MDI-permutation rank correlation={rank_corr:.3f} (<0.7) "
f"- possible cardinality bias in MDI")

# 4. Importance stability across trees
per_tree_importances = np.array([t.feature_importances_ for t in rf.estimators_])
importance_cv = per_tree_importances.std(axis=0) / (per_tree_importances.mean(axis=0) + 1e-10)
diagnostics['mean_importance_cv'] = importance_cv.mean()

# Plots
fig, axes = plt.subplots(2, 2, figsize=(16, 11))

# Tree depth histogram
axes[0, 0].hist(depths, bins=20, color='#3b82f6', alpha=0.7)
axes[0, 0].axvline(np.mean(depths), color='#dc2626', linestyle='--',
label=f'Mean={np.mean(depths):.1f}')
axes[0, 0].set_xlabel("Tree Depth")
axes[0, 0].set_ylabel("Count")
axes[0, 0].set_title("Tree Depth Distribution")
axes[0, 0].legend()

# Top features: MDI
top_idx = np.argsort(mdi_importance)[-15:]
axes[0, 1].barh([feature_names[i] for i in top_idx], mdi_importance[top_idx],
color='#7c3aed', alpha=0.7)
axes[0, 1].set_xlabel("MDI Importance")
axes[0, 1].set_title("Top 15 Features: MDI")

# Top features: Permutation
top_perm_idx = np.argsort(perm_mean)[-15:]
axes[1, 0].barh([feature_names[i] for i in top_perm_idx], perm_mean[top_perm_idx],
xerr=perm_std[top_perm_idx], color='#16a34a', alpha=0.7)
axes[1, 0].set_xlabel("Permutation Importance")
axes[1, 0].set_title("Top 15 Features: Permutation (with std)")

# MDI vs Permutation scatter
axes[1, 1].scatter(mdi_importance, perm_mean, s=20, alpha=0.6, color='#ea580c')
axes[1, 1].set_xlabel("MDI Importance")
axes[1, 1].set_ylabel("Permutation Importance")
axes[1, 1].set_title(f"MDI vs Permutation (Spearman r={rank_corr:.3f})")
for i, name in enumerate(feature_names):
if mdi_importance[i] > np.percentile(mdi_importance, 90):
axes[1, 1].annotate(name, (mdi_importance[i], perm_mean[i]), fontsize=7)

plt.suptitle("Random Forest Health Check", fontsize=13)
plt.tight_layout()
plt.show()

return diagnostics

## OOB Error Convergence and Optimal Tree Count

```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

def find_optimal_n_estimators(X_train, y_train, X_val, y_val,
max_trees: int = 500,
step: int = 10) -> dict:
"""
Find the optimal number of trees by monitoring OOB error convergence.
Returns: dict with optimal_n and the OOB curve data.
"""
n_values = list(range(10, max_trees + 1, step))
oob_errors = []
val_aucs = []

for n in n_values:
rf = RandomForestClassifier(
n_estimators=n,
oob_score=True,
max_features='sqrt',
min_samples_leaf=5,
n_jobs=-1,
random_state=42,
)
rf.fit(X_train, y_train)
oob_errors.append(1 - rf.oob_score_)
val_aucs.append(roc_auc_score(y_val, rf.predict_proba(X_val)[:, 1]))

oob_errors = np.array(oob_errors)
val_aucs = np.array(val_aucs)

# Find elbow: where OOB improvement drops below 0.001 per 10 trees
improvements = np.diff(oob_errors)
elbow_idx = next((i for i, imp in enumerate(improvements) if abs(imp) < 0.001), -1)
optimal_n = n_values[elbow_idx] if elbow_idx >= 0 else n_values[np.argmin(oob_errors)]

print(f"OOB error range: [{oob_errors.min():.4f}, {oob_errors.max():.4f}]")
print(f"Val AUC range: [{val_aucs.min():.4f}, {val_aucs.max():.4f}]")
print(f"Optimal n_estimators (elbow): {optimal_n}")
print(f"OOB error at optimal: {oob_errors[n_values.index(optimal_n)]:.4f}")

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(n_values, oob_errors, 'o-', ms=3, color='#dc2626', linewidth=1.5)
axes[0].axvline(optimal_n, color='#1e293b', linestyle='--',
label=f'Optimal n={optimal_n}')
axes[0].set_xlabel("Number of Trees")
axes[0].set_ylabel("OOB Error Rate")
axes[0].set_title("OOB Error Convergence")
axes[0].legend()

axes[1].plot(n_values, val_aucs, 'o-', ms=3, color='#3b82f6', linewidth=1.5)
axes[1].axvline(optimal_n, color='#1e293b', linestyle='--',
label=f'Optimal n={optimal_n}')
axes[1].set_xlabel("Number of Trees")
axes[1].set_ylabel("Validation AUC")
axes[1].set_title("Validation AUC vs n_estimators")
axes[1].legend()

plt.suptitle("Finding Optimal n_estimators", fontsize=12)
plt.tight_layout()
plt.show()

return {'optimal_n': optimal_n, 'oob_errors': oob_errors, 'val_aucs': val_aucs}


# Usage
X, y = make_classification(n_samples=8000, n_features=25,
n_informative=12, random_state=42)
X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
result = find_optimal_n_estimators(X_tr, y_tr, X_val, y_val, max_trees=400, step=10)

Proximity Matrix: Similarity from Tree Structure

The Random Forest proximity matrix measures how often pairs of data points land in the same leaf across all trees. This is a powerful unsupervised measure of similarity that is invariant to scale and captures non-linear relationships:

import numpy as np
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

def compute_proximity_matrix(rf: RandomForestClassifier,
X: np.ndarray,
normalize: bool = True) -> np.ndarray:
"""
Compute RF proximity matrix.
proximity[i,j] = fraction of trees where samples i and j land in same leaf.
Range: [0, 1] - higher means more similar.

O(n² * n_estimators * n_nodes) - use only for n < 5000.
"""
n = X.shape[0]
n_trees = len(rf.estimators_)

# Get leaf indices for each sample, each tree
# leaf_indices: (n_samples, n_estimators)
leaf_indices = rf.apply(X)

# Count co-occurrences
proximity = np.zeros((n, n), dtype=np.float32)
for tree_idx in range(n_trees):
leaves = leaf_indices[:, tree_idx]
# Vectorized: points in same leaf have proximity += 1
same_leaf = leaves[:, None] == leaves[None, :] # (n, n) boolean
proximity += same_leaf.astype(np.float32)

if normalize:
proximity /= n_trees

return proximity


# Usage: compare samples within/across classes
X_prox, y_prox = make_classification(n_samples=300, n_features=10,
n_informative=6, random_state=42)
rf_prox = RandomForestClassifier(n_estimators=100, random_state=42).fit(X_prox, y_prox)
P = compute_proximity_matrix(rf_prox, X_prox)

# Sort by class for visualization
sort_idx = np.argsort(y_prox)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

axes[0].imshow(P[sort_idx][:, sort_idx], cmap='viridis', aspect='auto')
axes[0].set_title("RF Proximity Matrix (sorted by class)")
axes[0].set_xlabel("Sample index")
axes[0].set_ylabel("Sample index")
plt.colorbar(axes[0].images[0], ax=axes[0])

# Within-class vs between-class proximity
within_same = P[y_prox==0][:, y_prox==0][np.triu_indices(y_prox.sum()==0, k=1)].mean()
within_diff = P[y_prox==0][:, y_prox==1].mean()
axes[1].bar(['Same class', 'Different class'],
[P[y_prox[sort_idx[:150]]==y_prox[sort_idx[:150]]].mean(),
P[y_prox[sort_idx[:150]]!=y_prox[sort_idx[150:]]].mean()
if len(sort_idx) > 150 else 0],
color=['#3b82f6', '#ef4444'], alpha=0.8)
axes[1].set_ylabel("Mean Proximity")
axes[1].set_title("Within-class vs Between-class Proximity")

plt.tight_layout()
plt.show()

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Random Forest Ensemble demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.