Gradient Boosting From Scratch
The Production Scenario
A fintech company has been running a Random Forest for loan default prediction. AUC sits at 0.84 - solid, but the model committee has set a 0.90 target before approving the model for an expanded product line. A new ML engineer proposes switching to gradient boosting. She runs the experiment over a weekend. Same features, same training data. The gradient boosting model hits 0.91 AUC.
The senior engineer calls a design review. He is skeptical of "magic" and asks the team to explain precisely why gradient boosting outperforms the Random Forest on this dataset. Nobody gives a satisfying answer. One engineer says: "it just uses better trees." Another says: "it has more hyperparameters so it fits better." Both answers are wrong in the ways that matter.
The meeting ends with an action item: understand gradient boosting deeply enough to defend the model choice. This lesson provides that understanding. By the end, you will be able to explain gradient boosting from the loss function up - including why it specifically reduces bias while Random Forest reduces variance, and when each approach wins.
The Additive Model Framework
Gradient boosting builds an additive model: the final prediction is a sum of many simple models, each contributing a small correction.
where:
- is a constant (the initial guess, typically the mean for regression)
- for are shallow regression trees
- is the learning rate (shrinkage)
- is the number of boosting rounds
The model is built iteratively:
The key question: what should each be, and how should we choose it? This is where functional gradient descent comes in.
AdaBoost: Intuition Before the Math
Before gradient boosting, Freund and Schapire introduced AdaBoost in 1997. It is worth understanding because it crystallizes the sequential correction concept:
- Start with uniform sample weights .
- Fit a weak classifier on the weighted training set.
- Compute weighted error .
- Compute classifier weight .
- Increase weights on misclassified samples, decrease on correct ones.
- Final prediction: .
AdaBoost's key idea: each new classifier focuses on the mistakes of the previous ensemble by up-weighting hard examples. The sequential correction is explicit in the sample weights.
AdaBoost is clever but fragile - it is sensitive to noisy labels (up-weighting outliers repeatedly hurts), and the weighting trick is specific to binary classification. Gradient boosting is the principled generalization that works for any differentiable loss function.
Gradient Boosting as Functional Gradient Descent
Gradient descent in parameter space moves parameters in the direction of the negative gradient of the loss. Gradient boosting applies the same idea in function space - we are not optimizing a parameter vector, we are finding a function .
We want to find that minimizes the total loss:
If were a free vector of real numbers (one per training example), we would update it as:
The gradient with respect to each prediction is:
We cannot take an unconstrained step in function space (that would just set for every training point, memorizing the data). Instead, we fit a tree to the negative gradients, then add a shrunken version of that tree to the ensemble:
The negative gradients are called pseudo-residuals:
Each tree is fit to the pseudo-residuals . This is gradient descent in function space: instead of updating a parameter vector, we add a function (tree) that best approximates the gradient direction.
Pseudo-Residuals for Common Loss Functions
The power of this formulation: pseudo-residuals automatically adapt to any differentiable loss function.
Mean Squared Error - Regression
For MSE, pseudo-residuals are actual residuals. This is the special case that makes gradient boosting regression intuitive: just fit trees to the residuals and add them up.
Log-Loss - Binary Classification
The pseudo-residuals are the difference between the true label and the current predicted probability. This is identical in form to logistic regression's gradient - the connection to optimization theory is explicit and not coincidental.
Mean Absolute Error - Robust Regression
Pseudo-residuals are just signs. This makes gradient boosting with MAE loss more robust to outliers than with MSE loss - outliers contribute the same magnitude signal () as regular examples, rather than quadratically larger signals.
Pseudo-Residual Summary
Loss function | Pseudo-residual r_i | Sensitivity to outliers
──────────────────|──────────────────────────────|─────────────────────────
MSE (1/2)(y-F)² | y_i - F(x_i) (residuals) | High (quadratic)
MAE |y-F| | sign(y_i - F(x_i)) | Low (bounded ±1)
Log-loss | y_i - σ(F(x_i)) | Moderate
Huber | mix of MSE/MAE by threshold | Low for large residuals
The Gradient Boosting Algorithm
Algorithm: Gradient Boosting
─────────────────────────────────────────────────────────────────────────────
Input: training data {(x_i, y_i)}, loss L, M rounds, learning rate ν
Step 1 - Initialise with a constant:
F_0(x) = argmin_γ Σ L(y_i, γ)
(For MSE: F_0 = mean(y))
(For log-loss: F_0 = log(p̄/(1-p̄)) where p̄ = mean(y))
Step 2 - For m = 1, 2, ..., M:
a. Compute pseudo-residuals at each training point:
r_im = -[∂L(y_i, F(x_i)) / ∂F(x_i)] evaluated at F = F_{m-1}
b. Fit a regression tree h_m to {(x_i, r_im)}
(tree predicts pseudo-residuals, not original targets)
c. Compute optimal leaf weights γ_jm for each leaf j:
γ_jm = argmin_γ Σ_{x_i in leaf j} L(y_i, F_{m-1}(x_i) + γ)
(one-dimensional line search per leaf)
d. Update the ensemble:
F_m(x) = F_{m-1}(x) + ν · h_m(x)
Output: F_M(x)
─────────────────────────────────────────────────────────────────────────────
Step 2c (computing optimal leaf weights) is an important detail. After the tree structure is fixed (which feature and threshold at each node), the leaf values are re-optimized directly against the loss function - not just set to the average pseudo-residual in that leaf. Sklearn's implementation does this; our scratch implementation below uses pseudo-residual averages (simpler and nearly equivalent for MSE loss).
NumPy GBM From Scratch
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, log_loss
class GradientBoostingRegressorScratch:
"""
Gradient boosting for regression with MSE loss.
For MSE: pseudo-residuals = actual residuals.
Implements bootstrap row sampling (stochastic boosting).
"""
def __init__(
self,
n_estimators: int = 100,
learning_rate: float = 0.1,
max_depth: int = 3,
subsample: float = 1.0,
random_state: int | None = None,
):
self.n_estimators = n_estimators
self.learning_rate = learning_rate
self.max_depth = max_depth
self.subsample = subsample
self.random_state = random_state
self.trees_: list = []
self.F0_: float = 0.0
self.train_losses_: list = []
def fit(self, X: np.ndarray, y: np.ndarray) -> "GradientBoostingRegressorScratch":
rng = np.random.default_rng(self.random_state)
n_samples = X.shape[0]
# Step 1: initialise with mean
self.F0_ = float(np.mean(y))
F = np.full(n_samples, self.F0_)
self.trees_ = []
self.train_losses_ = [float(np.mean((y - F) ** 2))]
for m in range(self.n_estimators):
# Step 2a: pseudo-residuals (= residuals for MSE)
residuals = y - F
# Stochastic gradient boosting: row subsampling
if self.subsample < 1.0:
n_sub = max(1, int(n_samples * self.subsample))
sample_idx = rng.choice(n_samples, size=n_sub, replace=False)
else:
sample_idx = np.arange(n_samples)
X_sub = X[sample_idx]
r_sub = residuals[sample_idx]
# Step 2b: fit regression tree to pseudo-residuals
tree = DecisionTreeRegressor(
max_depth=self.max_depth,
random_state=int(rng.integers(1_000_000)),
)
tree.fit(X_sub, r_sub)
self.trees_.append(tree)
# Step 2d: update ensemble on ALL samples
F = F + self.learning_rate * tree.predict(X)
self.train_losses_.append(float(np.mean((y - F) ** 2)))
return self
def predict(self, X: np.ndarray) -> np.ndarray:
F = np.full(X.shape[0], self.F0_)
for tree in self.trees_:
F = F + self.learning_rate * tree.predict(X)
return F
def staged_predict(self, X: np.ndarray):
"""Yield predictions after each boosting round - useful for manual early stopping."""
F = np.full(X.shape[0], self.F0_)
yield F.copy()
for tree in self.trees_:
F = F + self.learning_rate * tree.predict(X)
yield F.copy()
class GradientBoostingClassifierScratch:
"""
Gradient boosting for binary classification with log-loss.
Pseudo-residuals: r_i = y_i - sigmoid(F_{m-1}(x_i)).
"""
def __init__(
self,
n_estimators: int = 100,
learning_rate: float = 0.1,
max_depth: int = 3,
subsample: float = 1.0,
random_state: int | None = None,
):
self.n_estimators = n_estimators
self.learning_rate = learning_rate
self.max_depth = max_depth
self.subsample = subsample
self.random_state = random_state
self.trees_: list = []
self.F0_: float = 0.0
self.train_losses_: list = []
@staticmethod
def _sigmoid(x: np.ndarray) -> np.ndarray:
return 1.0 / (1.0 + np.exp(-x))
def fit(self, X: np.ndarray, y: np.ndarray) -> "GradientBoostingClassifierScratch":
rng = np.random.default_rng(self.random_state)
n_samples = X.shape[0]
y = y.astype(float)
# Step 1: initialise with log-odds of mean
p_mean = np.clip(np.mean(y), 1e-9, 1 - 1e-9)
self.F0_ = float(np.log(p_mean / (1 - p_mean)))
F = np.full(n_samples, self.F0_)
self.trees_ = []
initial_prob = self._sigmoid(F)
self.train_losses_ = [float(log_loss(y, initial_prob))]
for m in range(self.n_estimators):
# Step 2a: pseudo-residuals for log-loss = y_i - p_i
p = self._sigmoid(F)
residuals = y - p
# Stochastic gradient boosting
if self.subsample < 1.0:
n_sub = max(1, int(n_samples * self.subsample))
sample_idx = rng.choice(n_samples, size=n_sub, replace=False)
else:
sample_idx = np.arange(n_samples)
X_sub = X[sample_idx]
r_sub = residuals[sample_idx]
# Step 2b: fit tree to pseudo-residuals
tree = DecisionTreeRegressor(
max_depth=self.max_depth,
random_state=int(rng.integers(1_000_000)),
)
tree.fit(X_sub, r_sub)
self.trees_.append(tree)
# Step 2d: update
F = F + self.learning_rate * tree.predict(X)
current_prob = self._sigmoid(F)
self.train_losses_.append(float(log_loss(y, current_prob)))
return self
def predict_proba(self, X: np.ndarray) -> np.ndarray:
F = np.full(X.shape[0], self.F0_)
for tree in self.trees_:
F = F + self.learning_rate * tree.predict(X)
p = self._sigmoid(F)
return np.column_stack([1 - p, p])
def predict(self, X: np.ndarray) -> np.ndarray:
return (self.predict_proba(X)[:, 1] >= 0.5).astype(int)
Usage and Loss Curve
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt
# ── Regression example ────────────────────────────────────────────────────────
X_reg, y_reg = make_regression(n_samples=2000, n_features=15, noise=30, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)
gbm_reg = GradientBoostingRegressorScratch(
n_estimators=200, learning_rate=0.1, max_depth=3, subsample=0.8, random_state=42
)
gbm_reg.fit(X_tr, y_tr)
train_mse = mean_squared_error(y_tr, gbm_reg.predict(X_tr))
test_mse = mean_squared_error(y_te, gbm_reg.predict(X_te))
print(f"Regression - Train MSE: {train_mse:.2f} | Test MSE: {test_mse:.2f}")
# ── Classification example ────────────────────────────────────────────────────
X_cls, y_cls = make_classification(
n_samples=3000, n_features=20, n_informative=10, random_state=42
)
X_ctr, X_cte, y_ctr, y_cte = train_test_split(X_cls, y_cls, test_size=0.2, random_state=42)
gbm_cls = GradientBoostingClassifierScratch(
n_estimators=200, learning_rate=0.1, max_depth=3, subsample=0.8, random_state=42
)
gbm_cls.fit(X_ctr, y_ctr)
train_auc = roc_auc_score(y_ctr, gbm_cls.predict_proba(X_ctr)[:, 1])
test_auc = roc_auc_score(y_cte, gbm_cls.predict_proba(X_cte)[:, 1])
print(f"Classification - Train AUC: {train_auc:.4f} | Test AUC: {test_auc:.4f}")
# ── Loss curve visualization ──────────────────────────────────────────────────
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
ax1.plot(gbm_reg.train_losses_, color="#2563eb", linewidth=1.5)
ax1.set_xlabel("Boosting round")
ax1.set_ylabel("Train MSE")
ax1.set_title("Gradient Boosting - Regression Loss Curve")
ax1.grid(alpha=0.3)
ax2.plot(gbm_cls.train_losses_, color="#dc2626", linewidth=1.5)
ax2.set_xlabel("Boosting round")
ax2.set_ylabel("Train Log-Loss")
ax2.set_title("Gradient Boosting - Classification Loss Curve")
ax2.grid(alpha=0.3)
plt.tight_layout()
plt.savefig("gbm_loss_curves.png", dpi=150)
plt.show()
Shrinkage (Learning Rate): Why It Helps
The learning rate scales each tree's contribution before adding it to the ensemble:
Shrinkage is the primary regularization mechanism in gradient boosting.
Effect of learning rate on bias-variance tradeoff:
ν = 1.0 │ Fast learning │ Few trees needed │ High overfitting risk
ν = 0.1 │ Moderate │ ~10x more trees │ Good generalization (default)
ν = 0.01 │ Slow learning │ ~100x more trees │ Very robust, computationally expensive
The deep intuition: smaller takes smaller steps in function space. Smaller steps are less likely to overshoot a minimum and are more likely to find a better path through the function space landscape. It is the same reason SGD with a small learning rate often finds better minima than with a large learning rate.
The fundamental tradeoff: smaller requires more boosting rounds to achieve the same training loss, but produces better generalization. The relationship is approximately:
for a fixed dataset and fixed target performance. Halving the learning rate roughly doubles the optimal number of rounds.
:::tip The learning rate - n_estimators interaction
Never reduce learning rate without also increasing n_estimators. If you set learning_rate=0.01 and keep n_estimators=100, the model will massively underfit. Set n_estimators=1000 or use early stopping to find the right number of rounds automatically.
:::
Stochastic Gradient Boosting: Row and Column Subsampling
Friedman (1999) proposed stochastic gradient boosting: at each round, fit the tree on a random subsample of the training data (without replacement), instead of the full dataset.
# In our scratch implementation:
if self.subsample < 1.0:
n_sub = max(1, int(n_samples * self.subsample))
sample_idx = rng.choice(n_samples, size=n_sub, replace=False)
Benefits of row subsampling (subsample < 1.0):
- Variance reduction: different trees see different data subsets - similar to bagging.
- Faster training: each tree fits on fewer samples.
- Improved generalization: randomness acts as regularization.
Modern implementations also support column subsampling (colsample_bytree, colsample_bylevel): like Random Forest's feature randomization, randomly selecting features before building each tree or at each level. This further decorrelates trees and provides additional regularization.
Typical subsampling settings for production:
Parameter │ Typical range │ Effect
───────────────────|───────────────|──────────────────────────────────
subsample │ 0.7-0.9 │ Row-level subsampling per tree
colsample_bytree │ 0.7-0.9 │ Column subsampling per tree (XGBoost)
colsample_bylevel │ 0.7-0.9 │ Column subsampling per level (XGBoost)
min_child_weight │ 1-20 │ Min sum of instance weights in leaf
Sklearn GradientBoostingClassifier with Early Stopping
from sklearn.ensemble import GradientBoostingClassifier, HistGradientBoostingClassifier
from sklearn.metrics import roc_auc_score
import numpy as np
X, y = make_classification(
n_samples=8000, n_features=20, n_informative=12,
n_redundant=4, random_state=42
)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
# ── Option 1: sklearn GradientBoostingClassifier ─────────────────────────────
gbm_clf = GradientBoostingClassifier(
n_estimators=500, # large upper bound - early stopping will trim
learning_rate=0.05, # small learning rate
max_depth=4, # shallow trees (3-6 standard for boosting)
subsample=0.8, # stochastic boosting
min_samples_leaf=20, # avoid tiny leaves (regularization)
max_features="sqrt", # column subsampling
n_iter_no_change=20, # early stopping: stop if no improvement for 20 rounds
validation_fraction=0.1, # internal validation set (from training data)
tol=1e-4,
random_state=42,
)
gbm_clf.fit(X_tr, y_tr)
auc = roc_auc_score(y_te, gbm_clf.predict_proba(X_te)[:, 1])
print(f"GBM Test AUC : {auc:.4f}")
print(f"Rounds used : {gbm_clf.n_estimators_} (of 500 max)")
print(f"Train deviance : {gbm_clf.train_score_[-1]:.4f}")
# ── Option 2: HistGradientBoostingClassifier (10-100x faster for large n) ────
hgb = HistGradientBoostingClassifier(
max_iter=500,
learning_rate=0.05,
max_depth=6,
early_stopping=True,
n_iter_no_change=20,
random_state=42,
)
hgb.fit(X_tr, y_tr)
hgb_auc = roc_auc_score(y_te, hgb.predict_proba(X_te)[:, 1])
print(f"HistGBM Test AUC : {hgb_auc:.4f}")
print(f"HistGBM rounds : {hgb.n_iter_}")
Manual Early Stopping with staged_predict_proba
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import log_loss
import numpy as np
# Train without built-in early stopping
gbm = GradientBoostingClassifier(
n_estimators=300, learning_rate=0.05, max_depth=3, random_state=42
)
gbm.fit(X_tr, y_tr)
# Evaluate at each round on held-out test set
best_round, best_loss = 0, float("inf")
test_losses = []
for m, y_prob in enumerate(gbm.staged_predict_proba(X_te)):
loss = log_loss(y_te, y_prob)
test_losses.append(loss)
if loss < best_loss:
best_loss = loss
best_round = m + 1
print(f"Best round : {best_round}")
print(f"Best test log-loss: {best_loss:.4f}")
# Visualize overfitting
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(test_losses, color="#dc2626", linewidth=1.5, label="Test log-loss")
ax.plot(gbm.train_score_, color="#16a34a", linewidth=1.5, label="Train deviance")
ax.axvline(best_round, linestyle="--", color="gray", alpha=0.7,
label=f"Optimal round={best_round}")
ax.set_xlabel("Boosting round")
ax.set_ylabel("Loss")
ax.set_title("Gradient Boosting - Overfitting Signature")
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig("gbm_overfitting.png", dpi=150)
plt.show()
Why Boosting Has Lower Bias Than Bagging
This is the question from the production scenario. The precise answer:
Bagging (Random Forest): Each tree is a full-depth model with low individual bias. Averaging reduces variance. But the bias of the ensemble approximates the bias of a single tree - averaging identical systematic errors does not correct them.
Gradient boosting: Each round fits a tree to the error of the current ensemble. If the current model is consistently underpredicting certain regions (systematic bias in those regions), the next tree is explicitly trained to fix those regions. Boosting iteratively reduces bias by construction.
Bias-variance view:
Method │ Mechanism │ Bias effect │ Variance effect
───────────────|───────────────────|──────────────────|──────────────────
Random Forest │ Decorrelated avg │ Unchanged │ Reduced ~1/(B(1-ρ))
Gradient Boost │ Sequential corr. │ Reduced each m │ Can grow with m
RF wins when variance is the bottleneck (overfit single tree).
GBM wins when bias is the bottleneck (systematic errors remain).
The credit risk scenario: the Random Forest at AUC 0.84 had reached its bias floor - after extensive hyperparameter tuning, systematic errors in certain risk profiles remained uncorrected. Gradient boosting's sequential error correction reduced that bias by fitting new trees directly to the regions the ensemble was getting wrong.
Scratch vs Sklearn Comparison
from sklearn.ensemble import GradientBoostingRegressor
# Scratch implementation
gbm_scratch = GradientBoostingRegressorScratch(
n_estimators=200, learning_rate=0.1, max_depth=3, subsample=0.8, random_state=42
)
gbm_scratch.fit(X_tr, y_tr)
scratch_mse = mean_squared_error(y_te, gbm_scratch.predict(X_te))
# Sklearn implementation
gbm_sk = GradientBoostingRegressor(
n_estimators=200, learning_rate=0.1, max_depth=3, subsample=0.8, random_state=42
)
gbm_sk.fit(X_tr, y_tr)
sklearn_mse = mean_squared_error(y_te, gbm_sk.predict(X_te))
print(f"Scratch MSE : {scratch_mse:.4f}")
print(f"Sklearn MSE : {sklearn_mse:.4f}")
print(f"Gap : {abs(scratch_mse - sklearn_mse):.4f}")
# Scratch is slightly worse: sklearn uses optimal leaf weights (line search),
# while our scratch uses pseudo-residual averages in each leaf.
# For MSE loss the difference is small; for classification losses it's larger.
Key Hyperparameters Reference
| Parameter | Effect | Typical range |
|---|---|---|
n_estimators | Number of boosting rounds | 100-2000 (use early stopping) |
learning_rate | Shrinkage per round | 0.01-0.3 (lower = more rounds needed) |
max_depth | Tree depth per round | 3-6 (shallow trees are standard) |
subsample | Row sampling per round | 0.6-0.9 (stochastic boosting) |
min_samples_leaf | Minimum leaf size | 10-50 (key regularizer) |
max_features | Column sampling per split | "sqrt" or 0.5-0.9 |
n_iter_no_change | Early stopping patience | 10-50 rounds |
validation_fraction | Internal validation split | 0.1 (10%) |
:::note Shallow trees are standard in gradient boosting
Random Forest uses fully grown trees (default max_depth=None). Gradient boosting uses shallow trees, typically max_depth=3-6. Deep trees in gradient boosting overfit quickly because each tree is already correcting the ensemble's mistakes - a deep correction tree can easily memorize individual training points. Start with max_depth=3 and increase only if needed.
:::
:::warning Gradient boosting and noisy labels
Because boosting focuses correction on examples the ensemble gets wrong, it can over-focus on mislabeled or outlier examples. Random Forest is more robust to noisy labels. If your training data has significant label noise (>5%): (1) use robust loss functions (Huber for regression, or MAE-based losses), (2) set aggressive shrinkage (), (3) use strong min_samples_leaf regularization to avoid fitting individual mislabeled points.
:::
Production Engineering Notes
Training Speed
Gradient boosting trains trees sequentially. Training time is approximately - linear in . This is slower than Random Forest for equivalent tree counts, but the sequential nature means each tree is more informative (it targets specific errors), so far fewer rounds are needed.
For large datasets (), always use histogram-based implementations:
# sklearn HistGradientBoostingClassifier: 10-100x faster than GradientBoostingClassifier
# Uses the same algorithm as LightGBM internally
from sklearn.ensemble import HistGradientBoostingClassifier
hgb = HistGradientBoostingClassifier(
max_iter=1000,
learning_rate=0.05,
max_depth=6,
early_stopping=True,
n_iter_no_change=20,
random_state=42,
)
hgb.fit(X_tr, y_tr)
# For production: XGBoost or LightGBM are preferred over sklearn's implementation
# XGBoost: pip install xgboost
# LightGBM: pip install lightgbm
# Both have C++ backends, GPU support, and native distributed training
Deployment
import joblib
# Serialize with compression
joblib.dump(gbm_clf, "gbm_loan_default_v3.joblib", compress=3)
gbm_loaded = joblib.load("gbm_loan_default_v3.joblib")
# Verify round-trip
import numpy as np
assert np.allclose(gbm_clf.predict_proba(X_te[:5]),
gbm_loaded.predict_proba(X_te[:5]))
print("Model serialization verified")
For sub-millisecond inference, export to ONNX or use XGBoost/LightGBM's native serialization formats, which support C++ inference libraries with 10-100x speedup over Python.
When GBM Wins vs Random Forest
The specific conditions where GBM outperforms RF:
- The Random Forest's test performance has plateaued after thorough hyperparameter tuning
- The residual errors show systematic patterns (model is wrong in consistent ways)
- Training data is large enough that GBM's sequential correction converges reliably
- Labels are clean (GBM is sensitive to label noise)
- You have engineering capacity to tune 3-4 hyperparameters carefully
When to stay with Random Forest:
- Training budget is tight (RF trains 5-10x faster)
- Noisy labels (RF is more robust)
- Need OOB error without separate validation set
- Simple deployment requirements (RF is easier to explain to stakeholders)
YouTube Resources
| Video | Channel | Why Watch It |
|---|---|---|
| Gradient Boost Part 1 - Regression Main Ideas | StatQuest with Josh Starmer | Best visual explanation of pseudo-residuals and additive model |
| Gradient Boost Part 2 - Regression Details | StatQuest with Josh Starmer | Mathematical derivation of optimal leaf weights |
| Gradient Boost Part 3 - Classification | StatQuest with Josh Starmer | Log-loss pseudo-residuals and log-odds prediction |
| Gradient Boosting From Scratch Python | Andrej Karpathy-style tutorial | Building GBM in pure NumPy, step by step |
| XGBoost vs LightGBM vs CatBoost | Krish Naik | Production comparison of major GBM libraries |
Interview Questions and Answers
Q1: Explain gradient boosting as functional gradient descent. Why do we fit trees to pseudo-residuals?
Gradient boosting minimizes by iteratively fitting trees to the negative gradient of the loss in function space. The negative gradient at training point is the pseudo-residual: . By fitting a tree to these pseudo-residuals and adding it to the ensemble, we take a step in the direction of steepest descent in function space - similar to gradient descent in parameter space. The key constraint is that we cannot take an unconstrained step (that would just set , memorizing training data). Instead, we project the gradient direction onto the space of regression trees, which forces generalization. The learning rate controls step size, preventing overshoot.
Q2: What are pseudo-residuals for log-loss, and why are they identical in form to logistic regression gradients?
For binary log-loss where , the pseudo-residual is . The predicted probability acts as the "current estimate" and the pseudo-residual is the residual between the true label and this estimate. This is identical to logistic regression's gradient , because both derive from the same log-loss objective. The connection is not coincidental - gradient boosting with log-loss in function space is equivalent to gradient descent in logistic regression's parameter space, except that GBM approximates the gradient direction with a tree rather than with a linear function.
Q3: Explain shrinkage (learning rate) in gradient boosting. Why does a smaller learning rate improve generalization?
Shrinkage scales each tree's contribution: where . A smaller takes smaller steps in function space. This is beneficial because: (1) the function space landscape is not perfectly smooth - large steps can overshoot local optima; (2) smaller steps average over more diverse trees, reducing variance; (3) the approximation error from projecting onto tree space accumulates less with smaller steps. The cost is computational: roughly constant, so halving doubles the number of rounds needed. In practice, to with early stopping is standard. Never reduce without increasing n_estimators proportionally.
Q4: Why does gradient boosting have lower bias than Random Forest, but potentially higher variance?
Random Forest reduces variance by averaging decorrelated trees, but the ensemble's bias approximately equals a single tree's bias - averaging identical systematic errors does not correct them. Gradient boosting explicitly corrects systematic errors: each round fits a tree to what the current ensemble gets wrong. If the model consistently underpredicts high-risk loans, the next tree directly targets that underprediction. This sequential correction reduces bias below what any single tree or bagging ensemble can achieve. However, each additional boosting round also increases the model's capacity to fit training noise, so variance can grow with the number of rounds. This is why early stopping is essential for GBM but not for RF - you can never add too many trees to a Random Forest, but you can definitely overfit by adding too many boosting rounds.
Q5: Your gradient boosting model has AUC 0.94 on training and 0.81 on validation. What are your next steps?
The 0.13 AUC gap signals overfitting. Systematic approach: (1) Reduce learning rate and add early stopping: set learning_rate=0.02, n_estimators=2000, n_iter_no_change=30 - let the model find its own optimal round count. (2) Reduce tree depth: try max_depth=3 instead of deeper - shallow trees in boosting are a key regularizer. (3) Add subsampling: set subsample=0.8 and colsample_bytree=0.8 - stochastic boosting reduces overfitting. (4) Increase min_samples_leaf: try 20-50, preventing the model from fitting individual training examples. (5) Check data quality: examine pseudo-residuals for systematic outliers - a few mislabeled examples that the model repeatedly gets wrong can cause severe overfitting. (6) Reduce n_estimators after tuning other parameters via cross-validation.
Q6: How does stochastic gradient boosting compare to bagging, and why can combining both improve performance further?
Stochastic gradient boosting (Friedman, 1999) samples rows without replacement at each boosting round (controlled by subsample). This is similar in spirit to bagging but with key differences: (1) it samples without replacement (bagging samples with replacement), (2) each sample is used for one sequential correction round (not a separate full model), (3) the update step applies to all samples, not just the subsample. The benefit is variance reduction similar to bagging, plus computational savings. Modern GBM implementations combine both: subsample for row-level randomness (like bagging) and colsample_bytree for feature randomness (like Random Forest). Together they drive (correlation between trees) toward zero while still maintaining the sequential bias-correction property of boosting. This combination of bagging-style variance reduction with boosting-style bias reduction is part of why XGBoost/LightGBM consistently outperform both pure Random Forests and pure gradient boosting on tabular benchmarks.
XGBoost vs LightGBM vs sklearn GBM: When to Use What
from sklearn.ensemble import GradientBoostingClassifier, HistGradientBoostingClassifier
import numpy as np
# ── sklearn GradientBoostingClassifier ────────────────────────────────────────
# Pure Python + C extensions. Slow on large datasets (n > 50K).
# Good for: small datasets, exact comparison, learning the algorithm.
# Key params: learning_rate, n_estimators, max_depth, subsample, min_samples_leaf
from sklearn.ensemble import GradientBoostingClassifier
gbm_sklearn = GradientBoostingClassifier(
n_estimators=300,
learning_rate=0.05,
max_depth=4,
subsample=0.8,
min_samples_leaf=20,
random_state=42,
)
# ── HistGradientBoostingClassifier (sklearn) ──────────────────────────────────
# Histogram-based like LightGBM. Much faster (100x on large datasets).
# Handles missing values natively. Categorical feature support.
# Good for: medium datasets (50K–1M), sklearn pipeline compatibility.
from sklearn.ensemble import HistGradientBoostingClassifier
hgbm = HistGradientBoostingClassifier(
max_iter=300,
learning_rate=0.05,
max_depth=4,
min_samples_leaf=20,
l2_regularization=0.1,
random_state=42,
categorical_features='from_dtype', # auto-detect category dtype columns
)
# ── XGBoost (if installed) ────────────────────────────────────────────────────
try:
import xgboost as xgb
xgb_model = xgb.XGBClassifier(
n_estimators=300,
learning_rate=0.05,
max_depth=4,
subsample=0.8,
colsample_bytree=0.8, # feature subsampling
reg_alpha=0.1, # L1 regularization on leaf weights
reg_lambda=1.0, # L2 regularization on leaf weights
use_label_encoder=False,
eval_metric='auc',
random_state=42,
n_jobs=-1,
)
except ImportError:
print("XGBoost not installed - use HistGradientBoostingClassifier as fallback")
# ── LightGBM (if installed) ───────────────────────────────────────────────────
try:
import lightgbm as lgb
lgbm_model = lgb.LGBMClassifier(
n_estimators=300,
learning_rate=0.05,
max_depth=4,
num_leaves=31, # LightGBM uses num_leaves, not max_depth (leaf-wise)
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.1,
reg_lambda=1.0,
min_child_samples=20, # equivalent to min_samples_leaf
random_state=42,
n_jobs=-1,
)
except ImportError:
print("LightGBM not installed")
When to choose which:
| Dataset size | Choice | Reason |
|---|---|---|
| < 10K rows | sklearn GBM or HGBM | Either works; GBM for exact comparison |
| 10K–1M rows | HistGBM or LightGBM | Histogram-based - 10-100x faster |
| > 1M rows | LightGBM | GOSS + EFB: fastest at scale |
| Missing values | HGBM or LightGBM | Native NaN support |
| GPU available | XGBoost with device='cuda' | Best GPU implementation |
| sklearn Pipeline | HGBM | Full sklearn API compatibility |
Early Stopping and Monitoring Training
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
import matplotlib.pyplot as plt
def train_with_early_stopping(X_train, y_train, X_val, y_val,
max_iter: int = 1000,
learning_rate: float = 0.05,
patience: int = 30) -> tuple:
"""
Gradient boosting with early stopping using validation set.
Stops when AUC on validation set doesn't improve for `patience` rounds.
"""
model = HistGradientBoostingClassifier(
max_iter=max_iter,
learning_rate=learning_rate,
max_depth=4,
min_samples_leaf=20,
early_stopping=True,
validation_fraction=None, # we provide our own val set
n_iter_no_change=patience,
scoring='roc_auc',
random_state=42,
verbose=0,
)
# HistGBM with validation_fraction=None uses internal OOB
# For external validation, use a different approach:
model_external = HistGradientBoostingClassifier(
max_iter=max_iter,
learning_rate=learning_rate,
max_depth=4,
min_samples_leaf=20,
early_stopping=True,
validation_fraction=0.1, # hold out 10% of training data for early stopping
n_iter_no_change=patience,
scoring='roc_auc',
random_state=42,
)
model_external.fit(X_train, y_train)
n_iter_used = model_external.n_iter_
val_score = model_external.score(X_val, y_val)
print(f"Stopped at iteration: {n_iter_used}/{max_iter}")
print(f"Validation accuracy: {val_score:.4f}")
return model_external, n_iter_used
## Hyperparameter Interaction Guide
The two most important and interacting hyperparameters in gradient boosting are `learning_rate` and `n_estimators`. They are coupled: lower `learning_rate` requires proportionally more `n_estimators` to achieve similar performance. The standard approach is to fix `learning_rate=0.05` and tune `n_estimators` via early stopping:
```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=5000, n_features=20,
n_informative=12, n_redundant=4, random_state=42)
# Learning rate vs n_estimators tradeoff
lr_n_configs = [
(0.5, 50, 'LR=0.5, n=50 (fast, coarse)'),
(0.1, 200, 'LR=0.1, n=200 (balanced, default)'),
(0.05, 400, 'LR=0.05, n=400 (fine-grained)'),
(0.01, 1000,'LR=0.01, n=1000 (slow, high accuracy)'),
]
cv = StratifiedKFold(5, shuffle=True, random_state=42)
results = []
for lr, n, label in lr_n_configs:
model = HistGradientBoostingClassifier(
learning_rate=lr, max_iter=n,
max_depth=4, min_samples_leaf=20, random_state=42
)
scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc', n_jobs=-1)
results.append({
'label': label, 'lr': lr, 'n': n,
'mean_auc': scores.mean(), 'std_auc': scores.std()
})
print(f"{label}: AUC={scores.mean():.4f} ± {scores.std():.4f}")
# Other key hyperparameters and their effects:
params_guide = {
'max_depth': ('3-6', 'Depth of each tree. Shallow=less variance, more bias. Start: 4'),
'min_samples_leaf': ('10-50', 'Minimum leaf size. Larger=more regularization. Start: 20'),
'subsample': ('0.7-0.9', 'Row subsampling. Reduces overfitting. Start: 0.8'),
'colsample_bytree': ('0.7-0.9', 'Feature subsampling. Start: 0.8 (XGBoost/LightGBM)'),
'l2_regularization': ('0.0-10.0', 'L2 on leaf weights. Start: 1.0'),
}
print("\nHyperparameter reference:")
for param, (range_str, description) in params_guide.items():
print(f" {param:20s} range={range_str:10s} - {description}")
Pseudo-Residuals by Loss Function: Full Reference
| Loss function | Use case | Pseudo-residual |
|---|---|---|
| L2 (squared error) | Regression, robust to small errors | |
| L1 (absolute error) | Regression, robust to outliers | |
| Huber | Regression, combines L1+L2 | L2 near 0, L1 far from 0 |
| Log-loss (binary) | Binary classification | |
| Softmax (multi-class) | Multi-class classification | |
| Poisson deviance | Count data (e.g., claims) | |
| Gamma deviance | Right-skewed positive targets |
from sklearn.ensemble import HistGradientBoostingRegressor
# Loss function selection guide:
reg_losses = {
'squared_error': HistGradientBoostingRegressor(loss='squared_error'), # default, fast
'absolute_error': HistGradientBoostingRegressor(loss='absolute_error'), # robust to outliers
'poisson': HistGradientBoostingRegressor(loss='poisson'), # for count data (y >= 0)
'gamma': HistGradientBoostingRegressor(loss='gamma'), # for positive right-skewed data
}
# Use gamma loss for insurance claim amounts (right-skewed, always positive)
# Use poisson for insurance claim counts, web page views, event counts
# Use absolute_error when target has extreme outliers you don't want to fit perfectly
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Gradient Boosting & Residuals demo on the EngineersOfAI Playground - no code required.
:::
