Module 03: Tree Models and Ensembles

Why Tree Models Dominate Tabular ML

Open any Kaggle leaderboard for a structured-data competition. Scroll through the top ten solutions. You will find XGBoost, LightGBM, or CatBoost in nearly every one of them - often stacked on top of each other. A widely cited survey of Kaggle solutions found that tree-based ensembles appear in over 80% of winning entries on tabular datasets. This is not a coincidence, and it is not going away.

Tree models succeed where neural networks struggle. They require no feature scaling. They handle mixed data types - numeric, ordinal, and categorical - without preprocessing pipelines. They are robust to outliers because splits are based on rank order, not absolute values. They capture non-linear interactions and high-order feature relationships automatically, without manual feature engineering. And they train in minutes on datasets that would take hours to fine-tune a transformer on.

But there is a deeper reason tree models dominate: they compose beautifully. A single decision tree is a weak learner - high variance, prone to memorizing noise. Stack hundreds of them, each correcting the errors of the last, and you get gradient boosting: one of the most powerful learning algorithms ever invented. The theory behind why this works - bias-variance decomposition, ensemble diversity, the connection to functional gradient descent - is what this module is about.

Understanding tree models deeply is not optional for production ML engineers. When a fraud model at a payments company misses a new attack pattern, the answer is almost always in how the tree was built and where it overfit. When a feature matters in XGBoost but SHAP says it is irrelevant, you need to understand split-based vs. permutation-based importance. When LightGBM trains 10x faster than XGBoost on your dataset, the reason is leaf-wise growth and histogram binning - both covered here.

Module Map

Lesson Table

#	Lesson	Core Concept	You Will Build
01	Decision Trees Internals	Recursive binary splitting, CART, impurity	Decision tree from scratch in NumPy
02	Information Gain, Gini, Entropy	Split evaluation metrics, ID3 vs CART	Impurity function comparisons
03	Pruning and Depth Control	Pre-pruning, post-pruning, cost-complexity	Pruning pipeline + validation curves
04	Random Forests	Bagging, feature subsampling, OOB error	Random forest from scratch
05	Gradient Boosting From Scratch	Functional gradient descent, residual fitting	GBM in 60 lines of NumPy
06	XGBoost Deep Dive	Second-order Taylor expansion, regularized objectives	XGBoost vs sklearn GBM benchmark
07	LightGBM and CatBoost	Histogram binning, leaf-wise growth, target encoding	Speed comparison across 1M rows
08	Feature Importance and SHAP	Split importance, permutation importance, Shapley values	SHAP waterfall plots for a production model
09	Stacking and Blending	Meta-learners, cross-val stacking, blending	Full stacking pipeline

Key Concepts at a Glance

Recursive Binary Splitting

Every decision tree is built by the same fundamental algorithm: at each node, find the feature and threshold that best separates the data, split, and recurse. The result is a tree of axis-aligned binary decisions:

                    [amount > $500?]
                   /               \
            [country == US?]    [merchant_category == travel?]
           /        \                /               \
       LEGIT      FRAUD          FRAUD             LEGIT

The key question is what "best separates" means. For classification, this is measured by impurity - how mixed the classes are in a node. For regression, it is the reduction in variance. The two dominant impurity measures are entropy (used by ID3, C4.5) and Gini index (used by CART and sklearn).

Ensemble Methods: Bagging vs Boosting

Two fundamentally different strategies for turning weak trees into strong learners:

Property	Bagging (Random Forest)	Boosting (GBM, XGBoost)
Trees trained	Independently, in parallel	Sequentially, each corrects the last
Each tree fits	A bootstrap sample of the data	The residuals/pseudo-residuals of the ensemble
Variance reduction	High - averaging decorrelated trees	Moderate - each tree is deliberately constrained
Bias reduction	Low - individual trees are deep	High - boosting reduces bias with each round
Overfit risk	Low - hard to overfit with enough trees	High - needs early stopping, regularization
Training speed	Fast (parallelizable)	Slower (sequential)
Best use case	Noisy data, high-variance settings	Low-noise data, competition-grade accuracy

The Bias-Variance Connection

The reason bagging and boosting work so differently comes down to bias-variance decomposition. For any model's expected squared error on a new point $x$ :

$\mathbb{E}[(y - \hat{f}(x))^2] = \underbrace{\text{Bias}^2[\hat{f}(x)]}_{\text{systematic error}} + \underbrace{\text{Var}[\hat{f}(x)]}_{\text{sensitivity to training data}} + \underbrace{\sigma^2}_{\text{irreducible noise}}$

A single deep decision tree has low bias (it can fit complex patterns) but high variance (it memorizes noise and changes drastically with different training samples). Bagging directly attacks variance by averaging many high-variance trees trained on different bootstrap samples - if the trees are decorrelated, averaging $B$ trees reduces variance by a factor of $B$ . Boosting attacks bias by fitting a sequence of shallow trees, each correcting the previous ensemble's errors.

Feature Importance and SHAP

Tree models produce two kinds of feature importance:

Split-based importance (default in sklearn, XGBoost): count how many times a feature was used to split, weighted by impurity reduction. Fast to compute, but biased toward high-cardinality features.

SHAP (SHapley Additive exPlanations): for each prediction, compute each feature's exact contribution using cooperative game theory. Consistent, model-agnostic, and interpretable. The gold standard for production explainability.

$\phi_i = \sum_{S \subseteq F \setminus \{i\}} \frac{|S|!(|F|-|S|-1)!}{|F|!} [f(S \cup \{i\}) - f(S)]$

Where $\phi_i$ is the Shapley value for feature $i$ , $F$ is the set of all features, and $f(S)$ is the model's prediction using only the feature subset $S$ .

Historical Context

The tree model family has a 40-year history. Understanding it reveals why modern methods are designed the way they are.

Year	Algorithm	Key Innovation	Limitation
1986	ID3 (Quinlan)	Information gain for categorical splits	No regression, no pruning, no missing values
1993	C4.5 (Quinlan)	Gain ratio, continuous features, pruning	Memory-intensive, slower than ID3
1984	CART (Breiman et al.)	Gini index, regression trees, cost-complexity pruning	Single tree = high variance
1996	Bagging (Breiman)	Bootstrap aggregation, variance reduction	Trees are still correlated
2001	Random Forests (Breiman)	Random feature subsets at each split → decorrelation	Slower inference, less interpretable
2001	Gradient Boosting (Friedman)	Functional gradient descent, generalized loss functions	Slow training, sensitive to hyperparameters
2016	XGBoost (Chen & Guestrin)	Second-order gradients, regularized objective, parallelism	Memory-heavy for very large datasets
2017	LightGBM (Microsoft)	Histogram binning, leaf-wise growth, GOSS/EFB	Can overfit small datasets
2017	CatBoost (Yandex)	Ordered boosting, native categorical encoding	Slower training than LightGBM
2017	SHAP (Lundberg & Lee)	Exact Shapley values for trees in polynomial time	Requires shapley computation infrastructure

:::note CART is the foundation Most modern tree libraries - sklearn, XGBoost, LightGBM - implement variants of CART, not ID3 or C4.5. CART uses Gini index for classification and MSE reduction for regression, supports continuous splits on both features and targets, and uses cost-complexity pruning. When someone says "decision tree" in a production context, they almost always mean CART. :::

Prerequisites

Before starting this module, you should be comfortable with:

Module 01 - ML Foundations: training vs. test error, bias-variance tradeoff, cross-validation, hyperparameter tuning
Module 02 - Linear Models: gradient descent, loss functions, regularization (the intuition transfers directly)
Python: NumPy arrays and indexing, basic pandas, sklearn fit/predict pattern
Math: logarithms (for entropy), basic probability (class frequencies as probabilities)

You do not need calculus for the first three lessons (trees are built with greedy search, not gradients). Gradient boosting (Lesson 05) reintroduces calculus - Taylor expansion and partial derivatives.

Learning Objectives

By the end of this module, you will be able to:

Implement a decision tree from scratch using recursive binary splitting and Gini impurity
Explain exactly why a tree splits where it does - and debug a tree that is memorizing training data
Build a random forest from first principles and understand why OOB error is an unbiased estimate
Derive gradient boosting as functional gradient descent and implement it in 60 lines of NumPy
Explain the second-order Taylor approximation that makes XGBoost's objective function analytically solvable
Benchmark LightGBM vs XGBoost on a large dataset and explain the performance difference
Compute and interpret SHAP values for both global and local explanations
Build a stacking ensemble with proper cross-validation to prevent leakage

Production Context

Tree models appear in production in ways that define entire industry verticals:

Fraud detection: Nearly every payments company (Stripe, PayPal, Adyen) uses gradient boosted trees as the core of their fraud scoring system. Trees naturally handle the imbalanced, high-dimensional, mix of transaction features. SHAP values allow compliance teams to explain why a transaction was flagged.

Ad ranking: Real-time bidding systems use LightGBM because it can score millions of candidate ads in milliseconds. The histogram-based inference is cache-friendly and branch-predictable - crucial at nanosecond latency budgets.

Credit scoring: Logistic regression is required by regulation in some markets, but tree models are used to discover the features and interactions that then get engineered into the linear model. CatBoost's native categorical handling is particularly valuable for credit bureau features.

Clinical decision support: Random forests are used in clinical risk scoring (e.g., predicting ICU readmission) because their OOB error provides an honest estimate of generalization without a separate test set - important when data is scarce.

Feature discovery: Even when the final model is a neural network, XGBoost is often trained first. Its SHAP values reveal which raw features matter, guiding neural architecture design and embedding choices.

This module will make you fluent in the full tree model family - from the first split in a CART tree to SHAP explanations in a 1000-tree XGBoost ensemble.

Why Tree Models Dominate Tabular ML​

Module Map​

Lesson Table​

Key Concepts at a Glance​

Recursive Binary Splitting​

Ensemble Methods: Bagging vs Boosting​

The Bias-Variance Connection​

Feature Importance and SHAP​

Historical Context​

Prerequisites​

Learning Objectives​

Production Context​