Module 03: Tree Models and Ensembles
Why Tree Models Dominate Tabular ML
Open any Kaggle leaderboard for a structured-data competition. Scroll through the top ten solutions. You will find XGBoost, LightGBM, or CatBoost in nearly every one of them - often stacked on top of each other. A widely cited survey of Kaggle solutions found that tree-based ensembles appear in over 80% of winning entries on tabular datasets. This is not a coincidence, and it is not going away.
Tree models succeed where neural networks struggle. They require no feature scaling. They handle mixed data types - numeric, ordinal, and categorical - without preprocessing pipelines. They are robust to outliers because splits are based on rank order, not absolute values. They capture non-linear interactions and high-order feature relationships automatically, without manual feature engineering. And they train in minutes on datasets that would take hours to fine-tune a transformer on.
But there is a deeper reason tree models dominate: they compose beautifully. A single decision tree is a weak learner - high variance, prone to memorizing noise. Stack hundreds of them, each correcting the errors of the last, and you get gradient boosting: one of the most powerful learning algorithms ever invented. The theory behind why this works - bias-variance decomposition, ensemble diversity, the connection to functional gradient descent - is what this module is about.
Understanding tree models deeply is not optional for production ML engineers. When a fraud model at a payments company misses a new attack pattern, the answer is almost always in how the tree was built and where it overfit. When a feature matters in XGBoost but SHAP says it is irrelevant, you need to understand split-based vs. permutation-based importance. When LightGBM trains 10x faster than XGBoost on your dataset, the reason is leaf-wise growth and histogram binning - both covered here.
Module Map
Lesson Table
| # | Lesson | Core Concept | You Will Build |
|---|---|---|---|
| 01 | Decision Trees Internals | Recursive binary splitting, CART, impurity | Decision tree from scratch in NumPy |
| 02 | Information Gain, Gini, Entropy | Split evaluation metrics, ID3 vs CART | Impurity function comparisons |
| 03 | Pruning and Depth Control | Pre-pruning, post-pruning, cost-complexity | Pruning pipeline + validation curves |
| 04 | Random Forests | Bagging, feature subsampling, OOB error | Random forest from scratch |
| 05 | Gradient Boosting From Scratch | Functional gradient descent, residual fitting | GBM in 60 lines of NumPy |
| 06 | XGBoost Deep Dive | Second-order Taylor expansion, regularized objectives | XGBoost vs sklearn GBM benchmark |
| 07 | LightGBM and CatBoost | Histogram binning, leaf-wise growth, target encoding | Speed comparison across 1M rows |
| 08 | Feature Importance and SHAP | Split importance, permutation importance, Shapley values | SHAP waterfall plots for a production model |
| 09 | Stacking and Blending | Meta-learners, cross-val stacking, blending | Full stacking pipeline |
Key Concepts at a Glance
Recursive Binary Splitting
Every decision tree is built by the same fundamental algorithm: at each node, find the feature and threshold that best separates the data, split, and recurse. The result is a tree of axis-aligned binary decisions:
[amount > $500?]
/ \
[country == US?] [merchant_category == travel?]
/ \ / \
LEGIT FRAUD FRAUD LEGIT
The key question is what "best separates" means. For classification, this is measured by impurity - how mixed the classes are in a node. For regression, it is the reduction in variance. The two dominant impurity measures are entropy (used by ID3, C4.5) and Gini index (used by CART and sklearn).
Ensemble Methods: Bagging vs Boosting
Two fundamentally different strategies for turning weak trees into strong learners:
| Property | Bagging (Random Forest) | Boosting (GBM, XGBoost) |
|---|---|---|
| Trees trained | Independently, in parallel | Sequentially, each corrects the last |
| Each tree fits | A bootstrap sample of the data | The residuals/pseudo-residuals of the ensemble |
| Variance reduction | High - averaging decorrelated trees | Moderate - each tree is deliberately constrained |
| Bias reduction | Low - individual trees are deep | High - boosting reduces bias with each round |
| Overfit risk | Low - hard to overfit with enough trees | High - needs early stopping, regularization |
| Training speed | Fast (parallelizable) | Slower (sequential) |
| Best use case | Noisy data, high-variance settings | Low-noise data, competition-grade accuracy |
The Bias-Variance Connection
The reason bagging and boosting work so differently comes down to bias-variance decomposition. For any model's expected squared error on a new point :
A single deep decision tree has low bias (it can fit complex patterns) but high variance (it memorizes noise and changes drastically with different training samples). Bagging directly attacks variance by averaging many high-variance trees trained on different bootstrap samples - if the trees are decorrelated, averaging trees reduces variance by a factor of . Boosting attacks bias by fitting a sequence of shallow trees, each correcting the previous ensemble's errors.
Feature Importance and SHAP
Tree models produce two kinds of feature importance:
Split-based importance (default in sklearn, XGBoost): count how many times a feature was used to split, weighted by impurity reduction. Fast to compute, but biased toward high-cardinality features.
SHAP (SHapley Additive exPlanations): for each prediction, compute each feature's exact contribution using cooperative game theory. Consistent, model-agnostic, and interpretable. The gold standard for production explainability.
Where is the Shapley value for feature , is the set of all features, and is the model's prediction using only the feature subset .
Historical Context
The tree model family has a 40-year history. Understanding it reveals why modern methods are designed the way they are.
| Year | Algorithm | Key Innovation | Limitation |
|---|---|---|---|
| 1986 | ID3 (Quinlan) | Information gain for categorical splits | No regression, no pruning, no missing values |
| 1993 | C4.5 (Quinlan) | Gain ratio, continuous features, pruning | Memory-intensive, slower than ID3 |
| 1984 | CART (Breiman et al.) | Gini index, regression trees, cost-complexity pruning | Single tree = high variance |
| 1996 | Bagging (Breiman) | Bootstrap aggregation, variance reduction | Trees are still correlated |
| 2001 | Random Forests (Breiman) | Random feature subsets at each split → decorrelation | Slower inference, less interpretable |
| 2001 | Gradient Boosting (Friedman) | Functional gradient descent, generalized loss functions | Slow training, sensitive to hyperparameters |
| 2016 | XGBoost (Chen & Guestrin) | Second-order gradients, regularized objective, parallelism | Memory-heavy for very large datasets |
| 2017 | LightGBM (Microsoft) | Histogram binning, leaf-wise growth, GOSS/EFB | Can overfit small datasets |
| 2017 | CatBoost (Yandex) | Ordered boosting, native categorical encoding | Slower training than LightGBM |
| 2017 | SHAP (Lundberg & Lee) | Exact Shapley values for trees in polynomial time | Requires shapley computation infrastructure |
:::note CART is the foundation Most modern tree libraries - sklearn, XGBoost, LightGBM - implement variants of CART, not ID3 or C4.5. CART uses Gini index for classification and MSE reduction for regression, supports continuous splits on both features and targets, and uses cost-complexity pruning. When someone says "decision tree" in a production context, they almost always mean CART. :::
Prerequisites
Before starting this module, you should be comfortable with:
- Module 01 - ML Foundations: training vs. test error, bias-variance tradeoff, cross-validation, hyperparameter tuning
- Module 02 - Linear Models: gradient descent, loss functions, regularization (the intuition transfers directly)
- Python: NumPy arrays and indexing, basic pandas, sklearn fit/predict pattern
- Math: logarithms (for entropy), basic probability (class frequencies as probabilities)
You do not need calculus for the first three lessons (trees are built with greedy search, not gradients). Gradient boosting (Lesson 05) reintroduces calculus - Taylor expansion and partial derivatives.
Learning Objectives
By the end of this module, you will be able to:
- Implement a decision tree from scratch using recursive binary splitting and Gini impurity
- Explain exactly why a tree splits where it does - and debug a tree that is memorizing training data
- Build a random forest from first principles and understand why OOB error is an unbiased estimate
- Derive gradient boosting as functional gradient descent and implement it in 60 lines of NumPy
- Explain the second-order Taylor approximation that makes XGBoost's objective function analytically solvable
- Benchmark LightGBM vs XGBoost on a large dataset and explain the performance difference
- Compute and interpret SHAP values for both global and local explanations
- Build a stacking ensemble with proper cross-validation to prevent leakage
Production Context
Tree models appear in production in ways that define entire industry verticals:
Fraud detection: Nearly every payments company (Stripe, PayPal, Adyen) uses gradient boosted trees as the core of their fraud scoring system. Trees naturally handle the imbalanced, high-dimensional, mix of transaction features. SHAP values allow compliance teams to explain why a transaction was flagged.
Ad ranking: Real-time bidding systems use LightGBM because it can score millions of candidate ads in milliseconds. The histogram-based inference is cache-friendly and branch-predictable - crucial at nanosecond latency budgets.
Credit scoring: Logistic regression is required by regulation in some markets, but tree models are used to discover the features and interactions that then get engineered into the linear model. CatBoost's native categorical handling is particularly valuable for credit bureau features.
Clinical decision support: Random forests are used in clinical risk scoring (e.g., predicting ICU readmission) because their OOB error provides an honest estimate of generalization without a separate test set - important when data is scarce.
Feature discovery: Even when the final model is a neural network, XGBoost is often trained first. Its SHAP values reveal which raw features matter, guiding neural architecture design and embedding choices.
This module will make you fluent in the full tree model family - from the first split in a CART tree to SHAP explanations in a 1000-tree XGBoost ensemble.
