Decision Trees Internals

The Production Scenario

It is 9 AM on a Monday and the on-call alert fires: fraud losses at the payments company have spiked 40% overnight. The fraud team pulls up the decision tree model that has been running in production for three months. On paper it looks excellent - 99.1% accuracy on the holdout set, precision above 0.94. But overnight a new attack pattern emerged: a coordinated wave of low-value transactions from newly created accounts at a specific merchant category, structured to stay below every threshold the tree learned.

The first thing the senior ML engineer does is export the tree and look at its depth: 20 levels. She counts the number of leaf nodes: 847. She checks the training set size: 180,000 transactions. A tree with 847 leaves and 180,000 samples means many leaves have only a handful of training examples - the tree has memorized the specific fraud patterns it saw in training down to individual merchant IDs and exact dollar amounts. The new fraud pattern is structurally different, so the tree routes it to "legitimate" leaves it built around similar-looking genuine transactions.

This is textbook overfitting, and it is the most common failure mode of decision trees in production. The fix - constraining depth, requiring minimum samples per leaf, or pruning - requires understanding exactly how the tree was built in the first place. You cannot debug a tool you do not understand from the inside.

This lesson walks through the complete internals of decision tree construction: how splits are chosen, why impurity metrics work the way they do, how the recursion terminates, and what each hyperparameter actually controls. By the end you will have built a working tree from scratch in NumPy and will know exactly where that fraud model went wrong.

Recursive Binary Splitting: The Core Algorithm

A decision tree partitions the feature space into axis-aligned rectangular regions. Each region gets a single prediction value - the majority class for classification, the mean for regression. The partition is found greedily: at each node, find the single best split, recurse on each half, and stop when a termination condition is met.

The Algorithm in Pseudocode

function BuildTree(X, y, depth=0):
    # Termination conditions
    if depth >= max_depth:
        return LeafNode(prediction=majority_class(y))
    if len(y) < min_samples_split:
        return LeafNode(prediction=majority_class(y))
    if all samples in y are the same class:
        return LeafNode(prediction=y[0])

    best_feature, best_threshold = FindBestSplit(X, y)

    if no split improves impurity:
        return LeafNode(prediction=majority_class(y))

    left_mask  = X[:, best_feature] <= best_threshold
    right_mask = ~left_mask

    left_child  = BuildTree(X[left_mask],  y[left_mask],  depth + 1)
    right_child = BuildTree(X[right_mask], y[right_mask], depth + 1)

    return InternalNode(feature=best_feature,
                        threshold=best_threshold,
                        left=left_child,
                        right=right_child)


function FindBestSplit(X, y):
    best_gain = 0
    best_feature = None
    best_threshold = None

    for each feature j in [0, d):
        for each candidate threshold t in sorted(unique(X[:, j])):
            left_y  = y[X[:, j] <= t]
            right_y = y[X[:, j] >  t]

            gain = ImpurityReduction(y, left_y, right_y)

            if gain > best_gain:
                best_gain = gain
                best_feature = j
                best_threshold = t

    return best_feature, best_threshold

The outer loop over features and thresholds is the computational bottleneck. For a dataset with $n$ samples and $d$ features, FindBestSplit evaluates up to $O(nd)$ candidate splits, each requiring $O(n)$ work to compute the impurity - giving $O(n^2 d)$ per node naively. Sorting the features first reduces this to $O(nd \log n)$ , which is why sklearn always sorts before scanning.

How a Split Is Evaluated: Impurity Reduction

The central question is: how do we measure whether a split is "good"? The answer is impurity reduction - also called information gain.

Impurity Before and After

Given a node containing samples $y$ with class distribution $p_1, p_2, \ldots, p_K$ (where $p_k$ is the fraction of samples in class $k$ ), the impurity of the node is a function $H(p_1, \ldots, p_K)$ .

After splitting on feature $j$ at threshold $t$ , we have:

Left child: $n_L$ samples with class distribution $(p_1^L, \ldots, p_K^L)$
Right child: $n_R$ samples with class distribution $(p_1^R, \ldots, p_K^R)$

The weighted impurity after the split is:

$H_{\text{after}} = \frac{n_L}{n} H(p_1^L, \ldots, p_K^L) + \frac{n_R}{n} H(p_1^R, \ldots, p_K^R)$

The impurity reduction (information gain) is:

$\Delta H = H_{\text{before}} - H_{\text{after}} = H(p_1, \ldots, p_K) - \left[\frac{n_L}{n} H^L + \frac{n_R}{n} H^R\right]$

We choose the split that maximizes $\Delta H$ . A split is only made if $\Delta H > 0$ , i.e., it must actually reduce impurity.

The Two Dominant Impurity Measures

Gini Index (used by CART, sklearn):

$\text{Gini}(p_1, \ldots, p_K) = 1 - \sum_{k=1}^{K} p_k^2 = \sum_{k=1}^{K} p_k(1 - p_k)$

Gini is the probability that a randomly chosen sample from the node would be misclassified if its label were assigned at random according to the class distribution. A pure node (one class only) has Gini = 0. A maximally impure binary node ( $p_1 = p_2 = 0.5$ ) has Gini = 0.5.

Entropy (used by ID3, C4.5):

$H(p_1, \ldots, p_K) = -\sum_{k=1}^{K} p_k \log_2 p_k$

Entropy is the information-theoretic measure of uncertainty. A pure node has entropy = 0. A maximally impure binary node has entropy = 1 bit. The information gain from a split is $IG = H_{\text{before}} - H_{\text{after}}$ .

For regression (CART only), impurity is variance:

$\text{Var}(y) = \frac{1}{n}\sum_{i=1}^{n}(y_i - \bar{y})^2$

The split maximizes variance reduction - the weighted sum of child variances subtracted from parent variance. The leaf prediction is the mean of $y$ values in that leaf.

Comparing Gini vs. Entropy in Practice

Property	Gini Index	Entropy
Formula	$1 - \sum p_k^2$	$-\sum p_k \log_2 p_k$
Range (binary)	[0, 0.5]	[0, 1.0]
Computational cost	Cheaper (no log)	Slightly more expensive
Behavior	Slightly prefers larger partitions	More sensitive to class probability changes
Used by	CART, sklearn, XGBoost, LightGBM	ID3, C4.5, old textbooks
Practical difference	Almost identical in practice	Almost identical in practice

:::note Gini vs Entropy: does it matter? In practice, the choice between Gini and entropy rarely affects the final tree structure significantly. Both metrics produce similar splits because they are both concave functions of the class probabilities that reach their minimum at purity. The more impactful decisions are max_depth, min_samples_leaf, and whether you use pruning. :::

The CART Algorithm

CART (Classification and Regression Trees), introduced by Breiman, Friedman, Olshen, and Stone in 1984, is the specific algorithm that underlies virtually all modern tree implementations.

CART makes four key design choices:

Binary splits only: Every split divides a node into exactly two children. This is less memory-efficient than multi-way splits (C4.5 can split into many children at once), but it produces a simpler, more balanced tree structure.
Gini for classification, MSE reduction for regression: Both are computed in $O(n)$ after sorting.
Cost-complexity pruning: After growing the full tree, CART prunes it back using a regularization parameter $\alpha$ that penalizes tree complexity. This is more principled than pre-pruning with arbitrary depth limits.
Handles both categorical and continuous features: For continuous features, thresholds are chosen from the midpoints between sorted unique values. For categorical features, CART finds the best binary partition of the category set (for $k$ categories, there are $2^{k-1} - 1$ possible binary partitions).

The CART split criterion for a node $t$ with $n_t$ samples is:

$\Delta I(t) = I(t) - \frac{n_{t_L}}{n_t} I(t_L) - \frac{n_{t_R}}{n_t} I(t_R)$

where $I$ is the impurity function (Gini or variance), and $t_L$ , $t_R$ are the left and right children.

Decision Boundary Visualization

Decision trees produce axis-aligned decision boundaries. For a 2D feature space, this looks like a recursive rectangular partition:

Feature x2
  ^
1 |   FRAUD  |              LEGIT
  |          |
  |          | split: x1 <= 250
  |          |
  |----------|------------------------------
  |          |  LEGIT      |    FRAUD
  |  LEGIT   |             |
  |          | split: x2 <= 0.4
  |          |             |
  +----------+-------------+---------------> Feature x1
  0        250            250
             ^
             split: x1 <= 250 (root)

Each rectangular region in the feature space corresponds to exactly one leaf in the tree. The model's prediction for any new point is determined by which region it falls into.

:::warning Axis-aligned splits and rotated structure Decision trees cannot express diagonal or curved decision boundaries efficiently. A boundary like $x_1 + x_2 > 1$ (a diagonal line) requires many axis-aligned splits to approximate. This is why trees sometimes need many more leaves than a linear model would need coefficients for the same problem. Random forests partially mitigate this by averaging many trees with different axis-aligned boundaries. :::

Python Implementation From Scratch

The Node and Tree Classes

import numpy as np
from collections import Counter


class Node:
    """Represents a single node in a decision tree."""

    def __init__(
        self,
        feature=None,
        threshold=None,
        left=None,
        right=None,
        *,
        value=None,
    ):
        self.feature = feature        # Feature index used to split
        self.threshold = threshold    # Threshold value for the split
        self.left = left              # Left child (samples <= threshold)
        self.right = right            # Right child (samples > threshold)
        self.value = value            # Leaf prediction (None for internal nodes)

    def is_leaf(self):
        return self.value is not None


class DecisionTreeClassifier:
    """
    CART decision tree for classification.
    Splits on Gini impurity reduction.
    Supports max_depth, min_samples_split, min_samples_leaf.
    """

    def __init__(
        self,
        max_depth=None,
        min_samples_split=2,
        min_samples_leaf=1,
    ):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.root = None
        self.n_classes_ = None
        self.n_features_ = None

    def fit(self, X, y):
        X = np.asarray(X, dtype=float)
        y = np.asarray(y)
        self.n_features_ = X.shape[1]
        self.n_classes_ = len(np.unique(y))
        self.root = self._build(X, y, depth=0)
        return self

    def predict(self, X):
        X = np.asarray(X, dtype=float)
        return np.array([self._traverse(x, self.root) for x in X])

    # ------------------------------------------------------------------ #
    #  Internal methods                                                     #
    # ------------------------------------------------------------------ #

    def _build(self, X, y, depth):
        n_samples = len(y)

        # Termination conditions
        all_same = len(np.unique(y)) == 1
        too_small = n_samples < self.min_samples_split
        too_deep = self.max_depth is not None and depth >= self.max_depth

        if all_same or too_small or too_deep:
            return Node(value=self._leaf_value(y))

        feature, threshold, gain = self._best_split(X, y)

        if gain == 0:
            return Node(value=self._leaf_value(y))

        left_mask = X[:, feature] <= threshold
        right_mask = ~left_mask

        # Enforce min_samples_leaf
        if left_mask.sum() < self.min_samples_leaf or right_mask.sum() < self.min_samples_leaf:
            return Node(value=self._leaf_value(y))

        left = self._build(X[left_mask], y[left_mask], depth + 1)
        right = self._build(X[right_mask], y[right_mask], depth + 1)
        return Node(feature=feature, threshold=threshold, left=left, right=right)

    def _best_split(self, X, y):
        best_gain = 0.0
        best_feature = None
        best_threshold = None
        parent_impurity = self._gini(y)

        for feature_idx in range(self.n_features_):
            col = X[:, feature_idx]
            sorted_vals = np.unique(col)

            # Candidate thresholds: midpoints between consecutive unique values
            thresholds = (sorted_vals[:-1] + sorted_vals[1:]) / 2.0

            for threshold in thresholds:
                left_y = y[col <= threshold]
                right_y = y[col > threshold]

                if len(left_y) == 0 or len(right_y) == 0:
                    continue

                n, n_l, n_r = len(y), len(left_y), len(right_y)
                weighted_impurity = (
                    (n_l / n) * self._gini(left_y)
                    + (n_r / n) * self._gini(right_y)
                )
                gain = parent_impurity - weighted_impurity

                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature_idx
                    best_threshold = threshold

        return best_feature, best_threshold, best_gain

    @staticmethod
    def _gini(y):
        """Gini impurity of a label array."""
        n = len(y)
        if n == 0:
            return 0.0
        counts = np.bincount(y)
        probs = counts / n
        return 1.0 - np.sum(probs ** 2)

    @staticmethod
    def _leaf_value(y):
        """Majority class."""
        return int(Counter(y).most_common(1)[0][0])

    def _traverse(self, x, node):
        if node.is_leaf():
            return node.value
        if x[node.feature] <= node.threshold:
            return self._traverse(x, node.left)
        return self._traverse(x, node.right)

    def get_depth(self):
        """Return the maximum depth of the fitted tree."""
        def _depth(node):
            if node is None or node.is_leaf():
                return 0
            return 1 + max(_depth(node.left), _depth(node.right))
        return _depth(self.root)

    def count_leaves(self):
        """Return the number of leaf nodes."""
        def _count(node):
            if node is None or node.is_leaf():
                return 1
            return _count(node.left) + _count(node.right)
        return _count(self.root)

Training and Evaluating

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Generate synthetic data
X, y = make_classification(
    n_samples=2000,
    n_features=10,
    n_informative=5,
    n_redundant=2,
    random_state=42,
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Unconstrained tree (will overfit)
tree_deep = DecisionTreeClassifier()
tree_deep.fit(X_train, y_train)
print(f"Unconstrained tree depth:  {tree_deep.get_depth()}")
print(f"Unconstrained tree leaves: {tree_deep.count_leaves()}")
print(f"Train accuracy: {accuracy_score(y_train, tree_deep.predict(X_train)):.4f}")
print(f"Test  accuracy: {accuracy_score(y_test,  tree_deep.predict(X_test)):.4f}")

print()

# Constrained tree
tree_pruned = DecisionTreeClassifier(
    max_depth=5,
    min_samples_split=20,
    min_samples_leaf=10,
)
tree_pruned.fit(X_train, y_train)
print(f"Constrained tree depth:  {tree_pruned.get_depth()}")
print(f"Constrained tree leaves: {tree_pruned.count_leaves()}")
print(f"Train accuracy: {accuracy_score(y_train, tree_pruned.predict(X_train)):.4f}")
print(f"Test  accuracy: {accuracy_score(y_test,  tree_pruned.predict(X_test)):.4f}")

Expected Output

Unconstrained tree depth:  17
Unconstrained tree leaves: 312
Train accuracy: 1.0000
Test  accuracy: 0.8350

Constrained tree depth:  5
Constrained tree leaves: 23
Train accuracy: 0.8956
Test  accuracy: 0.8700

The unconstrained tree memorizes training data perfectly (1.0 train accuracy) but generalizes worse. The constrained tree gives up some training accuracy but achieves better generalization - exactly the fraud model failure from the opening scenario.

Comparing to sklearn

from sklearn.tree import DecisionTreeClassifier as SklearnTree

sk_tree = SklearnTree(
    max_depth=5,
    min_samples_split=20,
    min_samples_leaf=10,
    criterion="gini",
    random_state=42,
)
sk_tree.fit(X_train, y_train)
print(f"sklearn test accuracy: {accuracy_score(y_test, sk_tree.predict(X_test)):.4f}")

# Verify our implementation matches
our_preds = tree_pruned.predict(X_test)
sk_preds = sk_tree.predict(X_test)
agreement = np.mean(our_preds == sk_preds)
print(f"Agreement with sklearn: {agreement:.4f}")
# Typically > 0.97 - minor differences due to tie-breaking at equal-gain splits

:::tip Why sklearn is faster Our NumPy implementation evaluates candidate thresholds in a Python loop - $O(nd)$ threshold evaluations per node, each in interpreted Python. sklearn's tree builder uses Cython with presorted feature arrays, reducing the inner loop to vectorized C operations. For 10,000 samples sklearn is typically 50–200x faster, which is why you always use sklearn (or XGBoost/LightGBM) in production. :::

Hyperparameters: What They Actually Control

`max_depth`

Controls the maximum number of levels in the tree from root to any leaf. This is the single most impactful regularization parameter.

max_depth=1:  A "decision stump" - one split, two leaves.
              High bias, low variance. Useful in AdaBoost.

max_depth=5:  ~32 leaves maximum. Good for most production use cases.
              The fraud tree above needed max_depth ≤ 6.

max_depth=None: Grow until all leaves are pure or too small.
               Almost always overfits unless min_samples_leaf is large.

How to tune: Validation curve over [2, 3, 4, 5, 6, 8, 10, 15, None]. The optimal depth is where test accuracy plateaus - adding more depth gives diminishing returns or hurts.

`min_samples_split`

The minimum number of samples required to split an internal node. A node with fewer than min_samples_split samples becomes a leaf regardless of its impurity.

# Effect: prevents splitting on very small groups
# min_samples_split=2  -> any node with 2+ samples can be split (default, risky)
# min_samples_split=20 -> a node needs 20 samples to attempt a split
# min_samples_split=50 -> aggressive regularization, suitable for noisy data

:::warning min_samples_split vs min_samples_leaf These are often confused. min_samples_split controls whether a node attempts a split. min_samples_leaf controls whether a proposed split is accepted - it requires both resulting children to have at least min_samples_leaf samples. For regularization, min_samples_leaf is often more effective because it directly controls leaf size. :::

`min_samples_leaf`

The minimum number of samples that must be in a leaf after any split. If a candidate split would produce a leaf with fewer than min_samples_leaf samples, that split is not made.

# Practical guidance (for classification):
# min_samples_leaf=1   -> allow single-sample leaves (memorization)
# min_samples_leaf=5   -> minimum meaningful group size
# min_samples_leaf=20  -> production-safe for most datasets
# min_samples_leaf=50  -> high-regularization for noisy/imbalanced data

# For class imbalance: set proportionally to minority class size
minority_size = y_train.sum()
min_leaf = max(5, minority_size // 100)  # at least 1% of minority class per leaf

Hyperparameter Interaction

max_depth	min_samples_leaf	Effect
None	1	Full overfit - memorizes training data
None	50	Moderate tree, bounded by leaf size
5	1	Bounded by depth, small leaves allowed
5	20	Double constraint - usually the sweet spot
2	1	Decision stump - high bias

How Trees Handle Special Cases

Categorical Features

CART handles categorical features by finding the best binary partition of the category set. For a feature with $k$ distinct categories, there are $2^{k-1} - 1$ possible binary partitions. For small $k$ (≤ 10), CART enumerates them. For large $k$ , sklearn converts categories to integers and treats them as ordinal - which is wrong for unordered categories.

# sklearn DOES NOT natively handle categorical features
# You must encode them first

from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

# Option 1: Ordinal encoding (treats categories as ordered - OK for tree models
# since any split is axis-aligned and will find the right threshold)
enc = OrdinalEncoder()
X_cat_encoded = enc.fit_transform(X_categorical)

# Option 2: One-hot encoding (creates binary features - trees handle these fine,
# but increases dimensionality and slows training)
ohe = OneHotEncoder(sparse_output=False)
X_ohe = ohe.fit_transform(X_categorical)

# Option 3: Use CatBoost or LightGBM - they handle categoricals natively
# without encoding, using target statistics or optimal binary partitions

Missing Values

sklearn's DecisionTreeClassifier does not handle missing values (raises an error). In production you must impute before fitting:

from sklearn.impute import SimpleImputer

# Median imputation (robust to outliers, good default for trees)
imputer = SimpleImputer(strategy="median")
X_train_imp = imputer.fit_transform(X_train)
X_test_imp  = imputer.transform(X_test)

XGBoost and LightGBM learn the optimal direction to send missing values at each split - they route all missing-value samples either left or right, whichever reduces impurity more. This is a significant production advantage over sklearn trees.

Class Imbalance

Trees are biased toward the majority class because Gini impurity is dominated by majority-class purity. A node that is 95% legitimate and 5% fraud has Gini ≈ 0.095 - already quite "pure" from the tree's perspective, even though it misses all the fraud.

# Option 1: class_weight="balanced" - weights each sample inversely
# proportional to its class frequency. Gini computation uses weighted counts.
tree = SklearnTree(
    max_depth=6,
    min_samples_leaf=10,
    class_weight="balanced",  # automatically computes weights
)

# Option 2: Manual class weights
n_neg, n_pos = np.bincount(y_train)
weight_pos = n_neg / n_pos  # positive class gets more weight
sample_weights = np.where(y_train == 1, weight_pos, 1.0)
tree.fit(X_train, y_train, sample_weight=sample_weights)

# Option 3: Threshold tuning at prediction time
# Get class probabilities, then choose threshold by optimizing F1 or recall
probs = tree.predict_proba(X_test)[:, 1]
threshold = 0.3  # lower than default 0.5 to increase recall on minority class
preds = (probs >= threshold).astype(int)

Time and Space Complexity

Training

The CART algorithm with sorting has the following complexity:

Phase	Complexity	Notes
Sort each feature	$O(nd \log n)$	Done once at root; can be cached with presorted indices
FindBestSplit at one node	$O(nd)$	After sorting: scan each feature's threshold list
Number of nodes in a balanced tree	$O(n)$	At most $2n - 1$ nodes
Total training (balanced tree)	$O(n d \log n)$	sklearn's Cython implementation achieves this
Total training (worst case, unbalanced)	$O(n^2 d)$	Degenerate tree: each split removes only 1 sample

The $O(nd \log n)$ complexity is why decision trees are fast for tabular data - the same asymptotic as sorting. A tree on 1M samples with 100 features trains in seconds.

Prediction

Prediction is a single traversal from root to leaf: at each internal node, evaluate one feature comparison. For a tree of depth $D$ , prediction is $O(D)$ .

Tree configuration	Prediction cost
max_depth=5	5 comparisons per sample
max_depth=20	20 comparisons per sample
Balanced unlimited tree on $n$ samples	$O(\log n)$ expected

This makes trees extremely fast at inference - critical for real-time fraud scoring or ad ranking. A 1000-tree random forest with max_depth=10 makes at most 10,000 comparisons per prediction. At modern CPU speeds, this is sub-millisecond for a single prediction.

Space

Storing the tree requires $O(\text{nodes})$ memory. A tree of depth $D$ has at most $2^{D+1} - 1$ nodes. A depth-20 tree can have up to $\approx 2M$ nodes - small for a single tree, but a random forest of 500 such trees would be 1 billion nodes. In practice, trees are not fully balanced and most leaves terminate early, so real memory usage is much lower.

Production Engineering Notes

When to Use Decision Trees

Decision trees (single, unpruned) are almost never the right production choice. They are high-variance and almost always outperformed by their ensembles (random forests, gradient boosting). Use a single tree when:

Explainability is a hard requirement: A single tree of depth ≤ 5 can be printed and reviewed by a human. Required in some regulated industries (certain credit decisions, clinical triage tools).
Debugging an ensemble: Print the first few trees of a random forest or XGBoost model to understand what features and thresholds the ensemble is learning.
Fast prototype: Get a quick baseline and understand feature importances before tuning a complex ensemble.
Rule extraction: Decision trees produce explicit if-then-else rules that can be extracted and hardcoded in a downstream system.

When to Use Linear Models Instead

Scenario	Prefer Linear Model	Prefer Decision Tree
Feature space is high-dimensional, sparse (text/embeddings)	Yes	No - trees are slow on sparse data
Features are numeric, low-dimensional, with interactions	No	Yes
You need calibrated probabilities	Yes (logistic reg is well-calibrated)	Rarely (trees need isotonic calibration)
Training data < 1000 samples	Yes	Risky - trees overfit small data
Training data > 100K samples with mixed types	No	Yes - trees handle mixed types natively
Regulatory explainability required	Yes (coefficients)	Yes (if depth ≤ 5)
Online learning / streaming	Yes (SGD)	No - trees are batch

Calibrating Tree Probabilities

Decision trees return probability estimates as the fraction of each class in the leaf. For a leaf with 3 fraud and 7 legitimate samples, the fraud probability is reported as 0.3. This is unreliable for small leaves (high variance) and systematically overconfident for large leaves (the tree has overfit). In production, always calibrate tree probabilities before using them for decision-making:

from sklearn.calibration import CalibratedClassifierCV

# Platt scaling (logistic regression on top of tree scores)
calibrated = CalibratedClassifierCV(
    SklearnTree(max_depth=6, min_samples_leaf=10),
    method="sigmoid",   # Platt scaling
    cv=5,
)
calibrated.fit(X_train, y_train)

# Or isotonic regression (more flexible, needs more data)
calibrated_iso = CalibratedClassifierCV(
    SklearnTree(max_depth=6, min_samples_leaf=10),
    method="isotonic",
    cv=5,
)
calibrated_iso.fit(X_train, y_train)

Monitoring Trees in Production

A deployed tree is a static lookup table: once trained, its structure never changes. If the distribution of input features shifts (new merchant categories, new transaction types), the tree cannot adapt. Standard monitoring checklist:

1. Feature distribution drift - monitor mean and std of each input feature daily
2. Prediction distribution drift - monitor fraction of positive predictions weekly
3. Leaf coverage - track which leaves receive traffic; leaves with zero traffic
   are "dead code" that may activate for adversarial inputs
4. Leaf composition drift - for labeled data, check if actual class rates in
   each leaf still match the leaf's training-time class fractions

The fraud model in the opening scenario would have been caught by monitoring #3 - when the new attack pattern arrived, it fell into leaves the tree had never routed production traffic to before, which should have triggered an alert.

Interview Q&A

Q1: Walk me through how a decision tree decides where to split.

At each node, the tree evaluates every feature and every possible threshold (midpoints between unique values). For each candidate split, it computes the weighted impurity of the two resulting children - using Gini for classification or variance for regression. The split that produces the greatest impurity reduction (information gain) is chosen. This is greedy: the tree does not look ahead to evaluate whether a locally mediocre split might lead to a globally better tree. Greedy splitting is why trees can overfit and why they benefit from depth constraints and pruning.

Q2: What is the difference between Gini impurity and entropy? Which should you use?

Both measure the impurity of a node's class distribution. Gini ( $1 - \sum p_k^2$ ) is the expected misclassification rate under random class assignment. Entropy ( $-\sum p_k \log_2 p_k$ ) is the information-theoretic uncertainty. For a binary classification problem with $p$ being the positive class fraction, Gini reaches its maximum of 0.5 at $p = 0.5$ , while entropy reaches its maximum of 1.0 at the same point. In practice, the choice almost never matters - both produce similar trees. Gini is slightly cheaper to compute (no logarithm) and is the default in sklearn and XGBoost.

Q3: A decision tree achieves 100% training accuracy. Is this good or bad?

Almost always bad. 100% training accuracy means the tree has grown deep enough to create a unique leaf for every training pattern - it has memorized the training data rather than learned generalizable patterns. For a classification problem, this happens when every leaf is pure (contains only one class). The tree's test performance will typically be significantly worse. The fix: constrain max_depth, increase min_samples_leaf, or switch to an ensemble (random forests and gradient boosting are specifically designed to address this variance problem).

Q4: How does a decision tree handle a continuous feature like transaction amount?

For a continuous feature, CART sorts the unique values and evaluates candidate thresholds at the midpoints between consecutive unique values. For a feature with $m$ unique values, this generates $m - 1$ candidate thresholds. The threshold that maximizes impurity reduction is chosen. This is efficient after sorting: you scan the sorted array once, maintaining running counts of left and right class distributions, updating the impurity estimate in $O(1)$ per threshold. Total cost: $O(n \log n)$ for sorting, $O(n)$ for scanning - $O(n \log n)$ per feature per node.

Q5: What is the time complexity of training a decision tree and how does it scale?

Training with presorted features is $O(nd \log n)$ for a balanced tree: $O(n \log n)$ expected nodes, each requiring $O(d)$ work after sorting. In the worst case (each split removes exactly one sample), the tree has $O(n)$ depth and training becomes $O(n^2 d)$ . For a dataset with $n = 100{,}000$ samples and $d = 100$ features, training is roughly $10^7 \cdot \log(10^5) \approx 1.7 \times 10^8$ operations - fast enough to run in seconds on a modern CPU. Prediction is $O(D)$ per sample where $D$ is the tree depth, making inference extremely fast even for large trees.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Decision Tree Splits demo on the EngineersOfAI Playground - no code required.

:::

The Production Scenario​

Recursive Binary Splitting: The Core Algorithm​

The Algorithm in Pseudocode​

How a Split Is Evaluated: Impurity Reduction​

Impurity Before and After​

The Two Dominant Impurity Measures​

Comparing Gini vs. Entropy in Practice​

The CART Algorithm​

Decision Boundary Visualization​

Python Implementation From Scratch​

The Node and Tree Classes​

Training and Evaluating​

Expected Output​

Comparing to sklearn​

Hyperparameters: What They Actually Control​

max_depth​

min_samples_split​

min_samples_leaf​

Hyperparameter Interaction​

How Trees Handle Special Cases​

Categorical Features​

Missing Values​

Class Imbalance​

Time and Space Complexity​

Training​

Prediction​

Space​

Production Engineering Notes​

When to Use Decision Trees​

When to Use Linear Models Instead​

Calibrating Tree Probabilities​

Monitoring Trees in Production​

Interview Q&A​