Decision Trees Internals
The Production Scenario
It is 9 AM on a Monday and the on-call alert fires: fraud losses at the payments company have spiked 40% overnight. The fraud team pulls up the decision tree model that has been running in production for three months. On paper it looks excellent - 99.1% accuracy on the holdout set, precision above 0.94. But overnight a new attack pattern emerged: a coordinated wave of low-value transactions from newly created accounts at a specific merchant category, structured to stay below every threshold the tree learned.
The first thing the senior ML engineer does is export the tree and look at its depth: 20 levels. She counts the number of leaf nodes: 847. She checks the training set size: 180,000 transactions. A tree with 847 leaves and 180,000 samples means many leaves have only a handful of training examples - the tree has memorized the specific fraud patterns it saw in training down to individual merchant IDs and exact dollar amounts. The new fraud pattern is structurally different, so the tree routes it to "legitimate" leaves it built around similar-looking genuine transactions.
This is textbook overfitting, and it is the most common failure mode of decision trees in production. The fix - constraining depth, requiring minimum samples per leaf, or pruning - requires understanding exactly how the tree was built in the first place. You cannot debug a tool you do not understand from the inside.
This lesson walks through the complete internals of decision tree construction: how splits are chosen, why impurity metrics work the way they do, how the recursion terminates, and what each hyperparameter actually controls. By the end you will have built a working tree from scratch in NumPy and will know exactly where that fraud model went wrong.
Recursive Binary Splitting: The Core Algorithm
A decision tree partitions the feature space into axis-aligned rectangular regions. Each region gets a single prediction value - the majority class for classification, the mean for regression. The partition is found greedily: at each node, find the single best split, recurse on each half, and stop when a termination condition is met.
The Algorithm in Pseudocode
function BuildTree(X, y, depth=0):
# Termination conditions
if depth >= max_depth:
return LeafNode(prediction=majority_class(y))
if len(y) < min_samples_split:
return LeafNode(prediction=majority_class(y))
if all samples in y are the same class:
return LeafNode(prediction=y[0])
best_feature, best_threshold = FindBestSplit(X, y)
if no split improves impurity:
return LeafNode(prediction=majority_class(y))
left_mask = X[:, best_feature] <= best_threshold
right_mask = ~left_mask
left_child = BuildTree(X[left_mask], y[left_mask], depth + 1)
right_child = BuildTree(X[right_mask], y[right_mask], depth + 1)
return InternalNode(feature=best_feature,
threshold=best_threshold,
left=left_child,
right=right_child)
function FindBestSplit(X, y):
best_gain = 0
best_feature = None
best_threshold = None
for each feature j in [0, d):
for each candidate threshold t in sorted(unique(X[:, j])):
left_y = y[X[:, j] <= t]
right_y = y[X[:, j] > t]
gain = ImpurityReduction(y, left_y, right_y)
if gain > best_gain:
best_gain = gain
best_feature = j
best_threshold = t
return best_feature, best_threshold
The outer loop over features and thresholds is the computational bottleneck. For a dataset with samples and features, FindBestSplit evaluates up to candidate splits, each requiring work to compute the impurity - giving per node naively. Sorting the features first reduces this to , which is why sklearn always sorts before scanning.
How a Split Is Evaluated: Impurity Reduction
The central question is: how do we measure whether a split is "good"? The answer is impurity reduction - also called information gain.
Impurity Before and After
Given a node containing samples with class distribution (where is the fraction of samples in class ), the impurity of the node is a function .
After splitting on feature at threshold , we have:
- Left child: samples with class distribution
- Right child: samples with class distribution
The weighted impurity after the split is:
The impurity reduction (information gain) is:
We choose the split that maximizes . A split is only made if , i.e., it must actually reduce impurity.
The Two Dominant Impurity Measures
Gini Index (used by CART, sklearn):
Gini is the probability that a randomly chosen sample from the node would be misclassified if its label were assigned at random according to the class distribution. A pure node (one class only) has Gini = 0. A maximally impure binary node () has Gini = 0.5.
Entropy (used by ID3, C4.5):
Entropy is the information-theoretic measure of uncertainty. A pure node has entropy = 0. A maximally impure binary node has entropy = 1 bit. The information gain from a split is .
For regression (CART only), impurity is variance:
The split maximizes variance reduction - the weighted sum of child variances subtracted from parent variance. The leaf prediction is the mean of values in that leaf.
Comparing Gini vs. Entropy in Practice
| Property | Gini Index | Entropy |
|---|---|---|
| Formula | ||
| Range (binary) | [0, 0.5] | [0, 1.0] |
| Computational cost | Cheaper (no log) | Slightly more expensive |
| Behavior | Slightly prefers larger partitions | More sensitive to class probability changes |
| Used by | CART, sklearn, XGBoost, LightGBM | ID3, C4.5, old textbooks |
| Practical difference | Almost identical in practice | Almost identical in practice |
:::note Gini vs Entropy: does it matter? In practice, the choice between Gini and entropy rarely affects the final tree structure significantly. Both metrics produce similar splits because they are both concave functions of the class probabilities that reach their minimum at purity. The more impactful decisions are max_depth, min_samples_leaf, and whether you use pruning. :::
The CART Algorithm
CART (Classification and Regression Trees), introduced by Breiman, Friedman, Olshen, and Stone in 1984, is the specific algorithm that underlies virtually all modern tree implementations.
CART makes four key design choices:
-
Binary splits only: Every split divides a node into exactly two children. This is less memory-efficient than multi-way splits (C4.5 can split into many children at once), but it produces a simpler, more balanced tree structure.
-
Gini for classification, MSE reduction for regression: Both are computed in after sorting.
-
Cost-complexity pruning: After growing the full tree, CART prunes it back using a regularization parameter that penalizes tree complexity. This is more principled than pre-pruning with arbitrary depth limits.
-
Handles both categorical and continuous features: For continuous features, thresholds are chosen from the midpoints between sorted unique values. For categorical features, CART finds the best binary partition of the category set (for categories, there are possible binary partitions).
The CART split criterion for a node with samples is:
where is the impurity function (Gini or variance), and , are the left and right children.
Decision Boundary Visualization
Decision trees produce axis-aligned decision boundaries. For a 2D feature space, this looks like a recursive rectangular partition:
Feature x2
^
1 | FRAUD | LEGIT
| |
| | split: x1 <= 250
| |
|----------|------------------------------
| | LEGIT | FRAUD
| LEGIT | |
| | split: x2 <= 0.4
| | |
+----------+-------------+---------------> Feature x1
0 250 250
^
split: x1 <= 250 (root)
Each rectangular region in the feature space corresponds to exactly one leaf in the tree. The model's prediction for any new point is determined by which region it falls into.
:::warning Axis-aligned splits and rotated structure Decision trees cannot express diagonal or curved decision boundaries efficiently. A boundary like (a diagonal line) requires many axis-aligned splits to approximate. This is why trees sometimes need many more leaves than a linear model would need coefficients for the same problem. Random forests partially mitigate this by averaging many trees with different axis-aligned boundaries. :::
Python Implementation From Scratch
The Node and Tree Classes
import numpy as np
from collections import Counter
class Node:
"""Represents a single node in a decision tree."""
def __init__(
self,
feature=None,
threshold=None,
left=None,
right=None,
*,
value=None,
):
self.feature = feature # Feature index used to split
self.threshold = threshold # Threshold value for the split
self.left = left # Left child (samples <= threshold)
self.right = right # Right child (samples > threshold)
self.value = value # Leaf prediction (None for internal nodes)
def is_leaf(self):
return self.value is not None
class DecisionTreeClassifier:
"""
CART decision tree for classification.
Splits on Gini impurity reduction.
Supports max_depth, min_samples_split, min_samples_leaf.
"""
def __init__(
self,
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
):
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.min_samples_leaf = min_samples_leaf
self.root = None
self.n_classes_ = None
self.n_features_ = None
def fit(self, X, y):
X = np.asarray(X, dtype=float)
y = np.asarray(y)
self.n_features_ = X.shape[1]
self.n_classes_ = len(np.unique(y))
self.root = self._build(X, y, depth=0)
return self
def predict(self, X):
X = np.asarray(X, dtype=float)
return np.array([self._traverse(x, self.root) for x in X])
# ------------------------------------------------------------------ #
# Internal methods #
# ------------------------------------------------------------------ #
def _build(self, X, y, depth):
n_samples = len(y)
# Termination conditions
all_same = len(np.unique(y)) == 1
too_small = n_samples < self.min_samples_split
too_deep = self.max_depth is not None and depth >= self.max_depth
if all_same or too_small or too_deep:
return Node(value=self._leaf_value(y))
feature, threshold, gain = self._best_split(X, y)
if gain == 0:
return Node(value=self._leaf_value(y))
left_mask = X[:, feature] <= threshold
right_mask = ~left_mask
# Enforce min_samples_leaf
if left_mask.sum() < self.min_samples_leaf or right_mask.sum() < self.min_samples_leaf:
return Node(value=self._leaf_value(y))
left = self._build(X[left_mask], y[left_mask], depth + 1)
right = self._build(X[right_mask], y[right_mask], depth + 1)
return Node(feature=feature, threshold=threshold, left=left, right=right)
def _best_split(self, X, y):
best_gain = 0.0
best_feature = None
best_threshold = None
parent_impurity = self._gini(y)
for feature_idx in range(self.n_features_):
col = X[:, feature_idx]
sorted_vals = np.unique(col)
# Candidate thresholds: midpoints between consecutive unique values
thresholds = (sorted_vals[:-1] + sorted_vals[1:]) / 2.0
for threshold in thresholds:
left_y = y[col <= threshold]
right_y = y[col > threshold]
if len(left_y) == 0 or len(right_y) == 0:
continue
n, n_l, n_r = len(y), len(left_y), len(right_y)
weighted_impurity = (
(n_l / n) * self._gini(left_y)
+ (n_r / n) * self._gini(right_y)
)
gain = parent_impurity - weighted_impurity
if gain > best_gain:
best_gain = gain
best_feature = feature_idx
best_threshold = threshold
return best_feature, best_threshold, best_gain
@staticmethod
def _gini(y):
"""Gini impurity of a label array."""
n = len(y)
if n == 0:
return 0.0
counts = np.bincount(y)
probs = counts / n
return 1.0 - np.sum(probs ** 2)
@staticmethod
def _leaf_value(y):
"""Majority class."""
return int(Counter(y).most_common(1)[0][0])
def _traverse(self, x, node):
if node.is_leaf():
return node.value
if x[node.feature] <= node.threshold:
return self._traverse(x, node.left)
return self._traverse(x, node.right)
def get_depth(self):
"""Return the maximum depth of the fitted tree."""
def _depth(node):
if node is None or node.is_leaf():
return 0
return 1 + max(_depth(node.left), _depth(node.right))
return _depth(self.root)
def count_leaves(self):
"""Return the number of leaf nodes."""
def _count(node):
if node is None or node.is_leaf():
return 1
return _count(node.left) + _count(node.right)
return _count(self.root)
Training and Evaluating
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# Generate synthetic data
X, y = make_classification(
n_samples=2000,
n_features=10,
n_informative=5,
n_redundant=2,
random_state=42,
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Unconstrained tree (will overfit)
tree_deep = DecisionTreeClassifier()
tree_deep.fit(X_train, y_train)
print(f"Unconstrained tree depth: {tree_deep.get_depth()}")
print(f"Unconstrained tree leaves: {tree_deep.count_leaves()}")
print(f"Train accuracy: {accuracy_score(y_train, tree_deep.predict(X_train)):.4f}")
print(f"Test accuracy: {accuracy_score(y_test, tree_deep.predict(X_test)):.4f}")
print()
# Constrained tree
tree_pruned = DecisionTreeClassifier(
max_depth=5,
min_samples_split=20,
min_samples_leaf=10,
)
tree_pruned.fit(X_train, y_train)
print(f"Constrained tree depth: {tree_pruned.get_depth()}")
print(f"Constrained tree leaves: {tree_pruned.count_leaves()}")
print(f"Train accuracy: {accuracy_score(y_train, tree_pruned.predict(X_train)):.4f}")
print(f"Test accuracy: {accuracy_score(y_test, tree_pruned.predict(X_test)):.4f}")
Expected Output
Unconstrained tree depth: 17
Unconstrained tree leaves: 312
Train accuracy: 1.0000
Test accuracy: 0.8350
Constrained tree depth: 5
Constrained tree leaves: 23
Train accuracy: 0.8956
Test accuracy: 0.8700
The unconstrained tree memorizes training data perfectly (1.0 train accuracy) but generalizes worse. The constrained tree gives up some training accuracy but achieves better generalization - exactly the fraud model failure from the opening scenario.
Comparing to sklearn
from sklearn.tree import DecisionTreeClassifier as SklearnTree
sk_tree = SklearnTree(
max_depth=5,
min_samples_split=20,
min_samples_leaf=10,
criterion="gini",
random_state=42,
)
sk_tree.fit(X_train, y_train)
print(f"sklearn test accuracy: {accuracy_score(y_test, sk_tree.predict(X_test)):.4f}")
# Verify our implementation matches
our_preds = tree_pruned.predict(X_test)
sk_preds = sk_tree.predict(X_test)
agreement = np.mean(our_preds == sk_preds)
print(f"Agreement with sklearn: {agreement:.4f}")
# Typically > 0.97 - minor differences due to tie-breaking at equal-gain splits
:::tip Why sklearn is faster Our NumPy implementation evaluates candidate thresholds in a Python loop - threshold evaluations per node, each in interpreted Python. sklearn's tree builder uses Cython with presorted feature arrays, reducing the inner loop to vectorized C operations. For 10,000 samples sklearn is typically 50–200x faster, which is why you always use sklearn (or XGBoost/LightGBM) in production. :::
Hyperparameters: What They Actually Control
max_depth
Controls the maximum number of levels in the tree from root to any leaf. This is the single most impactful regularization parameter.
max_depth=1: A "decision stump" - one split, two leaves.
High bias, low variance. Useful in AdaBoost.
max_depth=5: ~32 leaves maximum. Good for most production use cases.
The fraud tree above needed max_depth ≤ 6.
max_depth=None: Grow until all leaves are pure or too small.
Almost always overfits unless min_samples_leaf is large.
How to tune: Validation curve over [2, 3, 4, 5, 6, 8, 10, 15, None]. The optimal depth is where test accuracy plateaus - adding more depth gives diminishing returns or hurts.
min_samples_split
The minimum number of samples required to split an internal node. A node with fewer than min_samples_split samples becomes a leaf regardless of its impurity.
# Effect: prevents splitting on very small groups
# min_samples_split=2 -> any node with 2+ samples can be split (default, risky)
# min_samples_split=20 -> a node needs 20 samples to attempt a split
# min_samples_split=50 -> aggressive regularization, suitable for noisy data
:::warning min_samples_split vs min_samples_leaf
These are often confused. min_samples_split controls whether a node attempts a split. min_samples_leaf controls whether a proposed split is accepted - it requires both resulting children to have at least min_samples_leaf samples. For regularization, min_samples_leaf is often more effective because it directly controls leaf size.
:::
min_samples_leaf
The minimum number of samples that must be in a leaf after any split. If a candidate split would produce a leaf with fewer than min_samples_leaf samples, that split is not made.
# Practical guidance (for classification):
# min_samples_leaf=1 -> allow single-sample leaves (memorization)
# min_samples_leaf=5 -> minimum meaningful group size
# min_samples_leaf=20 -> production-safe for most datasets
# min_samples_leaf=50 -> high-regularization for noisy/imbalanced data
# For class imbalance: set proportionally to minority class size
minority_size = y_train.sum()
min_leaf = max(5, minority_size // 100) # at least 1% of minority class per leaf
Hyperparameter Interaction
| max_depth | min_samples_leaf | Effect |
|---|---|---|
| None | 1 | Full overfit - memorizes training data |
| None | 50 | Moderate tree, bounded by leaf size |
| 5 | 1 | Bounded by depth, small leaves allowed |
| 5 | 20 | Double constraint - usually the sweet spot |
| 2 | 1 | Decision stump - high bias |
How Trees Handle Special Cases
Categorical Features
CART handles categorical features by finding the best binary partition of the category set. For a feature with distinct categories, there are possible binary partitions. For small (≤ 10), CART enumerates them. For large , sklearn converts categories to integers and treats them as ordinal - which is wrong for unordered categories.
# sklearn DOES NOT natively handle categorical features
# You must encode them first
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
# Option 1: Ordinal encoding (treats categories as ordered - OK for tree models
# since any split is axis-aligned and will find the right threshold)
enc = OrdinalEncoder()
X_cat_encoded = enc.fit_transform(X_categorical)
# Option 2: One-hot encoding (creates binary features - trees handle these fine,
# but increases dimensionality and slows training)
ohe = OneHotEncoder(sparse_output=False)
X_ohe = ohe.fit_transform(X_categorical)
# Option 3: Use CatBoost or LightGBM - they handle categoricals natively
# without encoding, using target statistics or optimal binary partitions
Missing Values
sklearn's DecisionTreeClassifier does not handle missing values (raises an error). In production you must impute before fitting:
from sklearn.impute import SimpleImputer
# Median imputation (robust to outliers, good default for trees)
imputer = SimpleImputer(strategy="median")
X_train_imp = imputer.fit_transform(X_train)
X_test_imp = imputer.transform(X_test)
XGBoost and LightGBM learn the optimal direction to send missing values at each split - they route all missing-value samples either left or right, whichever reduces impurity more. This is a significant production advantage over sklearn trees.
Class Imbalance
Trees are biased toward the majority class because Gini impurity is dominated by majority-class purity. A node that is 95% legitimate and 5% fraud has Gini ≈ 0.095 - already quite "pure" from the tree's perspective, even though it misses all the fraud.
# Option 1: class_weight="balanced" - weights each sample inversely
# proportional to its class frequency. Gini computation uses weighted counts.
tree = SklearnTree(
max_depth=6,
min_samples_leaf=10,
class_weight="balanced", # automatically computes weights
)
# Option 2: Manual class weights
n_neg, n_pos = np.bincount(y_train)
weight_pos = n_neg / n_pos # positive class gets more weight
sample_weights = np.where(y_train == 1, weight_pos, 1.0)
tree.fit(X_train, y_train, sample_weight=sample_weights)
# Option 3: Threshold tuning at prediction time
# Get class probabilities, then choose threshold by optimizing F1 or recall
probs = tree.predict_proba(X_test)[:, 1]
threshold = 0.3 # lower than default 0.5 to increase recall on minority class
preds = (probs >= threshold).astype(int)
Time and Space Complexity
Training
The CART algorithm with sorting has the following complexity:
| Phase | Complexity | Notes |
|---|---|---|
| Sort each feature | Done once at root; can be cached with presorted indices | |
| FindBestSplit at one node | After sorting: scan each feature's threshold list | |
| Number of nodes in a balanced tree | At most nodes | |
| Total training (balanced tree) | sklearn's Cython implementation achieves this | |
| Total training (worst case, unbalanced) | Degenerate tree: each split removes only 1 sample |
The complexity is why decision trees are fast for tabular data - the same asymptotic as sorting. A tree on 1M samples with 100 features trains in seconds.
Prediction
Prediction is a single traversal from root to leaf: at each internal node, evaluate one feature comparison. For a tree of depth , prediction is .
| Tree configuration | Prediction cost |
|---|---|
| max_depth=5 | 5 comparisons per sample |
| max_depth=20 | 20 comparisons per sample |
| Balanced unlimited tree on samples | expected |
This makes trees extremely fast at inference - critical for real-time fraud scoring or ad ranking. A 1000-tree random forest with max_depth=10 makes at most 10,000 comparisons per prediction. At modern CPU speeds, this is sub-millisecond for a single prediction.
Space
Storing the tree requires memory. A tree of depth has at most nodes. A depth-20 tree can have up to nodes - small for a single tree, but a random forest of 500 such trees would be 1 billion nodes. In practice, trees are not fully balanced and most leaves terminate early, so real memory usage is much lower.
Production Engineering Notes
When to Use Decision Trees
Decision trees (single, unpruned) are almost never the right production choice. They are high-variance and almost always outperformed by their ensembles (random forests, gradient boosting). Use a single tree when:
- Explainability is a hard requirement: A single tree of depth ≤ 5 can be printed and reviewed by a human. Required in some regulated industries (certain credit decisions, clinical triage tools).
- Debugging an ensemble: Print the first few trees of a random forest or XGBoost model to understand what features and thresholds the ensemble is learning.
- Fast prototype: Get a quick baseline and understand feature importances before tuning a complex ensemble.
- Rule extraction: Decision trees produce explicit if-then-else rules that can be extracted and hardcoded in a downstream system.
When to Use Linear Models Instead
| Scenario | Prefer Linear Model | Prefer Decision Tree |
|---|---|---|
| Feature space is high-dimensional, sparse (text/embeddings) | Yes | No - trees are slow on sparse data |
| Features are numeric, low-dimensional, with interactions | No | Yes |
| You need calibrated probabilities | Yes (logistic reg is well-calibrated) | Rarely (trees need isotonic calibration) |
| Training data < 1000 samples | Yes | Risky - trees overfit small data |
| Training data > 100K samples with mixed types | No | Yes - trees handle mixed types natively |
| Regulatory explainability required | Yes (coefficients) | Yes (if depth ≤ 5) |
| Online learning / streaming | Yes (SGD) | No - trees are batch |
Calibrating Tree Probabilities
Decision trees return probability estimates as the fraction of each class in the leaf. For a leaf with 3 fraud and 7 legitimate samples, the fraud probability is reported as 0.3. This is unreliable for small leaves (high variance) and systematically overconfident for large leaves (the tree has overfit). In production, always calibrate tree probabilities before using them for decision-making:
from sklearn.calibration import CalibratedClassifierCV
# Platt scaling (logistic regression on top of tree scores)
calibrated = CalibratedClassifierCV(
SklearnTree(max_depth=6, min_samples_leaf=10),
method="sigmoid", # Platt scaling
cv=5,
)
calibrated.fit(X_train, y_train)
# Or isotonic regression (more flexible, needs more data)
calibrated_iso = CalibratedClassifierCV(
SklearnTree(max_depth=6, min_samples_leaf=10),
method="isotonic",
cv=5,
)
calibrated_iso.fit(X_train, y_train)
Monitoring Trees in Production
A deployed tree is a static lookup table: once trained, its structure never changes. If the distribution of input features shifts (new merchant categories, new transaction types), the tree cannot adapt. Standard monitoring checklist:
1. Feature distribution drift - monitor mean and std of each input feature daily
2. Prediction distribution drift - monitor fraction of positive predictions weekly
3. Leaf coverage - track which leaves receive traffic; leaves with zero traffic
are "dead code" that may activate for adversarial inputs
4. Leaf composition drift - for labeled data, check if actual class rates in
each leaf still match the leaf's training-time class fractions
The fraud model in the opening scenario would have been caught by monitoring #3 - when the new attack pattern arrived, it fell into leaves the tree had never routed production traffic to before, which should have triggered an alert.
Interview Q&A
Q1: Walk me through how a decision tree decides where to split.
At each node, the tree evaluates every feature and every possible threshold (midpoints between unique values). For each candidate split, it computes the weighted impurity of the two resulting children - using Gini for classification or variance for regression. The split that produces the greatest impurity reduction (information gain) is chosen. This is greedy: the tree does not look ahead to evaluate whether a locally mediocre split might lead to a globally better tree. Greedy splitting is why trees can overfit and why they benefit from depth constraints and pruning.
Q2: What is the difference between Gini impurity and entropy? Which should you use?
Both measure the impurity of a node's class distribution. Gini () is the expected misclassification rate under random class assignment. Entropy () is the information-theoretic uncertainty. For a binary classification problem with being the positive class fraction, Gini reaches its maximum of 0.5 at , while entropy reaches its maximum of 1.0 at the same point. In practice, the choice almost never matters - both produce similar trees. Gini is slightly cheaper to compute (no logarithm) and is the default in sklearn and XGBoost.
Q3: A decision tree achieves 100% training accuracy. Is this good or bad?
Almost always bad. 100% training accuracy means the tree has grown deep enough to create a unique leaf for every training pattern - it has memorized the training data rather than learned generalizable patterns. For a classification problem, this happens when every leaf is pure (contains only one class). The tree's test performance will typically be significantly worse. The fix: constrain max_depth, increase min_samples_leaf, or switch to an ensemble (random forests and gradient boosting are specifically designed to address this variance problem).
Q4: How does a decision tree handle a continuous feature like transaction amount?
For a continuous feature, CART sorts the unique values and evaluates candidate thresholds at the midpoints between consecutive unique values. For a feature with unique values, this generates candidate thresholds. The threshold that maximizes impurity reduction is chosen. This is efficient after sorting: you scan the sorted array once, maintaining running counts of left and right class distributions, updating the impurity estimate in per threshold. Total cost: for sorting, for scanning - per feature per node.
Q5: What is the time complexity of training a decision tree and how does it scale?
Training with presorted features is for a balanced tree: expected nodes, each requiring work after sorting. In the worst case (each split removes exactly one sample), the tree has depth and training becomes . For a dataset with samples and features, training is roughly operations - fast enough to run in seconds on a modern CPU. Prediction is per sample where is the tree depth, making inference extremely fast even for large trees.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Decision Tree Splits demo on the EngineersOfAI Playground - no code required.
:::
