Information Gain, Gini Impurity, and Entropy
The Production Scenario
You are a data scientist at a large regional hospital network. The clinical team has tasked you with building a 30-day readmission prediction model - if you can flag high-risk patients before discharge, care coordinators can intervene and reduce costly readmissions. You pull together a dataset: 40,000 discharge records, features like age, primary diagnosis, number of medications, length of stay, prior admissions, and insurance type.
You train a DecisionTreeClassifier and inspect the resulting tree. Something immediately stands out: the root split is on num_medications > 11, the second level splits are both on num_medications > 8 and num_medications > 14, and the third level still leans heavily on medications. The tree is almost entirely built around a single feature. The clinical team asks the obvious question: is this medically meaningful, or is the algorithm infatuated with one variable?
To answer that question, you need to understand exactly what the tree optimizer is doing when it picks a split. It is not guessing, and it is not random. It computes a score for every possible split on every possible feature and threshold, then picks the one with the highest score. That score is either information gain (if criterion='entropy') or Gini impurity reduction (if criterion='gini', the default). Understanding how those scores are computed tells you precisely why num_medications keeps winning - it has many distinct values, giving it more threshold candidates and more opportunities to find a clean separation.
This lesson builds the full mathematical foundation. You will derive entropy and Gini from first principles, compute them by hand on a toy example, implement them in NumPy from scratch, and understand their practical differences. You will also learn about the multi-valued feature bias - the root cause of your medication feature's dominance - and the Gain Ratio fix introduced by the C4.5 algorithm. By the end, you can look at any tree split and explain exactly why the algorithm chose it.
Why This Exists: The Problem of "Purity"
Before information gain, early tree algorithms like CHAID (1980) used statistical tests like chi-square to evaluate splits - a sensible approach, but one that made it hard to build trees consistently and efficiently. The ID3 algorithm by Ross Quinlan in 1986 introduced a cleaner idea: frame the split selection problem as an information-theoretic optimization. Instead of asking "is this split statistically significant?", ask "how much uncertainty does this split resolve?"
The key insight is that a pure node - one where all examples belong to the same class - is maximally predictable and has zero uncertainty. An impure node - where classes are mixed - is uncertain and we need more splits to resolve it. The splitting criterion measures impurity, and we pick the split that reduces it the most.
This framing works beautifully. It gives us a single scalar to optimize, it handles any number of classes naturally, and it connects to the deep mathematical theory of information. The only question is: which impurity measure should we use?
Historical Context
1948 - Shannon Entropy. Claude Shannon's landmark paper "A Mathematical Theory of Communication" defined entropy as a measure of information content in a probability distribution. Shannon was thinking about communication channels, not decision trees - but the mathematics transferred perfectly. His entropy formula became the foundation for information-theoretic measures in machine learning.
1986 - ID3 (Quinlan). Ross Quinlan introduced the ID3 algorithm, which used Shannon entropy and information gain as the splitting criterion. ID3 was the first practical tree algorithm that could handle multi-way splits on categorical features efficiently. The "aha moment" was realizing that information gain directly measures how much uncertainty a split eliminates.
1993 - C4.5 (Quinlan). Quinlan improved ID3 with C4.5, which added support for continuous features, missing values, and - critically - the Gain Ratio correction that addresses information gain's bias toward high-cardinality features. C4.5 became the algorithm behind the iconic J48 implementation still used today.
1984 - CART (Breiman et al.). Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone introduced CART: Classification and Regression Trees. CART introduced Gini impurity as an alternative to entropy, and mandated binary splits only (never multi-way). This design makes CART faster and better suited for ensembles. Sklearn's DecisionTreeClassifier implements CART.
1997 - Friedman MSE. For regression trees, Jerome Friedman proposed a weighted MSE criterion that finds splits more efficiently on small datasets by separately weighting left and right children - a small but meaningful practical improvement.
Shannon Entropy: Measuring Impurity as Surprise
The Information-Theoretic Root
Shannon's entropy captures the intuition that surprise = information. If you know a coin always lands heads, observing "heads" gives you zero information - you already knew. If the coin is fair, each flip tells you one full bit. Entropy captures this: high entropy means high uncertainty, means more information per observation.
For a node containing examples from classes, where is the fraction of examples belonging to class :
The convention handles empty classes - a class with zero examples contributes nothing to the sum. Think of it as a limit: as , .
Entropy Properties - Derived from First Principles
Minimum (zero): When all examples belong to one class, for one and for all others. Then . A perfectly pure node has zero entropy - no uncertainty, no information needed.
Maximum for binary case: For a binary problem, is maximized at , where bit. A perfectly mixed 50-50 node requires exactly one bit to specify which class an example belongs to.
Maximum for classes: With classes, the maximum entropy is , achieved when all classes are equally likely ( for all ). With 4 classes equally balanced, maximum entropy is bits.
Concavity: Entropy is a strictly concave function of the class probability vector. This is the key mathematical property that guarantees splitting always helps: by Jensen's inequality, the weighted average entropy of children is always at most the parent's entropy. Splitting cannot increase uncertainty.
Binary Entropy H(p) vs class probability p
H(p)
1.0 | * <- maximum at p=0.5 (1.0 bit)
| * *
0.8 | * *
| * *
0.6 | * *
| * *
0.4 | * *
0.2 |* *
0.0 +--+---+---+---+---+---+-> p
0 0.2 0.4 0.6 0.8 1.0
(pure) (mixed: max H) (pure)
Worked Example: Entropy at a Node
Suppose a node has 10 patients: 3 were readmitted (positive), 7 were not (negative).
This node is fairly impure - a classifier at this node can only do somewhat better than random. Compare to a perfectly pure node ( or ): bits. Or a maximally impure node (): bit.
Information Gain: The Splitting Criterion
Derivation from First Principles
A split on feature partitions the dataset into subsets - one per distinct value of . For continuous features (as in CART), this means exactly two subsets: threshold and threshold. The weighted average entropy after the split is:
The weights reflect how many samples flow into each child node - a child receiving 80% of the data matters 4x more than one receiving 20%.
Information Gain is the reduction in entropy that the split achieves:
The tree greedily picks the feature and threshold that maximizes . By Jensen's inequality applied to the concave entropy function, always - a split cannot increase expected entropy.
Full Numerical Example
Consider 14 training examples structured as a binary split on medications > 10:
| Subset | Count | Readmitted (Yes) | Not Readmitted (No) |
|---|---|---|---|
| Root | 14 | 7 | 7 |
| Left (meds > 10) | 6 | 4 | 2 |
| Right (meds ≤ 10) | 8 | 3 | 5 |
Root entropy:
Left child (4 Yes, 2 No):
Right child (3 Yes, 5 No):
Information Gain:
This split delivers 0.062 bits of gain. The tree compares this against every other feature and threshold and picks the maximum.
Gini Impurity: The Computationally Efficient Alternative
Derivation from Misclassification Probability
Gini impurity has a clean probabilistic interpretation: the probability of misclassifying a randomly chosen element if it were labeled according to the class distribution in the node.
Pick an element at random - it belongs to class with probability . Assign it a label by sampling from the same distribution - you assign class with probability . The probability of a mismatch is:
For a binary problem with and :
Maximized at (value 0.5) and zero at or - the same boundary behavior as entropy.
Gini Reduction: The Actual Splitting Criterion
The algorithm maximizes the Gini impurity reduction:
Numerical Example: Gini on the Same Dataset
Root node ():
Left child (4 Yes, 2 No, ):
Right child (3 Yes, 5 No, ):
Gini Reduction:
Shape Comparison: Entropy vs Gini
Impurity vs class probability p (binary case)
1.0 | * <- Entropy peak (1.0 bit)
| * *
0.8 | * * Entropy: -p log2(p) - (1-p) log2(1-p)
| * *
0.6 | * *
0.5 | * <- Gini peak (0.5)
| * * * *
| * * Gini: 2p(1-p)
0.0 +--+---+---+---+---> p
0 0.2 0.4 0.6 1.0
Both curves are symmetric around and zero at the extremes. Entropy penalizes impure nodes slightly more strongly - it curves more steeply near 0.5. In practice the two criteria agree on the same best split well over 99% of the time.
:::note Why sklearn defaults to Gini Gini requires no logarithm computation - just squaring and summing. For large datasets with millions of candidate splits, this compute difference is meaningful. Additionally, both criteria produce nearly identical trees on most real-world datasets, so sklearn correctly treats Gini as the default and faster choice. :::
The CART Splitting Algorithm
CART (Classification and Regression Trees) defines a precise algorithm for choosing splits. It always uses binary splits, works with continuous features via threshold search, and exhaustively evaluates every feature-threshold combination at every node.
CART Split Selection Pseudocode
─────────────────────────────────────────────────────────────────
For each feature f in {1, ..., d}:
Sort training instances by feature f value
For each adjacent pair of unique values (v_i, v_{i+1}):
threshold t = (v_i + v_{i+1}) / 2 <- midpoint candidate
Left = {x : x[f] <= t}
Right = {x : x[f] > t}
score = Impurity(parent) - weighted_impurity(Left, Right)
If score > best_score:
best_score = score
best_feature = f
best_threshold = t
Return best_feature, best_threshold
Key design decisions in CART:
- Always binary splits (left/right only) - simpler trees, better suited for ensembles
- Midpoint thresholds between adjacent unique values - avoids boundary condition bugs
- All features evaluated at every node - exhaustive but guaranteed locally optimal
- No look-ahead - CART is greedy; it picks the best split now, ignoring future levels
Full NumPy Implementation: All Splits for a Feature
The following implementation computes all candidate splits for every feature and reports the best one under both Gini and entropy. This mirrors what sklearn does internally.
import numpy as np
from typing import Optional, Tuple, Dict
# ── Impurity functions ──────────────────────────────────────────────────────
def compute_entropy(y: np.ndarray) -> float:
"""Shannon entropy of a label array (bits)."""
n = len(y)
if n == 0:
return 0.0
counts = np.bincount(y)
probs = counts[counts > 0] / n # drop zero-count classes
return float(-np.sum(probs * np.log2(probs)))
def compute_gini(y: np.ndarray) -> float:
"""Gini impurity of a label array."""
n = len(y)
if n == 0:
return 0.0
counts = np.bincount(y)
probs = counts / n
return float(1.0 - np.sum(probs ** 2))
def weighted_impurity(
y_left: np.ndarray,
y_right: np.ndarray,
criterion: str = "gini",
) -> float:
"""Weighted average impurity after a binary split."""
n = len(y_left) + len(y_right)
fn = compute_gini if criterion == "gini" else compute_entropy
return (len(y_left) / n) * fn(y_left) + (len(y_right) / n) * fn(y_right)
def impurity_reduction(
y: np.ndarray,
y_left: np.ndarray,
y_right: np.ndarray,
criterion: str = "gini",
) -> float:
"""Impurity reduction achieved by a split."""
fn = compute_gini if criterion == "gini" else compute_entropy
return fn(y) - weighted_impurity(y_left, y_right, criterion)
# ── All splits for a single feature ─────────────────────────────────────────
def all_splits_for_feature(
X: np.ndarray,
y: np.ndarray,
feature_idx: int,
criterion: str = "gini",
) -> list:
"""
Compute impurity reduction for every candidate threshold on one feature.
Returns a sorted list of (threshold, gain) pairs descending by gain.
"""
col = X[:, feature_idx]
unique_vals = np.unique(col)
if len(unique_vals) == 1:
return []
# Candidate thresholds: midpoints between consecutive sorted unique values
thresholds = (unique_vals[:-1] + unique_vals[1:]) / 2.0
results = []
for t in thresholds:
mask = col <= t
y_left, y_right = y[mask], y[~mask]
if len(y_left) == 0 or len(y_right) == 0:
continue
gain = impurity_reduction(y, y_left, y_right, criterion)
results.append((float(t), float(gain)))
return sorted(results, key=lambda x: x[1], reverse=True)
# ── Best split search across all features (CART algorithm) ──────────────────
def best_split(
X: np.ndarray,
y: np.ndarray,
criterion: str = "gini",
) -> Tuple[Optional[int], Optional[float], float]:
"""
Exhaustive search for the feature and threshold that maximise
impurity reduction.
Returns
-------
best_feature : int - column index of the winning feature
best_threshold: float - split threshold (left gets <= threshold)
best_gain : float - impurity reduction achieved
"""
n_samples, n_features = X.shape
best_feature, best_threshold, best_gain = None, None, 0.0
for feature_idx in range(n_features):
splits = all_splits_for_feature(X, y, feature_idx, criterion)
if splits and splits[0][1] > best_gain:
best_gain = splits[0][1]
best_threshold = splits[0][0]
best_feature = feature_idx
return best_feature, best_threshold, best_gain
# ── Side-by-side comparison across criteria ──────────────────────────────────
def compare_splits(X: np.ndarray, y: np.ndarray, feature_names=None) -> None:
"""Print the best split for each criterion side-by-side."""
print(f"{'Criterion':<12} {'Feature':<18} {'Threshold':<12} {'Gain':<10}")
print("-" * 56)
for crit in ["gini", "entropy"]:
feat, thresh, gain = best_split(X, y, criterion=crit)
fname = feature_names[feat] if feature_names and feat is not None else str(feat)
print(f"{crit:<12} {fname:<18} {thresh:<12.4f} {gain:<10.4f}")
# ── Demonstration ─────────────────────────────────────────────────────────────
if __name__ == "__main__":
rng = np.random.default_rng(42)
# 300 patients; feature 0=num_medications (0-20), feature 1=age (0-90)
X = np.column_stack([
rng.integers(0, 21, size=300).astype(float), # num_medications
rng.integers(30, 90, size=300).astype(float), # age
])
# Readmitted if many medications AND older age
y = ((X[:, 0] > 12) & (X[:, 1] > 60)).astype(int)
print(f"Class distribution: {np.bincount(y)}")
print(f"Root entropy: {compute_entropy(y):.4f} bits")
print(f"Root Gini: {compute_gini(y):.4f}")
print()
compare_splits(X, y, feature_names=["num_medications", "age"])
print("\n--- All splits for num_medications (top 5) ---")
for thresh, gain in all_splits_for_feature(X, y, 0, "gini")[:5]:
print(f" threshold={thresh:.1f} gini_reduction={gain:.4f}")
Sklearn Equivalent
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=500, n_features=4, n_informative=2, random_state=42
)
for criterion in ["gini", "entropy"]:
clf = DecisionTreeClassifier(max_depth=3, criterion=criterion, random_state=42)
clf.fit(X, y)
print(f"\n=== {criterion.upper()} ===")
print(export_text(clf, feature_names=[f"feature_{i}" for i in range(4)]))
Continuous Feature Splitting: Sort + Midpoints
A critical implementation detail: for a feature with unique values, there are exactly candidate thresholds, placed at the midpoints between consecutive sorted values. This avoids boundary condition bugs and ensures the threshold correctly separates adjacent groups.
def continuous_feature_thresholds(values: np.ndarray) -> np.ndarray:
"""
Compute midpoint thresholds between sorted unique values.
For values [1, 3, 5, 8], returns [2.0, 4.0, 6.5].
"""
unique_sorted = np.unique(values)
return (unique_sorted[:-1] + unique_sorted[1:]) / 2.0
# Why midpoints work:
# - Threshold at 2.0: splits {1} from {3, 5, 8} - clean boundary
# - Threshold at 3.0: might ambiguously classify value=3
# - Midpoints guarantee: all <=t go left, all >t go right, no ties at threshold
# Example with medical data:
meds = np.array([4, 4, 6, 8, 8, 11, 12, 14, 15, 18], dtype=float)
print("Candidate thresholds:", continuous_feature_thresholds(meds))
# Output: [ 5. 7. 9.5 11.5 13. 14.5 16.5]
# 7 thresholds for 8 unique values
Regression Trees: Variance Reduction
For continuous targets, there are no class proportions. A node is "pure" when all target values are identical - meaning zero variance. The splitting criterion becomes variance reduction (equivalent to MSE reduction):
The leaf prediction is the mean of all training targets that reach that leaf, which minimizes the within-leaf MSE.
def variance_reduction(
y: np.ndarray,
y_left: np.ndarray,
y_right: np.ndarray,
) -> float:
"""Variance reduction for a regression tree split (= MSE reduction)."""
n = len(y)
def weighted_var(subset: np.ndarray) -> float:
return (len(subset) / n) * float(np.var(subset)) if len(subset) > 0 else 0.0
return float(np.var(y)) - weighted_var(y_left) - weighted_var(y_right)
def best_regression_split(X: np.ndarray, y: np.ndarray) -> Dict:
"""CART split for regression: maximize variance reduction."""
n_features = X.shape[1]
best: Dict = {"feature": None, "threshold": None, "gain": 0.0}
for fi in range(n_features):
col = X[:, fi]
thresholds = continuous_feature_thresholds(col)
for t in thresholds:
mask = col <= t
yl, yr = y[mask], y[~mask]
if len(yl) == 0 or len(yr) == 0:
continue
gain = variance_reduction(y, yl, yr)
if gain > best["gain"]:
best = {"feature": fi, "threshold": float(t), "gain": float(gain)}
return best
# Example: predict hospital length of stay (continuous)
rng = np.random.default_rng(42)
X_reg = rng.random((200, 3))
y_reg = 3.0 * X_reg[:, 0] + 1.5 * X_reg[:, 1] + rng.normal(0, 0.2, 200)
result = best_regression_split(X_reg, y_reg)
print(f"Best regression split: feature={result['feature']}, "
f"threshold={result['threshold']:.4f}, "
f"variance_reduction={result['gain']:.4f}")
In sklearn, DecisionTreeRegressor uses criterion='squared_error' by default. The option criterion='friedman_mse' applies Friedman's improvement, which weights the MSE by the harmonic mean of node sizes and can find better splits on smaller datasets.
The Multi-Valued Feature Bias
This is the root cause of why num_medications dominates your healthcare tree. Information gain has a structural preference for features with many distinct values.
Consider a pathological extreme: a patient ID feature with one unique value per row. A split on patient ID can route single patients to individual leaves, perfectly zeroing out entropy in each child. Every split on patient ID achieves maximum IG. Yet patient ID is completely useless for generalization.
Continuous numeric features with many unique values suffer a milder version of the same bias. A feature with 20 possible values has roughly 19 threshold candidates; a binary feature has just one. Given more candidates, the high-cardinality feature has more opportunities to stumble onto a threshold that looks good in-sample.
Feature candidates available for splitting:
age (0-80, step 5) → ~15 thresholds to try
num_medications (0-25) → ~24 thresholds to try <- more shots at a good split
insurance_type (binary) → 1 threshold to try
More threshold candidates = more chances to find an
in-sample split that doesn't generalize.
The optimizer is not cheating - it is doing its job.
The bias is structural, not a bug.
:::warning This bias is not fixed by Gini Gini impurity reduction suffers from the same multi-valued feature bias as information gain. Both take the maximum gain across all candidate thresholds - so features with more candidates have an inherent advantage. The fix requires normalizing by the feature's own entropy, which is what Gain Ratio does. :::
Gain Ratio: The C4.5 Fix
Quinlan's C4.5 algorithm (1993) normalizes information gain by the intrinsic information (split entropy) of the feature itself. The intuition: a feature that creates many balanced subsets is doing something expensive - it is using a lot of the tree's branching budget. That cost should be charged against its IG.
This is the entropy of the distribution of examples across the feature's values. For a binary split that sends 50% left and 50% right, bit. For a split that sends 90% one way and 10% the other, bits.
A feature that creates many small, balanced subsets has high - its raw IG gets divided by a large number, penalizing it relative to simpler features. A binary feature with a balanced split has - minimal penalty.
def intrinsic_information(
y: np.ndarray,
y_left: np.ndarray,
y_right: np.ndarray,
) -> float:
"""
Entropy of the split itself (sizes of children, not their labels).
Called 'SplitInfo' in C4.5.
"""
n = len(y)
p_left = len(y_left) / n
p_right = len(y_right) / n
split_entropy = 0.0
for p in [p_left, p_right]:
if p > 0:
split_entropy -= p * np.log2(p)
return split_entropy
def gain_ratio(
X: np.ndarray,
y: np.ndarray,
feature_idx: int,
threshold: float,
) -> float:
"""Gain Ratio for a single binary split on a continuous feature."""
col = X[:, feature_idx]
mask = col <= threshold
y_left, y_right = y[mask], y[~mask]
if len(y_left) == 0 or len(y_right) == 0:
return 0.0
ig = impurity_reduction(y, y_left, y_right, criterion="entropy")
split_info = intrinsic_information(y, y_left, y_right)
# Avoid division by zero (perfectly imbalanced split)
return ig / split_info if split_info > 0 else 0.0
def best_split_gain_ratio(X: np.ndarray, y: np.ndarray) -> Dict:
"""Find the split that maximizes Gain Ratio across all features/thresholds."""
n_features = X.shape[1]
best: Dict = {"feature": None, "threshold": None, "gr": 0.0}
for fi in range(n_features):
col = X[:, fi]
thresholds = continuous_feature_thresholds(col)
for t in thresholds:
gr = gain_ratio(X, y, fi, t)
if gr > best["gr"]:
best = {"feature": fi, "threshold": float(t), "gr": float(gr)}
return best
# Compare IG vs GR on the medication dataset
rng = np.random.default_rng(42)
X_demo = np.column_stack([
rng.integers(0, 21, size=300).astype(float), # num_medications (high cardinality)
rng.integers(0, 3, size=300).astype(float), # insurance_type (low cardinality)
rng.integers(30, 90, size=300).astype(float), # age (medium cardinality)
])
y_demo = ((X_demo[:, 0] > 12) & (X_demo[:, 2] > 60)).astype(int)
print("=== Information Gain ranking ===")
compare_splits(X_demo, y_demo, feature_names=["medications", "insurance", "age"])
print("\n=== Gain Ratio ranking ===")
best_gr = best_split_gain_ratio(X_demo, y_demo)
print(f"Best GR split: feature={best_gr['feature']}, "
f"threshold={best_gr['threshold']:.1f}, GR={best_gr['gr']:.4f}")
:::note sklearn does not implement Gain Ratio
DecisionTreeClassifier provides raw Gini or entropy, not Gain Ratio. If multi-valued feature bias concerns you in production, address it through feature engineering (bin continuous features into ordinal categories) or use Random Forests, where random feature subsampling at each node indirectly reduces the bias by ensuring high-cardinality features do not always compete.
:::
Mean Decrease Impurity (MDI) Feature Importance
sklearn computes feature importance as the total impurity reduction contributed by a feature across all nodes where it is used, weighted by the fraction of samples passing through each node:
where is the number of samples at node and is the impurity reduction at that node. Nodes near the root receive far higher weight because there.
from sklearn.tree import DecisionTreeClassifier
from sklearn.inspection import permutation_importance
import numpy as np
rng = np.random.default_rng(42)
X_imp = np.column_stack([
rng.integers(0, 21, size=500).astype(float), # medications (high card)
rng.integers(30, 90, size=500).astype(float), # age
rng.integers(0, 3, size=500).astype(float), # insurance (low card)
])
y_imp = ((X_imp[:, 0] > 12) & (X_imp[:, 1] > 60)).astype(int)
# Split
from sklearn.model_selection import train_test_split
X_tr, X_val, y_tr, y_val = train_test_split(X_imp, y_imp, test_size=0.3, random_state=42)
clf = DecisionTreeClassifier(max_depth=6, random_state=42)
clf.fit(X_tr, y_tr)
# MDI (built-in, biased)
print("MDI Feature Importance (biased by cardinality):")
for i, imp in enumerate(clf.feature_importances_):
print(f" Feature {i}: {imp:.4f}")
# Permutation importance (unbiased, measured on validation set)
perm_result = permutation_importance(clf, X_val, y_val, n_repeats=15, random_state=42)
print("\nPermutation Importance (unbiased):")
for i, (mean, std) in enumerate(zip(perm_result.importances_mean,
perm_result.importances_std)):
print(f" Feature {i}: {mean:.4f} +/- {std:.4f}")
Criterion Comparison Table
| Property | Entropy (IG) | Gini Impurity | Gain Ratio | Variance Reduction |
|---|---|---|---|---|
| Primary use | Classification | Classification | Classification | Regression |
| Formula | ||||
| Requires log | Yes | No | Yes | No |
| Speed | Slower | Faster | Slowest | Fast |
| Multi-valued feature bias | Yes | Yes | No (corrected) | Yes |
| sklearn default | No | Yes (classification) | No | Yes (regression) |
| Penalizes near-0.5 impurity | More strongly | Less strongly | More strongly | N/A |
| Max value (binary problem) | 1.0 bit | 0.5 | ~1.0 (normalized) | dataset variance |
| Used in | ID3, C4.5 | CART, sklearn | C4.5 | CART, sklearn |
| Split agreement (vs Gini) | ~99% | - | ~95% | N/A |
Production Engineering Notes
Class Imbalance and Impurity Metrics
In imbalanced datasets - hospital readmission data is commonly 85-90% negative - both Gini and entropy can be misleading. A node that predicts "no readmission" for every patient achieves low impurity simply because the majority class dominates. The algorithm will enthusiastically create "pure" leaves that are pure only because they contain almost exclusively the majority class.
from sklearn.tree import DecisionTreeClassifier
from sklearn.inspection import permutation_importance
# Fix 1: Balanced class weights
# Internally multiplies minority samples by n/(K * n_minority)
clf_balanced = DecisionTreeClassifier(
criterion="gini",
max_depth=5,
class_weight="balanced", # upweights minority class in impurity
random_state=42,
)
clf_balanced.fit(X_tr, y_tr)
# Fix 2: Custom sample weights
# Use when imbalance varies across subgroups or time periods
sample_weights = np.where(y_tr == 1, 5.0, 1.0) # 5x weight on positives
clf_custom = DecisionTreeClassifier(max_depth=5, random_state=42)
clf_custom.fit(X_tr, y_tr, sample_weight=sample_weights)
Practical Criterion Selection
In production, the criterion choice rarely changes final model performance by more than 0.5%. The actionable guidance:
- Start with
criterion='gini'(sklearn default) - faster, equivalent results in almost all cases. - If your dataset has very many classes (), entropy may handle boundaries slightly better due to its steeper penalization of impurity near .
- For regression,
criterion='squared_error'is almost always correct. Trycriterion='friedman_mse'on datasets smaller than 10,000 samples. - If a single feature dominates importance scores, investigate multi-valued bias before blaming the criterion. Check that feature's cardinality and compare MDI against permutation importance.
- Never tune the criterion as your primary hyperparameter. Tune
max_depth,min_samples_leaf, andccp_alpha- those levers control overfitting and matter far more.
:::danger Never report MDI importance without checking for cardinality bias
Mean Decrease Impurity (sklearn's feature_importances_) is systematically biased toward high-cardinality features. Before presenting feature importance to stakeholders, always cross-check with permutation importance on a held-out validation set. A feature that looks 3x more important than others in MDI may actually be only 1.1x more important by permutation - the rest is cardinality inflation.
:::
YouTube Resources
| Video | Channel | Why Watch It |
|---|---|---|
| Decision Trees - Information Gain, Gini Impurity | StatQuest with Josh Starmer | Best visual intuition for entropy and Gini, 20 min, essential starting point |
| Decision Tree Classification in Python | Krish Naik | End-to-end sklearn with impurity comparison and real dataset |
| Information Theory Basics | Crash Course AI | Shannon entropy from first principles, accessible and visual |
| CART vs ID3 vs C4.5 | Normalized Nerd | Compares the three foundational tree algorithms side by side |
| Bias in Feature Importance | Towards Data Science | MDI bias with high-cardinality features explained clearly |
Interview Questions and Answers
Q1: Explain the difference between information gain and Gini impurity. When would you use each?
Information gain measures entropy reduction: where . It requires computing logarithms. Gini impurity measures expected misclassification probability: , requiring only squaring. Both criteria identify the same best split approximately 99% of the time on real-world data. Use Gini (sklearn's default) for speed. Use entropy when you want to align with information-theoretic analysis or when the dataset has very many classes () where entropy's steeper curve near may slightly improve boundary placement. In practice, always tune depth and regularization parameters rather than the criterion.
Q2: Why does information gain have a bias toward high-cardinality features, and how does Gain Ratio fix it?
Information gain rewards splits that produce many small, pure subsets. A feature with 100 unique values has 99 threshold candidates to try, while a binary feature has one. The high-cardinality feature has many more chances to find a split that looks good in-sample by pure statistical chance, even if it does not generalize. Gain Ratio corrects this by dividing IG by the intrinsic information of the split itself: , where is large when the feature creates many balanced subsets. Features that exploit cardinality are penalized proportionally. sklearn does not implement Gain Ratio - address this bias with feature binning or use Random Forests where per-node random feature sampling reduces the bias indirectly.
Q3: Prove that information gain is always non-negative.
Entropy is a concave function of the class probability vector. Jensen's inequality states that for a concave function : . The parent entropy is the entropy of the mixture distribution over all child subsets. The weighted average of children's entropies is the expected entropy after conditioning on feature . By Jensen's inequality applied to the concave entropy function: , so . Splitting can only reduce expected entropy, never increase it.
Q4: What happens to Gini impurity in a perfectly balanced multi-class problem with K classes?
For equally likely classes, for all . Then . For : . For : . For : . Gini approaches 1.0 as the number of equally probable classes grows. The entropy in this case is , which also grows but logarithmically. For multi-class problems with many balanced classes, both metrics behave sensibly but they have different scales - Gini approaches 1.0 linearly while entropy grows logarithmically.
Q5: In the hospital readmission example, num_medications dominates the tree. How would you investigate and fix this?
First, compare MDI feature importance (sklearn's feature_importances_) against permutation importance (sklearn.inspection.permutation_importance on a held-out validation set). If permutation importance for num_medications is much lower than MDI importance, that confirms the multi-valued bias - the feature dominated split selection because it had many threshold candidates, not because it is genuinely more predictive. Fix options: (1) Bin num_medications into 4-5 ordinal categories (low/medium/high/very high) to reduce cardinality while preserving information; (2) Use Random Forest with max_features='sqrt' - at each node, only a random subset of features compete, so num_medications cannot dominate every split; (3) Compare SHAP values - SHAP correctly attributes credit for interactions rather than rewarding features for appearing at many nodes.
Q6: How does CART handle continuous features differently from the original ID3 algorithm?
ID3 was designed for categorical features and created multi-way splits - one branch per category value. For a feature with 20 values, ID3 creates 20 branches at that node, making trees extremely wide and prone to overfitting (essentially memorizing training data category by category). CART always makes binary splits: it finds the single threshold that maximizes the impurity reduction, creating exactly two child nodes (left: , right: ). Binary splits work naturally with continuous features, allow the same feature to be reused at different thresholds at different depths, and produce more compact trees. The CART approach is what sklearn implements and is the foundation for all modern gradient boosting libraries.
Q7: A colleague claims that switching from Gini to entropy significantly improved their model. What is actually going on?
It is almost certainly a coincidence from randomness in evaluation (different train/test splits, stochastic training, or hyperparameter interactions). The two criteria agree on the same split approximately 99% of the time. Any observed improvement from switching criteria is almost certainly less than what would be gained from proper cross-validation, tuning max_depth, adjusting min_samples_leaf, or adding better features. If a team is spending time tuning the criterion, they are optimizing the wrong lever. The correct response is to run a proper cross-validated comparison with many random seeds and verify the difference is statistically significant - it almost never is.
Production: Full Split-Finding Implementation
The following is a production-quality implementation of CART split finding with correct handling of both classification and regression targets, multiple impurity measures, and categorical feature support:
import numpy as np
from dataclasses import dataclass
from typing import Optional, Tuple
@dataclass
class SplitResult:
feature_idx: int
threshold: float
impurity_decrease: float
left_mask: np.ndarray
right_mask: np.ndarray
def gini(y: np.ndarray) -> float:
"""Gini impurity for a node."""
if len(y) == 0:
return 0.0
classes, counts = np.unique(y, return_counts=True)
probs = counts / len(y)
return 1.0 - np.sum(probs ** 2)
def entropy(y: np.ndarray) -> float:
"""Shannon entropy for a node."""
if len(y) == 0:
return 0.0
classes, counts = np.unique(y, return_counts=True)
probs = counts / len(y)
probs = probs[probs > 0]
return -np.sum(probs * np.log2(probs))
def mse_impurity(y: np.ndarray) -> float:
"""Variance (MSE from mean) - used for regression trees."""
if len(y) == 0:
return 0.0
return float(np.var(y))
def find_best_split(X: np.ndarray, y: np.ndarray,
criterion: str = 'gini',
min_samples_leaf: int = 1) -> Optional[SplitResult]:
"""
CART split finding: exhaustively search all features and thresholds.
Returns the split with highest weighted impurity decrease.
criterion: 'gini' | 'entropy' | 'mse'
"""
impurity_fn = {'gini': gini, 'entropy': entropy, 'mse': mse_impurity}[criterion]
n_samples, n_features = X.shape
parent_impurity = impurity_fn(y)
best = None
best_decrease = 0.0
for feat_idx in range(n_features):
col = X[:, feat_idx]
# Candidate thresholds: midpoints between consecutive unique sorted values
unique_vals = np.unique(col)
if len(unique_vals) < 2:
continue # no valid split for this feature
thresholds = (unique_vals[:-1] + unique_vals[1:]) / 2.0
for t in thresholds:
left_mask = col <= t
right_mask = ~left_mask
n_left = left_mask.sum()
n_right = right_mask.sum()
# Enforce min_samples_leaf constraint
if n_left < min_samples_leaf or n_right < min_samples_leaf:
continue
# Weighted impurity decrease
left_imp = impurity_fn(y[left_mask])
right_imp = impurity_fn(y[right_mask])
weighted = (n_left * left_imp + n_right * right_imp) / n_samples
decrease = parent_impurity - weighted
if decrease > best_decrease:
best_decrease = decrease
best = SplitResult(
feature_idx=feat_idx,
threshold=t,
impurity_decrease=decrease,
left_mask=left_mask,
right_mask=right_mask,
)
return best
def compute_gain_ratio(X_col: np.ndarray, y: np.ndarray,
threshold: float) -> Tuple[float, float]:
"""
Gain ratio (C4.5): corrects ID3's bias toward high-cardinality features.
GainRatio(A) = InformationGain(A) / SplitInformation(A)
SplitInformation penalizes features that split data into many small groups.
"""
n = len(y)
left = y[X_col <= threshold]
right = y[X_col > threshold]
# Information gain
parent_h = entropy(y)
child_h = (len(left) * entropy(left) + len(right) * entropy(right)) / n
ig = parent_h - child_h
# Split information: entropy of the split itself
p_left = len(left) / n
p_right = len(right) / n
split_info = 0.0
if p_left > 0: split_info -= p_left * np.log2(p_left)
if p_right > 0: split_info -= p_right * np.log2(p_right)
gain_ratio = ig / split_info if split_info > 1e-10 else 0.0
return ig, gain_ratio
# ── Example: compare all criteria on the same dataset ────────────────────────
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
iris = load_iris()
X_iris = StandardScaler().fit_transform(iris.data)
y_iris = iris.target
print("Split search on Iris dataset:")
for criterion in ['gini', 'entropy']:
result = find_best_split(X_iris, y_iris, criterion=criterion, min_samples_leaf=5)
if result:
feat_name = iris.feature_names[result.feature_idx]
print(f" {criterion:8s}: best_feature='{feat_name}', "
f"threshold={result.threshold:.3f}, "
f"impurity_decrease={result.impurity_decrease:.4f}")
# Regression tree
from sklearn.datasets import load_diabetes
X_diab, y_diab = load_diabetes(return_X_y=True)
result_reg = find_best_split(X_diab, y_diab, criterion='mse', min_samples_leaf=10)
if result_reg:
print(f"\n MSE split: feature_idx={result_reg.feature_idx}, "
f"threshold={result_reg.threshold:.4f}, "
f"variance_reduction={result_reg.impurity_decrease:.2f}")
Impurity Criteria Visualization
import numpy as np
import matplotlib.pyplot as plt
def plot_impurity_comparison():
"""Plot Gini, entropy, and misclassification error as functions of p (binary class)."""
p = np.linspace(0, 1, 200)
p = np.clip(p, 1e-10, 1 - 1e-10)
gini_vals = 2 * p * (1 - p) # = 1 - p² - (1-p)²
entropy_vals = -(p * np.log2(p) + (1-p) * np.log2(1-p)) # Shannon entropy
misclass_vals = 1 - np.maximum(p, 1-p) # 1 - max(p, 1-p)
# Normalize entropy to [0, 1] for comparison (max entropy = 1 bit for binary)
# entropy range is already [0, 1] for binary
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].plot(p, gini_vals, label='Gini: $2p(1-p)$', linewidth=2, color='#3b82f6')
axes[0].plot(p, entropy_vals, label='Entropy: $H(p)$ (bits)', linewidth=2, color='#7c3aed')
axes[0].plot(p, misclass_vals, label='Misclassification: $1-\max(p,1-p)$', linewidth=2,
color='#dc2626', linestyle='--')
axes[0].set_xlabel("Proportion of class 1 (p)", fontsize=12)
axes[0].set_ylabel("Impurity Measure", fontsize=12)
axes[0].set_title("Impurity Measures for Binary Classification", fontsize=13)
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)
# Ratio: entropy / Gini - shows when they differ
ratio = entropy_vals / (gini_vals + 1e-10)
axes[1].plot(p, ratio, linewidth=2, color='#16a34a')
axes[1].axhline(y=ratio.mean(), color='gray', linestyle='--',
label=f'Mean ratio = {ratio.mean():.3f}')
axes[1].set_xlabel("p", fontsize=12)
axes[1].set_ylabel("Entropy / Gini ratio", fontsize=12)
axes[1].set_title("When Entropy and Gini Disagree", fontsize=13)
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.suptitle("Splitting Criteria Comparison - All Agree Near p=0 and p=1", fontsize=12)
plt.tight_layout()
plt.show()
# Key observation: entropy / Gini ≈ constant across most of [0,1]
# Maximum disagreement near p=0.2 and p=0.8
print(f"Max entropy/gini ratio: {ratio.max():.4f} at p = {p[np.argmax(ratio)]:.3f}")
print(f"Min entropy/gini ratio: {ratio.min():.4f} at p = {p[np.argmin(ratio)]:.3f}")
print("→ The two criteria differ by at most ~43% in raw value,")
print(" but almost always rank splits the same way.")
plot_impurity_comparison()
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Entropy Explorer demo on the EngineersOfAI Playground - no code required.
:::
