Skip to main content

Information Gain, Gini Impurity, and Entropy

The Production Scenario

You are a data scientist at a large regional hospital network. The clinical team has tasked you with building a 30-day readmission prediction model - if you can flag high-risk patients before discharge, care coordinators can intervene and reduce costly readmissions. You pull together a dataset: 40,000 discharge records, features like age, primary diagnosis, number of medications, length of stay, prior admissions, and insurance type.

You train a DecisionTreeClassifier and inspect the resulting tree. Something immediately stands out: the root split is on num_medications > 11, the second level splits are both on num_medications > 8 and num_medications > 14, and the third level still leans heavily on medications. The tree is almost entirely built around a single feature. The clinical team asks the obvious question: is this medically meaningful, or is the algorithm infatuated with one variable?

To answer that question, you need to understand exactly what the tree optimizer is doing when it picks a split. It is not guessing, and it is not random. It computes a score for every possible split on every possible feature and threshold, then picks the one with the highest score. That score is either information gain (if criterion='entropy') or Gini impurity reduction (if criterion='gini', the default). Understanding how those scores are computed tells you precisely why num_medications keeps winning - it has many distinct values, giving it more threshold candidates and more opportunities to find a clean separation.

This lesson builds the full mathematical foundation. You will derive entropy and Gini from first principles, compute them by hand on a toy example, implement them in NumPy from scratch, and understand their practical differences. You will also learn about the multi-valued feature bias - the root cause of your medication feature's dominance - and the Gain Ratio fix introduced by the C4.5 algorithm. By the end, you can look at any tree split and explain exactly why the algorithm chose it.

Why This Exists: The Problem of "Purity"

Before information gain, early tree algorithms like CHAID (1980) used statistical tests like chi-square to evaluate splits - a sensible approach, but one that made it hard to build trees consistently and efficiently. The ID3 algorithm by Ross Quinlan in 1986 introduced a cleaner idea: frame the split selection problem as an information-theoretic optimization. Instead of asking "is this split statistically significant?", ask "how much uncertainty does this split resolve?"

The key insight is that a pure node - one where all examples belong to the same class - is maximally predictable and has zero uncertainty. An impure node - where classes are mixed - is uncertain and we need more splits to resolve it. The splitting criterion measures impurity, and we pick the split that reduces it the most.

This framing works beautifully. It gives us a single scalar to optimize, it handles any number of classes naturally, and it connects to the deep mathematical theory of information. The only question is: which impurity measure should we use?

Historical Context

1948 - Shannon Entropy. Claude Shannon's landmark paper "A Mathematical Theory of Communication" defined entropy as a measure of information content in a probability distribution. Shannon was thinking about communication channels, not decision trees - but the mathematics transferred perfectly. His entropy formula became the foundation for information-theoretic measures in machine learning.

1986 - ID3 (Quinlan). Ross Quinlan introduced the ID3 algorithm, which used Shannon entropy and information gain as the splitting criterion. ID3 was the first practical tree algorithm that could handle multi-way splits on categorical features efficiently. The "aha moment" was realizing that information gain directly measures how much uncertainty a split eliminates.

1993 - C4.5 (Quinlan). Quinlan improved ID3 with C4.5, which added support for continuous features, missing values, and - critically - the Gain Ratio correction that addresses information gain's bias toward high-cardinality features. C4.5 became the algorithm behind the iconic J48 implementation still used today.

1984 - CART (Breiman et al.). Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone introduced CART: Classification and Regression Trees. CART introduced Gini impurity as an alternative to entropy, and mandated binary splits only (never multi-way). This design makes CART faster and better suited for ensembles. Sklearn's DecisionTreeClassifier implements CART.

1997 - Friedman MSE. For regression trees, Jerome Friedman proposed a weighted MSE criterion that finds splits more efficiently on small datasets by separately weighting left and right children - a small but meaningful practical improvement.

Shannon Entropy: Measuring Impurity as Surprise

The Information-Theoretic Root

Shannon's entropy captures the intuition that surprise = information. If you know a coin always lands heads, observing "heads" gives you zero information - you already knew. If the coin is fair, each flip tells you one full bit. Entropy captures this: high entropy means high uncertainty, means more information per observation.

For a node SS containing examples from KK classes, where pkp_k is the fraction of examples belonging to class kk:

H(S)=k=1Kpklog2pkH(S) = -\sum_{k=1}^{K} p_k \log_2 p_k

The convention 0log20=00 \cdot \log_2 0 = 0 handles empty classes - a class with zero examples contributes nothing to the sum. Think of it as a limit: as p0p \to 0, plogp0p \log p \to 0.

Entropy Properties - Derived from First Principles

Minimum (zero): When all examples belong to one class, pk=1p_k = 1 for one kk and pj=0p_j = 0 for all others. Then H=(1log21)=(10)=0H = -(1 \cdot \log_2 1) = -(1 \cdot 0) = 0. A perfectly pure node has zero entropy - no uncertainty, no information needed.

Maximum for binary case: For a binary problem, HH is maximized at p=0.5p = 0.5, where H=(0.5log20.5+0.5log20.5)=(0.5(1)+0.5(1))=1.0H = -(0.5 \log_2 0.5 + 0.5 \log_2 0.5) = -(0.5 \cdot (-1) + 0.5 \cdot (-1)) = 1.0 bit. A perfectly mixed 50-50 node requires exactly one bit to specify which class an example belongs to.

Maximum for KK classes: With KK classes, the maximum entropy is log2K\log_2 K, achieved when all classes are equally likely (pk=1/Kp_k = 1/K for all kk). With 4 classes equally balanced, maximum entropy is log24=2\log_2 4 = 2 bits.

Concavity: Entropy is a strictly concave function of the class probability vector. This is the key mathematical property that guarantees splitting always helps: by Jensen's inequality, the weighted average entropy of children is always at most the parent's entropy. Splitting cannot increase uncertainty.

Binary Entropy H(p) vs class probability p
H(p)
1.0 | * <- maximum at p=0.5 (1.0 bit)
| * *
0.8 | * *
| * *
0.6 | * *
| * *
0.4 | * *
0.2 |* *
0.0 +--+---+---+---+---+---+-> p
0 0.2 0.4 0.6 0.8 1.0
(pure) (mixed: max H) (pure)

Worked Example: Entropy at a Node

Suppose a node has 10 patients: 3 were readmitted (positive), 7 were not (negative).

p+=0.3,p=0.7p_+ = 0.3, \quad p_- = 0.7

H(S)=(0.3log20.3+0.7log20.7)H(S) = -(0.3 \log_2 0.3 + 0.7 \log_2 0.7)

=(0.3×(1.737)+0.7×(0.515))= -(0.3 \times (-1.737) + 0.7 \times (-0.515))

=(0.5210.360)=0.881 bits= -(-0.521 - 0.360) = 0.881 \text{ bits}

This node is fairly impure - a classifier at this node can only do somewhat better than random. Compare to a perfectly pure node (p+=0p_+ = 0 or p+=1p_+ = 1): H=0H = 0 bits. Or a maximally impure node (p+=0.5p_+ = 0.5): H=1.0H = 1.0 bit.

Information Gain: The Splitting Criterion

Derivation from First Principles

A split on feature AA partitions the dataset SS into subsets SvS_v - one per distinct value vv of AA. For continuous features (as in CART), this means exactly two subsets: \leq threshold and >> threshold. The weighted average entropy after the split is:

HA(S)=vSvSH(Sv)H_A(S) = \sum_{v} \frac{|S_v|}{|S|} H(S_v)

The weights SvS\frac{|S_v|}{|S|} reflect how many samples flow into each child node - a child receiving 80% of the data matters 4x more than one receiving 20%.

Information Gain is the reduction in entropy that the split achieves:

IG(S,A)=H(S)vSvSH(Sv)IG(S, A) = H(S) - \sum_{v} \frac{|S_v|}{|S|} H(S_v)

The tree greedily picks the feature and threshold that maximizes IGIG. By Jensen's inequality applied to the concave entropy function, IG0IG \geq 0 always - a split cannot increase expected entropy.

Full Numerical Example

Consider 14 training examples structured as a binary split on medications > 10:

SubsetCountReadmitted (Yes)Not Readmitted (No)
Root1477
Left (meds > 10)642
Right (meds ≤ 10)835

Root entropy:

H(S)=(0.5log20.5+0.5log20.5)=1.0 bitH(S) = -(0.5 \log_2 0.5 + 0.5 \log_2 0.5) = 1.0 \text{ bit}

Left child (4 Yes, 2 No):

p+=4/6=0.667,p=2/6=0.333p_+ = 4/6 = 0.667, \quad p_- = 2/6 = 0.333

H(SL)=(0.667log20.667+0.333log20.333)=0.918 bitsH(S_L) = -(0.667 \log_2 0.667 + 0.333 \log_2 0.333) = 0.918 \text{ bits}

Right child (3 Yes, 5 No):

p+=3/8=0.375,p=5/8=0.625p_+ = 3/8 = 0.375, \quad p_- = 5/8 = 0.625

H(SR)=(0.375log20.375+0.625log20.625)=0.954 bitsH(S_R) = -(0.375 \log_2 0.375 + 0.625 \log_2 0.625) = 0.954 \text{ bits}

Information Gain:

IG=1.0(614×0.918+814×0.954)IG = 1.0 - \left(\frac{6}{14} \times 0.918 + \frac{8}{14} \times 0.954\right)

=1.0(0.393+0.545)=1.00.938=0.062 bits= 1.0 - (0.393 + 0.545) = 1.0 - 0.938 = 0.062 \text{ bits}

This split delivers 0.062 bits of gain. The tree compares this against every other feature and threshold and picks the maximum.

Gini Impurity: The Computationally Efficient Alternative

Derivation from Misclassification Probability

Gini impurity has a clean probabilistic interpretation: the probability of misclassifying a randomly chosen element if it were labeled according to the class distribution in the node.

Pick an element at random - it belongs to class kk with probability pkp_k. Assign it a label by sampling from the same distribution - you assign class jj with probability pjp_j. The probability of a mismatch is:

G(S)=kpkjkpj=kpk(1pk)=1kpk2G(S) = \sum_{k} p_k \sum_{j \neq k} p_j = \sum_{k} p_k (1 - p_k) = 1 - \sum_k p_k^2

For a binary problem with p+=pp_+ = p and p=1pp_- = 1-p:

G=1(p2+(1p)2)=1p21+2pp2=2p(1p)G = 1 - (p^2 + (1-p)^2) = 1 - p^2 - 1 + 2p - p^2 = 2p(1-p)

Maximized at p=0.5p = 0.5 (value 0.5) and zero at p=0p = 0 or p=1p = 1 - the same boundary behavior as entropy.

Gini Reduction: The Actual Splitting Criterion

The algorithm maximizes the Gini impurity reduction:

ΔG(S,A)=G(S)vSvSG(Sv)\Delta G(S, A) = G(S) - \sum_{v} \frac{|S_v|}{|S|} G(S_v)

Numerical Example: Gini on the Same Dataset

Root node (p+=0.5p_+ = 0.5):

G(S)=1(0.52+0.52)=10.5=0.5G(S) = 1 - (0.5^2 + 0.5^2) = 1 - 0.5 = 0.5

Left child (4 Yes, 2 No, p+=2/3p_+ = 2/3):

G(SL)=1((46)2+(26)2)=1(0.444+0.111)=0.444G(S_L) = 1 - \left(\left(\frac{4}{6}\right)^2 + \left(\frac{2}{6}\right)^2\right) = 1 - (0.444 + 0.111) = 0.444

Right child (3 Yes, 5 No, p+=3/8p_+ = 3/8):

G(SR)=1((38)2+(58)2)=1(0.141+0.391)=0.469G(S_R) = 1 - \left(\left(\frac{3}{8}\right)^2 + \left(\frac{5}{8}\right)^2\right) = 1 - (0.141 + 0.391) = 0.469

Gini Reduction:

ΔG=0.5(614×0.444+814×0.469)=0.5(0.190+0.268)=0.042\Delta G = 0.5 - \left(\frac{6}{14} \times 0.444 + \frac{8}{14} \times 0.469\right) = 0.5 - (0.190 + 0.268) = 0.042

Shape Comparison: Entropy vs Gini

Impurity vs class probability p (binary case)

1.0 | * <- Entropy peak (1.0 bit)
| * *
0.8 | * * Entropy: -p log2(p) - (1-p) log2(1-p)
| * *
0.6 | * *
0.5 | * <- Gini peak (0.5)
| * * * *
| * * Gini: 2p(1-p)
0.0 +--+---+---+---+---> p
0 0.2 0.4 0.6 1.0

Both curves are symmetric around p=0.5p = 0.5 and zero at the extremes. Entropy penalizes impure nodes slightly more strongly - it curves more steeply near 0.5. In practice the two criteria agree on the same best split well over 99% of the time.

:::note Why sklearn defaults to Gini Gini requires no logarithm computation - just squaring and summing. For large datasets with millions of candidate splits, this compute difference is meaningful. Additionally, both criteria produce nearly identical trees on most real-world datasets, so sklearn correctly treats Gini as the default and faster choice. :::

The CART Splitting Algorithm

CART (Classification and Regression Trees) defines a precise algorithm for choosing splits. It always uses binary splits, works with continuous features via threshold search, and exhaustively evaluates every feature-threshold combination at every node.

CART Split Selection Pseudocode
─────────────────────────────────────────────────────────────────
For each feature f in {1, ..., d}:
Sort training instances by feature f value
For each adjacent pair of unique values (v_i, v_{i+1}):
threshold t = (v_i + v_{i+1}) / 2 <- midpoint candidate
Left = {x : x[f] <= t}
Right = {x : x[f] > t}
score = Impurity(parent) - weighted_impurity(Left, Right)
If score > best_score:
best_score = score
best_feature = f
best_threshold = t

Return best_feature, best_threshold

Key design decisions in CART:

  • Always binary splits (left/right only) - simpler trees, better suited for ensembles
  • Midpoint thresholds between adjacent unique values - avoids boundary condition bugs
  • All features evaluated at every node - exhaustive but guaranteed locally optimal
  • No look-ahead - CART is greedy; it picks the best split now, ignoring future levels

Full NumPy Implementation: All Splits for a Feature

The following implementation computes all candidate splits for every feature and reports the best one under both Gini and entropy. This mirrors what sklearn does internally.

import numpy as np
from typing import Optional, Tuple, Dict

# ── Impurity functions ──────────────────────────────────────────────────────

def compute_entropy(y: np.ndarray) -> float:
"""Shannon entropy of a label array (bits)."""
n = len(y)
if n == 0:
return 0.0
counts = np.bincount(y)
probs = counts[counts > 0] / n # drop zero-count classes
return float(-np.sum(probs * np.log2(probs)))


def compute_gini(y: np.ndarray) -> float:
"""Gini impurity of a label array."""
n = len(y)
if n == 0:
return 0.0
counts = np.bincount(y)
probs = counts / n
return float(1.0 - np.sum(probs ** 2))


def weighted_impurity(
y_left: np.ndarray,
y_right: np.ndarray,
criterion: str = "gini",
) -> float:
"""Weighted average impurity after a binary split."""
n = len(y_left) + len(y_right)
fn = compute_gini if criterion == "gini" else compute_entropy
return (len(y_left) / n) * fn(y_left) + (len(y_right) / n) * fn(y_right)


def impurity_reduction(
y: np.ndarray,
y_left: np.ndarray,
y_right: np.ndarray,
criterion: str = "gini",
) -> float:
"""Impurity reduction achieved by a split."""
fn = compute_gini if criterion == "gini" else compute_entropy
return fn(y) - weighted_impurity(y_left, y_right, criterion)


# ── All splits for a single feature ─────────────────────────────────────────

def all_splits_for_feature(
X: np.ndarray,
y: np.ndarray,
feature_idx: int,
criterion: str = "gini",
) -> list:
"""
Compute impurity reduction for every candidate threshold on one feature.
Returns a sorted list of (threshold, gain) pairs descending by gain.
"""
col = X[:, feature_idx]
unique_vals = np.unique(col)
if len(unique_vals) == 1:
return []

# Candidate thresholds: midpoints between consecutive sorted unique values
thresholds = (unique_vals[:-1] + unique_vals[1:]) / 2.0
results = []

for t in thresholds:
mask = col <= t
y_left, y_right = y[mask], y[~mask]
if len(y_left) == 0 or len(y_right) == 0:
continue
gain = impurity_reduction(y, y_left, y_right, criterion)
results.append((float(t), float(gain)))

return sorted(results, key=lambda x: x[1], reverse=True)


# ── Best split search across all features (CART algorithm) ──────────────────

def best_split(
X: np.ndarray,
y: np.ndarray,
criterion: str = "gini",
) -> Tuple[Optional[int], Optional[float], float]:
"""
Exhaustive search for the feature and threshold that maximise
impurity reduction.

Returns
-------
best_feature : int - column index of the winning feature
best_threshold: float - split threshold (left gets <= threshold)
best_gain : float - impurity reduction achieved
"""
n_samples, n_features = X.shape
best_feature, best_threshold, best_gain = None, None, 0.0

for feature_idx in range(n_features):
splits = all_splits_for_feature(X, y, feature_idx, criterion)
if splits and splits[0][1] > best_gain:
best_gain = splits[0][1]
best_threshold = splits[0][0]
best_feature = feature_idx

return best_feature, best_threshold, best_gain


# ── Side-by-side comparison across criteria ──────────────────────────────────

def compare_splits(X: np.ndarray, y: np.ndarray, feature_names=None) -> None:
"""Print the best split for each criterion side-by-side."""
print(f"{'Criterion':<12} {'Feature':<18} {'Threshold':<12} {'Gain':<10}")
print("-" * 56)
for crit in ["gini", "entropy"]:
feat, thresh, gain = best_split(X, y, criterion=crit)
fname = feature_names[feat] if feature_names and feat is not None else str(feat)
print(f"{crit:<12} {fname:<18} {thresh:<12.4f} {gain:<10.4f}")


# ── Demonstration ─────────────────────────────────────────────────────────────

if __name__ == "__main__":
rng = np.random.default_rng(42)

# 300 patients; feature 0=num_medications (0-20), feature 1=age (0-90)
X = np.column_stack([
rng.integers(0, 21, size=300).astype(float), # num_medications
rng.integers(30, 90, size=300).astype(float), # age
])
# Readmitted if many medications AND older age
y = ((X[:, 0] > 12) & (X[:, 1] > 60)).astype(int)

print(f"Class distribution: {np.bincount(y)}")
print(f"Root entropy: {compute_entropy(y):.4f} bits")
print(f"Root Gini: {compute_gini(y):.4f}")
print()
compare_splits(X, y, feature_names=["num_medications", "age"])

print("\n--- All splits for num_medications (top 5) ---")
for thresh, gain in all_splits_for_feature(X, y, 0, "gini")[:5]:
print(f" threshold={thresh:.1f} gini_reduction={gain:.4f}")

Sklearn Equivalent

from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.datasets import make_classification

X, y = make_classification(
n_samples=500, n_features=4, n_informative=2, random_state=42
)

for criterion in ["gini", "entropy"]:
clf = DecisionTreeClassifier(max_depth=3, criterion=criterion, random_state=42)
clf.fit(X, y)
print(f"\n=== {criterion.upper()} ===")
print(export_text(clf, feature_names=[f"feature_{i}" for i in range(4)]))

Continuous Feature Splitting: Sort + Midpoints

A critical implementation detail: for a feature with nn unique values, there are exactly n1n-1 candidate thresholds, placed at the midpoints between consecutive sorted values. This avoids boundary condition bugs and ensures the threshold correctly separates adjacent groups.

def continuous_feature_thresholds(values: np.ndarray) -> np.ndarray:
"""
Compute midpoint thresholds between sorted unique values.
For values [1, 3, 5, 8], returns [2.0, 4.0, 6.5].
"""
unique_sorted = np.unique(values)
return (unique_sorted[:-1] + unique_sorted[1:]) / 2.0


# Why midpoints work:
# - Threshold at 2.0: splits {1} from {3, 5, 8} - clean boundary
# - Threshold at 3.0: might ambiguously classify value=3
# - Midpoints guarantee: all <=t go left, all >t go right, no ties at threshold

# Example with medical data:
meds = np.array([4, 4, 6, 8, 8, 11, 12, 14, 15, 18], dtype=float)
print("Candidate thresholds:", continuous_feature_thresholds(meds))
# Output: [ 5. 7. 9.5 11.5 13. 14.5 16.5]
# 7 thresholds for 8 unique values

Regression Trees: Variance Reduction

For continuous targets, there are no class proportions. A node is "pure" when all target values are identical - meaning zero variance. The splitting criterion becomes variance reduction (equivalent to MSE reduction):

MSE(S)=1SiS(yiyˉS)2\text{MSE}(S) = \frac{1}{|S|} \sum_{i \in S} (y_i - \bar{y}_S)^2

ΔMSE(S,A)=MSE(S)vSvSMSE(Sv)\Delta \text{MSE}(S, A) = \text{MSE}(S) - \sum_{v} \frac{|S_v|}{|S|} \text{MSE}(S_v)

The leaf prediction is the mean of all training targets that reach that leaf, which minimizes the within-leaf MSE.

def variance_reduction(
y: np.ndarray,
y_left: np.ndarray,
y_right: np.ndarray,
) -> float:
"""Variance reduction for a regression tree split (= MSE reduction)."""
n = len(y)
def weighted_var(subset: np.ndarray) -> float:
return (len(subset) / n) * float(np.var(subset)) if len(subset) > 0 else 0.0
return float(np.var(y)) - weighted_var(y_left) - weighted_var(y_right)


def best_regression_split(X: np.ndarray, y: np.ndarray) -> Dict:
"""CART split for regression: maximize variance reduction."""
n_features = X.shape[1]
best: Dict = {"feature": None, "threshold": None, "gain": 0.0}

for fi in range(n_features):
col = X[:, fi]
thresholds = continuous_feature_thresholds(col)
for t in thresholds:
mask = col <= t
yl, yr = y[mask], y[~mask]
if len(yl) == 0 or len(yr) == 0:
continue
gain = variance_reduction(y, yl, yr)
if gain > best["gain"]:
best = {"feature": fi, "threshold": float(t), "gain": float(gain)}

return best


# Example: predict hospital length of stay (continuous)
rng = np.random.default_rng(42)
X_reg = rng.random((200, 3))
y_reg = 3.0 * X_reg[:, 0] + 1.5 * X_reg[:, 1] + rng.normal(0, 0.2, 200)

result = best_regression_split(X_reg, y_reg)
print(f"Best regression split: feature={result['feature']}, "
f"threshold={result['threshold']:.4f}, "
f"variance_reduction={result['gain']:.4f}")

In sklearn, DecisionTreeRegressor uses criterion='squared_error' by default. The option criterion='friedman_mse' applies Friedman's improvement, which weights the MSE by the harmonic mean of node sizes and can find better splits on smaller datasets.

The Multi-Valued Feature Bias

This is the root cause of why num_medications dominates your healthcare tree. Information gain has a structural preference for features with many distinct values.

Consider a pathological extreme: a patient ID feature with one unique value per row. A split on patient ID can route single patients to individual leaves, perfectly zeroing out entropy in each child. Every split on patient ID achieves maximum IG. Yet patient ID is completely useless for generalization.

Continuous numeric features with many unique values suffer a milder version of the same bias. A feature with 20 possible values has roughly 19 threshold candidates; a binary feature has just one. Given more candidates, the high-cardinality feature has more opportunities to stumble onto a threshold that looks good in-sample.

Feature candidates available for splitting:

age (0-80, step 5) → ~15 thresholds to try
num_medications (0-25) → ~24 thresholds to try <- more shots at a good split
insurance_type (binary) → 1 threshold to try

More threshold candidates = more chances to find an
in-sample split that doesn't generalize.

The optimizer is not cheating - it is doing its job.
The bias is structural, not a bug.

:::warning This bias is not fixed by Gini Gini impurity reduction suffers from the same multi-valued feature bias as information gain. Both take the maximum gain across all candidate thresholds - so features with more candidates have an inherent advantage. The fix requires normalizing by the feature's own entropy, which is what Gain Ratio does. :::

Gain Ratio: The C4.5 Fix

Quinlan's C4.5 algorithm (1993) normalizes information gain by the intrinsic information (split entropy) of the feature itself. The intuition: a feature that creates many balanced subsets is doing something expensive - it is using a lot of the tree's branching budget. That cost should be charged against its IG.

H(A)=vSvSlog2SvSH(A) = -\sum_{v} \frac{|S_v|}{|S|} \log_2 \frac{|S_v|}{|S|}

This is the entropy of the distribution of examples across the feature's values. For a binary split that sends 50% left and 50% right, H(A)=1.0H(A) = 1.0 bit. For a split that sends 90% one way and 10% the other, H(A)0.47H(A) \approx 0.47 bits.

GR(S,A)=IG(S,A)H(A)\text{GR}(S, A) = \frac{IG(S, A)}{H(A)}

A feature that creates many small, balanced subsets has high H(A)H(A) - its raw IG gets divided by a large number, penalizing it relative to simpler features. A binary feature with a balanced split has H(A)=1.0H(A) = 1.0 - minimal penalty.

def intrinsic_information(
y: np.ndarray,
y_left: np.ndarray,
y_right: np.ndarray,
) -> float:
"""
Entropy of the split itself (sizes of children, not their labels).
Called 'SplitInfo' in C4.5.
"""
n = len(y)
p_left = len(y_left) / n
p_right = len(y_right) / n
split_entropy = 0.0
for p in [p_left, p_right]:
if p > 0:
split_entropy -= p * np.log2(p)
return split_entropy


def gain_ratio(
X: np.ndarray,
y: np.ndarray,
feature_idx: int,
threshold: float,
) -> float:
"""Gain Ratio for a single binary split on a continuous feature."""
col = X[:, feature_idx]
mask = col <= threshold
y_left, y_right = y[mask], y[~mask]

if len(y_left) == 0 or len(y_right) == 0:
return 0.0

ig = impurity_reduction(y, y_left, y_right, criterion="entropy")
split_info = intrinsic_information(y, y_left, y_right)

# Avoid division by zero (perfectly imbalanced split)
return ig / split_info if split_info > 0 else 0.0


def best_split_gain_ratio(X: np.ndarray, y: np.ndarray) -> Dict:
"""Find the split that maximizes Gain Ratio across all features/thresholds."""
n_features = X.shape[1]
best: Dict = {"feature": None, "threshold": None, "gr": 0.0}

for fi in range(n_features):
col = X[:, fi]
thresholds = continuous_feature_thresholds(col)
for t in thresholds:
gr = gain_ratio(X, y, fi, t)
if gr > best["gr"]:
best = {"feature": fi, "threshold": float(t), "gr": float(gr)}

return best


# Compare IG vs GR on the medication dataset
rng = np.random.default_rng(42)
X_demo = np.column_stack([
rng.integers(0, 21, size=300).astype(float), # num_medications (high cardinality)
rng.integers(0, 3, size=300).astype(float), # insurance_type (low cardinality)
rng.integers(30, 90, size=300).astype(float), # age (medium cardinality)
])
y_demo = ((X_demo[:, 0] > 12) & (X_demo[:, 2] > 60)).astype(int)

print("=== Information Gain ranking ===")
compare_splits(X_demo, y_demo, feature_names=["medications", "insurance", "age"])

print("\n=== Gain Ratio ranking ===")
best_gr = best_split_gain_ratio(X_demo, y_demo)
print(f"Best GR split: feature={best_gr['feature']}, "
f"threshold={best_gr['threshold']:.1f}, GR={best_gr['gr']:.4f}")

:::note sklearn does not implement Gain Ratio DecisionTreeClassifier provides raw Gini or entropy, not Gain Ratio. If multi-valued feature bias concerns you in production, address it through feature engineering (bin continuous features into ordinal categories) or use Random Forests, where random feature subsampling at each node indirectly reduces the bias by ensuring high-cardinality features do not always compete. :::

Mean Decrease Impurity (MDI) Feature Importance

sklearn computes feature importance as the total impurity reduction contributed by a feature across all nodes where it is used, weighted by the fraction of samples passing through each node:

MDI(fj)=1Treestreesn:split on jntnΔI(n)\text{MDI}(f_j) = \frac{1}{|\text{Trees}|}\sum_{\text{trees}}\sum_{\substack{n:\,\text{split on }j}} \frac{n_t}{n} \cdot \Delta I(n)

where ntn_t is the number of samples at node tt and ΔI(n)\Delta I(n) is the impurity reduction at that node. Nodes near the root receive far higher weight because ntnn_t \approx n there.

from sklearn.tree import DecisionTreeClassifier
from sklearn.inspection import permutation_importance
import numpy as np

rng = np.random.default_rng(42)
X_imp = np.column_stack([
rng.integers(0, 21, size=500).astype(float), # medications (high card)
rng.integers(30, 90, size=500).astype(float), # age
rng.integers(0, 3, size=500).astype(float), # insurance (low card)
])
y_imp = ((X_imp[:, 0] > 12) & (X_imp[:, 1] > 60)).astype(int)

# Split
from sklearn.model_selection import train_test_split
X_tr, X_val, y_tr, y_val = train_test_split(X_imp, y_imp, test_size=0.3, random_state=42)

clf = DecisionTreeClassifier(max_depth=6, random_state=42)
clf.fit(X_tr, y_tr)

# MDI (built-in, biased)
print("MDI Feature Importance (biased by cardinality):")
for i, imp in enumerate(clf.feature_importances_):
print(f" Feature {i}: {imp:.4f}")

# Permutation importance (unbiased, measured on validation set)
perm_result = permutation_importance(clf, X_val, y_val, n_repeats=15, random_state=42)
print("\nPermutation Importance (unbiased):")
for i, (mean, std) in enumerate(zip(perm_result.importances_mean,
perm_result.importances_std)):
print(f" Feature {i}: {mean:.4f} +/- {std:.4f}")

Criterion Comparison Table

PropertyEntropy (IG)Gini ImpurityGain RatioVariance Reduction
Primary useClassificationClassificationClassificationRegression
Formulapklog2pk-\sum p_k \log_2 p_k1pk21 - \sum p_k^2IG/H(A)IG / H(A)MSE(S)weighted MSE\text{MSE}(S) - \text{weighted MSE}
Requires logYesNoYesNo
SpeedSlowerFasterSlowestFast
Multi-valued feature biasYesYesNo (corrected)Yes
sklearn defaultNoYes (classification)NoYes (regression)
Penalizes near-0.5 impurityMore stronglyLess stronglyMore stronglyN/A
Max value (binary problem)1.0 bit0.5~1.0 (normalized)dataset variance
Used inID3, C4.5CART, sklearnC4.5CART, sklearn
Split agreement (vs Gini)~99%-~95%N/A

Production Engineering Notes

Class Imbalance and Impurity Metrics

In imbalanced datasets - hospital readmission data is commonly 85-90% negative - both Gini and entropy can be misleading. A node that predicts "no readmission" for every patient achieves low impurity simply because the majority class dominates. The algorithm will enthusiastically create "pure" leaves that are pure only because they contain almost exclusively the majority class.

from sklearn.tree import DecisionTreeClassifier
from sklearn.inspection import permutation_importance

# Fix 1: Balanced class weights
# Internally multiplies minority samples by n/(K * n_minority)
clf_balanced = DecisionTreeClassifier(
criterion="gini",
max_depth=5,
class_weight="balanced", # upweights minority class in impurity
random_state=42,
)
clf_balanced.fit(X_tr, y_tr)

# Fix 2: Custom sample weights
# Use when imbalance varies across subgroups or time periods
sample_weights = np.where(y_tr == 1, 5.0, 1.0) # 5x weight on positives
clf_custom = DecisionTreeClassifier(max_depth=5, random_state=42)
clf_custom.fit(X_tr, y_tr, sample_weight=sample_weights)

Practical Criterion Selection

In production, the criterion choice rarely changes final model performance by more than 0.5%. The actionable guidance:

  1. Start with criterion='gini' (sklearn default) - faster, equivalent results in almost all cases.
  2. If your dataset has very many classes (K>10K > 10), entropy may handle boundaries slightly better due to its steeper penalization of impurity near p=1/Kp = 1/K.
  3. For regression, criterion='squared_error' is almost always correct. Try criterion='friedman_mse' on datasets smaller than 10,000 samples.
  4. If a single feature dominates importance scores, investigate multi-valued bias before blaming the criterion. Check that feature's cardinality and compare MDI against permutation importance.
  5. Never tune the criterion as your primary hyperparameter. Tune max_depth, min_samples_leaf, and ccp_alpha - those levers control overfitting and matter far more.

:::danger Never report MDI importance without checking for cardinality bias Mean Decrease Impurity (sklearn's feature_importances_) is systematically biased toward high-cardinality features. Before presenting feature importance to stakeholders, always cross-check with permutation importance on a held-out validation set. A feature that looks 3x more important than others in MDI may actually be only 1.1x more important by permutation - the rest is cardinality inflation. :::

YouTube Resources

VideoChannelWhy Watch It
Decision Trees - Information Gain, Gini ImpurityStatQuest with Josh StarmerBest visual intuition for entropy and Gini, 20 min, essential starting point
Decision Tree Classification in PythonKrish NaikEnd-to-end sklearn with impurity comparison and real dataset
Information Theory BasicsCrash Course AIShannon entropy from first principles, accessible and visual
CART vs ID3 vs C4.5Normalized NerdCompares the three foundational tree algorithms side by side
Bias in Feature ImportanceTowards Data ScienceMDI bias with high-cardinality features explained clearly

Interview Questions and Answers

Q1: Explain the difference between information gain and Gini impurity. When would you use each?

Information gain measures entropy reduction: IG(S,A)=H(S)vSvSH(Sv)IG(S,A) = H(S) - \sum_v \frac{|S_v|}{|S|} H(S_v) where H(S)=pklog2pkH(S) = -\sum p_k \log_2 p_k. It requires computing logarithms. Gini impurity measures expected misclassification probability: G(S)=1kpk2G(S) = 1 - \sum_k p_k^2, requiring only squaring. Both criteria identify the same best split approximately 99% of the time on real-world data. Use Gini (sklearn's default) for speed. Use entropy when you want to align with information-theoretic analysis or when the dataset has very many classes (K>10K > 10) where entropy's steeper curve near p=1/Kp = 1/K may slightly improve boundary placement. In practice, always tune depth and regularization parameters rather than the criterion.

Q2: Why does information gain have a bias toward high-cardinality features, and how does Gain Ratio fix it?

Information gain rewards splits that produce many small, pure subsets. A feature with 100 unique values has 99 threshold candidates to try, while a binary feature has one. The high-cardinality feature has many more chances to find a split that looks good in-sample by pure statistical chance, even if it does not generalize. Gain Ratio corrects this by dividing IG by the intrinsic information of the split itself: GR=IG(S,A)/H(A)GR = IG(S,A) / H(A), where H(A)=vSvSlog2SvSH(A) = -\sum_v \frac{|S_v|}{|S|} \log_2 \frac{|S_v|}{|S|} is large when the feature creates many balanced subsets. Features that exploit cardinality are penalized proportionally. sklearn does not implement Gain Ratio - address this bias with feature binning or use Random Forests where per-node random feature sampling reduces the bias indirectly.

Q3: Prove that information gain is always non-negative.

Entropy is a concave function of the class probability vector. Jensen's inequality states that for a concave function ff: f(E[X])E[f(X)]f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)]. The parent entropy H(S)H(S) is the entropy of the mixture distribution over all child subsets. The weighted average of children's entropies vSvSH(Sv)\sum_v \frac{|S_v|}{|S|} H(S_v) is the expected entropy after conditioning on feature AA. By Jensen's inequality applied to the concave entropy function: H(S)vSvSH(Sv)H(S) \geq \sum_v \frac{|S_v|}{|S|} H(S_v), so IG=H(S)HA(S)0IG = H(S) - H_A(S) \geq 0. Splitting can only reduce expected entropy, never increase it.

Q4: What happens to Gini impurity in a perfectly balanced multi-class problem with K classes?

For KK equally likely classes, pk=1/Kp_k = 1/K for all kk. Then G=1k=1K(1/K)2=1K(1/K2)=11/KG = 1 - \sum_{k=1}^K (1/K)^2 = 1 - K \cdot (1/K^2) = 1 - 1/K. For K=2K=2: G=0.5G = 0.5. For K=10K=10: G=0.9G = 0.9. For K=100K=100: G=0.99G = 0.99. Gini approaches 1.0 as the number of equally probable classes grows. The entropy in this case is log2K\log_2 K, which also grows but logarithmically. For multi-class problems with many balanced classes, both metrics behave sensibly but they have different scales - Gini approaches 1.0 linearly while entropy grows logarithmically.

Q5: In the hospital readmission example, num_medications dominates the tree. How would you investigate and fix this?

First, compare MDI feature importance (sklearn's feature_importances_) against permutation importance (sklearn.inspection.permutation_importance on a held-out validation set). If permutation importance for num_medications is much lower than MDI importance, that confirms the multi-valued bias - the feature dominated split selection because it had many threshold candidates, not because it is genuinely more predictive. Fix options: (1) Bin num_medications into 4-5 ordinal categories (low/medium/high/very high) to reduce cardinality while preserving information; (2) Use Random Forest with max_features='sqrt' - at each node, only a random subset of features compete, so num_medications cannot dominate every split; (3) Compare SHAP values - SHAP correctly attributes credit for interactions rather than rewarding features for appearing at many nodes.

Q6: How does CART handle continuous features differently from the original ID3 algorithm?

ID3 was designed for categorical features and created multi-way splits - one branch per category value. For a feature with 20 values, ID3 creates 20 branches at that node, making trees extremely wide and prone to overfitting (essentially memorizing training data category by category). CART always makes binary splits: it finds the single threshold tt that maximizes the impurity reduction, creating exactly two child nodes (left: xtx \leq t, right: x>tx > t). Binary splits work naturally with continuous features, allow the same feature to be reused at different thresholds at different depths, and produce more compact trees. The CART approach is what sklearn implements and is the foundation for all modern gradient boosting libraries.

Q7: A colleague claims that switching from Gini to entropy significantly improved their model. What is actually going on?

It is almost certainly a coincidence from randomness in evaluation (different train/test splits, stochastic training, or hyperparameter interactions). The two criteria agree on the same split approximately 99% of the time. Any observed improvement from switching criteria is almost certainly less than what would be gained from proper cross-validation, tuning max_depth, adjusting min_samples_leaf, or adding better features. If a team is spending time tuning the criterion, they are optimizing the wrong lever. The correct response is to run a proper cross-validated comparison with many random seeds and verify the difference is statistically significant - it almost never is.

Production: Full Split-Finding Implementation

The following is a production-quality implementation of CART split finding with correct handling of both classification and regression targets, multiple impurity measures, and categorical feature support:

import numpy as np
from dataclasses import dataclass
from typing import Optional, Tuple

@dataclass
class SplitResult:
feature_idx: int
threshold: float
impurity_decrease: float
left_mask: np.ndarray
right_mask: np.ndarray

def gini(y: np.ndarray) -> float:
"""Gini impurity for a node."""
if len(y) == 0:
return 0.0
classes, counts = np.unique(y, return_counts=True)
probs = counts / len(y)
return 1.0 - np.sum(probs ** 2)

def entropy(y: np.ndarray) -> float:
"""Shannon entropy for a node."""
if len(y) == 0:
return 0.0
classes, counts = np.unique(y, return_counts=True)
probs = counts / len(y)
probs = probs[probs > 0]
return -np.sum(probs * np.log2(probs))

def mse_impurity(y: np.ndarray) -> float:
"""Variance (MSE from mean) - used for regression trees."""
if len(y) == 0:
return 0.0
return float(np.var(y))

def find_best_split(X: np.ndarray, y: np.ndarray,
criterion: str = 'gini',
min_samples_leaf: int = 1) -> Optional[SplitResult]:
"""
CART split finding: exhaustively search all features and thresholds.
Returns the split with highest weighted impurity decrease.

criterion: 'gini' | 'entropy' | 'mse'
"""
impurity_fn = {'gini': gini, 'entropy': entropy, 'mse': mse_impurity}[criterion]

n_samples, n_features = X.shape
parent_impurity = impurity_fn(y)

best = None
best_decrease = 0.0

for feat_idx in range(n_features):
col = X[:, feat_idx]
# Candidate thresholds: midpoints between consecutive unique sorted values
unique_vals = np.unique(col)
if len(unique_vals) < 2:
continue # no valid split for this feature

thresholds = (unique_vals[:-1] + unique_vals[1:]) / 2.0

for t in thresholds:
left_mask = col <= t
right_mask = ~left_mask

n_left = left_mask.sum()
n_right = right_mask.sum()

# Enforce min_samples_leaf constraint
if n_left < min_samples_leaf or n_right < min_samples_leaf:
continue

# Weighted impurity decrease
left_imp = impurity_fn(y[left_mask])
right_imp = impurity_fn(y[right_mask])
weighted = (n_left * left_imp + n_right * right_imp) / n_samples

decrease = parent_impurity - weighted

if decrease > best_decrease:
best_decrease = decrease
best = SplitResult(
feature_idx=feat_idx,
threshold=t,
impurity_decrease=decrease,
left_mask=left_mask,
right_mask=right_mask,
)

return best


def compute_gain_ratio(X_col: np.ndarray, y: np.ndarray,
threshold: float) -> Tuple[float, float]:
"""
Gain ratio (C4.5): corrects ID3's bias toward high-cardinality features.
GainRatio(A) = InformationGain(A) / SplitInformation(A)

SplitInformation penalizes features that split data into many small groups.
"""
n = len(y)
left = y[X_col <= threshold]
right = y[X_col > threshold]

# Information gain
parent_h = entropy(y)
child_h = (len(left) * entropy(left) + len(right) * entropy(right)) / n
ig = parent_h - child_h

# Split information: entropy of the split itself
p_left = len(left) / n
p_right = len(right) / n
split_info = 0.0
if p_left > 0: split_info -= p_left * np.log2(p_left)
if p_right > 0: split_info -= p_right * np.log2(p_right)

gain_ratio = ig / split_info if split_info > 1e-10 else 0.0
return ig, gain_ratio


# ── Example: compare all criteria on the same dataset ────────────────────────
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

iris = load_iris()
X_iris = StandardScaler().fit_transform(iris.data)
y_iris = iris.target

print("Split search on Iris dataset:")
for criterion in ['gini', 'entropy']:
result = find_best_split(X_iris, y_iris, criterion=criterion, min_samples_leaf=5)
if result:
feat_name = iris.feature_names[result.feature_idx]
print(f" {criterion:8s}: best_feature='{feat_name}', "
f"threshold={result.threshold:.3f}, "
f"impurity_decrease={result.impurity_decrease:.4f}")

# Regression tree
from sklearn.datasets import load_diabetes
X_diab, y_diab = load_diabetes(return_X_y=True)
result_reg = find_best_split(X_diab, y_diab, criterion='mse', min_samples_leaf=10)
if result_reg:
print(f"\n MSE split: feature_idx={result_reg.feature_idx}, "
f"threshold={result_reg.threshold:.4f}, "
f"variance_reduction={result_reg.impurity_decrease:.2f}")

Impurity Criteria Visualization

import numpy as np
import matplotlib.pyplot as plt

def plot_impurity_comparison():
"""Plot Gini, entropy, and misclassification error as functions of p (binary class)."""
p = np.linspace(0, 1, 200)
p = np.clip(p, 1e-10, 1 - 1e-10)

gini_vals = 2 * p * (1 - p) # = 1 - p² - (1-p)²
entropy_vals = -(p * np.log2(p) + (1-p) * np.log2(1-p)) # Shannon entropy
misclass_vals = 1 - np.maximum(p, 1-p) # 1 - max(p, 1-p)

# Normalize entropy to [0, 1] for comparison (max entropy = 1 bit for binary)
# entropy range is already [0, 1] for binary

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(p, gini_vals, label='Gini: $2p(1-p)$', linewidth=2, color='#3b82f6')
axes[0].plot(p, entropy_vals, label='Entropy: $H(p)$ (bits)', linewidth=2, color='#7c3aed')
axes[0].plot(p, misclass_vals, label='Misclassification: $1-\max(p,1-p)$', linewidth=2,
color='#dc2626', linestyle='--')
axes[0].set_xlabel("Proportion of class 1 (p)", fontsize=12)
axes[0].set_ylabel("Impurity Measure", fontsize=12)
axes[0].set_title("Impurity Measures for Binary Classification", fontsize=13)
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Ratio: entropy / Gini - shows when they differ
ratio = entropy_vals / (gini_vals + 1e-10)
axes[1].plot(p, ratio, linewidth=2, color='#16a34a')
axes[1].axhline(y=ratio.mean(), color='gray', linestyle='--',
label=f'Mean ratio = {ratio.mean():.3f}')
axes[1].set_xlabel("p", fontsize=12)
axes[1].set_ylabel("Entropy / Gini ratio", fontsize=12)
axes[1].set_title("When Entropy and Gini Disagree", fontsize=13)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.suptitle("Splitting Criteria Comparison - All Agree Near p=0 and p=1", fontsize=12)
plt.tight_layout()
plt.show()

# Key observation: entropy / Gini ≈ constant across most of [0,1]
# Maximum disagreement near p=0.2 and p=0.8
print(f"Max entropy/gini ratio: {ratio.max():.4f} at p = {p[np.argmax(ratio)]:.3f}")
print(f"Min entropy/gini ratio: {ratio.min():.4f} at p = {p[np.argmin(ratio)]:.3f}")
print("→ The two criteria differ by at most ~43% in raw value,")
print(" but almost always rank splits the same way.")

plot_impurity_comparison()

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Entropy Explorer demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.