Entropy and Information
The Scenario That Motivates This Lesson
You are a senior ML engineer at a company building a medical diagnosis system. Your model outputs a probability distribution over 10 possible diagnoses. Before shipping, your tech lead asks two questions:
"How uncertain is the model on average? And when we split a decision tree node, how do we know which feature reduces uncertainty the most?"
Both questions have the same answer: entropy. The model's average uncertainty is its entropy over the output distribution. The best split is the feature that maximally reduces entropy - called information gain.
Entropy is also the foundation of every loss function in this module. Understanding it deeply means understanding why ML systems work.
What Is Information?
Start with a thought experiment. You receive two messages:
- "The sun rose in the east this morning."
- "A magnitude 9.0 earthquake just struck downtown Tokyo."
Which message contains more information? Intuitively, the second - because it is more surprising. The first was nearly certain to happen.
Shannon formalized this intuition. The self-information (or surprisal) of an event with probability is:
Properties of self-information:
- Rare events have high information: if , then
- Certain events have zero information: if , then
- Additive for independent events: when
The choice of logarithm base determines the unit:
- Base 2 → bits (most common in information theory)
- Base e → nats (used in ML/PyTorch,
torch.nn.CrossEntropyLoss) - Base 10 → hartleys (rare)
:::note Why the logarithm? The logarithm is the unique function (up to scaling) that is: (1) continuous, (2) monotonically decreasing in p, (3) zero for p=1, and (4) additive for independent events. These four requirements uniquely determine , following from Shannon's original axioms. :::
Shannon Entropy: Expected Surprise
Entropy is the expected self-information over a distribution - the average surprise you receive when drawing from that distribution:
By convention, (since ).
Intuitive Interpretation
Entropy measures uncertainty or spread in a distribution:
- High entropy = distribution is spread out, outcome is unpredictable
- Low entropy = distribution is concentrated, outcome is predictable
Entropy of coin flip distributions (base 2, in bits):
Fair coin (p=0.5): H = 1.000 bit ████████████████████
Biased coin (p=0.9): H = 0.469 bits █████████
Very biased (p=0.99): H = 0.081 bits ██
Deterministic (p=1.0): H = 0.000 bits (no uncertainty)
Computing Entropy: Examples
Fair coin ():
Biased coin ():
Uniform distribution over outcomes:
The uniform distribution has maximum entropy for a given number of outcomes. This is why the uniform prior is the "least informative" prior - it encodes maximum uncertainty.
The Binary Entropy Function
For a binary random variable with , the entropy is:
h(p) - Binary Entropy Function
1.0 | *****
| ** **
| ** **
0.5 | * *
| * *
| * *
| * *
0.0 |*_____________________*
0 0.5 1.0
p
Maximum at p = 0.5 → H = 1 bit
Zero at p = 0 and p = 1 → deterministic = no uncertainty
Key properties:
- - symmetric about
- Maximum bit at
- Concave everywhere on
Python: Computing Entropy
import numpy as np
from scipy.stats import entropy as scipy_entropy
def entropy(p: np.ndarray, base: float = 2) -> float:
"""
Compute Shannon entropy of a probability distribution.
Args:
p: probability array (must sum to 1)
base: logarithm base (2 = bits, np.e = nats)
Returns:
entropy value
"""
p = np.asarray(p, dtype=float)
assert np.isclose(p.sum(), 1.0), f"Probabilities must sum to 1, got {p.sum():.6f}"
assert np.all(p >= 0), "Probabilities must be non-negative"
# Handle p=0 terms: 0 * log(0) = 0 by convention
mask = p > 0
log_p = np.log(p[mask]) / np.log(base)
return float(-np.sum(p[mask] * log_p))
# --- Example 1: Compare distributions ---
distributions = {
"Uniform (4 outcomes)": [0.25, 0.25, 0.25, 0.25],
"Peaked (one dominant)": [0.70, 0.10, 0.10, 0.10],
"Near-deterministic": [0.97, 0.01, 0.01, 0.01],
"Deterministic": [1.00, 0.00, 0.00, 0.00],
}
print(f"{'Distribution':<30} | {'Entropy (bits)':>14}")
print("-" * 50)
for name, p in distributions.items():
h = entropy(p, base=2)
print(f"{name:<30} | {h:>14.4f}")
# Output:
# Uniform (4 outcomes) | 2.0000
# Peaked (one dominant) | 1.3568
# Near-deterministic | 0.2423
# Deterministic | 0.0000
# --- Example 2: Binary entropy function ---
p_values = np.linspace(1e-6, 1 - 1e-6, 1000)
h_binary = -p_values * np.log2(p_values) - (1 - p_values) * np.log2(1 - p_values)
print(f"\nBinary entropy at p=0.5: {entropy([0.5, 0.5]):.6f} bits")
print(f"Binary entropy at p=0.1: {entropy([0.1, 0.9]):.6f} bits")
print(f"Binary entropy at p=0.9: {entropy([0.9, 0.1]):.6f} bits")
# Notice h(0.1) == h(0.9) - the function is symmetric
# --- Example 3: Entropy of typical model outputs ---
def model_entropy_analysis():
"""Compare entropy for different prediction confidence levels."""
examples = {
"Confident correct prediction": [0.95, 0.03, 0.01, 0.01],
"Uncertain prediction": [0.30, 0.28, 0.22, 0.20],
"Uniform uncertainty": [0.25, 0.25, 0.25, 0.25],
"Confused (two candidates)": [0.49, 0.49, 0.01, 0.01],
}
max_h = np.log2(4) # 4 classes → max = 2 bits
print(f"\n{'Scenario':<35} {'Entropy':>10} {'% of max':>10}")
print("-" * 57)
for name, p in examples.items():
h = entropy(p)
print(f"{name:<35} {h:>10.4f} {100 * h / max_h:>9.1f}%")
model_entropy_analysis()
Entropy of Common Probability Distributions
Discrete Distributions
| Distribution | Parameters | Entropy (nats) |
|---|---|---|
| Bernoulli() | ||
| Uniform over | ||
| Geometric() | ||
| Poisson() | for large | |
| Categorical() | -vector |
Maximum Entropy Principle
Among all discrete distributions over outcomes, the uniform distribution maximizes entropy:
with equality if and only if is uniform. This follows from the concavity of and Jensen's inequality.
This principle is widely used: the maximum-entropy distribution subject to known constraints is the least biased distribution consistent with the constraints. It underlies:
- Choice of uniform priors in Bayesian inference
- Softmax's role as the maximum-entropy classifier
- The normalization in attention mechanisms
Differential Entropy: Continuous Random Variables
For a continuous random variable with probability density function , the differential entropy is:
:::warning Differential Entropy Can Be Negative Unlike discrete entropy, differential entropy can be negative. A very narrow Gaussian has negative differential entropy. This is because differential entropy is defined relative to the Lebesgue measure - it measures spread relative to a continuous uniform distribution, not an absolute bit count. :::
Differential Entropy of Key Distributions
Gaussian :
The Gaussian has maximum differential entropy among all distributions with a given variance. This is why the Gaussian arises as the maximum-entropy distribution under a second-moment constraint.
Exponential :
Uniform :
import numpy as np
from scipy import stats
def differential_entropy_gaussian(sigma: float) -> float:
"""Differential entropy of N(mu, sigma^2) in nats."""
return 0.5 * np.log(2 * np.pi * np.e * sigma**2)
def differential_entropy_exponential(lam: float) -> float:
"""Differential entropy of Exp(lambda) in nats."""
return 1 - np.log(lam)
def differential_entropy_uniform(a: float, b: float) -> float:
"""Differential entropy of Uniform(a, b) in nats."""
return np.log(b - a)
# Gaussian entropy grows logarithmically with variance
print("Gaussian Differential Entropy vs. Width:")
print(f"{'Sigma':>8} | {'h(X) nats':>12} | {'h(X) bits':>12}")
print("-" * 38)
for sigma in [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]:
h_nats = differential_entropy_gaussian(sigma)
h_bits = h_nats / np.log(2)
marker = " ← negative!" if h_nats < 0 else ""
print(f"{sigma:>8.1f} | {h_nats:>12.4f} | {h_bits:>12.4f}{marker}")
# Note: sigma < 1/sqrt(2*pi*e) ≈ 0.242 gives negative entropy
# Verify against scipy
rv = stats.norm(loc=0, scale=2.0)
print(f"\nScipy entropy for N(0, 4): {rv.entropy():.6f} nats")
print(f"Formula result: {differential_entropy_gaussian(2.0):.6f} nats")
Joint Entropy and Conditional Entropy
Joint Entropy
For two random variables and :
Chain rule of entropy:
Conditional Entropy
Key inequality - conditioning reduces entropy:
Conditioning on can only reduce (or keep equal) the entropy of . Knowing more can never increase average uncertainty. This is the foundation of information gain.
Venn diagram of entropy quantities:
H(X, Y)
┌──────────────────────────────────────────────┐
│ │
│ H(X) │ H(Y) │
│ ┌─────────────┐ │ ┌─────────────┐ │
│ │ │ │ │ │ │
│ │ H(X|Y) │ I(X;Y) │ H(Y|X) │ │
│ │ │ │ │ │ │
│ └─────────────┴───┴───┴─────────────┘ │
│ │
└──────────────────────────────────────────────┘
H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)
I(X;Y) = H(X) + H(Y) - H(X,Y) [mutual information, Lesson 04]
ML Connection: Decision Trees and Information Gain
:::info ML Connection - Decision Trees Decision trees select the feature that maximizes information gain - the reduction in label entropy after the split. This is the ID3 and C4.5 algorithm's core criterion. :::
Information Gain of splitting dataset by feature :
where is the entropy of the class label distribution in set , and is the subset where .
import numpy as np
def dataset_entropy(labels: np.ndarray) -> float:
"""Shannon entropy (bits) of a set of class labels."""
_, counts = np.unique(labels, return_counts=True)
probs = counts / counts.sum()
mask = probs > 0
return float(-np.sum(probs[mask] * np.log2(probs[mask])))
def information_gain(labels: np.ndarray, feature: np.ndarray) -> float:
"""
Information gain of splitting by a categorical feature.
Args:
labels: array of class labels
feature: array of feature values (categorical)
Returns:
information gain in bits
"""
parent_h = dataset_entropy(labels)
values, counts = np.unique(feature, return_counts=True)
n = len(labels)
weighted_child_h = sum(
(count / n) * dataset_entropy(labels[feature == v])
for v, count in zip(values, counts)
)
return parent_h - weighted_child_h
# Classic "Play Tennis" dataset
# Labels: 1=play, 0=don't play
labels = np.array([1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0])
weather = np.array([0, 0, 1, 2, 2, 2, 1, 0, 0, 2, 0, 1, 1, 2]) # 0=sunny,1=overcast,2=rain
wind = np.array([0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1]) # 0=weak,1=strong
humidity= np.array([0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0]) # 0=high,1=normal
parent_h = dataset_entropy(labels)
ig_weather = information_gain(labels, weather)
ig_wind = information_gain(labels, wind)
ig_humidity = information_gain(labels, humidity)
print(f"Parent entropy: {parent_h:.4f} bits")
print(f"IG(Weather): {ig_weather:.4f} bits")
print(f"IG(Wind): {ig_wind:.4f} bits")
print(f"IG(Humidity): {ig_humidity:.4f} bits")
features = {"Weather": ig_weather, "Wind": ig_wind, "Humidity": ig_humidity}
best = max(features, key=features.get)
print(f"\nBest first split: {best} (IG = {features[best]:.4f} bits)")
# Output:
# Parent entropy: 0.9403 bits
# IG(Weather): 0.2467 bits
# IG(Wind): 0.0481 bits
# IG(Humidity): 0.1518 bits
# Best first split: Weather (IG = 0.2467 bits)
Comparison - Entropy vs. Gini Impurity:
Criterion | Formula | Properties
-----------------+----------------------------+----------------------------
Entropy (ID3/C4.5)| -Σ p_k log p_k | Theoretically grounded in IT
Gini (CART) | 1 - Σ p_k² | Faster to compute, similar splits
Log Loss (XGBoost)| Cross-entropy gradient | Differentiable, gradient boosting
Entropy is more sensitive to rare class impurity; Gini is slightly cheaper to compute. In practice, they produce nearly identical trees.
Properties of Entropy
Key Inequalities and Laws
| Property | Statement | Implication |
|---|---|---|
| Non-negativity | Entropy is always non-negative | |
| Maximum entropy | Uniform is maximally uncertain | |
| Conditioning reduces entropy | Information never hurts | |
| Chain rule | Joint = marginal + conditional | |
| Subadditivity | Equality iff | |
| Data processing inequality | Processing cannot create info |
Entropy and Compression
Shannon's source coding theorem: the minimum average number of bits to encode samples from using an optimal code equals bits per symbol. You cannot do better.
This connects entropy to the fundamental limits of compression - the topic of Lesson 05.
Entropy Across the ML Stack
ML Task | How Entropy Is Used
----------------------------+-----------------------------------------------
Decision trees | Splitting criterion - maximize information gain
Language models | Perplexity = 2^H(test data | model)
Uncertainty quantification | Model entropy = prediction confidence
Active learning | Query maximum entropy samples
MaxEnt RL (SAC) | Entropy bonus encourages exploration
VAE encoder | Entropy of q(z|x) in ELBO bound
Calibration | Well-calibrated models: entropy matches freq.
Feature selection | Mutual information (entropy-based, Lesson 04)
Clustering purity | Cluster label entropy = cluster quality
MaxEnt Reinforcement Learning
In Soft Actor-Critic (SAC), the objective augments reward with an entropy bonus:
The entropy term rewards the policy for taking diverse actions. This prevents premature exploitation, stabilizes training, and encourages exploration of the state space.
Entropy Regularization in PyTorch
import torch
import torch.nn.functional as F
def entropy_regularization(
logits: torch.Tensor,
alpha: float = 0.01
) -> torch.Tensor:
"""
Entropy regularization term for MaxEnt RL and semi-supervised learning.
Encourages high-entropy (uncertain) predictions when added to the loss.
Args:
logits: raw model outputs, shape (batch, num_classes)
alpha: regularization strength
Returns:
negative mean entropy (add to loss to maximize entropy)
"""
log_probs = F.log_softmax(logits, dim=-1)
probs = torch.exp(log_probs)
# H = -sum(p * log_p) per sample
per_sample_entropy = -torch.sum(probs * log_probs, dim=-1)
mean_entropy = per_sample_entropy.mean()
return -alpha * mean_entropy # negative because we minimize losses
# Demonstrate: confident vs uncertain outputs
torch.manual_seed(42)
confident_logits = torch.tensor([[5.0, -2.0, -2.0, -2.0]] * 4)
uncertain_logits = torch.tensor([[0.1, 0.1, -0.1, -0.1]] * 4)
def compute_entropy(logits):
probs = F.softmax(logits, dim=-1)
log_probs = F.log_softmax(logits, dim=-1)
return -torch.sum(probs * log_probs, dim=-1).mean().item()
print(f"Entropy of confident model: {compute_entropy(confident_logits):.4f} nats")
print(f"Entropy of uncertain model: {compute_entropy(uncertain_logits):.4f} nats")
print(f"Max entropy (4 classes): {torch.log(torch.tensor(4.0)).item():.4f} nats")
Interview Questions and Answers
Q1: What is entropy in information theory, and why do we use it in decision trees?
Entropy measures the average uncertainty (expected surprise) in a probability distribution. It is maximized when the distribution is uniform (maximum uncertainty) and zero when deterministic (no uncertainty).
In decision trees, we use entropy to measure label impurity at a node. A pure node (all labels identical) has entropy 0. A maximally mixed node has entropy for classes. We select the feature maximizing information gain - the reduction in weighted average entropy after splitting. This greedily reduces label uncertainty most efficiently.
Q2: What is the relationship between entropy and the minimum description length of a message?
Shannon's source coding theorem states that the minimum average number of bits to encode a message from source using an optimal lossless code equals bits per symbol. No code can do better on average. Huffman coding achieves within 1 bit of this bound per symbol; arithmetic coding can approach the bound arbitrarily closely.
For ML: a language model with cross-entropy nats per token needs at least nats per token to encode text. Perplexity quantifies compression efficiency - lower perplexity = better compression = better language model.
Q3: Why is the uniform distribution the maximum entropy distribution?
Among all distributions over discrete outcomes, the uniform maximizes entropy. Proof by Lagrange multipliers: maximize subject to . The Lagrangian optimality condition for all forces all to be equal.
Intuitively: the uniform distribution makes the fewest assumptions - it encodes maximum uncertainty. In Bayesian inference, the uniform prior is the least informative choice.
Q4: Can entropy be negative? What about differential entropy?
Discrete entropy is always non-negative: . Each term because implies .
Differential entropy (continuous distributions) can be negative. For example, , which is negative for . Differential entropy is defined relative to the Lebesgue measure - it measures spread relative to the continuous uniform distribution rather than an absolute information count.
Q5: What is the data processing inequality, and what does it mean for ML?
The data processing inequality states: if is a Markov chain (Z depends on X only through Y), then . Equivalently, - processing data cannot increase the information it contains about the original input.
For ML: every layer of a neural network can only reduce (or preserve) the information about the input. You cannot create information by processing data - only extract, reorganize, or discard it. This is the basis of the information bottleneck theory of deep learning (Lesson 04): deep networks are hypothesized to progressively compress input information while retaining task-relevant information. It also explains why feature engineering (domain knowledge) matters: it lets you preserve the right information before it is lost.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Entropy Explorer demo on the EngineersOfAI Playground - no code required.
:::
