Skip to main content

Entropy and Information

The Scenario That Motivates This Lesson

You are a senior ML engineer at a company building a medical diagnosis system. Your model outputs a probability distribution over 10 possible diagnoses. Before shipping, your tech lead asks two questions:

"How uncertain is the model on average? And when we split a decision tree node, how do we know which feature reduces uncertainty the most?"

Both questions have the same answer: entropy. The model's average uncertainty is its entropy over the output distribution. The best split is the feature that maximally reduces entropy - called information gain.

Entropy is also the foundation of every loss function in this module. Understanding it deeply means understanding why ML systems work.

What Is Information?

Start with a thought experiment. You receive two messages:

  1. "The sun rose in the east this morning."
  2. "A magnitude 9.0 earthquake just struck downtown Tokyo."

Which message contains more information? Intuitively, the second - because it is more surprising. The first was nearly certain to happen.

Shannon formalized this intuition. The self-information (or surprisal) of an event xx with probability p(x)p(x) is:

I(x)=logp(x)I(x) = -\log p(x)

Properties of self-information:

  • Rare events have high information: if p(x)0p(x) \to 0, then I(x)I(x) \to \infty
  • Certain events have zero information: if p(x)=1p(x) = 1, then I(x)=0I(x) = 0
  • Additive for independent events: I(x1,x2)=I(x1)+I(x2)I(x_1, x_2) = I(x_1) + I(x_2) when x1x2x_1 \perp x_2

The choice of logarithm base determines the unit:

  • Base 2 → bits (most common in information theory)
  • Base e → nats (used in ML/PyTorch, torch.nn.CrossEntropyLoss)
  • Base 10 → hartleys (rare)

:::note Why the logarithm? The logarithm is the unique function (up to scaling) that is: (1) continuous, (2) monotonically decreasing in p, (3) zero for p=1, and (4) additive for independent events. These four requirements uniquely determine I(x)=logp(x)I(x) = -\log p(x), following from Shannon's original axioms. :::

Shannon Entropy: Expected Surprise

Entropy is the expected self-information over a distribution - the average surprise you receive when drawing from that distribution:

H(X)=E[logp(X)]=xXp(x)logp(x)H(X) = \mathbb{E}[-\log p(X)] = -\sum_{x \in \mathcal{X}} p(x) \log p(x)

By convention, 0log0=00 \log 0 = 0 (since limp0plogp=0\lim_{p \to 0} p \log p = 0).

Intuitive Interpretation

Entropy measures uncertainty or spread in a distribution:

  • High entropy = distribution is spread out, outcome is unpredictable
  • Low entropy = distribution is concentrated, outcome is predictable
Entropy of coin flip distributions (base 2, in bits):

Fair coin (p=0.5): H = 1.000 bit ████████████████████
Biased coin (p=0.9): H = 0.469 bits █████████
Very biased (p=0.99): H = 0.081 bits ██
Deterministic (p=1.0): H = 0.000 bits (no uncertainty)

Computing Entropy: Examples

Fair coin (p=0.5p = 0.5): H=0.5log2(0.5)0.5log2(0.5)=2×0.5×(1)=1 bitH = -0.5\log_2(0.5) - 0.5\log_2(0.5) = -2 \times 0.5 \times (-1) = 1 \text{ bit}

Biased coin (p(Heads)=0.9p(\text{Heads}) = 0.9): H=0.9log2(0.9)0.1log2(0.1)=0.137+0.332=0.469 bitsH = -0.9\log_2(0.9) - 0.1\log_2(0.1) = 0.137 + 0.332 = 0.469 \text{ bits}

Uniform distribution over KK outcomes: H=i=1K1Klog21K=log2KH = -\sum_{i=1}^{K} \frac{1}{K}\log_2\frac{1}{K} = \log_2 K

The uniform distribution has maximum entropy for a given number of outcomes. This is why the uniform prior is the "least informative" prior - it encodes maximum uncertainty.

The Binary Entropy Function

For a binary random variable with P(X=1)=pP(X=1) = p, the entropy is:

h(p)=plog2p(1p)log2(1p)h(p) = -p\log_2 p - (1-p)\log_2(1-p)

h(p) - Binary Entropy Function

1.0 | *****
| ** **
| ** **
0.5 | * *
| * *
| * *
| * *
0.0 |*_____________________*
0 0.5 1.0
p

Maximum at p = 0.5 → H = 1 bit
Zero at p = 0 and p = 1 → deterministic = no uncertainty

Key properties:

  • h(p)=h(1p)h(p) = h(1-p) - symmetric about p=0.5p = 0.5
  • Maximum h(p)=1h(p) = 1 bit at p=0.5p = 0.5
  • h(0)=h(1)=0h(0) = h(1) = 0
  • Concave everywhere on [0,1][0, 1]

Python: Computing Entropy

import numpy as np
from scipy.stats import entropy as scipy_entropy


def entropy(p: np.ndarray, base: float = 2) -> float:
"""
Compute Shannon entropy of a probability distribution.

Args:
p: probability array (must sum to 1)
base: logarithm base (2 = bits, np.e = nats)

Returns:
entropy value
"""
p = np.asarray(p, dtype=float)
assert np.isclose(p.sum(), 1.0), f"Probabilities must sum to 1, got {p.sum():.6f}"
assert np.all(p >= 0), "Probabilities must be non-negative"

# Handle p=0 terms: 0 * log(0) = 0 by convention
mask = p > 0
log_p = np.log(p[mask]) / np.log(base)
return float(-np.sum(p[mask] * log_p))


# --- Example 1: Compare distributions ---
distributions = {
"Uniform (4 outcomes)": [0.25, 0.25, 0.25, 0.25],
"Peaked (one dominant)": [0.70, 0.10, 0.10, 0.10],
"Near-deterministic": [0.97, 0.01, 0.01, 0.01],
"Deterministic": [1.00, 0.00, 0.00, 0.00],
}

print(f"{'Distribution':<30} | {'Entropy (bits)':>14}")
print("-" * 50)
for name, p in distributions.items():
h = entropy(p, base=2)
print(f"{name:<30} | {h:>14.4f}")

# Output:
# Uniform (4 outcomes) | 2.0000
# Peaked (one dominant) | 1.3568
# Near-deterministic | 0.2423
# Deterministic | 0.0000


# --- Example 2: Binary entropy function ---
p_values = np.linspace(1e-6, 1 - 1e-6, 1000)
h_binary = -p_values * np.log2(p_values) - (1 - p_values) * np.log2(1 - p_values)

print(f"\nBinary entropy at p=0.5: {entropy([0.5, 0.5]):.6f} bits")
print(f"Binary entropy at p=0.1: {entropy([0.1, 0.9]):.6f} bits")
print(f"Binary entropy at p=0.9: {entropy([0.9, 0.1]):.6f} bits")
# Notice h(0.1) == h(0.9) - the function is symmetric


# --- Example 3: Entropy of typical model outputs ---
def model_entropy_analysis():
"""Compare entropy for different prediction confidence levels."""
examples = {
"Confident correct prediction": [0.95, 0.03, 0.01, 0.01],
"Uncertain prediction": [0.30, 0.28, 0.22, 0.20],
"Uniform uncertainty": [0.25, 0.25, 0.25, 0.25],
"Confused (two candidates)": [0.49, 0.49, 0.01, 0.01],
}

max_h = np.log2(4) # 4 classes → max = 2 bits

print(f"\n{'Scenario':<35} {'Entropy':>10} {'% of max':>10}")
print("-" * 57)
for name, p in examples.items():
h = entropy(p)
print(f"{name:<35} {h:>10.4f} {100 * h / max_h:>9.1f}%")

model_entropy_analysis()

Entropy of Common Probability Distributions

Discrete Distributions

DistributionParametersEntropy (nats)
Bernoulli(pp)p(0,1)p \in (0,1)plnp(1p)ln(1p)-p\ln p - (1-p)\ln(1-p)
Uniform over KKKKlnK\ln K
Geometric(pp)pp(1p)ln(1p)plnpp\frac{-(1-p)\ln(1-p) - p\ln p}{p}
Poisson(λ\lambda)λ>0\lambda > 012ln(2πeλ)\approx \frac{1}{2}\ln(2\pi e\lambda) for large λ\lambda
Categorical(p\mathbf{p})KK-vectorkpklnpk-\sum_k p_k \ln p_k

Maximum Entropy Principle

Among all discrete distributions over KK outcomes, the uniform distribution maximizes entropy:

H(X)logKH(X) \leq \log K

with equality if and only if XX is uniform. This follows from the concavity of log-\log and Jensen's inequality.

This principle is widely used: the maximum-entropy distribution subject to known constraints is the least biased distribution consistent with the constraints. It underlies:

  • Choice of uniform priors in Bayesian inference
  • Softmax's role as the maximum-entropy classifier
  • The normalization in attention mechanisms

Differential Entropy: Continuous Random Variables

For a continuous random variable XX with probability density function f(x)f(x), the differential entropy is:

h(X)=f(x)logf(x)dxh(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx

:::warning Differential Entropy Can Be Negative Unlike discrete entropy, differential entropy can be negative. A very narrow Gaussian has negative differential entropy. This is because differential entropy is defined relative to the Lebesgue measure - it measures spread relative to a continuous uniform distribution, not an absolute bit count. :::

Differential Entropy of Key Distributions

Gaussian XN(μ,σ2)X \sim \mathcal{N}(\mu, \sigma^2):

h(X)=12ln(2πeσ2)h(X) = \frac{1}{2}\ln(2\pi e \sigma^2)

The Gaussian has maximum differential entropy among all distributions with a given variance. This is why the Gaussian arises as the maximum-entropy distribution under a second-moment constraint.

Exponential XExp(λ)X \sim \text{Exp}(\lambda):

h(X)=1lnλh(X) = 1 - \ln\lambda

Uniform XUniform(a,b)X \sim \text{Uniform}(a, b):

h(X)=ln(ba)h(X) = \ln(b - a)

import numpy as np
from scipy import stats


def differential_entropy_gaussian(sigma: float) -> float:
"""Differential entropy of N(mu, sigma^2) in nats."""
return 0.5 * np.log(2 * np.pi * np.e * sigma**2)


def differential_entropy_exponential(lam: float) -> float:
"""Differential entropy of Exp(lambda) in nats."""
return 1 - np.log(lam)


def differential_entropy_uniform(a: float, b: float) -> float:
"""Differential entropy of Uniform(a, b) in nats."""
return np.log(b - a)


# Gaussian entropy grows logarithmically with variance
print("Gaussian Differential Entropy vs. Width:")
print(f"{'Sigma':>8} | {'h(X) nats':>12} | {'h(X) bits':>12}")
print("-" * 38)
for sigma in [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]:
h_nats = differential_entropy_gaussian(sigma)
h_bits = h_nats / np.log(2)
marker = " ← negative!" if h_nats < 0 else ""
print(f"{sigma:>8.1f} | {h_nats:>12.4f} | {h_bits:>12.4f}{marker}")

# Note: sigma < 1/sqrt(2*pi*e) ≈ 0.242 gives negative entropy

# Verify against scipy
rv = stats.norm(loc=0, scale=2.0)
print(f"\nScipy entropy for N(0, 4): {rv.entropy():.6f} nats")
print(f"Formula result: {differential_entropy_gaussian(2.0):.6f} nats")

Joint Entropy and Conditional Entropy

Joint Entropy

For two random variables XX and YY:

H(X,Y)=x,yp(x,y)logp(x,y)H(X, Y) = -\sum_{x, y} p(x, y) \log p(x, y)

Chain rule of entropy:

H(X,Y)=H(X)+H(YX)=H(Y)+H(XY)H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)

Conditional Entropy

H(YX)=xp(x)yp(yx)logp(yx)=EX[H(YX=x)]H(Y|X) = -\sum_{x} p(x) \sum_{y} p(y|x) \log p(y|x) = \mathbb{E}_X\bigl[H(Y|X=x)\bigr]

Key inequality - conditioning reduces entropy:

H(YX)H(Y)H(Y|X) \leq H(Y)

Conditioning on XX can only reduce (or keep equal) the entropy of YY. Knowing more can never increase average uncertainty. This is the foundation of information gain.

Venn diagram of entropy quantities:

H(X, Y)
┌──────────────────────────────────────────────┐
│ │
│ H(X) │ H(Y) │
│ ┌─────────────┐ │ ┌─────────────┐ │
│ │ │ │ │ │ │
│ │ H(X|Y) │ I(X;Y) │ H(Y|X) │ │
│ │ │ │ │ │ │
│ └─────────────┴───┴───┴─────────────┘ │
│ │
└──────────────────────────────────────────────┘

H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)
I(X;Y) = H(X) + H(Y) - H(X,Y) [mutual information, Lesson 04]

ML Connection: Decision Trees and Information Gain

:::info ML Connection - Decision Trees Decision trees select the feature that maximizes information gain - the reduction in label entropy after the split. This is the ID3 and C4.5 algorithm's core criterion. :::

Information Gain of splitting dataset SS by feature AA:

IG(S,A)=H(S)vvalues(A)SvSH(Sv)\text{IG}(S, A) = H(S) - \sum_{v \in \text{values}(A)} \frac{|S_v|}{|S|} H(S_v)

where H(S)H(S) is the entropy of the class label distribution in set SS, and SvS_v is the subset where A=vA = v.

import numpy as np


def dataset_entropy(labels: np.ndarray) -> float:
"""Shannon entropy (bits) of a set of class labels."""
_, counts = np.unique(labels, return_counts=True)
probs = counts / counts.sum()
mask = probs > 0
return float(-np.sum(probs[mask] * np.log2(probs[mask])))


def information_gain(labels: np.ndarray, feature: np.ndarray) -> float:
"""
Information gain of splitting by a categorical feature.

Args:
labels: array of class labels
feature: array of feature values (categorical)

Returns:
information gain in bits
"""
parent_h = dataset_entropy(labels)

values, counts = np.unique(feature, return_counts=True)
n = len(labels)

weighted_child_h = sum(
(count / n) * dataset_entropy(labels[feature == v])
for v, count in zip(values, counts)
)

return parent_h - weighted_child_h


# Classic "Play Tennis" dataset
# Labels: 1=play, 0=don't play
labels = np.array([1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0])
weather = np.array([0, 0, 1, 2, 2, 2, 1, 0, 0, 2, 0, 1, 1, 2]) # 0=sunny,1=overcast,2=rain
wind = np.array([0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1]) # 0=weak,1=strong
humidity= np.array([0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0]) # 0=high,1=normal

parent_h = dataset_entropy(labels)
ig_weather = information_gain(labels, weather)
ig_wind = information_gain(labels, wind)
ig_humidity = information_gain(labels, humidity)

print(f"Parent entropy: {parent_h:.4f} bits")
print(f"IG(Weather): {ig_weather:.4f} bits")
print(f"IG(Wind): {ig_wind:.4f} bits")
print(f"IG(Humidity): {ig_humidity:.4f} bits")

features = {"Weather": ig_weather, "Wind": ig_wind, "Humidity": ig_humidity}
best = max(features, key=features.get)
print(f"\nBest first split: {best} (IG = {features[best]:.4f} bits)")

# Output:
# Parent entropy: 0.9403 bits
# IG(Weather): 0.2467 bits
# IG(Wind): 0.0481 bits
# IG(Humidity): 0.1518 bits
# Best first split: Weather (IG = 0.2467 bits)

Comparison - Entropy vs. Gini Impurity:

Criterion | Formula | Properties
-----------------+----------------------------+----------------------------
Entropy (ID3/C4.5)| -Σ p_k log p_k | Theoretically grounded in IT
Gini (CART) | 1 - Σ p_k² | Faster to compute, similar splits
Log Loss (XGBoost)| Cross-entropy gradient | Differentiable, gradient boosting

Entropy is more sensitive to rare class impurity; Gini is slightly cheaper to compute. In practice, they produce nearly identical trees.

Properties of Entropy

Key Inequalities and Laws

PropertyStatementImplication
Non-negativityH(X)0H(X) \geq 0Entropy is always non-negative
Maximum entropyH(X)logXH(X) \leq \log \|\mathcal{X}\|Uniform is maximally uncertain
Conditioning reduces entropyH(XY)H(X)H(X\|Y) \leq H(X)Information never hurts
Chain ruleH(X,Y)=H(X)+H(YX)H(X,Y) = H(X) + H(Y\|X)Joint = marginal + conditional
SubadditivityH(X,Y)H(X)+H(Y)H(X,Y) \leq H(X) + H(Y)Equality iff XYX \perp Y
Data processing inequalityH(f(X))H(X)H(f(X)) \leq H(X)Processing cannot create info

Entropy and Compression

Shannon's source coding theorem: the minimum average number of bits to encode samples from XX using an optimal code equals H(X)H(X) bits per symbol. You cannot do better.

This connects entropy to the fundamental limits of compression - the topic of Lesson 05.

Entropy Across the ML Stack

ML Task | How Entropy Is Used
----------------------------+-----------------------------------------------
Decision trees | Splitting criterion - maximize information gain
Language models | Perplexity = 2^H(test data | model)
Uncertainty quantification | Model entropy = prediction confidence
Active learning | Query maximum entropy samples
MaxEnt RL (SAC) | Entropy bonus encourages exploration
VAE encoder | Entropy of q(z|x) in ELBO bound
Calibration | Well-calibrated models: entropy matches freq.
Feature selection | Mutual information (entropy-based, Lesson 04)
Clustering purity | Cluster label entropy = cluster quality

MaxEnt Reinforcement Learning

In Soft Actor-Critic (SAC), the objective augments reward with an entropy bonus:

J(π)=tE(st,at)ρπ[r(st,at)+αH(π(st))]J(\pi) = \sum_t \mathbb{E}_{(s_t, a_t) \sim \rho_\pi}\left[r(s_t, a_t) + \alpha \, H(\pi(\cdot|s_t))\right]

The entropy term αH(π(st))\alpha H(\pi(\cdot|s_t)) rewards the policy for taking diverse actions. This prevents premature exploitation, stabilizes training, and encourages exploration of the state space.

Entropy Regularization in PyTorch

import torch
import torch.nn.functional as F


def entropy_regularization(
logits: torch.Tensor,
alpha: float = 0.01
) -> torch.Tensor:
"""
Entropy regularization term for MaxEnt RL and semi-supervised learning.
Encourages high-entropy (uncertain) predictions when added to the loss.

Args:
logits: raw model outputs, shape (batch, num_classes)
alpha: regularization strength

Returns:
negative mean entropy (add to loss to maximize entropy)
"""
log_probs = F.log_softmax(logits, dim=-1)
probs = torch.exp(log_probs)

# H = -sum(p * log_p) per sample
per_sample_entropy = -torch.sum(probs * log_probs, dim=-1)
mean_entropy = per_sample_entropy.mean()

return -alpha * mean_entropy # negative because we minimize losses


# Demonstrate: confident vs uncertain outputs
torch.manual_seed(42)

confident_logits = torch.tensor([[5.0, -2.0, -2.0, -2.0]] * 4)
uncertain_logits = torch.tensor([[0.1, 0.1, -0.1, -0.1]] * 4)

def compute_entropy(logits):
probs = F.softmax(logits, dim=-1)
log_probs = F.log_softmax(logits, dim=-1)
return -torch.sum(probs * log_probs, dim=-1).mean().item()

print(f"Entropy of confident model: {compute_entropy(confident_logits):.4f} nats")
print(f"Entropy of uncertain model: {compute_entropy(uncertain_logits):.4f} nats")
print(f"Max entropy (4 classes): {torch.log(torch.tensor(4.0)).item():.4f} nats")

Interview Questions and Answers

Q1: What is entropy in information theory, and why do we use it in decision trees?

Entropy H(X)=xp(x)logp(x)H(X) = -\sum_x p(x)\log p(x) measures the average uncertainty (expected surprise) in a probability distribution. It is maximized when the distribution is uniform (maximum uncertainty) and zero when deterministic (no uncertainty).

In decision trees, we use entropy to measure label impurity at a node. A pure node (all labels identical) has entropy 0. A maximally mixed node has entropy log2K\log_2 K for KK classes. We select the feature maximizing information gain - the reduction in weighted average entropy after splitting. This greedily reduces label uncertainty most efficiently.

Q2: What is the relationship between entropy and the minimum description length of a message?

Shannon's source coding theorem states that the minimum average number of bits to encode a message from source XX using an optimal lossless code equals H(X)H(X) bits per symbol. No code can do better on average. Huffman coding achieves within 1 bit of this bound per symbol; arithmetic coding can approach the bound arbitrarily closely.

For ML: a language model with cross-entropy HH nats per token needs at least HH nats per token to encode text. Perplexity =eH= e^H quantifies compression efficiency - lower perplexity = better compression = better language model.

Q3: Why is the uniform distribution the maximum entropy distribution?

Among all distributions over KK discrete outcomes, the uniform p(x)=1/Kp(x) = 1/K maximizes entropy. Proof by Lagrange multipliers: maximize H=kpklnpkH = -\sum_k p_k \ln p_k subject to kpk=1\sum_k p_k = 1. The Lagrangian optimality condition lnpk1=λ-\ln p_k - 1 = \lambda for all kk forces all pkp_k to be equal.

Intuitively: the uniform distribution makes the fewest assumptions - it encodes maximum uncertainty. In Bayesian inference, the uniform prior is the least informative choice.

Q4: Can entropy be negative? What about differential entropy?

Discrete entropy is always non-negative: H(X)0H(X) \geq 0. Each term p(x)logp(x)0-p(x)\log p(x) \geq 0 because 0p(x)10 \leq p(x) \leq 1 implies logp(x)0\log p(x) \leq 0.

Differential entropy (continuous distributions) can be negative. For example, h(X)=12ln(2πeσ2)h(X) = \frac{1}{2}\ln(2\pi e \sigma^2), which is negative for σ<12πe0.242\sigma < \frac{1}{\sqrt{2\pi e}} \approx 0.242. Differential entropy is defined relative to the Lebesgue measure - it measures spread relative to the continuous uniform distribution rather than an absolute information count.

Q5: What is the data processing inequality, and what does it mean for ML?

The data processing inequality states: if XYZX \to Y \to Z is a Markov chain (Z depends on X only through Y), then H(Z)H(Y)H(X)H(Z) \leq H(Y) \leq H(X). Equivalently, I(X;Z)I(X;Y)I(X; Z) \leq I(X; Y) - processing data cannot increase the information it contains about the original input.

For ML: every layer of a neural network can only reduce (or preserve) the information about the input. You cannot create information by processing data - only extract, reorganize, or discard it. This is the basis of the information bottleneck theory of deep learning (Lesson 04): deep networks are hypothesized to progressively compress input information while retaining task-relevant information. It also explains why feature engineering (domain knowledge) matters: it lets you preserve the right information before it is lost.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Entropy Explorer demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.