Entropy and Information

The Scenario That Motivates This Lesson

You are a senior ML engineer at a company building a medical diagnosis system. Your model outputs a probability distribution over 10 possible diagnoses. Before shipping, your tech lead asks two questions:

"How uncertain is the model on average? And when we split a decision tree node, how do we know which feature reduces uncertainty the most?"

Both questions have the same answer: entropy. The model's average uncertainty is its entropy over the output distribution. The best split is the feature that maximally reduces entropy - called information gain.

Entropy is also the foundation of every loss function in this module. Understanding it deeply means understanding why ML systems work.

What Is Information?

Start with a thought experiment. You receive two messages:

"The sun rose in the east this morning."
"A magnitude 9.0 earthquake just struck downtown Tokyo."

Which message contains more information? Intuitively, the second - because it is more surprising. The first was nearly certain to happen.

Shannon formalized this intuition. The self-information (or surprisal) of an event $x$ with probability $p(x)$ is:

$I(x) = -\log p(x)$

Properties of self-information:

Rare events have high information: if $p(x) \to 0$ , then $I(x) \to \infty$
Certain events have zero information: if $p(x) = 1$ , then $I(x) = 0$
Additive for independent events: $I(x_1, x_2) = I(x_1) + I(x_2)$ when $x_1 \perp x_2$

The choice of logarithm base determines the unit:

Base 2 → bits (most common in information theory)
Base e → nats (used in ML/PyTorch, torch.nn.CrossEntropyLoss)
Base 10 → hartleys (rare)

:::note Why the logarithm? The logarithm is the unique function (up to scaling) that is: (1) continuous, (2) monotonically decreasing in p, (3) zero for p=1, and (4) additive for independent events. These four requirements uniquely determine $I(x) = -\log p(x)$ , following from Shannon's original axioms. :::

Shannon Entropy: Expected Surprise

Entropy is the expected self-information over a distribution - the average surprise you receive when drawing from that distribution:

$H(X) = \mathbb{E}[-\log p(X)] = -\sum_{x \in \mathcal{X}} p(x) \log p(x)$

By convention, $0 \log 0 = 0$ (since $\lim_{p \to 0} p \log p = 0$ ).

Intuitive Interpretation

Entropy measures uncertainty or spread in a distribution:

High entropy = distribution is spread out, outcome is unpredictable
Low entropy = distribution is concentrated, outcome is predictable

Entropy of coin flip distributions (base 2, in bits):

Fair coin (p=0.5):          H = 1.000 bit    ████████████████████
Biased coin (p=0.9):        H = 0.469 bits   █████████
Very biased  (p=0.99):      H = 0.081 bits   ██
Deterministic (p=1.0):      H = 0.000 bits   (no uncertainty)

Computing Entropy: Examples

Fair coin ( $p = 0.5$ ): $H = -0.5\log_2(0.5) - 0.5\log_2(0.5) = -2 \times 0.5 \times (-1) = 1 \text{ bit}$

Biased coin ( $p(\text{Heads}) = 0.9$ ): $H = -0.9\log_2(0.9) - 0.1\log_2(0.1) = 0.137 + 0.332 = 0.469 \text{ bits}$

Uniform distribution over $K$ outcomes: $H = -\sum_{i=1}^{K} \frac{1}{K}\log_2\frac{1}{K} = \log_2 K$

The uniform distribution has maximum entropy for a given number of outcomes. This is why the uniform prior is the "least informative" prior - it encodes maximum uncertainty.

The Binary Entropy Function

For a binary random variable with $P(X=1) = p$ , the entropy is:

$h(p) = -p\log_2 p - (1-p)\log_2(1-p)$

h(p) - Binary Entropy Function

1.0 |         *****
    |       **     **
    |     **         **
0.5 |    *             *
    |   *               *
    |  *                 *
    | *                   *
0.0 |*_____________________*
    0        0.5           1.0
              p

Maximum at p = 0.5 → H = 1 bit
Zero at p = 0 and p = 1 → deterministic = no uncertainty

Key properties:

$h(p) = h(1-p)$ - symmetric about $p = 0.5$
Maximum $h(p) = 1$ bit at $p = 0.5$
$h(0) = h(1) = 0$
Concave everywhere on $[0, 1]$

Python: Computing Entropy

import numpy as np
from scipy.stats import entropy as scipy_entropy


def entropy(p: np.ndarray, base: float = 2) -> float:
    """
    Compute Shannon entropy of a probability distribution.

    Args:
        p:    probability array (must sum to 1)
        base: logarithm base (2 = bits, np.e = nats)

    Returns:
        entropy value
    """
    p = np.asarray(p, dtype=float)
    assert np.isclose(p.sum(), 1.0), f"Probabilities must sum to 1, got {p.sum():.6f}"
    assert np.all(p >= 0), "Probabilities must be non-negative"

    # Handle p=0 terms: 0 * log(0) = 0 by convention
    mask = p > 0
    log_p = np.log(p[mask]) / np.log(base)
    return float(-np.sum(p[mask] * log_p))


# --- Example 1: Compare distributions ---
distributions = {
    "Uniform (4 outcomes)":  [0.25, 0.25, 0.25, 0.25],
    "Peaked (one dominant)": [0.70, 0.10, 0.10, 0.10],
    "Near-deterministic":    [0.97, 0.01, 0.01, 0.01],
    "Deterministic":         [1.00, 0.00, 0.00, 0.00],
}

print(f"{'Distribution':<30} | {'Entropy (bits)':>14}")
print("-" * 50)
for name, p in distributions.items():
    h = entropy(p, base=2)
    print(f"{name:<30} | {h:>14.4f}")

# Output:
# Uniform (4 outcomes)           |         2.0000
# Peaked (one dominant)          |         1.3568
# Near-deterministic             |         0.2423
# Deterministic                  |         0.0000


# --- Example 2: Binary entropy function ---
p_values = np.linspace(1e-6, 1 - 1e-6, 1000)
h_binary = -p_values * np.log2(p_values) - (1 - p_values) * np.log2(1 - p_values)

print(f"\nBinary entropy at p=0.5:  {entropy([0.5, 0.5]):.6f} bits")
print(f"Binary entropy at p=0.1:  {entropy([0.1, 0.9]):.6f} bits")
print(f"Binary entropy at p=0.9:  {entropy([0.9, 0.1]):.6f} bits")
# Notice h(0.1) == h(0.9) - the function is symmetric


# --- Example 3: Entropy of typical model outputs ---
def model_entropy_analysis():
    """Compare entropy for different prediction confidence levels."""
    examples = {
        "Confident correct prediction": [0.95, 0.03, 0.01, 0.01],
        "Uncertain prediction":          [0.30, 0.28, 0.22, 0.20],
        "Uniform uncertainty":           [0.25, 0.25, 0.25, 0.25],
        "Confused (two candidates)":     [0.49, 0.49, 0.01, 0.01],
    }

    max_h = np.log2(4)  # 4 classes → max = 2 bits

    print(f"\n{'Scenario':<35} {'Entropy':>10} {'% of max':>10}")
    print("-" * 57)
    for name, p in examples.items():
        h = entropy(p)
        print(f"{name:<35} {h:>10.4f} {100 * h / max_h:>9.1f}%")

model_entropy_analysis()

Entropy of Common Probability Distributions

Discrete Distributions

Distribution	Parameters	Entropy (nats)
Bernoulli( $p$ )	$p \in (0,1)$	$-p\ln p - (1-p)\ln(1-p)$
Uniform over $K$	$K$	$\ln K$
Geometric( $p$ )	$p$	$\frac{-(1-p)\ln(1-p) - p\ln p}{p}$
Poisson( $\lambda$ )	$\lambda > 0$	$\approx \frac{1}{2}\ln(2\pi e\lambda)$ for large $\lambda$
Categorical( $\mathbf{p}$ )	$K$ -vector	$-\sum_k p_k \ln p_k$

Maximum Entropy Principle

Among all discrete distributions over $K$ outcomes, the uniform distribution maximizes entropy:

$H(X) \leq \log K$

with equality if and only if $X$ is uniform. This follows from the concavity of $-\log$ and Jensen's inequality.

This principle is widely used: the maximum-entropy distribution subject to known constraints is the least biased distribution consistent with the constraints. It underlies:

Choice of uniform priors in Bayesian inference
Softmax's role as the maximum-entropy classifier
The normalization in attention mechanisms

Differential Entropy: Continuous Random Variables

For a continuous random variable $X$ with probability density function $f(x)$ , the differential entropy is:

$h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx$

:::warning Differential Entropy Can Be Negative Unlike discrete entropy, differential entropy can be negative. A very narrow Gaussian has negative differential entropy. This is because differential entropy is defined relative to the Lebesgue measure - it measures spread relative to a continuous uniform distribution, not an absolute bit count. :::

Differential Entropy of Key Distributions

Gaussian $X \sim \mathcal{N}(\mu, \sigma^2)$ :

$h(X) = \frac{1}{2}\ln(2\pi e \sigma^2)$

The Gaussian has maximum differential entropy among all distributions with a given variance. This is why the Gaussian arises as the maximum-entropy distribution under a second-moment constraint.

Exponential $X \sim \text{Exp}(\lambda)$ :

$h(X) = 1 - \ln\lambda$

Uniform $X \sim \text{Uniform}(a, b)$ :

$h(X) = \ln(b - a)$

import numpy as np
from scipy import stats


def differential_entropy_gaussian(sigma: float) -> float:
    """Differential entropy of N(mu, sigma^2) in nats."""
    return 0.5 * np.log(2 * np.pi * np.e * sigma**2)


def differential_entropy_exponential(lam: float) -> float:
    """Differential entropy of Exp(lambda) in nats."""
    return 1 - np.log(lam)


def differential_entropy_uniform(a: float, b: float) -> float:
    """Differential entropy of Uniform(a, b) in nats."""
    return np.log(b - a)


# Gaussian entropy grows logarithmically with variance
print("Gaussian Differential Entropy vs. Width:")
print(f"{'Sigma':>8} | {'h(X) nats':>12} | {'h(X) bits':>12}")
print("-" * 38)
for sigma in [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]:
    h_nats = differential_entropy_gaussian(sigma)
    h_bits = h_nats / np.log(2)
    marker = " ← negative!" if h_nats < 0 else ""
    print(f"{sigma:>8.1f} | {h_nats:>12.4f} | {h_bits:>12.4f}{marker}")

# Note: sigma < 1/sqrt(2*pi*e) ≈ 0.242 gives negative entropy

# Verify against scipy
rv = stats.norm(loc=0, scale=2.0)
print(f"\nScipy entropy for N(0, 4): {rv.entropy():.6f} nats")
print(f"Formula result:             {differential_entropy_gaussian(2.0):.6f} nats")

Joint Entropy and Conditional Entropy

Joint Entropy

For two random variables $X$ and $Y$ :

$H(X, Y) = -\sum_{x, y} p(x, y) \log p(x, y)$

Chain rule of entropy:

$H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)$

Conditional Entropy

$H(Y|X) = -\sum_{x} p(x) \sum_{y} p(y|x) \log p(y|x) = \mathbb{E}_X\bigl[H(Y|X=x)\bigr]$

Key inequality - conditioning reduces entropy:

$H(Y|X) \leq H(Y)$

Conditioning on $X$ can only reduce (or keep equal) the entropy of $Y$ . Knowing more can never increase average uncertainty. This is the foundation of information gain.

Venn diagram of entropy quantities:

H(X, Y)
┌──────────────────────────────────────────────┐
│                                              │
│    H(X)            │         H(Y)            │
│  ┌─────────────┐   │   ┌─────────────┐       │
│  │             │   │   │             │       │
│  │   H(X|Y)   │ I(X;Y) │   H(Y|X)   │       │
│  │             │   │   │             │       │
│  └─────────────┴───┴───┴─────────────┘       │
│                                              │
└──────────────────────────────────────────────┘

H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)
I(X;Y) = H(X) + H(Y) - H(X,Y)   [mutual information, Lesson 04]

ML Connection: Decision Trees and Information Gain

:::info ML Connection - Decision Trees Decision trees select the feature that maximizes information gain - the reduction in label entropy after the split. This is the ID3 and C4.5 algorithm's core criterion. :::

Information Gain of splitting dataset $S$ by feature $A$ :

$\text{IG}(S, A) = H(S) - \sum_{v \in \text{values}(A)} \frac{|S_v|}{|S|} H(S_v)$

where $H(S)$ is the entropy of the class label distribution in set $S$ , and $S_v$ is the subset where $A = v$ .

import numpy as np


def dataset_entropy(labels: np.ndarray) -> float:
    """Shannon entropy (bits) of a set of class labels."""
    _, counts = np.unique(labels, return_counts=True)
    probs = counts / counts.sum()
    mask = probs > 0
    return float(-np.sum(probs[mask] * np.log2(probs[mask])))


def information_gain(labels: np.ndarray, feature: np.ndarray) -> float:
    """
    Information gain of splitting by a categorical feature.

    Args:
        labels:  array of class labels
        feature: array of feature values (categorical)

    Returns:
        information gain in bits
    """
    parent_h = dataset_entropy(labels)

    values, counts = np.unique(feature, return_counts=True)
    n = len(labels)

    weighted_child_h = sum(
        (count / n) * dataset_entropy(labels[feature == v])
        for v, count in zip(values, counts)
    )

    return parent_h - weighted_child_h


# Classic "Play Tennis" dataset
# Labels: 1=play, 0=don't play
labels  = np.array([1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0])
weather = np.array([0, 0, 1, 2, 2, 2, 1, 0, 0, 2, 0, 1, 1, 2])  # 0=sunny,1=overcast,2=rain
wind    = np.array([0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1])  # 0=weak,1=strong
humidity= np.array([0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0])  # 0=high,1=normal

parent_h = dataset_entropy(labels)
ig_weather  = information_gain(labels, weather)
ig_wind     = information_gain(labels, wind)
ig_humidity = information_gain(labels, humidity)

print(f"Parent entropy:              {parent_h:.4f} bits")
print(f"IG(Weather):                 {ig_weather:.4f} bits")
print(f"IG(Wind):                    {ig_wind:.4f} bits")
print(f"IG(Humidity):                {ig_humidity:.4f} bits")

features = {"Weather": ig_weather, "Wind": ig_wind, "Humidity": ig_humidity}
best = max(features, key=features.get)
print(f"\nBest first split: {best} (IG = {features[best]:.4f} bits)")

# Output:
# Parent entropy:              0.9403 bits
# IG(Weather):                 0.2467 bits
# IG(Wind):                    0.0481 bits
# IG(Humidity):                0.1518 bits
# Best first split: Weather (IG = 0.2467 bits)

Comparison - Entropy vs. Gini Impurity:

Criterion        | Formula                    | Properties
-----------------+----------------------------+----------------------------
Entropy (ID3/C4.5)| -Σ p_k log p_k            | Theoretically grounded in IT
Gini (CART)      | 1 - Σ p_k²                 | Faster to compute, similar splits
Log Loss (XGBoost)| Cross-entropy gradient     | Differentiable, gradient boosting

Entropy is more sensitive to rare class impurity; Gini is slightly cheaper to compute. In practice, they produce nearly identical trees.

Properties of Entropy

Key Inequalities and Laws

Property	Statement	Implication
Non-negativity	$H(X) \geq 0$	Entropy is always non-negative
Maximum entropy	$H(X) \leq \log \\|\mathcal{X}\\|$	Uniform is maximally uncertain
Conditioning reduces entropy	$H(X\\|Y) \leq H(X)$	Information never hurts
Chain rule	$H(X,Y) = H(X) + H(Y\\|X)$	Joint = marginal + conditional
Subadditivity	$H(X,Y) \leq H(X) + H(Y)$	Equality iff $X \perp Y$
Data processing inequality	$H(f(X)) \leq H(X)$	Processing cannot create info

Entropy and Compression

Shannon's source coding theorem: the minimum average number of bits to encode samples from $X$ using an optimal code equals $H(X)$ bits per symbol. You cannot do better.

This connects entropy to the fundamental limits of compression - the topic of Lesson 05.

Entropy Across the ML Stack

ML Task                     | How Entropy Is Used
----------------------------+-----------------------------------------------
Decision trees              | Splitting criterion - maximize information gain
Language models             | Perplexity = 2^H(test data | model)
Uncertainty quantification  | Model entropy = prediction confidence
Active learning             | Query maximum entropy samples
MaxEnt RL (SAC)             | Entropy bonus encourages exploration
VAE encoder                 | Entropy of q(z|x) in ELBO bound
Calibration                 | Well-calibrated models: entropy matches freq.
Feature selection           | Mutual information (entropy-based, Lesson 04)
Clustering purity           | Cluster label entropy = cluster quality

MaxEnt Reinforcement Learning

In Soft Actor-Critic (SAC), the objective augments reward with an entropy bonus:

$J(\pi) = \sum_t \mathbb{E}_{(s_t, a_t) \sim \rho_\pi}\left[r(s_t, a_t) + \alpha \, H(\pi(\cdot|s_t))\right]$

The entropy term $\alpha H(\pi(\cdot|s_t))$ rewards the policy for taking diverse actions. This prevents premature exploitation, stabilizes training, and encourages exploration of the state space.

Entropy Regularization in PyTorch

import torch
import torch.nn.functional as F


def entropy_regularization(
    logits: torch.Tensor,
    alpha: float = 0.01
) -> torch.Tensor:
    """
    Entropy regularization term for MaxEnt RL and semi-supervised learning.
    Encourages high-entropy (uncertain) predictions when added to the loss.

    Args:
        logits: raw model outputs, shape (batch, num_classes)
        alpha:  regularization strength

    Returns:
        negative mean entropy (add to loss to maximize entropy)
    """
    log_probs = F.log_softmax(logits, dim=-1)
    probs = torch.exp(log_probs)

    # H = -sum(p * log_p) per sample
    per_sample_entropy = -torch.sum(probs * log_probs, dim=-1)
    mean_entropy = per_sample_entropy.mean()

    return -alpha * mean_entropy  # negative because we minimize losses


# Demonstrate: confident vs uncertain outputs
torch.manual_seed(42)

confident_logits = torch.tensor([[5.0, -2.0, -2.0, -2.0]] * 4)
uncertain_logits = torch.tensor([[0.1, 0.1, -0.1, -0.1]] * 4)

def compute_entropy(logits):
    probs = F.softmax(logits, dim=-1)
    log_probs = F.log_softmax(logits, dim=-1)
    return -torch.sum(probs * log_probs, dim=-1).mean().item()

print(f"Entropy of confident model: {compute_entropy(confident_logits):.4f} nats")
print(f"Entropy of uncertain model: {compute_entropy(uncertain_logits):.4f} nats")
print(f"Max entropy (4 classes):    {torch.log(torch.tensor(4.0)).item():.4f} nats")

Interview Questions and Answers

Q1: What is entropy in information theory, and why do we use it in decision trees?

Entropy $H(X) = -\sum_x p(x)\log p(x)$ measures the average uncertainty (expected surprise) in a probability distribution. It is maximized when the distribution is uniform (maximum uncertainty) and zero when deterministic (no uncertainty).

In decision trees, we use entropy to measure label impurity at a node. A pure node (all labels identical) has entropy 0. A maximally mixed node has entropy $\log_2 K$ for $K$ classes. We select the feature maximizing information gain - the reduction in weighted average entropy after splitting. This greedily reduces label uncertainty most efficiently.

Q2: What is the relationship between entropy and the minimum description length of a message?

Shannon's source coding theorem states that the minimum average number of bits to encode a message from source $X$ using an optimal lossless code equals $H(X)$ bits per symbol. No code can do better on average. Huffman coding achieves within 1 bit of this bound per symbol; arithmetic coding can approach the bound arbitrarily closely.

For ML: a language model with cross-entropy $H$ nats per token needs at least $H$ nats per token to encode text. Perplexity $= e^H$ quantifies compression efficiency - lower perplexity = better compression = better language model.

Q3: Why is the uniform distribution the maximum entropy distribution?

Among all distributions over $K$ discrete outcomes, the uniform $p(x) = 1/K$ maximizes entropy. Proof by Lagrange multipliers: maximize $H = -\sum_k p_k \ln p_k$ subject to $\sum_k p_k = 1$ . The Lagrangian optimality condition $-\ln p_k - 1 = \lambda$ for all $k$ forces all $p_k$ to be equal.

Intuitively: the uniform distribution makes the fewest assumptions - it encodes maximum uncertainty. In Bayesian inference, the uniform prior is the least informative choice.

Q4: Can entropy be negative? What about differential entropy?

Discrete entropy is always non-negative: $H(X) \geq 0$ . Each term $-p(x)\log p(x) \geq 0$ because $0 \leq p(x) \leq 1$ implies $\log p(x) \leq 0$ .

Differential entropy (continuous distributions) can be negative. For example, $h(X) = \frac{1}{2}\ln(2\pi e \sigma^2)$ , which is negative for $\sigma < \frac{1}{\sqrt{2\pi e}} \approx 0.242$ . Differential entropy is defined relative to the Lebesgue measure - it measures spread relative to the continuous uniform distribution rather than an absolute information count.

Q5: What is the data processing inequality, and what does it mean for ML?

The data processing inequality states: if $X \to Y \to Z$ is a Markov chain (Z depends on X only through Y), then $H(Z) \leq H(Y) \leq H(X)$ . Equivalently, $I(X; Z) \leq I(X; Y)$ - processing data cannot increase the information it contains about the original input.

For ML: every layer of a neural network can only reduce (or preserve) the information about the input. You cannot create information by processing data - only extract, reorganize, or discard it. This is the basis of the information bottleneck theory of deep learning (Lesson 04): deep networks are hypothesized to progressively compress input information while retaining task-relevant information. It also explains why feature engineering (domain knowledge) matters: it lets you preserve the right information before it is lost.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Entropy Explorer demo on the EngineersOfAI Playground - no code required.

:::

The Scenario That Motivates This Lesson​

What Is Information?​

Shannon Entropy: Expected Surprise​

Intuitive Interpretation​

Computing Entropy: Examples​

The Binary Entropy Function​

Python: Computing Entropy​

Entropy of Common Probability Distributions​

Discrete Distributions​

Maximum Entropy Principle​

Differential Entropy: Continuous Random Variables​

Differential Entropy of Key Distributions​

Joint Entropy and Conditional Entropy​

Joint Entropy​

Conditional Entropy​

ML Connection: Decision Trees and Information Gain​

Properties of Entropy​

Key Inequalities and Laws​

Entropy and Compression​

Entropy Across the ML Stack​

MaxEnt Reinforcement Learning​

Entropy Regularization in PyTorch​

Interview Questions and Answers​