Skip to main content

Module 05 - Information Theory

Why Information Theory Is the Hidden Foundation of ML

Every time you train a neural network with cross-entropy loss, tune a VAE's KL penalty, select features by mutual information, or evaluate a language model with perplexity - you are doing applied information theory. Yet most ML engineers learn these tools as isolated recipes without understanding the unified theory that connects them.

Information theory, pioneered by Claude Shannon in his 1948 paper "A Mathematical Theory of Communication," answers a deceptively simple question: how much information is in a message? The answers Shannon derived turned out to be exactly the right mathematical framework for:

  • Measuring uncertainty in probability distributions (entropy)
  • Comparing two probability distributions (KL divergence, cross-entropy)
  • Quantifying how much two variables share (mutual information)
  • Understanding why simpler models generalize better (MDL, compression)
  • Training generative models to match data distributions (VAEs, GANs)

This module builds that unified understanding - from first principles through cutting-edge applications in deep learning.

Module Map

Lesson Overview

#LessonCore ConceptML Application
01Entropy and InformationH(X) = -Σ p(x) log p(x)Decision tree splitting, uncertainty quantification
02KL DivergenceD_KL(P||Q)VAE training, PPO policy updates
03Cross-Entropy and Loss FunctionsH(P,Q) = H(P) + D_KLEvery classification model ever trained
04Mutual InformationI(X;Y)Feature selection, information bottleneck
05Data Compression FundamentalsSource coding theoremPerplexity, model evaluation, LLMs as compressors
06Information GeometryFisher information matrixNatural gradient, K-FAC, second-order optimization
07Minimum Description LengthBest model = shortest descriptionRegularization theory, Occam's razor formalized

The Fundamental Insight: Information = Surprise

Shannon's key insight was that information content is inversely related to probability. An event that almost certainly happens (p ≈ 1) tells you almost nothing - you already knew it was coming. An extremely rare event (p ≈ 0) carries a lot of information - it was surprising.

This is captured in the self-information of an event:

I(x)=log2p(x)(bits)I(x) = -\log_2 p(x) \quad \text{(bits)}

EventProbabilityInformation
Flipping heads0.51 bit
Rolling a 6 on a fair die1/62.58 bits
Drawing ace of spades1/525.70 bits
Predicting tomorrow's exact stock price~0∞ bits

Entropy is the expected information content - the average surprise over a distribution:

H(X)=E[logp(X)]=xp(x)logp(x)H(X) = \mathbb{E}[-\log p(X)] = -\sum_x p(x) \log p(x)

This single formula, and its extensions, power virtually all of modern ML.

How Information Theory Connects ML Concepts

┌─────────────────────────────────────────────────────────────┐
│ INFORMATION THEORY │
│ │
│ H(X) ──────────────────────────────────── Entropy │
│ │ (uncertainty) │
│ │ H(P,Q) = H(P) + D_KL(P||Q) │
│ ├──────────────────────────────────── Cross-Entropy │
│ │ │ (loss function) │
│ │ └───────────────── KL Divergence │
│ │ (distribution gap) │
│ │ │
│ ├── I(X;Y) = H(X)+H(Y)-H(X,Y) ─── Mutual Information │
│ │ (feature relevance) │
│ │ │
│ └── Shannon's Source Coding ──────── Compression │
│ │ (perplexity) │
│ └── Fisher Info ──────── Geometry │
│ │ (natural grad) │
│ └── MDL ──── Regularisation │
└─────────────────────────────────────────────────────────────┘

Prerequisites

Before diving into this module, you should be comfortable with:

  • Probability theory (Module 03): probability distributions, expectation, conditional probability, Bayes' theorem
  • Calculus (Module 02): derivatives, integrals, chain rule (especially for entropy gradients)
  • Statistics (Module 04): maximum likelihood estimation (MLE connects directly to cross-entropy)
  • NumPy/Python: we write code throughout - familiarity with np.log, np.sum, distributions

:::tip If you haven't done MLE yet The connection between cross-entropy minimization and maximum likelihood estimation is one of the most important in all of ML. If you haven't studied MLE, read Lesson 01 of Module 04 before Lesson 03 of this module. :::

Learning Objectives

By the end of this module, you will be able to:

  1. Compute and interpret entropy for any discrete or continuous distribution, and explain why decision trees use entropy for splitting
  2. Derive the KL divergence and explain why it is not symmetric, giving geometric intuition for forward vs. reverse KL
  3. Implement cross-entropy loss from scratch and prove why minimizing it is equivalent to maximum likelihood estimation
  4. Use mutual information for feature selection and explain the information bottleneck principle
  5. Compute perplexity and explain why it measures language model quality as a compression ratio
  6. Explain the Fisher information matrix and why natural gradient descent is geometrically principled
  7. Apply the MDL principle to explain regularization and model selection

The Big Picture: Why These Tools Are Used in ML

Cross-Entropy Loss

When you train a classifier, you minimize:

L=1Ni=1Nc=1Cyiclogp^ic\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \sum_{c=1}^{C} y_{ic} \log \hat{p}_{ic}

This is cross-entropy. It works better than MSE for classification because it directly measures the "distributional gap" between the true label distribution and the predicted distribution - exactly what information theory quantifies.

VAE ELBO

The variational autoencoder loss is:

LVAE=Eq(zx)[logp(xz)]reconstruction+DKL(q(zx)p(z))regularization\mathcal{L}_{\text{VAE}} = \underbrace{-\mathbb{E}_{q(z|x)}[\log p(x|z)]}_{\text{reconstruction}} + \underbrace{D_{\text{KL}}(q(z|x) \| p(z))}_{\text{regularization}}

The KL term is literally the information-theoretic divergence between the encoder's posterior and the prior - you are penalizing the model for encoding more information than necessary.

Language Model Perplexity

PPL=2H(Ptest,Pmodel)\text{PPL} = 2^{H(P_{\text{test}}, P_{\text{model}})}

Perplexity is cross-entropy exponentiated. A language model with perplexity 50 means it is as confused as if it had to choose uniformly among 50 equally likely next tokens. This is a direct information-theoretic measure of how well the model has compressed language.

PPO's KL Constraint

Proximal Policy Optimization constrains updates by:

DKL(πθπθold)δD_{\text{KL}}(\pi_\theta \| \pi_{\theta_{\text{old}}}) \leq \delta

This prevents the new policy from deviating too far from the old one - measured in information-theoretic terms, not just parameter space.

Historical Context

YearMilestoneImpact on ML
1948Shannon's "A Mathematical Theory of Communication"Founded the entire field
1951Huffman codingOptimal prefix codes
1959Rényi entropyGeneralized entropy family
1975Kolmogorov complexityMDL principle foundations
1987MDL principle (Rissanen)Formal basis for model selection
1998Amari's Information GeometryNatural gradient for neural nets
2013VAEs (Kingma & Welling)KL divergence in deep generative models
2015Information Bottleneck for DNNsTishby's theory of deep learning
2017PPO (Schulman et al.)KL constraint in RL
2020+LLMs evaluated by perplexityInformation theory as model benchmark

Notation Reference

Throughout this module we use:

SymbolMeaning
H(X)H(X)Entropy of random variable X
H(P,Q)H(P, Q)Cross-entropy between distributions P and Q
DKL(PQ)D_{\text{KL}}(P \| Q)KL divergence from Q to P
I(X;Y)I(X;Y)Mutual information between X and Y
h(X)h(X)Differential entropy (continuous)
p(x)p(x)Probability mass/density of x
Ep[]\mathbb{E}_p[\cdot]Expectation under distribution p
F\mathcal{F}Fisher information matrix
K(x)K(x)Kolmogorov complexity of x
natsEntropy in base-e logarithms
bitsEntropy in base-2 logarithms

:::note Bits vs. Nats ML frameworks (PyTorch, TensorFlow) use natural logarithm (base e), so loss values are in nats. Information theory texts often use base-2 logarithms, giving bits. The conversion is 1 nat = log₂(e) ≈ 1.443 bits. Both are valid - just be consistent. :::

How to Use This Module

For ML engineers building production systems: focus on Lessons 01–04 for the day-to-day tools (entropy, KL, cross-entropy, MI), then Lesson 05 for understanding perplexity and model evaluation.

For researchers working on generative models or RL: Lessons 02 and 06 are critical - KL divergence and information geometry underpin VAEs, diffusion models, and policy optimization.

For those preparing for ML interviews: every lesson has an Interview Q&A section. The most commonly tested topics are cross-entropy vs. MSE (Lesson 03), KL divergence asymmetry (Lesson 02), and entropy in decision trees (Lesson 01).

For deep learning theorists: Lessons 04 and 07 connect to fundamental questions about why deep networks generalize - the information bottleneck and MDL perspectives.

Let's begin.

© 2026 EngineersOfAI. All rights reserved.