Module 05 - Information Theory
Why Information Theory Is the Hidden Foundation of ML
Every time you train a neural network with cross-entropy loss, tune a VAE's KL penalty, select features by mutual information, or evaluate a language model with perplexity - you are doing applied information theory. Yet most ML engineers learn these tools as isolated recipes without understanding the unified theory that connects them.
Information theory, pioneered by Claude Shannon in his 1948 paper "A Mathematical Theory of Communication," answers a deceptively simple question: how much information is in a message? The answers Shannon derived turned out to be exactly the right mathematical framework for:
- Measuring uncertainty in probability distributions (entropy)
- Comparing two probability distributions (KL divergence, cross-entropy)
- Quantifying how much two variables share (mutual information)
- Understanding why simpler models generalize better (MDL, compression)
- Training generative models to match data distributions (VAEs, GANs)
This module builds that unified understanding - from first principles through cutting-edge applications in deep learning.
Module Map
Lesson Overview
| # | Lesson | Core Concept | ML Application |
|---|---|---|---|
| 01 | Entropy and Information | H(X) = -Σ p(x) log p(x) | Decision tree splitting, uncertainty quantification |
| 02 | KL Divergence | D_KL(P||Q) | VAE training, PPO policy updates |
| 03 | Cross-Entropy and Loss Functions | H(P,Q) = H(P) + D_KL | Every classification model ever trained |
| 04 | Mutual Information | I(X;Y) | Feature selection, information bottleneck |
| 05 | Data Compression Fundamentals | Source coding theorem | Perplexity, model evaluation, LLMs as compressors |
| 06 | Information Geometry | Fisher information matrix | Natural gradient, K-FAC, second-order optimization |
| 07 | Minimum Description Length | Best model = shortest description | Regularization theory, Occam's razor formalized |
The Fundamental Insight: Information = Surprise
Shannon's key insight was that information content is inversely related to probability. An event that almost certainly happens (p ≈ 1) tells you almost nothing - you already knew it was coming. An extremely rare event (p ≈ 0) carries a lot of information - it was surprising.
This is captured in the self-information of an event:
| Event | Probability | Information |
|---|---|---|
| Flipping heads | 0.5 | 1 bit |
| Rolling a 6 on a fair die | 1/6 | 2.58 bits |
| Drawing ace of spades | 1/52 | 5.70 bits |
| Predicting tomorrow's exact stock price | ~0 | ∞ bits |
Entropy is the expected information content - the average surprise over a distribution:
This single formula, and its extensions, power virtually all of modern ML.
How Information Theory Connects ML Concepts
┌─────────────────────────────────────────────────────────────┐
│ INFORMATION THEORY │
│ │
│ H(X) ──────────────────────────────────── Entropy │
│ │ (uncertainty) │
│ │ H(P,Q) = H(P) + D_KL(P||Q) │
│ ├──────────────────────────────────── Cross-Entropy │
│ │ │ (loss function) │
│ │ └───────────────── KL Divergence │
│ │ (distribution gap) │
│ │ │
│ ├── I(X;Y) = H(X)+H(Y)-H(X,Y) ─── Mutual Information │
│ │ (feature relevance) │
│ │ │
│ └── Shannon's Source Coding ──────── Compression │
│ │ (perplexity) │
│ └── Fisher Info ──────── Geometry │
│ │ (natural grad) │
│ └── MDL ──── Regularisation │
└─────────────────────────────────────────────────────────────┘
Prerequisites
Before diving into this module, you should be comfortable with:
- Probability theory (Module 03): probability distributions, expectation, conditional probability, Bayes' theorem
- Calculus (Module 02): derivatives, integrals, chain rule (especially for entropy gradients)
- Statistics (Module 04): maximum likelihood estimation (MLE connects directly to cross-entropy)
- NumPy/Python: we write code throughout - familiarity with
np.log,np.sum, distributions
:::tip If you haven't done MLE yet The connection between cross-entropy minimization and maximum likelihood estimation is one of the most important in all of ML. If you haven't studied MLE, read Lesson 01 of Module 04 before Lesson 03 of this module. :::
Learning Objectives
By the end of this module, you will be able to:
- Compute and interpret entropy for any discrete or continuous distribution, and explain why decision trees use entropy for splitting
- Derive the KL divergence and explain why it is not symmetric, giving geometric intuition for forward vs. reverse KL
- Implement cross-entropy loss from scratch and prove why minimizing it is equivalent to maximum likelihood estimation
- Use mutual information for feature selection and explain the information bottleneck principle
- Compute perplexity and explain why it measures language model quality as a compression ratio
- Explain the Fisher information matrix and why natural gradient descent is geometrically principled
- Apply the MDL principle to explain regularization and model selection
The Big Picture: Why These Tools Are Used in ML
Cross-Entropy Loss
When you train a classifier, you minimize:
This is cross-entropy. It works better than MSE for classification because it directly measures the "distributional gap" between the true label distribution and the predicted distribution - exactly what information theory quantifies.
VAE ELBO
The variational autoencoder loss is:
The KL term is literally the information-theoretic divergence between the encoder's posterior and the prior - you are penalizing the model for encoding more information than necessary.
Language Model Perplexity
Perplexity is cross-entropy exponentiated. A language model with perplexity 50 means it is as confused as if it had to choose uniformly among 50 equally likely next tokens. This is a direct information-theoretic measure of how well the model has compressed language.
PPO's KL Constraint
Proximal Policy Optimization constrains updates by:
This prevents the new policy from deviating too far from the old one - measured in information-theoretic terms, not just parameter space.
Historical Context
| Year | Milestone | Impact on ML |
|---|---|---|
| 1948 | Shannon's "A Mathematical Theory of Communication" | Founded the entire field |
| 1951 | Huffman coding | Optimal prefix codes |
| 1959 | Rényi entropy | Generalized entropy family |
| 1975 | Kolmogorov complexity | MDL principle foundations |
| 1987 | MDL principle (Rissanen) | Formal basis for model selection |
| 1998 | Amari's Information Geometry | Natural gradient for neural nets |
| 2013 | VAEs (Kingma & Welling) | KL divergence in deep generative models |
| 2015 | Information Bottleneck for DNNs | Tishby's theory of deep learning |
| 2017 | PPO (Schulman et al.) | KL constraint in RL |
| 2020+ | LLMs evaluated by perplexity | Information theory as model benchmark |
Notation Reference
Throughout this module we use:
| Symbol | Meaning |
|---|---|
| Entropy of random variable X | |
| Cross-entropy between distributions P and Q | |
| KL divergence from Q to P | |
| Mutual information between X and Y | |
| Differential entropy (continuous) | |
| Probability mass/density of x | |
| Expectation under distribution p | |
| Fisher information matrix | |
| Kolmogorov complexity of x | |
| nats | Entropy in base-e logarithms |
| bits | Entropy in base-2 logarithms |
:::note Bits vs. Nats ML frameworks (PyTorch, TensorFlow) use natural logarithm (base e), so loss values are in nats. Information theory texts often use base-2 logarithms, giving bits. The conversion is 1 nat = log₂(e) ≈ 1.443 bits. Both are valid - just be consistent. :::
How to Use This Module
For ML engineers building production systems: focus on Lessons 01–04 for the day-to-day tools (entropy, KL, cross-entropy, MI), then Lesson 05 for understanding perplexity and model evaluation.
For researchers working on generative models or RL: Lessons 02 and 06 are critical - KL divergence and information geometry underpin VAEs, diffusion models, and policy optimization.
For those preparing for ML interviews: every lesson has an Interview Q&A section. The most commonly tested topics are cross-entropy vs. MSE (Lesson 03), KL divergence asymmetry (Lesson 02), and entropy in decision trees (Lesson 01).
For deep learning theorists: Lessons 04 and 07 connect to fundamental questions about why deep networks generalize - the information bottleneck and MDL perspectives.
Let's begin.
