Module 05 - Information Theory

Why Information Theory Is the Hidden Foundation of ML

Every time you train a neural network with cross-entropy loss, tune a VAE's KL penalty, select features by mutual information, or evaluate a language model with perplexity - you are doing applied information theory. Yet most ML engineers learn these tools as isolated recipes without understanding the unified theory that connects them.

Information theory, pioneered by Claude Shannon in his 1948 paper "A Mathematical Theory of Communication," answers a deceptively simple question: how much information is in a message? The answers Shannon derived turned out to be exactly the right mathematical framework for:

Measuring uncertainty in probability distributions (entropy)
Comparing two probability distributions (KL divergence, cross-entropy)
Quantifying how much two variables share (mutual information)
Understanding why simpler models generalize better (MDL, compression)
Training generative models to match data distributions (VAEs, GANs)

This module builds that unified understanding - from first principles through cutting-edge applications in deep learning.

Module Map

Lesson Overview

#	Lesson	Core Concept	ML Application
01	Entropy and Information	H(X) = -Σ p(x) log p(x)	Decision tree splitting, uncertainty quantification
02	KL Divergence	D_KL(P\|\|Q)	VAE training, PPO policy updates
03	Cross-Entropy and Loss Functions	H(P,Q) = H(P) + D_KL	Every classification model ever trained
04	Mutual Information	I(X;Y)	Feature selection, information bottleneck
05	Data Compression Fundamentals	Source coding theorem	Perplexity, model evaluation, LLMs as compressors
06	Information Geometry	Fisher information matrix	Natural gradient, K-FAC, second-order optimization
07	Minimum Description Length	Best model = shortest description	Regularization theory, Occam's razor formalized

The Fundamental Insight: Information = Surprise

Shannon's key insight was that information content is inversely related to probability. An event that almost certainly happens (p ≈ 1) tells you almost nothing - you already knew it was coming. An extremely rare event (p ≈ 0) carries a lot of information - it was surprising.

This is captured in the self-information of an event:

$I(x) = -\log_2 p(x) \quad \text{(bits)}$

Event	Probability	Information
Flipping heads	0.5	1 bit
Rolling a 6 on a fair die	1/6	2.58 bits
Drawing ace of spades	1/52	5.70 bits
Predicting tomorrow's exact stock price	~0	∞ bits

Entropy is the expected information content - the average surprise over a distribution:

$H(X) = \mathbb{E}[-\log p(X)] = -\sum_x p(x) \log p(x)$

This single formula, and its extensions, power virtually all of modern ML.

How Information Theory Connects ML Concepts

┌─────────────────────────────────────────────────────────────┐
│                    INFORMATION THEORY                        │
│                                                             │
│  H(X) ──────────────────────────────────── Entropy         │
│    │                                        (uncertainty)   │
│    │   H(P,Q) = H(P) + D_KL(P||Q)                         │
│    ├──────────────────────────────────── Cross-Entropy      │
│    │                 │                    (loss function)   │
│    │                 └───────────────── KL Divergence       │
│    │                                    (distribution gap)  │
│    │                                                        │
│    ├── I(X;Y) = H(X)+H(Y)-H(X,Y) ─── Mutual Information   │
│    │                                    (feature relevance) │
│    │                                                        │
│    └── Shannon's Source Coding ──────── Compression        │
│                  │                      (perplexity)        │
│                  └── Fisher Info ──────── Geometry         │
│                              │            (natural grad)    │
│                              └── MDL ──── Regularisation   │
└─────────────────────────────────────────────────────────────┘

Prerequisites

Before diving into this module, you should be comfortable with:

Probability theory (Module 03): probability distributions, expectation, conditional probability, Bayes' theorem
Calculus (Module 02): derivatives, integrals, chain rule (especially for entropy gradients)
Statistics (Module 04): maximum likelihood estimation (MLE connects directly to cross-entropy)
NumPy/Python: we write code throughout - familiarity with np.log, np.sum, distributions

:::tip If you haven't done MLE yet The connection between cross-entropy minimization and maximum likelihood estimation is one of the most important in all of ML. If you haven't studied MLE, read Lesson 01 of Module 04 before Lesson 03 of this module. :::

Learning Objectives

By the end of this module, you will be able to:

Compute and interpret entropy for any discrete or continuous distribution, and explain why decision trees use entropy for splitting
Derive the KL divergence and explain why it is not symmetric, giving geometric intuition for forward vs. reverse KL
Implement cross-entropy loss from scratch and prove why minimizing it is equivalent to maximum likelihood estimation
Use mutual information for feature selection and explain the information bottleneck principle
Compute perplexity and explain why it measures language model quality as a compression ratio
Explain the Fisher information matrix and why natural gradient descent is geometrically principled
Apply the MDL principle to explain regularization and model selection

The Big Picture: Why These Tools Are Used in ML

Cross-Entropy Loss

When you train a classifier, you minimize:

$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \sum_{c=1}^{C} y_{ic} \log \hat{p}_{ic}$

This is cross-entropy. It works better than MSE for classification because it directly measures the "distributional gap" between the true label distribution and the predicted distribution - exactly what information theory quantifies.

VAE ELBO

The variational autoencoder loss is:

$\mathcal{L}_{\text{VAE}} = \underbrace{-\mathbb{E}_{q(z|x)}[\log p(x|z)]}_{\text{reconstruction}} + \underbrace{D_{\text{KL}}(q(z|x) \| p(z))}_{\text{regularization}}$

The KL term is literally the information-theoretic divergence between the encoder's posterior and the prior - you are penalizing the model for encoding more information than necessary.

Language Model Perplexity

$\text{PPL} = 2^{H(P_{\text{test}}, P_{\text{model}})}$

Perplexity is cross-entropy exponentiated. A language model with perplexity 50 means it is as confused as if it had to choose uniformly among 50 equally likely next tokens. This is a direct information-theoretic measure of how well the model has compressed language.

PPO's KL Constraint

Proximal Policy Optimization constrains updates by:

$D_{\text{KL}}(\pi_\theta \| \pi_{\theta_{\text{old}}}) \leq \delta$

This prevents the new policy from deviating too far from the old one - measured in information-theoretic terms, not just parameter space.

Historical Context

Year	Milestone	Impact on ML
1948	Shannon's "A Mathematical Theory of Communication"	Founded the entire field
1951	Huffman coding	Optimal prefix codes
1959	Rényi entropy	Generalized entropy family
1975	Kolmogorov complexity	MDL principle foundations
1987	MDL principle (Rissanen)	Formal basis for model selection
1998	Amari's Information Geometry	Natural gradient for neural nets
2013	VAEs (Kingma & Welling)	KL divergence in deep generative models
2015	Information Bottleneck for DNNs	Tishby's theory of deep learning
2017	PPO (Schulman et al.)	KL constraint in RL
2020+	LLMs evaluated by perplexity	Information theory as model benchmark

Notation Reference

Throughout this module we use:

Symbol	Meaning
$H(X)$	Entropy of random variable X
$H(P, Q)$	Cross-entropy between distributions P and Q
$D_{\text{KL}}(P \\| Q)$	KL divergence from Q to P
$I(X;Y)$	Mutual information between X and Y
$h(X)$	Differential entropy (continuous)
$p(x)$	Probability mass/density of x
$\mathbb{E}_p[\cdot]$	Expectation under distribution p
$\mathcal{F}$	Fisher information matrix
$K(x)$	Kolmogorov complexity of x
nats	Entropy in base-e logarithms
bits	Entropy in base-2 logarithms

:::note Bits vs. Nats ML frameworks (PyTorch, TensorFlow) use natural logarithm (base e), so loss values are in nats. Information theory texts often use base-2 logarithms, giving bits. The conversion is 1 nat = log₂(e) ≈ 1.443 bits. Both are valid - just be consistent. :::

How to Use This Module

For ML engineers building production systems: focus on Lessons 01–04 for the day-to-day tools (entropy, KL, cross-entropy, MI), then Lesson 05 for understanding perplexity and model evaluation.

For researchers working on generative models or RL: Lessons 02 and 06 are critical - KL divergence and information geometry underpin VAEs, diffusion models, and policy optimization.

For those preparing for ML interviews: every lesson has an Interview Q&A section. The most commonly tested topics are cross-entropy vs. MSE (Lesson 03), KL divergence asymmetry (Lesson 02), and entropy in decision trees (Lesson 01).

For deep learning theorists: Lessons 04 and 07 connect to fundamental questions about why deep networks generalize - the information bottleneck and MDL perspectives.

Let's begin.

Why Information Theory Is the Hidden Foundation of ML​

Module Map​

Lesson Overview​

The Fundamental Insight: Information = Surprise​

How Information Theory Connects ML Concepts​

Prerequisites​

Learning Objectives​

The Big Picture: Why These Tools Are Used in ML​

Cross-Entropy Loss​

VAE ELBO​

Language Model Perplexity​

PPO's KL Constraint​

Historical Context​

Notation Reference​

How to Use This Module​