Mutual Information
The Scenario That Motivates This Lesson
You are building a fraud detection model with 500 features. A naive approach is to train a model on all of them. But your data scientist says:
"Let's rank features by mutual information with the fraud label and take the top 50."
Or perhaps you are trying to explain why word2vec works - why "king" and "queen" end up close in embedding space. The answer involves pointwise mutual information (PMI): the mathematical relationship between a word and its context.
Or you are reading a paper on deep learning generalization that claims:
"Deep networks learn to maximize mutual information between inputs and task-relevant features while compressing irrelevant information."
All three situations require understanding mutual information - the most general measure of statistical dependence between two random variables.
Definition: Mutual Information
The mutual information between random variables and is:
For continuous distributions:
Interpretation: The amount of information that knowing provides about (and vice versa, since MI is symmetric). It equals zero iff and are independent.
Alternative Expressions
Using entropy:
Using KL divergence:
This last form shows that MI measures how far the joint distribution is from the product of marginals - i.e., how much they deviate from independence.
Entropy Diagram: The Relationships
H(X, Y)
┌──────────────────────────────────────────────────────────┐
│ │
│ H(X) │ H(Y) │
│ ┌──────────────────┐ │ ┌──────────────────┐ │
│ │ │ │ │ │ │
│ │ H(X|Y) │ I(X;Y) │ H(Y|X) │ │
│ │ │ │ │ │ │
│ └──────────────────┴──────┴───┴──────────────────┘ │
│ │
└──────────────────────────────────────────────────────────┘
H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)
I(X;Y) = H(X) + H(Y) - H(X,Y)
I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)
Key Properties of Mutual Information
Symmetry
Knowing tells you as much about as knowing tells you about .
Non-Negativity
Since .
Upper Bounds
You cannot learn more about from than the total uncertainty in .
Self-Information
A variable contains exactly bits of information about itself.
Mutual Information vs. Correlation
Correlation measures only linear dependence. Mutual information captures any dependence - linear or nonlinear.
import numpy as np
from scipy import stats
def estimate_mi_binning(x: np.ndarray, y: np.ndarray, bins: int = 20) -> float:
"""
Estimate mutual information using histogram binning.
Simple but works well for continuous variables.
"""
# Joint histogram
c_xy, _, _ = np.histogram2d(x, y, bins=bins)
# Marginal histograms
c_x = c_xy.sum(axis=1)
c_y = c_xy.sum(axis=0)
n = c_xy.sum()
# Convert to probabilities
p_xy = c_xy / n
p_x = c_x / n
p_y = c_y / n
# MI = sum p(x,y) * log(p(x,y) / (p(x)*p(y)))
mi = 0.0
for i in range(bins):
for j in range(bins):
if p_xy[i, j] > 0 and p_x[i] > 0 and p_y[j] > 0:
mi += p_xy[i, j] * np.log(p_xy[i, j] / (p_x[i] * p_y[j]))
return float(mi)
np.random.seed(42)
n = 2000
# Case 1: Linear relationship
x_lin = np.random.randn(n)
y_lin = 2 * x_lin + 0.5 * np.random.randn(n)
# Case 2: Nonlinear relationship (x^2) - correlation ≈ 0 but dependent!
x_nl = np.random.uniform(-3, 3, n)
y_nl = x_nl**2 + 0.5 * np.random.randn(n)
# Case 3: Independent
x_ind = np.random.randn(n)
y_ind = np.random.randn(n)
print(f"{'Relationship':<25} | {'Pearson r':>10} | {'MI (nats)':>10}")
print("-" * 50)
for name, x, y in [
("Linear (strong)", x_lin, y_lin),
("Quadratic (x²)", x_nl, y_nl),
("Independent", x_ind, y_ind),
]:
r, _ = stats.pearsonr(x, y)
mi = estimate_mi_binning(x, y)
print(f"{name:<25} | {r:>10.4f} | {mi:>10.4f}")
# Output demonstrates:
# Linear relationship: high r, high MI
# Quadratic x^2: r ≈ 0 (linear corr fails!), but high MI (dependent!)
# Independent: r ≈ 0, MI ≈ 0
This is why MI-based feature selection finds features that have any relationship with the target, not just linear ones.
Pointwise Mutual Information (PMI)
Pointwise mutual information measures the association between two specific events and :
: and co-occur more than expected under independence : and co-occur less than expected : and are independent
Mutual information is the expected PMI.
PMI in Word Embeddings
:::info ML Connection - word2vec and PMI The word2vec Skip-gram model was shown by Levy & Goldberg (2014) to implicitly factorize the PMI matrix of words and their contexts.
Define the shifted PMI matrix:
where is the number of negative samples. Word2vec's word vectors and context vectors approximate:
"king" - "man" + "woman" ≈ "queen" works because the word vectors encode PMI relationships, and PMI of "king" and "royal contexts" minus PMI of "man contexts" plus PMI of "woman contexts" points to "queen." :::
import numpy as np
from collections import Counter
def compute_pmi_matrix(
corpus: list,
window_size: int = 2,
min_count: int = 1,
k: int = 1, # negative samples (for SPMI)
) -> tuple:
"""
Compute PMI matrix from a text corpus.
Returns:
vocab: list of words
pmi_matrix: (|V| x |V|) PMI values
"""
# Count word and co-occurrence frequencies
word_counts = Counter()
cooccur_counts = Counter()
for i, word in enumerate(corpus):
word_counts[word] += 1
start = max(0, i - window_size)
end = min(len(corpus), i + window_size + 1)
for j in range(start, end):
if i != j:
cooccur_counts[(word, corpus[j])] += 1
# Filter by min_count
vocab = [w for w, c in word_counts.items() if c >= min_count]
vocab_idx = {w: i for i, w in enumerate(vocab)}
V = len(vocab)
total_words = sum(word_counts.values())
total_cooccur = sum(cooccur_counts.values())
# Build PMI matrix
pmi = np.zeros((V, V))
for (w1, w2), count in cooccur_counts.items():
if w1 in vocab_idx and w2 in vocab_idx:
i, j = vocab_idx[w1], vocab_idx[w2]
p_ij = count / total_cooccur
p_i = word_counts[w1] / total_words
p_j = word_counts[w2] / total_words
if p_ij > 0 and p_i > 0 and p_j > 0:
# Shifted PMI
pmi[i, j] = max(np.log(p_ij / (p_i * p_j)) - np.log(k), 0)
return vocab, pmi
# Simple example corpus
corpus = "the cat sat on the mat the cat ate the rat the mat is flat".split()
vocab, pmi_mat = compute_pmi_matrix(corpus, window_size=2)
print("PMI Matrix (top co-occurrences):")
print(f"Vocab: {vocab}\n")
# Show strongest associations
associations = []
V = len(vocab)
for i in range(V):
for j in range(V):
if i != j and pmi_mat[i, j] > 0:
associations.append((vocab[i], vocab[j], pmi_mat[i, j]))
associations.sort(key=lambda x: -x[2])
print(f"{'Word 1':<8} {'Word 2':<8} {'PMI':>8}")
print("-" * 28)
for w1, w2, pmi in associations[:10]:
print(f"{w1:<8} {w2:<8} {pmi:>8.4f}")
Feature Selection with Mutual Information
MI-based feature selection ranks features by where is the target variable. This captures any form of statistical dependence.
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
from sklearn.datasets import make_classification
import numpy as np
def mi_feature_selection(
X: np.ndarray,
y: np.ndarray,
task: str = "classification",
top_k: int = 10,
) -> list:
"""
Rank features by mutual information with the target.
Args:
X: feature matrix (n_samples, n_features)
y: target array
task: 'classification' or 'regression'
top_k: return top k features
Returns:
list of (feature_index, mi_score) sorted by MI descending
"""
if task == "classification":
mi_scores = mutual_info_classif(X, y, random_state=42)
else:
mi_scores = mutual_info_regression(X, y, random_state=42)
ranked = sorted(enumerate(mi_scores), key=lambda x: -x[1])
return ranked[:top_k]
# Create dataset with mix of relevant and irrelevant features
X, y = make_classification(
n_samples=1000,
n_features=20,
n_informative=5, # only 5 truly informative features
n_redundant=5,
n_repeated=0,
n_classes=2,
random_state=42,
)
top_features = mi_feature_selection(X, y, task="classification", top_k=10)
print("Feature Selection by Mutual Information:")
print(f"{'Feature Index':>15} | {'MI Score':>10} | {'Informative?':>14}")
print("-" * 45)
for feat_idx, mi_score in top_features:
# sklearn's informative features are indices 0-4
is_informative = "YES" if feat_idx < 5 else "no"
print(f"{feat_idx:>15} | {mi_score:>10.4f} | {is_informative:>14}")
Advantages and Limitations
Advantages of MI feature selection:
✓ Captures any statistical dependency (linear + nonlinear)
✓ Model-agnostic - doesn't assume linear relationships
✓ Fast to compute (compared to wrapper methods)
✓ Theoretically grounded: maximally informative features
Limitations:
✗ Does not capture feature interactions (redundant features can both score high)
✗ Estimation is challenging for continuous variables (binning approximation)
✗ Ignores conditional dependencies: a feature might be informative only jointly
✗ MRMR (Maximum Relevance Minimum Redundancy) addresses the redundancy issue
MRMR: Maximum Relevance Minimum Redundancy
Standard MI selection can pick many redundant features (e.g., height in cm and height in inches). MRMR addresses this:
Select the feature with the highest relevance to the target minus the average redundancy with already-selected features .
Conditional Mutual Information
The conditional mutual information of and given :
Expanding using the chain rule of mutual information:
This is used in the information bottleneck principle.
The Information Bottleneck Principle
The information bottleneck (Tishby, Pereira & Bialek, 1999) frames learning as a compression problem:
Given an input and target , find a representation (the "bottleneck") that:
- Compresses : minimize - discard irrelevant information
- Preserves : maximize - keep task-relevant information
The IB objective:
where controls the trade-off between compression and accuracy.
:::info ML Connection - Information Bottleneck in Deep Learning Tishby & Schwartz-Ziv (2017) proposed that deep neural networks implicitly implement the information bottleneck:
- Fitting phase: layers first increase (learn to predict) while grows (memorize input)
- Compression phase: decreases (compress irrelevant info) while is maintained
This would explain why deep networks generalize: they discard input variations irrelevant to the task.
Note: This theory is debated - Saxe et al. (2018) showed the compression phase may not occur with non-saturating activations. But the information bottleneck framework remains a useful conceptual tool for understanding representation learning. :::
Information Bottleneck Trade-off Curve:
I(T;Y)
(task info)
1.0 | ****
| ***
| ***
0.5 | ****
| ***
| **
0.0 |___***________________
0 0.5 1.0 I(T;X)
(input compression)
The optimal representations lie on this "information curve."
Points above the curve are unachievable.
Points below are suboptimal.
β controls where you land on the curve.
Computing MI for Discrete Variables
import numpy as np
from itertools import product
def mutual_information_discrete(
x: np.ndarray,
y: np.ndarray,
) -> float:
"""
Exact mutual information for discrete variables.
Args:
x: discrete observations of X
y: discrete observations of Y (same length as x)
Returns:
I(X;Y) in nats
"""
assert len(x) == len(y)
n = len(x)
x_vals = np.unique(x)
y_vals = np.unique(y)
# Count joint occurrences
joint_counts = {}
for xi, yi in zip(x, y):
joint_counts[(xi, yi)] = joint_counts.get((xi, yi), 0) + 1
# Convert to probabilities
p_joint = {k: v / n for k, v in joint_counts.items()}
p_x = {xi: np.sum(x == xi) / n for xi in x_vals}
p_y = {yi: np.sum(y == yi) / n for yi in y_vals}
mi = 0.0
for (xi, yi), pxy in p_joint.items():
if pxy > 0:
mi += pxy * np.log(pxy / (p_x[xi] * p_y[yi]))
return mi
# Example: MI between a feature and binary label
np.random.seed(42)
n = 1000
# Feature 1: highly informative
x1 = np.random.choice([0, 1, 2], size=n, p=[0.4, 0.3, 0.3])
y = np.where(x1 == 0, np.random.choice([0, 1], p=[0.8, 0.2]),
np.where(x1 == 1, np.random.choice([0, 1], p=[0.5, 0.5]),
np.random.choice([0, 1], p=[0.2, 0.8])))
# Feature 2: independent of label
x2 = np.random.choice([0, 1, 2], size=n, p=[0.4, 0.3, 0.3])
mi_x1_y = mutual_information_discrete(x1, y)
mi_x2_y = mutual_information_discrete(x2, y)
h_y = -np.sum([0.5 * np.log(0.5)] * 2) # entropy of balanced binary target
print(f"MI(X1, Y) = {mi_x1_y:.4f} nats ← informative feature")
print(f"MI(X2, Y) = {mi_x2_y:.4f} nats ← random feature (should be ≈ 0)")
print(f"H(Y) = {h_y:.4f} nats (maximum possible MI)")
print(f"\nX1 explains {100*mi_x1_y/h_y:.1f}% of Y's entropy")
MI in Representation Learning: InfoNCE
Modern self-supervised learning methods (SimCLR, MoCo, CLIP) use MI-based objectives. The InfoNCE loss provides a lower bound on mutual information:
where:
Here is the positive (matching) sample for , and are samples including one positive and negatives. The function is a similarity score (often dot product of normalized embeddings).
import torch
import torch.nn.functional as F
def infonce_loss(
embeddings_a: torch.Tensor,
embeddings_b: torch.Tensor,
temperature: float = 0.07,
) -> torch.Tensor:
"""
InfoNCE / NT-Xent loss (used in SimCLR, CLIP).
Maximizes mutual information between paired views.
Args:
embeddings_a: (batch, dim) - view 1 embeddings, L2-normalized
embeddings_b: (batch, dim) - view 2 embeddings, L2-normalized
temperature: τ - controls sharpness of distribution
Returns:
InfoNCE loss (scalar)
"""
batch_size = embeddings_a.shape[0]
# Normalize embeddings
a = F.normalize(embeddings_a, dim=-1)
b = F.normalize(embeddings_b, dim=-1)
# Similarity matrix: (batch, batch)
sim_matrix = torch.matmul(a, b.T) / temperature
# Labels: diagonal entries are the matching pairs
labels = torch.arange(batch_size, device=a.device)
# Symmetric loss (as in SimCLR)
loss_a = F.cross_entropy(sim_matrix, labels) # a → b
loss_b = F.cross_entropy(sim_matrix.T, labels) # b → a
return (loss_a + loss_b) / 2
# Demo: random embeddings (before training)
torch.manual_seed(42)
batch, dim = 8, 64
emb_a = torch.randn(batch, dim)
emb_b = torch.randn(batch, dim) # independent (not paired yet)
emb_b_paired = emb_a + 0.1 * torch.randn(batch, dim) # similar to a
loss_random = infonce_loss(emb_a, emb_b)
loss_paired = infonce_loss(emb_a, emb_b_paired)
print(f"InfoNCE loss (random pairs, low MI): {loss_random.item():.4f}")
print(f"InfoNCE loss (correlated pairs, high MI): {loss_paired.item():.4f}")
print(f"MI lower bound from random pairs: {np.log(batch) - loss_random.item():.4f}")
print(f"MI lower bound from paired: {np.log(batch) - loss_paired.item():.4f}")
Summary: Mutual Information Relationships
I(X;Y) Definition
────────────────────────────────────────────────────────────
Via joint/marginals: Σ p(x,y) log [p(x,y) / p(x)p(y)]
Via entropies: H(X) + H(Y) - H(X,Y)
Via conditional: H(X) - H(X|Y) = H(Y) - H(Y|X)
Via KL: D_KL(p(x,y) || p(x)p(y))
Applications
────────────────────────────────────────────────────────────
Feature selection: Rank by I(feature; target)
Word embeddings: Skip-gram implicitly factorizes PMI matrix
Self-supervised: InfoNCE maximizes lower bound on I(view1; view2)
IB principle: Find T minimizing I(T;X) - β·I(T;Y)
Causal discovery: CI tests use I(X;Y|Z) = 0 for independence
Interview Questions and Answers
Q1: What is mutual information, and how does it differ from correlation?
Mutual information measures the total statistical dependence between and - including any nonlinear relationship. It equals zero iff and are independent.
Pearson correlation measures only linear dependence. A classic example: if and , the Pearson correlation is 0 (no linear relationship), but the mutual information is large ( is completely determined by ). MI-based feature selection finds features that are predictive in any way, not just linearly.
Q2: Explain the information bottleneck principle and its connection to deep learning.
The information bottleneck seeks a compressed representation of input that maximally preserves information about target . Formally: minimize while maximizing , with a trade-off controlled by . The optimal retains only the task-relevant information from , discarding everything else.
Tishby & Schwartz-Ziv (2017) hypothesized that deep neural networks implicitly solve this problem: after an initial fitting phase (increasing ), networks undergo a compression phase (decreasing while maintaining ). This compression is what enables generalization - the network discards input variations that don't affect the output. The theory is contested but provides useful intuition for why depth and regularization help.
Q3: How does word2vec relate to pointwise mutual information?
Levy & Goldberg (2014) showed that the skip-gram word2vec model with negative sampling, trained to predict context words from target words, implicitly factorizes the shifted PMI matrix:
where is the number of negative samples and PMI.
This is why word2vec captures semantic relationships: words that appear in similar contexts have high PMI with those contexts, and the embedding geometry reflects these PMI values. The famous analogy "king - man + woman = queen" works because the embedding arithmetic approximates PMI arithmetic.
Q4: Why can mutual information be hard to estimate for continuous variables?
For discrete variables, MI estimation is straightforward: count co-occurrences and compute the sum. For continuous variables, we cannot count exact probabilities - we need density estimation, which is hard in high dimensions.
Common approaches:
- Binning/histogram: Discretize the continuous variables. Simple but depends heavily on bin width.
- k-NN estimators (Kraskov et al.): Use the distance to the k-th nearest neighbor. Consistent and widely used.
- Neural MI estimation (MINE): Train a network to estimate the KL divergence between joint and marginal distributions. Scales to high dimensions but has high variance.
- InfoNCE bound: A lower bound on MI that is easily optimized with contrastive learning.
All methods have biases and limitations. This is why feature selection MI estimators (like scikit-learn's) use smoothing and careful implementation rather than naive binning.
Q5: What is the data processing inequality, and what does it say about neural network representations?
The data processing inequality states: for a Markov chain , we have . Processing to get cannot increase the information about .
For neural networks, this means: each layer can only reduce or maintain the information from the input layer. Layer cannot have more information about the input than layer . This has several implications:
- The softmax output layer has at most as much information about as any hidden layer
- You cannot add information to intermediate representations that was not in the input
- The task of each layer is to extract and organize task-relevant information, not to create new information
In the information bottleneck view, deep networks ideally increase in early layers (feature extraction) and decrease in later layers (compression of irrelevant input variations) - all while respecting that can never increase from one layer to the next.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Mutual Information demo on the EngineersOfAI Playground - no code required.
:::
