Mutual Information

The Scenario That Motivates This Lesson

You are building a fraud detection model with 500 features. A naive approach is to train a model on all of them. But your data scientist says:

"Let's rank features by mutual information with the fraud label and take the top 50."

Or perhaps you are trying to explain why word2vec works - why "king" and "queen" end up close in embedding space. The answer involves pointwise mutual information (PMI): the mathematical relationship between a word and its context.

Or you are reading a paper on deep learning generalization that claims:

"Deep networks learn to maximize mutual information between inputs and task-relevant features while compressing irrelevant information."

All three situations require understanding mutual information - the most general measure of statistical dependence between two random variables.

Definition: Mutual Information

The mutual information between random variables $X$ and $Y$ is:

$I(X; Y) = \sum_{x, y} p(x, y) \log \frac{p(x, y)}{p(x)p(y)}$

For continuous distributions:

$I(X; Y) = \int\int p(x, y) \log \frac{p(x, y)}{p(x)p(y)} \, dx \, dy$

Interpretation: The amount of information that knowing $X$ provides about $Y$ (and vice versa, since MI is symmetric). It equals zero iff $X$ and $Y$ are independent.

Alternative Expressions

Using entropy:

$I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)$

$I(X; Y) = H(X) + H(Y) - H(X, Y)$

Using KL divergence:

$I(X; Y) = D_{\text{KL}}\bigl(p(x, y) \| p(x)p(y)\bigr)$

This last form shows that MI measures how far the joint distribution is from the product of marginals - i.e., how much they deviate from independence.

Entropy Diagram: The Relationships

H(X, Y)
┌──────────────────────────────────────────────────────────┐
│                                                          │
│      H(X)                  │         H(Y)               │
│  ┌──────────────────┐      │   ┌──────────────────┐      │
│  │                  │      │   │                  │      │
│  │     H(X|Y)       │  I(X;Y)  │     H(Y|X)       │      │
│  │                  │      │   │                  │      │
│  └──────────────────┴──────┴───┴──────────────────┘      │
│                                                          │
└──────────────────────────────────────────────────────────┘

H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)
I(X;Y) = H(X) + H(Y) - H(X,Y)
I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)

Key Properties of Mutual Information

Symmetry

$I(X; Y) = I(Y; X)$

Knowing $X$ tells you as much about $Y$ as knowing $Y$ tells you about $X$ .

Non-Negativity

$I(X; Y) \geq 0 \quad \text{with equality iff } X \perp Y$

Since $I(X;Y) = D_{\text{KL}}(p(x,y) \| p(x)p(y)) \geq 0$ .

Upper Bounds

$I(X; Y) \leq H(X) \quad \text{and} \quad I(X; Y) \leq H(Y)$

$I(X; Y) \leq \min(H(X), H(Y))$

You cannot learn more about $X$ from $Y$ than the total uncertainty in $X$ .

Self-Information

$I(X; X) = H(X)$

A variable contains exactly $H(X)$ bits of information about itself.

Mutual Information vs. Correlation

Correlation measures only linear dependence. Mutual information captures any dependence - linear or nonlinear.

import numpy as np
from scipy import stats


def estimate_mi_binning(x: np.ndarray, y: np.ndarray, bins: int = 20) -> float:
    """
    Estimate mutual information using histogram binning.
    Simple but works well for continuous variables.
    """
    # Joint histogram
    c_xy, _, _ = np.histogram2d(x, y, bins=bins)
    # Marginal histograms
    c_x = c_xy.sum(axis=1)
    c_y = c_xy.sum(axis=0)

    n = c_xy.sum()
    # Convert to probabilities
    p_xy = c_xy / n
    p_x  = c_x / n
    p_y  = c_y / n

    # MI = sum p(x,y) * log(p(x,y) / (p(x)*p(y)))
    mi = 0.0
    for i in range(bins):
        for j in range(bins):
            if p_xy[i, j] > 0 and p_x[i] > 0 and p_y[j] > 0:
                mi += p_xy[i, j] * np.log(p_xy[i, j] / (p_x[i] * p_y[j]))
    return float(mi)


np.random.seed(42)
n = 2000

# Case 1: Linear relationship
x_lin = np.random.randn(n)
y_lin = 2 * x_lin + 0.5 * np.random.randn(n)

# Case 2: Nonlinear relationship (x^2) - correlation ≈ 0 but dependent!
x_nl = np.random.uniform(-3, 3, n)
y_nl = x_nl**2 + 0.5 * np.random.randn(n)

# Case 3: Independent
x_ind = np.random.randn(n)
y_ind = np.random.randn(n)

print(f"{'Relationship':<25} | {'Pearson r':>10} | {'MI (nats)':>10}")
print("-" * 50)

for name, x, y in [
    ("Linear (strong)", x_lin, y_lin),
    ("Quadratic (x²)", x_nl, y_nl),
    ("Independent", x_ind, y_ind),
]:
    r, _ = stats.pearsonr(x, y)
    mi = estimate_mi_binning(x, y)
    print(f"{name:<25} | {r:>10.4f} | {mi:>10.4f}")

# Output demonstrates:
# Linear relationship:  high r, high MI
# Quadratic x^2:        r ≈ 0 (linear corr fails!), but high MI (dependent!)
# Independent:          r ≈ 0, MI ≈ 0

This is why MI-based feature selection finds features that have any relationship with the target, not just linear ones.

Pointwise Mutual Information (PMI)

Pointwise mutual information measures the association between two specific events $x$ and $y$ :

$\text{PMI}(x, y) = \log \frac{p(x, y)}{p(x) \cdot p(y)}$

$\text{PMI}(x, y) > 0$ : $x$ and $y$ co-occur more than expected under independence $\text{PMI}(x, y) < 0$ : $x$ and $y$ co-occur less than expected $\text{PMI}(x, y) = 0$ : $x$ and $y$ are independent

$I(X; Y) = \mathbb{E}_{x,y \sim p(x,y)}[\text{PMI}(x, y)]$

Mutual information is the expected PMI.

PMI in Word Embeddings

:::info ML Connection - word2vec and PMI The word2vec Skip-gram model was shown by Levy & Goldberg (2014) to implicitly factorize the PMI matrix of words and their contexts.

Define the shifted PMI matrix:

$M_{ij} = \max\left(\text{PMI}(w_i, c_j) - \log k, \, 0\right)$

where $k$ is the number of negative samples. Word2vec's word vectors $W$ and context vectors $C$ approximate:

$W \cdot C^T \approx M_{\text{SPMI}}$

"king" - "man" + "woman" ≈ "queen" works because the word vectors encode PMI relationships, and PMI of "king" and "royal contexts" minus PMI of "man contexts" plus PMI of "woman contexts" points to "queen." :::

import numpy as np
from collections import Counter


def compute_pmi_matrix(
    corpus: list,
    window_size: int = 2,
    min_count: int = 1,
    k: int = 1,  # negative samples (for SPMI)
) -> tuple:
    """
    Compute PMI matrix from a text corpus.

    Returns:
        vocab: list of words
        pmi_matrix: (|V| x |V|) PMI values
    """
    # Count word and co-occurrence frequencies
    word_counts = Counter()
    cooccur_counts = Counter()

    for i, word in enumerate(corpus):
        word_counts[word] += 1
        start = max(0, i - window_size)
        end = min(len(corpus), i + window_size + 1)
        for j in range(start, end):
            if i != j:
                cooccur_counts[(word, corpus[j])] += 1

    # Filter by min_count
    vocab = [w for w, c in word_counts.items() if c >= min_count]
    vocab_idx = {w: i for i, w in enumerate(vocab)}
    V = len(vocab)

    total_words = sum(word_counts.values())
    total_cooccur = sum(cooccur_counts.values())

    # Build PMI matrix
    pmi = np.zeros((V, V))
    for (w1, w2), count in cooccur_counts.items():
        if w1 in vocab_idx and w2 in vocab_idx:
            i, j = vocab_idx[w1], vocab_idx[w2]
            p_ij = count / total_cooccur
            p_i  = word_counts[w1] / total_words
            p_j  = word_counts[w2] / total_words
            if p_ij > 0 and p_i > 0 and p_j > 0:
                # Shifted PMI
                pmi[i, j] = max(np.log(p_ij / (p_i * p_j)) - np.log(k), 0)

    return vocab, pmi


# Simple example corpus
corpus = "the cat sat on the mat the cat ate the rat the mat is flat".split()
vocab, pmi_mat = compute_pmi_matrix(corpus, window_size=2)

print("PMI Matrix (top co-occurrences):")
print(f"Vocab: {vocab}\n")

# Show strongest associations
associations = []
V = len(vocab)
for i in range(V):
    for j in range(V):
        if i != j and pmi_mat[i, j] > 0:
            associations.append((vocab[i], vocab[j], pmi_mat[i, j]))
associations.sort(key=lambda x: -x[2])

print(f"{'Word 1':<8} {'Word 2':<8} {'PMI':>8}")
print("-" * 28)
for w1, w2, pmi in associations[:10]:
    print(f"{w1:<8} {w2:<8} {pmi:>8.4f}")

Feature Selection with Mutual Information

MI-based feature selection ranks features $X_i$ by $I(X_i; Y)$ where $Y$ is the target variable. This captures any form of statistical dependence.

from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
from sklearn.datasets import make_classification
import numpy as np


def mi_feature_selection(
    X: np.ndarray,
    y: np.ndarray,
    task: str = "classification",
    top_k: int = 10,
) -> list:
    """
    Rank features by mutual information with the target.

    Args:
        X:    feature matrix (n_samples, n_features)
        y:    target array
        task: 'classification' or 'regression'
        top_k: return top k features

    Returns:
        list of (feature_index, mi_score) sorted by MI descending
    """
    if task == "classification":
        mi_scores = mutual_info_classif(X, y, random_state=42)
    else:
        mi_scores = mutual_info_regression(X, y, random_state=42)

    ranked = sorted(enumerate(mi_scores), key=lambda x: -x[1])
    return ranked[:top_k]


# Create dataset with mix of relevant and irrelevant features
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=5,     # only 5 truly informative features
    n_redundant=5,
    n_repeated=0,
    n_classes=2,
    random_state=42,
)

top_features = mi_feature_selection(X, y, task="classification", top_k=10)

print("Feature Selection by Mutual Information:")
print(f"{'Feature Index':>15} | {'MI Score':>10} | {'Informative?':>14}")
print("-" * 45)
for feat_idx, mi_score in top_features:
    # sklearn's informative features are indices 0-4
    is_informative = "YES" if feat_idx < 5 else "no"
    print(f"{feat_idx:>15} | {mi_score:>10.4f} | {is_informative:>14}")

Advantages and Limitations

Advantages of MI feature selection:
✓ Captures any statistical dependency (linear + nonlinear)
✓ Model-agnostic - doesn't assume linear relationships
✓ Fast to compute (compared to wrapper methods)
✓ Theoretically grounded: maximally informative features

Limitations:
✗ Does not capture feature interactions (redundant features can both score high)
✗ Estimation is challenging for continuous variables (binning approximation)
✗ Ignores conditional dependencies: a feature might be informative only jointly
✗ MRMR (Maximum Relevance Minimum Redundancy) addresses the redundancy issue

MRMR: Maximum Relevance Minimum Redundancy

Standard MI selection can pick many redundant features (e.g., height in cm and height in inches). MRMR addresses this:

$\text{Score}(X_i) = I(X_i; Y) - \frac{1}{|S|}\sum_{X_j \in S} I(X_i; X_j)$

Select the feature with the highest relevance to the target $Y$ minus the average redundancy with already-selected features $S$ .

Conditional Mutual Information

The conditional mutual information of $X$ and $Y$ given $Z$ :

$I(X; Y | Z) = H(X|Z) - H(X|Y, Z) = \mathbb{E}_Z\bigl[I(X; Y | Z=z)\bigr]$

Expanding using the chain rule of mutual information:

$I(X; Y, Z) = I(X; Z) + I(X; Y | Z)$

This is used in the information bottleneck principle.

The Information Bottleneck Principle

The information bottleneck (Tishby, Pereira & Bialek, 1999) frames learning as a compression problem:

Given an input $X$ and target $Y$ , find a representation $T$ (the "bottleneck") that:

Compresses $X$ : minimize $I(T; X)$ - discard irrelevant information
Preserves $Y$ : maximize $I(T; Y)$ - keep task-relevant information

The IB objective:

$\min_{p(t|x)} \left[ I(T; X) - \beta \cdot I(T; Y) \right]$

where $\beta$ controls the trade-off between compression and accuracy.

:::info ML Connection - Information Bottleneck in Deep Learning Tishby & Schwartz-Ziv (2017) proposed that deep neural networks implicitly implement the information bottleneck:

Fitting phase: layers first increase $I(T; Y)$ (learn to predict) while $I(T; X)$ grows (memorize input)
Compression phase: $I(T; X)$ decreases (compress irrelevant info) while $I(T; Y)$ is maintained

This would explain why deep networks generalize: they discard input variations irrelevant to the task.

Note: This theory is debated - Saxe et al. (2018) showed the compression phase may not occur with non-saturating activations. But the information bottleneck framework remains a useful conceptual tool for understanding representation learning. :::

Information Bottleneck Trade-off Curve:

I(T;Y)
(task info)
1.0 |                      ****
    |                   ***
    |                ***
0.5 |            ****
    |         ***
    |      **
0.0 |___***________________
    0       0.5       1.0     I(T;X)
                           (input compression)

The optimal representations lie on this "information curve."
Points above the curve are unachievable.
Points below are suboptimal.
β controls where you land on the curve.

Computing MI for Discrete Variables

import numpy as np
from itertools import product


def mutual_information_discrete(
    x: np.ndarray,
    y: np.ndarray,
) -> float:
    """
    Exact mutual information for discrete variables.

    Args:
        x: discrete observations of X
        y: discrete observations of Y (same length as x)

    Returns:
        I(X;Y) in nats
    """
    assert len(x) == len(y)
    n = len(x)

    x_vals = np.unique(x)
    y_vals = np.unique(y)

    # Count joint occurrences
    joint_counts = {}
    for xi, yi in zip(x, y):
        joint_counts[(xi, yi)] = joint_counts.get((xi, yi), 0) + 1

    # Convert to probabilities
    p_joint = {k: v / n for k, v in joint_counts.items()}
    p_x = {xi: np.sum(x == xi) / n for xi in x_vals}
    p_y = {yi: np.sum(y == yi) / n for yi in y_vals}

    mi = 0.0
    for (xi, yi), pxy in p_joint.items():
        if pxy > 0:
            mi += pxy * np.log(pxy / (p_x[xi] * p_y[yi]))

    return mi


# Example: MI between a feature and binary label
np.random.seed(42)
n = 1000

# Feature 1: highly informative
x1 = np.random.choice([0, 1, 2], size=n, p=[0.4, 0.3, 0.3])
y  = np.where(x1 == 0, np.random.choice([0, 1], p=[0.8, 0.2]),
     np.where(x1 == 1, np.random.choice([0, 1], p=[0.5, 0.5]),
                        np.random.choice([0, 1], p=[0.2, 0.8])))

# Feature 2: independent of label
x2 = np.random.choice([0, 1, 2], size=n, p=[0.4, 0.3, 0.3])

mi_x1_y = mutual_information_discrete(x1, y)
mi_x2_y = mutual_information_discrete(x2, y)
h_y = -np.sum([0.5 * np.log(0.5)] * 2)  # entropy of balanced binary target

print(f"MI(X1, Y) = {mi_x1_y:.4f} nats  ← informative feature")
print(f"MI(X2, Y) = {mi_x2_y:.4f} nats  ← random feature (should be ≈ 0)")
print(f"H(Y)      = {h_y:.4f} nats   (maximum possible MI)")
print(f"\nX1 explains {100*mi_x1_y/h_y:.1f}% of Y's entropy")

MI in Representation Learning: InfoNCE

Modern self-supervised learning methods (SimCLR, MoCo, CLIP) use MI-based objectives. The InfoNCE loss provides a lower bound on mutual information:

$I(X; Y) \geq \log N - \mathcal{L}_{\text{InfoNCE}}$

where:

$\mathcal{L}_{\text{InfoNCE}} = -\mathbb{E}\left[\log \frac{e^{f(x, y^+)}}{\sum_{j=1}^N e^{f(x, y_j)}}\right]$

Here $y^+$ is the positive (matching) sample for $x$ , and $y_1, ..., y_N$ are $N$ samples including one positive and $N-1$ negatives. The function $f(x, y)$ is a similarity score (often dot product of normalized embeddings).

import torch
import torch.nn.functional as F


def infonce_loss(
    embeddings_a: torch.Tensor,
    embeddings_b: torch.Tensor,
    temperature: float = 0.07,
) -> torch.Tensor:
    """
    InfoNCE / NT-Xent loss (used in SimCLR, CLIP).

    Maximizes mutual information between paired views.

    Args:
        embeddings_a: (batch, dim) - view 1 embeddings, L2-normalized
        embeddings_b: (batch, dim) - view 2 embeddings, L2-normalized
        temperature:  τ - controls sharpness of distribution

    Returns:
        InfoNCE loss (scalar)
    """
    batch_size = embeddings_a.shape[0]

    # Normalize embeddings
    a = F.normalize(embeddings_a, dim=-1)
    b = F.normalize(embeddings_b, dim=-1)

    # Similarity matrix: (batch, batch)
    sim_matrix = torch.matmul(a, b.T) / temperature

    # Labels: diagonal entries are the matching pairs
    labels = torch.arange(batch_size, device=a.device)

    # Symmetric loss (as in SimCLR)
    loss_a = F.cross_entropy(sim_matrix, labels)          # a → b
    loss_b = F.cross_entropy(sim_matrix.T, labels)        # b → a

    return (loss_a + loss_b) / 2


# Demo: random embeddings (before training)
torch.manual_seed(42)
batch, dim = 8, 64

emb_a = torch.randn(batch, dim)
emb_b = torch.randn(batch, dim)  # independent (not paired yet)
emb_b_paired = emb_a + 0.1 * torch.randn(batch, dim)  # similar to a

loss_random = infonce_loss(emb_a, emb_b)
loss_paired = infonce_loss(emb_a, emb_b_paired)

print(f"InfoNCE loss (random pairs, low MI):   {loss_random.item():.4f}")
print(f"InfoNCE loss (correlated pairs, high MI): {loss_paired.item():.4f}")
print(f"MI lower bound from random pairs: {np.log(batch) - loss_random.item():.4f}")
print(f"MI lower bound from paired:       {np.log(batch) - loss_paired.item():.4f}")

Summary: Mutual Information Relationships

I(X;Y) Definition
────────────────────────────────────────────────────────────
Via joint/marginals:  Σ p(x,y) log [p(x,y) / p(x)p(y)]
Via entropies:        H(X) + H(Y) - H(X,Y)
Via conditional:      H(X) - H(X|Y) = H(Y) - H(Y|X)
Via KL:               D_KL(p(x,y) || p(x)p(y))

Applications
────────────────────────────────────────────────────────────
Feature selection:    Rank by I(feature; target)
Word embeddings:      Skip-gram implicitly factorizes PMI matrix
Self-supervised:      InfoNCE maximizes lower bound on I(view1; view2)
IB principle:         Find T minimizing I(T;X) - β·I(T;Y)
Causal discovery:     CI tests use I(X;Y|Z) = 0 for independence

Interview Questions and Answers

Q1: What is mutual information, and how does it differ from correlation?

Mutual information $I(X;Y) = D_{\text{KL}}(p(x,y) \| p(x)p(y))$ measures the total statistical dependence between $X$ and $Y$ - including any nonlinear relationship. It equals zero iff $X$ and $Y$ are independent.

Pearson correlation measures only linear dependence. A classic example: if $Y = X^2$ and $X \sim \text{Uniform}(-1, 1)$ , the Pearson correlation is 0 (no linear relationship), but the mutual information is large ( $Y$ is completely determined by $X$ ). MI-based feature selection finds features that are predictive in any way, not just linearly.

Q2: Explain the information bottleneck principle and its connection to deep learning.

The information bottleneck seeks a compressed representation $T$ of input $X$ that maximally preserves information about target $Y$ . Formally: minimize $I(T;X)$ while maximizing $I(T;Y)$ , with a trade-off controlled by $\beta$ . The optimal $T$ retains only the task-relevant information from $X$ , discarding everything else.

Tishby & Schwartz-Ziv (2017) hypothesized that deep neural networks implicitly solve this problem: after an initial fitting phase (increasing $I(T;Y)$ ), networks undergo a compression phase (decreasing $I(T;X)$ while maintaining $I(T;Y)$ ). This compression is what enables generalization - the network discards input variations that don't affect the output. The theory is contested but provides useful intuition for why depth and regularization help.

Q3: How does word2vec relate to pointwise mutual information?

Levy & Goldberg (2014) showed that the skip-gram word2vec model with negative sampling, trained to predict context words from target words, implicitly factorizes the shifted PMI matrix:

$W_w \cdot W_c^T = M_{wc} \approx \text{SPMI}(w, c) = \max\left(\text{PMI}(w,c) - \log k, 0\right)$

where $k$ is the number of negative samples and PMI $(w,c) = \log \frac{p(w,c)}{p(w)p(c)}$ .

This is why word2vec captures semantic relationships: words that appear in similar contexts have high PMI with those contexts, and the embedding geometry reflects these PMI values. The famous analogy "king - man + woman = queen" works because the embedding arithmetic approximates PMI arithmetic.

Q4: Why can mutual information be hard to estimate for continuous variables?

For discrete variables, MI estimation is straightforward: count co-occurrences and compute the sum. For continuous variables, we cannot count exact probabilities - we need density estimation, which is hard in high dimensions.

Common approaches:

Binning/histogram: Discretize the continuous variables. Simple but depends heavily on bin width.
k-NN estimators (Kraskov et al.): Use the distance to the k-th nearest neighbor. Consistent and widely used.
Neural MI estimation (MINE): Train a network to estimate the KL divergence between joint and marginal distributions. Scales to high dimensions but has high variance.
InfoNCE bound: A lower bound on MI that is easily optimized with contrastive learning.

All methods have biases and limitations. This is why feature selection MI estimators (like scikit-learn's) use smoothing and careful implementation rather than naive binning.

Q5: What is the data processing inequality, and what does it say about neural network representations?

The data processing inequality states: for a Markov chain $X \to Y \to Z$ , we have $I(X;Z) \leq I(X;Y)$ . Processing $Y$ to get $Z$ cannot increase the information about $X$ .

For neural networks, this means: each layer can only reduce or maintain the information from the input layer. Layer $L+1$ cannot have more information about the input $X$ than layer $L$ . This has several implications:

The softmax output layer has at most as much information about $X$ as any hidden layer
You cannot add information to intermediate representations that was not in the input
The task of each layer is to extract and organize task-relevant information, not to create new information

In the information bottleneck view, deep networks ideally increase $I(\text{layer};Y)$ in early layers (feature extraction) and decrease $I(\text{layer};X)$ in later layers (compression of irrelevant input variations) - all while respecting that $I(\text{layer};X)$ can never increase from one layer to the next.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Mutual Information demo on the EngineersOfAI Playground - no code required.

:::

The Scenario That Motivates This Lesson​

Definition: Mutual Information​

Alternative Expressions​

Entropy Diagram: The Relationships​

Key Properties of Mutual Information​

Symmetry​

Non-Negativity​

Upper Bounds​

Self-Information​

Mutual Information vs. Correlation​

Pointwise Mutual Information (PMI)​

PMI in Word Embeddings​

Feature Selection with Mutual Information​

Advantages and Limitations​

MRMR: Maximum Relevance Minimum Redundancy​

Conditional Mutual Information​

The Information Bottleneck Principle​

Computing MI for Discrete Variables​

MI in Representation Learning: InfoNCE​

Summary: Mutual Information Relationships​

Interview Questions and Answers​