Data Compression Fundamentals
The Scenario That Motivates This Lesson
A colleague asks why evaluating language models with perplexity makes sense. You know the formula: , but can you explain the deeper reason?
Here it is: a language model is a compression algorithm. If a model assigns high probability to the next token, it needs few bits to encode that token. A model with lower perplexity can compress text more efficiently - it has better understood the structure of language.
This connection between language modeling and compression is not a metaphor. Shannon's source coding theorem makes it precise: the minimum bits needed to encode messages from a source equals the source's entropy. A better language model = a better compressor = lower entropy = lower perplexity.
Understanding data compression gives you a principled way to think about what language models, generative models, and representation learning are actually doing.
Shannon's Source Coding Theorem
Theorem (Shannon, 1948): For a discrete memoryless source with entropy , the minimum average code length of any uniquely decodable code satisfies:
You cannot do better than bits per symbol on average. And you can always get within 1 bit of this bound using an optimal code (Huffman coding).
Implication: Entropy is the fundamental limit of lossless compression. No algorithm can compress a source below its entropy rate.
Why the Lower Bound Holds
Suppose you have a code that assigns length to message . For the code to be uniquely decodable (you can always decode the original message), it must satisfy the Kraft inequality:
The average code length is . Minimizing this subject to Kraft's inequality (using Lagrange multipliers) gives the optimal . The minimum average length is then:
Prefix-Free Codes and Code Trees
A prefix-free code (or prefix code) has the property that no codeword is a prefix of any other. This allows unique decoding without delimiters.
Example: Code tree for 4 symbols
Codewords: Binary tree:
A → 0 [root]
B → 10 0/ \1
C → 110 [A] [ ]
D → 111 0/ \1
[B] [ ]
0/ \1
[C] [D]
Average length with P(A)=0.5, P(B)=0.25, P(C)=P(D)=0.125:
L = 0.5*1 + 0.25*2 + 0.125*3 + 0.125*3 = 1.75 bits
H(X) = 0.5*1 + 0.25*2 + 0.125*3 + 0.125*3 = 1.75 bits ← matches perfectly!
When probabilities are powers of 2, Huffman coding achieves exactly .
Huffman Coding: Optimal Prefix Codes
Huffman coding constructs the optimal prefix-free code greedily:
Algorithm:
- Create a leaf node for each symbol with its frequency
- Build a min-heap (priority queue) of nodes
- While more than one node remains: a. Remove the two nodes with lowest probability b. Create a new internal node with their combined probability c. Assign "0" to the left child and "1" to the right child d. Insert the new node back into the heap
- The remaining node is the root of the Huffman tree
- Codewords = paths from root to each leaf
import heapq
from collections import Counter
from dataclasses import dataclass, field
from typing import Optional
@dataclass(order=True)
class HuffmanNode:
"""Node in a Huffman tree."""
prob: float
symbol: Optional[str] = field(default=None, compare=False)
left: Optional['HuffmanNode'] = field(default=None, compare=False)
right: Optional['HuffmanNode'] = field(default=None, compare=False)
def build_huffman_tree(freqs: dict) -> HuffmanNode:
"""Build a Huffman tree from symbol frequencies."""
total = sum(freqs.values())
heap = [HuffmanNode(count / total, symbol)
for symbol, count in freqs.items()]
heapq.heapify(heap)
while len(heap) > 1:
left = heapq.heappop(heap)
right = heapq.heappop(heap)
parent = HuffmanNode(
prob=left.prob + right.prob,
left=left,
right=right,
)
heapq.heappush(heap, parent)
return heap[0]
def extract_codes(
node: HuffmanNode,
prefix: str = "",
codes: dict = None,
) -> dict:
"""Extract codewords from the Huffman tree via DFS."""
if codes is None:
codes = {}
if node.symbol is not None:
codes[node.symbol] = prefix or "0" # single symbol case
else:
extract_codes(node.left, prefix + "0", codes)
extract_codes(node.right, prefix + "1", codes)
return codes
def huffman_encode_decode(text: str) -> dict:
"""Full Huffman encoding and decoding example."""
# Build codes
freqs = Counter(text)
root = build_huffman_tree(freqs)
codes = extract_codes(root)
# Encode
encoded = "".join(codes[c] for c in text)
# Decode
decoded = []
node = root
for bit in encoded:
node = node.left if bit == "0" else node.right
if node.symbol is not None:
decoded.append(node.symbol)
node = root
# Analysis
original_bits = len(text) * 8 # ASCII: 8 bits/char
encoded_bits = len(encoded)
entropy = sum(-p * (p > 0) * __import__('math').log2(p)
for p in (c / len(text) for c in freqs.values()))
return {
"codes": codes,
"encoded": encoded[:40] + "...",
"decoded_matches": "".join(decoded) == text,
"original_bits": original_bits,
"encoded_bits": encoded_bits,
"compression_ratio": original_bits / encoded_bits,
"entropy": entropy,
"avg_code_length": encoded_bits / len(text),
}
# Example 1: English text
text = "the quick brown fox jumps over the lazy dog"
result = huffman_encode_decode(text)
print("=== Huffman Coding ===")
print("\nCodewords:")
for symbol, code in sorted(result["codes"].items(), key=lambda x: len(x[1])):
freq = text.count(symbol) / len(text)
print(f" '{symbol}' ({freq:.3f}): {code}")
print(f"\nOriginal: {result['original_bits']} bits ({result['original_bits']/8:.0f} bytes)")
print(f"Huffman: {result['encoded_bits']} bits")
print(f"Compression ratio: {result['compression_ratio']:.2f}x")
print(f"Entropy: {result['entropy']:.4f} bits/symbol")
print(f"Avg length: {result['avg_code_length']:.4f} bits/symbol")
print(f"Overhead: {result['avg_code_length'] - result['entropy']:.4f} bits/symbol")
print(f"Decoded correctly: {result['decoded_matches']}")
Arithmetic Coding: Approaching Entropy Exactly
Huffman coding assigns integer-length codewords - wasteful when is not an integer. Arithmetic coding can approach arbitrarily closely.
Core idea: Encode an entire message as a single real number in .
- Start with the interval
- For each symbol, subdivide the current interval proportionally to each symbol's probability
- Advance to the sub-interval corresponding to the next symbol
- Output any binary number within the final interval
Example: Encode "AB" with P(A)=0.7, P(B)=0.3
Initial: [0.0, 1.0)
Encode 'A': interval for A = [0.0, 0.7)
new range = [0.0, 0.7)
Encode 'B': within [0.0, 0.7), B occupies the top 30%
new range = [0.7*0.7, 0.7] = [0.49, 0.7)
Output any number in [0.49, 0.7): e.g., 0.5
Binary: 0.1 → code = "1" (1 bit for message "AB")
Entropy: H ≈ 2 symbols * (0.7*log2(1/0.7) + 0.3*log2(1/0.3)) ≈ 1.69 bits
Arithmetic coding gets close to this; Huffman would need 2 bits (1 per symbol).
def arithmetic_encode(message: str, probabilities: dict) -> float:
"""
Arithmetic encoding - encode a message as a float in [0, 1).
Args:
message: string to encode
probabilities: dict of {symbol: probability}
Returns:
encoded value in [0, 1)
"""
# Build cumulative distribution
symbols = sorted(probabilities.keys())
cumulative = {}
cumul = 0.0
for s in symbols:
cumulative[s] = cumul
cumul += probabilities[s]
low, high = 0.0, 1.0
for symbol in message:
range_size = high - low
high = low + range_size * (cumulative[symbol] + probabilities[symbol])
low = low + range_size * cumulative[symbol]
return (low + high) / 2 # midpoint
def arithmetic_decode(
encoded: float,
probabilities: dict,
message_length: int,
) -> str:
"""Arithmetic decoding - recover message from float."""
symbols = sorted(probabilities.keys())
cumulative = {}
cumul = 0.0
for s in symbols:
cumulative[s] = cumul
cumul += probabilities[s]
message = []
value = encoded
for _ in range(message_length):
for s in symbols:
low = cumulative[s]
high = low + probabilities[s]
if low <= value < high:
message.append(s)
value = (value - low) / probabilities[s]
break
return "".join(message)
# Demo
probs = {"A": 0.7, "B": 0.2, "C": 0.1}
message = "AABACAA"
encoded_val = arithmetic_encode(message, probs)
decoded = arithmetic_decode(encoded_val, probs, len(message))
# Theoretical entropy
import math
H = -sum(p * math.log2(p) for p in probs.values())
print(f"Source entropy: {H:.4f} bits/symbol")
print(f"Message: '{message}' ({len(message)} symbols)")
print(f"Theoretical minimum: {H * len(message):.4f} bits")
print(f"Encoded value: {encoded_val:.10f}")
print(f"Decoded: '{decoded}'")
print(f"Correct: {decoded == message}")
Lossless vs. Lossy Compression
Lossless Compression
────────────────────────────────────────────────────────────
Goal: recover exact original data
Theoretical limit: H(X) bits per symbol
Methods: Huffman, arithmetic, LZ77/LZ78 (ZIP/DEFLATE), LZMA
ML use case: compressing model weights, exact data storage
Lossy Compression
────────────────────────────────────────────────────────────
Goal: approximate original, trading accuracy for size
Theoretical limit: rate-distortion theory R(D)
Methods: JPEG, MP3, WebP, neural codecs
ML use case: model quantization, neural image compression
Rate-Distortion Tradeoff:
- R(D) = minimum bits needed to achieve distortion ≤ D
- Lower distortion tolerance → more bits required
- R(D=0) = H(X) (lossless limit)
Rate-Distortion Theory
For a source and distortion measure , the rate-distortion function gives the minimum description rate (bits/symbol) to achieve expected distortion at most :
This is the information-theoretic limit of lossy compression. Modern neural image compression (learned codecs) approaches this bound.
ML Connection: Language Models Are Compressors
:::info ML Connection - LLMs as Compressors A language model is implicitly a lossless compressor of text:
Compression procedure:
- For each token , use to define a code
- Assign shorter codes to high-probability tokens
- Total bits needed ≈
The cross-entropy loss is exactly the average compression cost per token. Perplexity is - the "effective vocabulary size" at each step.
Implications:
- A GPT-4-level model compressing English text is evidence it has modeled the statistical structure of language
- Hutter Prize: $50,000 for best compression of 100MB Wikipedia. Better AI = better compression.
- DeepMind's Chinchilla compresses web text better than zip utilities on text benchmarks :::
import numpy as np
import gzip
import io
def language_model_compression_bits(
token_log_probs: np.ndarray,
) -> dict:
"""
Compute compression statistics for a language model.
Args:
token_log_probs: array of log P(token | context) in nats
Returns:
compression stats
"""
n = len(token_log_probs)
total_nats = -np.sum(token_log_probs)
total_bits = total_nats / np.log(2)
avg_bits_per_token = total_bits / n
perplexity = np.exp(total_nats / n)
return {
"n_tokens": n,
"total_nats": total_nats,
"total_bits": total_bits,
"bits_per_token": avg_bits_per_token,
"perplexity": perplexity,
"compression_ratio_vs_uniform_50k": np.log2(50000) / avg_bits_per_token,
}
# Simulate different quality language models
np.random.seed(42)
n_tokens = 100
# Simulate token-level log probs for different model qualities
def simulate_log_probs(avg_perplexity: float, n: int) -> np.ndarray:
"""Generate realistic token log probs with given average perplexity."""
avg_log_p = -np.log(avg_perplexity)
# Add variance to simulate real distribution
return np.random.normal(avg_log_p, 1.0, n)
print("=== Language Model Compression Stats ===\n")
print(f"{'Model':<20} | {'PPL':>8} | {'Bits/tok':>10} | {'Compression vs rand':>20}")
print("-" * 65)
models = [
("Random (PPL=50k)", 50000),
("GPT-2 small", 50),
("GPT-3", 20),
("GPT-4 (est.)", 10),
("Human (est.)", 7),
]
for model_name, target_ppl in models:
log_probs = simulate_log_probs(target_ppl, n_tokens)
stats = language_model_compression_bits(log_probs)
print(
f"{model_name:<20} | {stats['perplexity']:>8.1f} | "
f"{stats['bits_per_token']:>10.2f} | "
f"{stats['compression_ratio_vs_uniform_50k']:>20.2f}x"
)
# Compare LM compression vs gzip on real text
sample_text = b"the cat sat on the mat and the cat ate the rat that sat on the mat" * 10
gzip_compressed = gzip.compress(sample_text)
lm_equiv_bits = len(sample_text) * 8 / 3.0 # simulate 3x compression by a language model
print(f"\nText size (raw): {len(sample_text)} bytes ({len(sample_text)*8} bits)")
print(f"After gzip: {len(gzip_compressed)} bytes ({len(gzip_compressed)*8} bits)")
print(f"gzip compression: {len(sample_text)/len(gzip_compressed):.2f}x")
print(f"(A good LM would achieve even better compression by exploiting long-range context)")
Kolmogorov Complexity and Compression as Intelligence
Kolmogorov complexity of a string is the length of the shortest program (on a universal Turing machine) that outputs .
This is the theoretical minimum description length of - independent of any particular code.
Key properties:
- : the string itself is always a valid program
- is not computable (it's undecidable in general)
- For most strings: - random strings are incompressible
- For structured strings: - "the 1000th digit of π" is short
Compression as intelligence: The Hutter Prize ($50,000 for best compression of 100MB of Wikipedia) is based on the insight that intelligence = compression. Understanding text means being able to predict it efficiently = compress it. A system that truly understands the world should be able to describe it concisely.
Intelligence ↔ Compression (informal):
- "Paris is the capital of France" can be compressed if you know geography
- "The stock opened at 142.37" can't be compressed well without prediction
- A model that understands language, science, and history
can compress text better than any syntactic tool
Kolmogorov complexity captures the theoretical limit.
Language model perplexity approximates achievable compression.
Entropy Coding in Practice
Modern lossless compressors (ZIP, FLAC, PNG) use entropy coding as a final step:
Typical Compression Pipeline:
Entropy
Raw Data → Modeling → Residuals → Prediction → Coding → Bitstream
│ │ │ │ │
│ (predict (what's (prob model Huffman/
│ next left) for residual) Arithmetic)
│ value)
│
└── Goal: drive residuals to be small and predictable
so entropy of residuals is low
FLAC (audio): fits a linear predictor to audio samples, then entropy-codes the small residuals.
PNG (images): applies a filter (predict each pixel from neighbors), entropy-codes the prediction errors.
LZ77 (ZIP): replaces repeated substrings with back-references (explicit compression of redundancy).
In all cases, the entropy of the residuals is the bottleneck - and Shannon's theorem is the fundamental limit.
Normalized Compression Distance
The normalized compression distance (NCD) is a practical, parameter-free similarity measure based on compression:
where is the compressed size of and is the concatenation of and .
Intuition: if and are similar, (the compressor reuses patterns from when compressing ). NCD is near 0 for similar strings and near 1 for unrelated strings.
import gzip
def ncd(x: bytes, y: bytes) -> float:
"""
Normalized Compression Distance using gzip.
Values in [0, 1]: 0 = identical, 1 = completely different.
"""
cx = len(gzip.compress(x))
cy = len(gzip.compress(y))
cxy = len(gzip.compress(x + y))
return (cxy - min(cx, cy)) / max(cx, cy)
# Example: text similarity
texts = {
"Python intro": b"Python is a high-level programming language for AI and ML.",
"Python advanced": b"Python generators and decorators enable powerful metaprogramming.",
"French text": b"Le chat est assis sur le tapis et mange une souris.",
"Random bytes": bytes(range(50)), # incompressible
}
print("=== Normalized Compression Distance ===\n")
keys = list(texts.keys())
print(f"{'Text A':<18} vs {'Text B':<18} | {'NCD':>6}")
print("-" * 50)
for i in range(len(keys)):
for j in range(i+1, len(keys)):
d = ncd(texts[keys[i]], texts[keys[j]])
print(f"{keys[i]:<18} vs {keys[j]:<18} | {d:>6.4f}")
Interview Questions and Answers
Q1: State Shannon's source coding theorem and explain its significance for ML.
Shannon's theorem: for a discrete memoryless source with entropy , the minimum average code length of any uniquely decodable code satisfies bits per symbol. This is a fundamental limit - no algorithm can compress a source below its entropy rate.
For ML, this means: the cross-entropy loss measures how close the model's compression is to the entropy lower bound. Perplexity is the effective "branching factor" at each prediction step. A perfect model with cross-entropy equal to would achieve optimal compression. Language models are competing to approach Shannon's theoretical limit for human language.
Q2: Explain Huffman coding and when it is suboptimal.
Huffman coding builds an optimal prefix-free code by greedily combining the two least probable symbols. It achieves average code length satisfying bits per symbol.
It is suboptimal (wastes up to 1 bit per symbol) when probabilities are not powers of 2. For example, if , the optimal code length is bits, but Huffman assigns 1 bit - 6x too many. For a sequence of symbols, block coding (Huffman coding blocks of symbols) reduces overhead to less than bits per symbol, approaching .
Arithmetic coding and ANS (asymmetric numeral systems, used in modern codecs) solve this by achieving arbitrarily close to without the integer-length constraint.
Q3: Why is a better language model equivalent to a better text compressor?
A language model can be used as an arithmetic coder: assign each token a code length proportional to . The total compressed size of a document with tokens is bits - which is exactly the cross-entropy loss multiplied by .
A lower perplexity model assigns higher probability to the actual tokens → needs fewer bits per token → is a better compressor. Conversely, any good text compressor implicitly represents a probabilistic model of language. They are two sides of the same coin. The Hutter Prize operationalizes this: cash prizes for any algorithm that better compresses 100MB of Wikipedia, which is simultaneously a prize for better intelligence.
Q4: What is Kolmogorov complexity, and why is it uncomputable?
Kolmogorov complexity is the length of the shortest program on a universal Turing machine that outputs and halts. It is the absolute minimum description length of - the theoretical compression limit independent of any encoding.
It is uncomputable because determining whether there exists a shorter program than a given one requires solving the halting problem (can this program be proven to halt and output ?). For any string, you can enumerate programs of increasing length and check if they output , but you cannot know when to stop - some very long programs might be running and about to output . This fundamental undecidability means we can never compute in general.
In practice, we use computable approximations (zip file size, Lempel-Ziv complexity) and language model perplexity.
Q5: What is the connection between rate-distortion theory and model quantization?
Rate-distortion theory describes the fundamental trade-off: to represent a source with distortion at most , you need at least bits per symbol. Lower distortion tolerance requires more bits.
Model quantization is a direct application: when you quantize neural network weights from float32 (32 bits/weight) to int8 (8 bits/weight), you are moving along the rate-distortion curve. You use fewer bits ( decreases) but introduce approximation errors in the weights (distortion increases), which increases model error.
The optimal quantization strategy minimizes for a given - this is the theory behind methods like GPTQ, AWQ, and learned quantization. Mixed-precision quantization (different bit-widths for different layers) tries to stay as close as possible to the rate-distortion frontier: use fewer bits where the model is robust to quantization error, more bits for sensitive layers (embedding layers, first/last layers). The theoretical framework predicts that model accuracy should degrade smoothly as you reduce bits, which is empirically confirmed.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Huffman Coding demo on the EngineersOfAI Playground - no code required.
:::
