Skip to main content

Data Compression Fundamentals

The Scenario That Motivates This Lesson

A colleague asks why evaluating language models with perplexity makes sense. You know the formula: PPL=eLCE\text{PPL} = e^{\mathcal{L}_{\text{CE}}}, but can you explain the deeper reason?

Here it is: a language model is a compression algorithm. If a model assigns high probability to the next token, it needs few bits to encode that token. A model with lower perplexity can compress text more efficiently - it has better understood the structure of language.

This connection between language modeling and compression is not a metaphor. Shannon's source coding theorem makes it precise: the minimum bits needed to encode messages from a source equals the source's entropy. A better language model = a better compressor = lower entropy = lower perplexity.

Understanding data compression gives you a principled way to think about what language models, generative models, and representation learning are actually doing.

Shannon's Source Coding Theorem

Theorem (Shannon, 1948): For a discrete memoryless source XX with entropy H(X)H(X), the minimum average code length LL of any uniquely decodable code satisfies:

H(X)L<H(X)+1(bits per symbol)H(X) \leq L < H(X) + 1 \quad \text{(bits per symbol)}

You cannot do better than H(X)H(X) bits per symbol on average. And you can always get within 1 bit of this bound using an optimal code (Huffman coding).

Implication: Entropy is the fundamental limit of lossless compression. No algorithm can compress a source below its entropy rate.

Why the Lower Bound Holds

Suppose you have a code that assigns length lil_i to message ii. For the code to be uniquely decodable (you can always decode the original message), it must satisfy the Kraft inequality:

i2li1\sum_{i} 2^{-l_i} \leq 1

The average code length is L=ipiliL = \sum_i p_i l_i. Minimizing this subject to Kraft's inequality (using Lagrange multipliers) gives the optimal li=log2pil_i^* = -\log_2 p_i. The minimum average length is then:

L=ipi(log2pi)=H(X)L^* = \sum_i p_i \cdot (-\log_2 p_i) = H(X)

Prefix-Free Codes and Code Trees

A prefix-free code (or prefix code) has the property that no codeword is a prefix of any other. This allows unique decoding without delimiters.

Example: Code tree for 4 symbols

Codewords: Binary tree:
A → 0 [root]
B → 10 0/ \1
C → 110 [A] [ ]
D → 111 0/ \1
[B] [ ]
0/ \1
[C] [D]

Average length with P(A)=0.5, P(B)=0.25, P(C)=P(D)=0.125:
L = 0.5*1 + 0.25*2 + 0.125*3 + 0.125*3 = 1.75 bits
H(X) = 0.5*1 + 0.25*2 + 0.125*3 + 0.125*3 = 1.75 bits ← matches perfectly!

When probabilities are powers of 2, Huffman coding achieves exactly H(X)H(X).

Huffman Coding: Optimal Prefix Codes

Huffman coding constructs the optimal prefix-free code greedily:

Algorithm:

  1. Create a leaf node for each symbol with its frequency
  2. Build a min-heap (priority queue) of nodes
  3. While more than one node remains: a. Remove the two nodes with lowest probability b. Create a new internal node with their combined probability c. Assign "0" to the left child and "1" to the right child d. Insert the new node back into the heap
  4. The remaining node is the root of the Huffman tree
  5. Codewords = paths from root to each leaf
import heapq
from collections import Counter
from dataclasses import dataclass, field
from typing import Optional


@dataclass(order=True)
class HuffmanNode:
"""Node in a Huffman tree."""
prob: float
symbol: Optional[str] = field(default=None, compare=False)
left: Optional['HuffmanNode'] = field(default=None, compare=False)
right: Optional['HuffmanNode'] = field(default=None, compare=False)


def build_huffman_tree(freqs: dict) -> HuffmanNode:
"""Build a Huffman tree from symbol frequencies."""
total = sum(freqs.values())
heap = [HuffmanNode(count / total, symbol)
for symbol, count in freqs.items()]
heapq.heapify(heap)

while len(heap) > 1:
left = heapq.heappop(heap)
right = heapq.heappop(heap)
parent = HuffmanNode(
prob=left.prob + right.prob,
left=left,
right=right,
)
heapq.heappush(heap, parent)

return heap[0]


def extract_codes(
node: HuffmanNode,
prefix: str = "",
codes: dict = None,
) -> dict:
"""Extract codewords from the Huffman tree via DFS."""
if codes is None:
codes = {}
if node.symbol is not None:
codes[node.symbol] = prefix or "0" # single symbol case
else:
extract_codes(node.left, prefix + "0", codes)
extract_codes(node.right, prefix + "1", codes)
return codes


def huffman_encode_decode(text: str) -> dict:
"""Full Huffman encoding and decoding example."""
# Build codes
freqs = Counter(text)
root = build_huffman_tree(freqs)
codes = extract_codes(root)

# Encode
encoded = "".join(codes[c] for c in text)

# Decode
decoded = []
node = root
for bit in encoded:
node = node.left if bit == "0" else node.right
if node.symbol is not None:
decoded.append(node.symbol)
node = root

# Analysis
original_bits = len(text) * 8 # ASCII: 8 bits/char
encoded_bits = len(encoded)
entropy = sum(-p * (p > 0) * __import__('math').log2(p)
for p in (c / len(text) for c in freqs.values()))

return {
"codes": codes,
"encoded": encoded[:40] + "...",
"decoded_matches": "".join(decoded) == text,
"original_bits": original_bits,
"encoded_bits": encoded_bits,
"compression_ratio": original_bits / encoded_bits,
"entropy": entropy,
"avg_code_length": encoded_bits / len(text),
}


# Example 1: English text
text = "the quick brown fox jumps over the lazy dog"
result = huffman_encode_decode(text)

print("=== Huffman Coding ===")
print("\nCodewords:")
for symbol, code in sorted(result["codes"].items(), key=lambda x: len(x[1])):
freq = text.count(symbol) / len(text)
print(f" '{symbol}' ({freq:.3f}): {code}")

print(f"\nOriginal: {result['original_bits']} bits ({result['original_bits']/8:.0f} bytes)")
print(f"Huffman: {result['encoded_bits']} bits")
print(f"Compression ratio: {result['compression_ratio']:.2f}x")
print(f"Entropy: {result['entropy']:.4f} bits/symbol")
print(f"Avg length: {result['avg_code_length']:.4f} bits/symbol")
print(f"Overhead: {result['avg_code_length'] - result['entropy']:.4f} bits/symbol")
print(f"Decoded correctly: {result['decoded_matches']}")

Arithmetic Coding: Approaching Entropy Exactly

Huffman coding assigns integer-length codewords - wasteful when log2pi-\log_2 p_i is not an integer. Arithmetic coding can approach H(X)H(X) arbitrarily closely.

Core idea: Encode an entire message as a single real number in [0,1)[0, 1).

  1. Start with the interval [0,1)[0, 1)
  2. For each symbol, subdivide the current interval proportionally to each symbol's probability
  3. Advance to the sub-interval corresponding to the next symbol
  4. Output any binary number within the final interval
Example: Encode "AB" with P(A)=0.7, P(B)=0.3

Initial: [0.0, 1.0)

Encode 'A': interval for A = [0.0, 0.7)
new range = [0.0, 0.7)

Encode 'B': within [0.0, 0.7), B occupies the top 30%
new range = [0.7*0.7, 0.7] = [0.49, 0.7)

Output any number in [0.49, 0.7): e.g., 0.5
Binary: 0.1 → code = "1" (1 bit for message "AB")

Entropy: H ≈ 2 symbols * (0.7*log2(1/0.7) + 0.3*log2(1/0.3)) ≈ 1.69 bits
Arithmetic coding gets close to this; Huffman would need 2 bits (1 per symbol).
def arithmetic_encode(message: str, probabilities: dict) -> float:
"""
Arithmetic encoding - encode a message as a float in [0, 1).

Args:
message: string to encode
probabilities: dict of {symbol: probability}

Returns:
encoded value in [0, 1)
"""
# Build cumulative distribution
symbols = sorted(probabilities.keys())
cumulative = {}
cumul = 0.0
for s in symbols:
cumulative[s] = cumul
cumul += probabilities[s]

low, high = 0.0, 1.0

for symbol in message:
range_size = high - low
high = low + range_size * (cumulative[symbol] + probabilities[symbol])
low = low + range_size * cumulative[symbol]

return (low + high) / 2 # midpoint


def arithmetic_decode(
encoded: float,
probabilities: dict,
message_length: int,
) -> str:
"""Arithmetic decoding - recover message from float."""
symbols = sorted(probabilities.keys())
cumulative = {}
cumul = 0.0
for s in symbols:
cumulative[s] = cumul
cumul += probabilities[s]

message = []
value = encoded

for _ in range(message_length):
for s in symbols:
low = cumulative[s]
high = low + probabilities[s]
if low <= value < high:
message.append(s)
value = (value - low) / probabilities[s]
break

return "".join(message)


# Demo
probs = {"A": 0.7, "B": 0.2, "C": 0.1}
message = "AABACAA"

encoded_val = arithmetic_encode(message, probs)
decoded = arithmetic_decode(encoded_val, probs, len(message))

# Theoretical entropy
import math
H = -sum(p * math.log2(p) for p in probs.values())
print(f"Source entropy: {H:.4f} bits/symbol")
print(f"Message: '{message}' ({len(message)} symbols)")
print(f"Theoretical minimum: {H * len(message):.4f} bits")
print(f"Encoded value: {encoded_val:.10f}")
print(f"Decoded: '{decoded}'")
print(f"Correct: {decoded == message}")

Lossless vs. Lossy Compression

Lossless Compression
────────────────────────────────────────────────────────────
Goal: recover exact original data
Theoretical limit: H(X) bits per symbol
Methods: Huffman, arithmetic, LZ77/LZ78 (ZIP/DEFLATE), LZMA
ML use case: compressing model weights, exact data storage

Lossy Compression
────────────────────────────────────────────────────────────
Goal: approximate original, trading accuracy for size
Theoretical limit: rate-distortion theory R(D)
Methods: JPEG, MP3, WebP, neural codecs
ML use case: model quantization, neural image compression

Rate-Distortion Tradeoff:
- R(D) = minimum bits needed to achieve distortion ≤ D
- Lower distortion tolerance → more bits required
- R(D=0) = H(X) (lossless limit)

Rate-Distortion Theory

For a source XX and distortion measure d(x,x^)d(x, \hat{x}), the rate-distortion function R(D)R(D) gives the minimum description rate (bits/symbol) to achieve expected distortion at most DD:

R(D)=minp(x^x):E[d(X,X^)]DI(X;X^)R(D) = \min_{p(\hat{x}|x): \mathbb{E}[d(X,\hat{X})] \leq D} I(X; \hat{X})

This is the information-theoretic limit of lossy compression. Modern neural image compression (learned codecs) approaches this bound.

ML Connection: Language Models Are Compressors

:::info ML Connection - LLMs as Compressors A language model pθ(wtw<t)p_\theta(w_t | w_{<t}) is implicitly a lossless compressor of text:

Compression procedure:

  1. For each token wtw_t, use pθ(w<t)p_\theta(\cdot | w_{<t}) to define a code
  2. Assign shorter codes to high-probability tokens
  3. Total bits needed ≈ tlog2pθ(wtw<t)-\sum_t \log_2 p_\theta(w_t | w_{<t})

The cross-entropy loss is exactly the average compression cost per token. Perplexity is 2average bits/token2^{\text{average bits/token}} - the "effective vocabulary size" at each step.

Implications:

  • A GPT-4-level model compressing English text is evidence it has modeled the statistical structure of language
  • Hutter Prize: $50,000 for best compression of 100MB Wikipedia. Better AI = better compression.
  • DeepMind's Chinchilla compresses web text better than zip utilities on text benchmarks :::
import numpy as np
import gzip
import io


def language_model_compression_bits(
token_log_probs: np.ndarray,
) -> dict:
"""
Compute compression statistics for a language model.

Args:
token_log_probs: array of log P(token | context) in nats

Returns:
compression stats
"""
n = len(token_log_probs)
total_nats = -np.sum(token_log_probs)
total_bits = total_nats / np.log(2)
avg_bits_per_token = total_bits / n
perplexity = np.exp(total_nats / n)

return {
"n_tokens": n,
"total_nats": total_nats,
"total_bits": total_bits,
"bits_per_token": avg_bits_per_token,
"perplexity": perplexity,
"compression_ratio_vs_uniform_50k": np.log2(50000) / avg_bits_per_token,
}


# Simulate different quality language models
np.random.seed(42)
n_tokens = 100

# Simulate token-level log probs for different model qualities
def simulate_log_probs(avg_perplexity: float, n: int) -> np.ndarray:
"""Generate realistic token log probs with given average perplexity."""
avg_log_p = -np.log(avg_perplexity)
# Add variance to simulate real distribution
return np.random.normal(avg_log_p, 1.0, n)

print("=== Language Model Compression Stats ===\n")
print(f"{'Model':<20} | {'PPL':>8} | {'Bits/tok':>10} | {'Compression vs rand':>20}")
print("-" * 65)

models = [
("Random (PPL=50k)", 50000),
("GPT-2 small", 50),
("GPT-3", 20),
("GPT-4 (est.)", 10),
("Human (est.)", 7),
]

for model_name, target_ppl in models:
log_probs = simulate_log_probs(target_ppl, n_tokens)
stats = language_model_compression_bits(log_probs)
print(
f"{model_name:<20} | {stats['perplexity']:>8.1f} | "
f"{stats['bits_per_token']:>10.2f} | "
f"{stats['compression_ratio_vs_uniform_50k']:>20.2f}x"
)


# Compare LM compression vs gzip on real text
sample_text = b"the cat sat on the mat and the cat ate the rat that sat on the mat" * 10
gzip_compressed = gzip.compress(sample_text)
lm_equiv_bits = len(sample_text) * 8 / 3.0 # simulate 3x compression by a language model

print(f"\nText size (raw): {len(sample_text)} bytes ({len(sample_text)*8} bits)")
print(f"After gzip: {len(gzip_compressed)} bytes ({len(gzip_compressed)*8} bits)")
print(f"gzip compression: {len(sample_text)/len(gzip_compressed):.2f}x")
print(f"(A good LM would achieve even better compression by exploiting long-range context)")

Kolmogorov Complexity and Compression as Intelligence

Kolmogorov complexity K(x)K(x) of a string xx is the length of the shortest program (on a universal Turing machine) that outputs xx.

K(x)=min{p:U(p)=x}K(x) = \min\{|p| : U(p) = x\}

This is the theoretical minimum description length of xx - independent of any particular code.

Key properties:

  • K(x)x+O(1)K(x) \leq |x| + O(1): the string itself is always a valid program
  • K(x)K(x) is not computable (it's undecidable in general)
  • For most strings: K(x)xK(x) \approx |x| - random strings are incompressible
  • For structured strings: K(x)xK(x) \ll |x| - "the 1000th digit of π" is short

Compression as intelligence: The Hutter Prize ($50,000 for best compression of 100MB of Wikipedia) is based on the insight that intelligence = compression. Understanding text means being able to predict it efficiently = compress it. A system that truly understands the world should be able to describe it concisely.

Intelligence ↔ Compression (informal):
- "Paris is the capital of France" can be compressed if you know geography
- "The stock opened at 142.37" can't be compressed well without prediction
- A model that understands language, science, and history
can compress text better than any syntactic tool

Kolmogorov complexity captures the theoretical limit.
Language model perplexity approximates achievable compression.

Entropy Coding in Practice

Modern lossless compressors (ZIP, FLAC, PNG) use entropy coding as a final step:

Typical Compression Pipeline:
Entropy
Raw Data → Modeling → Residuals → Prediction → Coding → Bitstream
│ │ │ │ │
│ (predict (what's (prob model Huffman/
│ next left) for residual) Arithmetic)
│ value)

└── Goal: drive residuals to be small and predictable
so entropy of residuals is low

FLAC (audio): fits a linear predictor to audio samples, then entropy-codes the small residuals.

PNG (images): applies a filter (predict each pixel from neighbors), entropy-codes the prediction errors.

LZ77 (ZIP): replaces repeated substrings with back-references (explicit compression of redundancy).

In all cases, the entropy of the residuals is the bottleneck - and Shannon's theorem is the fundamental limit.

Normalized Compression Distance

The normalized compression distance (NCD) is a practical, parameter-free similarity measure based on compression:

NCD(x,y)=C(xy)min(C(x),C(y))max(C(x),C(y))\text{NCD}(x, y) = \frac{C(xy) - \min(C(x), C(y))}{\max(C(x), C(y))}

where C(x)C(x) is the compressed size of xx and xyxy is the concatenation of xx and yy.

Intuition: if xx and yy are similar, C(xy)C(x)C(xy) \approx C(x) (the compressor reuses patterns from xx when compressing yy). NCD is near 0 for similar strings and near 1 for unrelated strings.

import gzip


def ncd(x: bytes, y: bytes) -> float:
"""
Normalized Compression Distance using gzip.
Values in [0, 1]: 0 = identical, 1 = completely different.
"""
cx = len(gzip.compress(x))
cy = len(gzip.compress(y))
cxy = len(gzip.compress(x + y))
return (cxy - min(cx, cy)) / max(cx, cy)


# Example: text similarity
texts = {
"Python intro": b"Python is a high-level programming language for AI and ML.",
"Python advanced": b"Python generators and decorators enable powerful metaprogramming.",
"French text": b"Le chat est assis sur le tapis et mange une souris.",
"Random bytes": bytes(range(50)), # incompressible
}

print("=== Normalized Compression Distance ===\n")
keys = list(texts.keys())
print(f"{'Text A':<18} vs {'Text B':<18} | {'NCD':>6}")
print("-" * 50)
for i in range(len(keys)):
for j in range(i+1, len(keys)):
d = ncd(texts[keys[i]], texts[keys[j]])
print(f"{keys[i]:<18} vs {keys[j]:<18} | {d:>6.4f}")

Interview Questions and Answers

Q1: State Shannon's source coding theorem and explain its significance for ML.

Shannon's theorem: for a discrete memoryless source with entropy H(X)H(X), the minimum average code length of any uniquely decodable code satisfies H(X)L<H(X)+1H(X) \leq L < H(X) + 1 bits per symbol. This is a fundamental limit - no algorithm can compress a source below its entropy rate.

For ML, this means: the cross-entropy loss 1Nlog2pθ(xi)-\frac{1}{N}\sum \log_2 p_\theta(x_i) measures how close the model's compression is to the entropy lower bound. Perplexity =2cross-entropy= 2^{\text{cross-entropy}} is the effective "branching factor" at each prediction step. A perfect model with cross-entropy equal to H(X)H(X) would achieve optimal compression. Language models are competing to approach Shannon's theoretical limit for human language.

Q2: Explain Huffman coding and when it is suboptimal.

Huffman coding builds an optimal prefix-free code by greedily combining the two least probable symbols. It achieves average code length LL satisfying H(X)L<H(X)+1H(X) \leq L < H(X) + 1 bits per symbol.

It is suboptimal (wastes up to 1 bit per symbol) when probabilities are not powers of 2. For example, if p(A)=0.9p(A) = 0.9, the optimal code length is log2(0.9)0.152-\log_2(0.9) \approx 0.152 bits, but Huffman assigns 1 bit - 6x too many. For a sequence of symbols, block coding (Huffman coding blocks of kk symbols) reduces overhead to less than 1/k1/k bits per symbol, approaching H(X)H(X).

Arithmetic coding and ANS (asymmetric numeral systems, used in modern codecs) solve this by achieving arbitrarily close to H(X)H(X) without the integer-length constraint.

Q3: Why is a better language model equivalent to a better text compressor?

A language model pθ(wtw<t)p_\theta(w_t | w_{<t}) can be used as an arithmetic coder: assign each token a code length proportional to log2pθ(wtw<t)-\log_2 p_\theta(w_t | w_{<t}). The total compressed size of a document with NN tokens is t=1Nlog2pθ(wtw<t)-\sum_{t=1}^N \log_2 p_\theta(w_t | w_{<t}) bits - which is exactly the cross-entropy loss multiplied by NN.

A lower perplexity model assigns higher probability to the actual tokens → needs fewer bits per token → is a better compressor. Conversely, any good text compressor implicitly represents a probabilistic model of language. They are two sides of the same coin. The Hutter Prize operationalizes this: cash prizes for any algorithm that better compresses 100MB of Wikipedia, which is simultaneously a prize for better intelligence.

Q4: What is Kolmogorov complexity, and why is it uncomputable?

Kolmogorov complexity K(x)K(x) is the length of the shortest program on a universal Turing machine that outputs xx and halts. It is the absolute minimum description length of xx - the theoretical compression limit independent of any encoding.

It is uncomputable because determining whether there exists a shorter program than a given one requires solving the halting problem (can this program be proven to halt and output xx?). For any string, you can enumerate programs of increasing length and check if they output xx, but you cannot know when to stop - some very long programs might be running and about to output xx. This fundamental undecidability means we can never compute K(x)K(x) in general.

In practice, we use computable approximations (zip file size, Lempel-Ziv complexity) and language model perplexity.

Q5: What is the connection between rate-distortion theory and model quantization?

Rate-distortion theory describes the fundamental trade-off: to represent a source with distortion at most DD, you need at least R(D)R(D) bits per symbol. Lower distortion tolerance requires more bits.

Model quantization is a direct application: when you quantize neural network weights from float32 (32 bits/weight) to int8 (8 bits/weight), you are moving along the rate-distortion curve. You use fewer bits (RR decreases) but introduce approximation errors in the weights (distortion DD increases), which increases model error.

The optimal quantization strategy minimizes DD for a given RR - this is the theory behind methods like GPTQ, AWQ, and learned quantization. Mixed-precision quantization (different bit-widths for different layers) tries to stay as close as possible to the rate-distortion frontier: use fewer bits where the model is robust to quantization error, more bits for sensitive layers (embedding layers, first/last layers). The theoretical framework predicts that model accuracy should degrade smoothly as you reduce bits, which is empirically confirmed.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Huffman Coding demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.