Minimum Description Length
The Scenario That Motivates This Lesson
You are debating model complexity with a colleague:
"We could fit this dataset with a degree-20 polynomial or a degree-2 polynomial. The degree-20 model has lower training loss. Why do we prefer the degree-2 model?"
The usual answer: "Occam's razor - prefer simpler models." But this is not a justification; it is a name for the observation. The real question is: why do simpler models generalize better?
The Minimum Description Length (MDL) principle gives a precise, information-theoretic answer:
The best model is the one that gives the shortest total description of both the model and the data given the model.
This is not a heuristic - it is a theorem connecting model selection, regularization, and compression. It formalizes exactly why you should prefer the degree-2 polynomial: it requires fewer bits to describe both the model and the residuals.
The Core Idea: Description Length
Consider two models and competing to describe data :
- : bits to describe the model
- : bits to describe the data given the model (residuals)
MDL Principle: Choose the model that minimizes total description length.
A simple example: fitting a noisy line
Data: 100 points roughly on y = 2x + 3 (with noise)
Model 1: degree-2 polynomial (3 parameters: a, b, c)
L(M₁) = bits to describe 3 parameters ≈ 3 × 32 bits = 96 bits
L(D|M₁) = bits to describe residuals ≈ small (good fit) = 200 bits
Total = 296 bits
Model 2: degree-10 polynomial (11 parameters)
L(M₂) = bits to describe 11 parameters ≈ 11 × 32 bits = 352 bits
L(D|M₂) = bits to describe residuals ≈ very small (overfits) = 50 bits
Total = 402 bits
MDL prefers Model 1: it compresses the data more efficiently.
The extra complexity of M₂ doesn't pay off in shorter residuals.
Connection to Kolmogorov Complexity
The ideal MDL measure uses Kolmogorov complexity - the length of the shortest program that outputs :
This is the shortest possible description of the data, combining model and data. A dataset generated by a simple process (polynomial, sinusoid) has low Kolmogorov complexity. A random dataset has high Kolmogorov complexity - it cannot be compressed.
:::warning Kolmogorov Complexity Is Uncomputable As discussed in Lesson 05, is not computable. MDL in practice uses computable approximations: two-part MDL, normalized maximum likelihood, or Bayesian MDL (which connects to marginal likelihood). :::
Two-Part MDL (Crude MDL)
Two-part MDL is a practical approximation: encode the model and the residuals separately.
where is the maximum likelihood estimate of parameters in model .
is the negative log-likelihood under the fitted model - exactly the training loss!
So:
This is exactly regularized loss minimization:
- = regularization term
- = negative log-likelihood
MDL says: regularization penalizes models that require many bits to describe. This is not a heuristic - it is the information-theoretically optimal trade-off between fit and complexity.
MDL and Regularization
L2 Regularization (Weight Decay)
Standard L2-regularized training minimizes:
MDL interpretation: Assume model parameters are encoded using a Gaussian prior . The bits needed to describe a parameter value under this prior is .
Therefore:
L2 regularization is MDL with a Gaussian code for parameters. The regularization strength determines how many bits you "spend" to encode each unit of parameter norm.
L1 Regularization
L1 regularization corresponds to a Laplace prior :
Under MDL: encoding a parameter with a Laplace prior costs .
Laplace priors encourage sparsity (many parameters exactly zero) because they have heavier tails than Gaussian - encoding exactly zero costs near-nothing, but large values are penalized proportionally. MDL says: the sparsest model that describes the data well is the best model.
Dropout as MDL
Dropout can be interpreted as a variational approximation to MDL. Each dropped weight effectively costs zero bits to encode. A model that is robust to random zeroing of weights is one that does not rely on any single parameter - it has learned a more compressible representation.
MDL and Bayesian Model Selection
Refined MDL (using normalized maximum likelihood or Bayesian model evidence) is equivalent to Bayesian model comparison:
This is the marginal likelihood or model evidence - the probability of the data averaged over all parameter settings under the prior.
The marginal likelihood naturally penalizes complexity: a complex model (many parameters) must spread its probability over a large parameter space, so it assigns lower probability to the specific data than a simple model does (assuming both fit the data similarly well).
Bayes factor analogy:
P(D | M₁)
Bayes factor = ─────────
P(D | M₂)
If M₁ is simpler but fits the data nearly as well as M₂,
the Bayes factor favors M₁.
This is the quantitative form of Occam's razor:
simpler models that explain the data are preferred,
not just as a heuristic, but because they are more
"surprised" by the data when they do fit - and
therefore the data provides stronger evidence for them.
Practical MDL Criteria
BIC (Bayesian Information Criterion)
where = number of parameters, = number of data points.
BIC approximates (the marginal likelihood) using Laplace's approximation. It penalizes model complexity logarithmically in the number of parameters.
AIC (Akaike Information Criterion)
AIC approximates the out-of-sample prediction error (not the description length). It penalizes complexity less severely than BIC for large .
import numpy as np
from scipy import stats
def bic(log_likelihood: float, k: int, n: int) -> float:
"""
Bayesian Information Criterion.
Args:
log_likelihood: maximized log P(D | theta_hat)
k: number of parameters
n: number of data points
Returns:
BIC (lower is better)
"""
return -2 * log_likelihood + k * np.log(n)
def aic(log_likelihood: float, k: int) -> float:
"""
Akaike Information Criterion.
Args:
log_likelihood: maximized log P(D | theta_hat)
k: number of parameters
Returns:
AIC (lower is better)
"""
return -2 * log_likelihood + 2 * k
def mdl_model_selection_demo():
"""
Demonstrate MDL-based model selection for polynomial regression.
True model: y = 2x + 3 + noise
"""
np.random.seed(42)
n = 50
x = np.linspace(-2, 2, n)
y_true = 2 * x + 3
y = y_true + np.random.normal(0, 0.5, n)
results = []
for degree in range(1, 12):
# Fit polynomial
coeffs = np.polyfit(x, y, degree)
y_pred = np.polyval(coeffs, x)
# Log-likelihood: assume Gaussian residuals
residuals = y - y_pred
sigma_hat = np.std(residuals)
if sigma_hat < 1e-10:
sigma_hat = 1e-10
log_lik = np.sum(stats.norm.logpdf(residuals, scale=sigma_hat))
k = degree + 1 # polynomial coefficients
bic_val = bic(log_lik, k, n)
aic_val = aic(log_lik, k)
results.append({
'degree': degree,
'k': k,
'log_lik': log_lik,
'bic': bic_val,
'aic': aic_val,
'rmse': np.sqrt(np.mean(residuals**2)),
})
print("=== MDL/BIC/AIC Model Selection for Polynomial Regression ===")
print(f"True model: degree 1 | n={n} samples\n")
print(f"{'Degree':>6} | {'#Params':>7} | {'Log-Lik':>10} | {'BIC':>10} | {'AIC':>10} | {'RMSE':>8}")
print("-" * 65)
for r in results:
best_bic = "*" if r['bic'] == min(res['bic'] for res in results) else " "
best_aic = "*" if r['aic'] == min(res['aic'] for res in results) else " "
print(
f"{r['degree']:>6} | {r['k']:>7} | {r['log_lik']:>10.2f} | "
f"{r['bic']:>9.2f}{best_bic} | {r['aic']:>9.2f}{best_aic} | "
f"{r['rmse']:>8.4f}"
)
print("\n* = selected by criterion (lower is better)")
mdl_model_selection_demo()
Two-Part MDL in Practice: Quantized Parameters
A concrete two-part MDL analysis for neural networks uses the idea that parameters must be encoded in finite precision.
If we encode each parameter to precision , the cost per parameter is bits, where is the range of values.
The total description length is:
This motivates precision-aware model selection: more parameters at lower precision vs. fewer parameters at higher precision. The optimal model minimizes this sum.
def two_part_mdl(
n_params: int,
log_likelihood: float,
param_range: float = 10.0,
precision: float = 0.001,
) -> float:
"""
Two-part MDL: L(model) + L(data | model).
L(model) = k * log2(param_range / precision) bits
L(data | model) = -log2_likelihood
Args:
n_params: number of model parameters k
log_likelihood: maximized log P(data | theta) in nats
param_range: range of parameter values (±param_range)
precision: encoding precision
Returns:
total description length in bits
"""
# Model description: k params, each taking log2(range/precision) bits
model_bits = n_params * np.log2(2 * param_range / precision)
# Data description: -log2(likelihood)
data_bits = -log_likelihood / np.log(2)
return model_bits + data_bits
# Demonstrate on linear vs polynomial
np.random.seed(0)
n = 100
x = np.linspace(0, 5, n)
y = 3.0 * x + 2.0 + np.random.normal(0, 0.5, n)
print("\n=== Two-Part MDL for Model Selection ===")
print(f"{'Model':<20} | {'Model bits':>12} | {'Data bits':>12} | {'Total bits':>12}")
print("-" * 60)
for degree in [1, 2, 3, 5, 10]:
coeffs = np.polyfit(x, y, degree)
y_pred = np.polyval(coeffs, x)
residuals = y - y_pred
sigma = np.std(residuals) + 1e-10
log_lik = np.sum(stats.norm.logpdf(residuals, scale=sigma))
k = degree + 1
model_bits = k * np.log2(2 * 10 / 0.001)
data_bits = -log_lik / np.log(2)
total = model_bits + data_bits
print(f"Degree {degree:<2} ({k} params) | {model_bits:>12.1f} | {data_bits:>12.1f} | {total:>12.1f}")
MDL and Regularization: The Formal Equivalence
The following equivalences hold exactly under mild assumptions:
Regularization Method | MDL Interpretation | Prior Distribution
-------------------------+-----------------------------+--------------------
L2 (weight decay) | Gaussian parameter code | N(0, 1/λ)
L1 (lasso) | Laplace parameter code | Laplace(0, 1/λ)
Dropout | Bernoulli parameter mask | Bernoulli(1-p_drop)
Early stopping | Bound on description length | Implicit complexity
Max-norm constraint | Bounded parameter code | Uniform(-r, r)
Variational inference | Refined MDL (stochastic) | q(θ) ≈ p(θ|D)
Evidence maximization | Normalized max likelihood | Full marginal p(D|M)
Every regularization method has an MDL interpretation: it corresponds to a particular code for model parameters. The "best" regularization is the one whose implied prior best matches the true distribution of useful parameter values.
MDL in Neural Network Pruning
MDL provides a principled framework for neural network pruning. Instead of arbitrary sparsity targets, we ask:
How many bits does each weight contribute to the description?
For a weight with encoding precision :
- Zero weight: costs 0 bits (just a single indicator bit)
- Non-zero weight : costs approximately bits + sign bit
A weight should be kept only if the improvement in data description exceeds the bits needed to describe it.
def mdl_pruning_score(
weight: float,
data_improvement: float,
precision: float = 1e-3,
) -> float:
"""
MDL score for keeping vs. pruning a weight.
Positive = keep weight (it reduces total description length).
Negative = prune weight (it costs more to describe than it saves).
Args:
weight: current weight value
data_improvement: bits saved in data description by this weight
precision: encoding precision
Returns:
net MDL benefit of keeping this weight (in bits)
"""
if abs(weight) < precision:
# Treat as zero: costs ~1 bit to describe (just a flag)
return data_improvement - 1.0
# Bits needed to encode this weight value
weight_bits = np.log2(abs(weight) / precision) + 1 # +1 for sign bit
return data_improvement - weight_bits
# Example: should we prune these weights?
print("\n=== MDL Pruning Decision ===")
print(f"{'Weight value':>15} | {'Data improvement':>17} | {'Weight bits':>12} | {'Decision':>10}")
print("-" * 65)
weights_and_improvements = [
(0.001, 0.5), # tiny weight, small improvement
(0.1, 5.0), # small weight, moderate improvement
(1.0, 20.0), # large weight, large improvement
(2.5, 3.0), # large weight, small improvement
(0.01, 0.05), # very small weight, very small improvement
]
for w, delta_bits in weights_and_improvements:
if abs(w) < 1e-3:
weight_bits = 1.0
else:
weight_bits = np.log2(abs(w) / 1e-3) + 1
net_benefit = mdl_pruning_score(w, delta_bits)
decision = "KEEP" if net_benefit > 0 else "PRUNE"
print(f"{w:>15.4f} | {delta_bits:>17.2f} | {weight_bits:>12.2f} | {decision:>10}")
MDL and Generalization: Why Compression Implies Generalization
The connection between MDL and generalization is given by a fundamental theorem:
MDL Generalization Bound: If a hypothesis has description length bits, then with probability at least :
Interpretation: Models with shorter descriptions generalize better, with the generalization gap bounded by .
This is a formal statement of Occam's razor: among models with the same training error, shorter descriptions → smaller generalization gap → better test performance.
It also explains why:
- Overparameterized models can still generalize: if the effective description length (implicit regularization, early stopping, dropout) is small, the bound holds
- Regularization improves generalization: it reduces
- More data helps more for complex models: the bound tightens as grows, making complexity less costly
Interview Questions and Answers
Q1: What is the Minimum Description Length principle, and how does it relate to model selection?
MDL states: among all models, choose the one that minimizes the total number of bits needed to describe both the model and the data given the model. Formally, minimize where is the cost to encode the model and is the cost to encode the data residuals.
For model selection, MDL operationalizes Occam's razor: a model that achieves slightly better fit but requires many more bits to describe may not be preferred. The degree-2 polynomial beats the degree-20 polynomial if the polynomial coefficients required to describe the degree-20 model cost more bits than the improvement in data fit saves. MDL quantifies this trade-off precisely using information theory rather than intuition.
Q2: How does L2 regularization relate to MDL?
L2 regularization adds to the loss, which equals for a Gaussian prior . The full L2 objective is:
This is exactly the two-part MDL objective where we encode parameters using a Gaussian code. Larger = fewer bits per parameter = simpler model description = stronger Occam's razor preference. The optimal (by cross-validation) is the one where the total description length is minimized.
Q3: Explain the MDL generalization bound and what it says about overparameterized neural networks.
The MDL generalization bound states that test error training error + where is the description length of the hypothesis and is the training set size. Models with shorter descriptions have tighter generalization bounds.
For overparameterized neural networks (more parameters than data points), the naive description length (all parameters at full precision) gives a vacuous bound. But neural networks trained with SGD, weight decay, and early stopping have a much shorter effective description length - they implicitly find solutions in low-dimensional regions of parameter space. This is related to the implicit regularization of SGD (it finds flat minima with low Fisher information), which corresponds to short effective description lengths. The overparameterization is not a problem if the learned solution is compressible.
Q4: What is the difference between BIC and AIC, and when do you use each?
BIC: - approximates the marginal likelihood using Laplace's method. Penalizes each parameter by bits, which grows with dataset size.
AIC: - approximates the expected out-of-sample prediction error (via an information-theoretic argument by Akaike). Penalizes each parameter by 2 bits regardless of .
When to use each:
- BIC: when you believe there is a true model of finite complexity and you have enough data to identify it. BIC is consistent - it selects the correct model as . Use for explanatory modeling (which model generated the data?).
- AIC: when you want the model with best predictive performance on new data. AIC minimizes predictive error asymptotically. Use for predictive modeling (which model makes the best predictions?).
- For small , use AICc (corrected AIC): .
Q5: How does MDL relate to the bias-variance trade-off?
The MDL total description length corresponds exactly to the bias-variance trade-off:
- is the variance term: complex models have high variance (many parameters to describe), so they tend to overfit. A short means few parameters = low variance.
- is the bias term: simple models have high bias (they cannot fit the data well), so the residuals are large and require many bits. A short means good fit = low bias.
Minimizing finds the optimal bias-variance trade-off from an information-theoretic perspective. Underfitting (high bias, high , low ) and overfitting (low bias, low , high ) both increase total description length. The optimal model minimizes the sum - this is the quantitative justification for the bias-variance trade-off.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Entropy Explorer demo on the EngineersOfAI Playground - no code required.
:::
