Skip to main content

Minimum Description Length

The Scenario That Motivates This Lesson

You are debating model complexity with a colleague:

"We could fit this dataset with a degree-20 polynomial or a degree-2 polynomial. The degree-20 model has lower training loss. Why do we prefer the degree-2 model?"

The usual answer: "Occam's razor - prefer simpler models." But this is not a justification; it is a name for the observation. The real question is: why do simpler models generalize better?

The Minimum Description Length (MDL) principle gives a precise, information-theoretic answer:

The best model is the one that gives the shortest total description of both the model and the data given the model.

This is not a heuristic - it is a theorem connecting model selection, regularization, and compression. It formalizes exactly why you should prefer the degree-2 polynomial: it requires fewer bits to describe both the model and the residuals.

The Core Idea: Description Length

Consider two models M1M_1 and M2M_2 competing to describe data DD:

Description length=L(M)+L(DM)\text{Description length} = L(M) + L(D | M)

  • L(M)L(M): bits to describe the model
  • L(DM)L(D | M): bits to describe the data given the model (residuals)

MDL Principle: Choose the model that minimizes total description length.

A simple example: fitting a noisy line

Data: 100 points roughly on y = 2x + 3 (with noise)

Model 1: degree-2 polynomial (3 parameters: a, b, c)
L(M₁) = bits to describe 3 parameters ≈ 3 × 32 bits = 96 bits
L(D|M₁) = bits to describe residuals ≈ small (good fit) = 200 bits
Total = 296 bits

Model 2: degree-10 polynomial (11 parameters)
L(M₂) = bits to describe 11 parameters ≈ 11 × 32 bits = 352 bits
L(D|M₂) = bits to describe residuals ≈ very small (overfits) = 50 bits
Total = 402 bits

MDL prefers Model 1: it compresses the data more efficiently.
The extra complexity of M₂ doesn't pay off in shorter residuals.

Connection to Kolmogorov Complexity

The ideal MDL measure uses Kolmogorov complexity K(x)K(x) - the length of the shortest program that outputs xx:

K(D)=minM[L(M)+L(DM)]K(D) = \min_M [L(M) + L(D | M)]

This is the shortest possible description of the data, combining model and data. A dataset generated by a simple process (polynomial, sinusoid) has low Kolmogorov complexity. A random dataset has high Kolmogorov complexity - it cannot be compressed.

:::warning Kolmogorov Complexity Is Uncomputable As discussed in Lesson 05, K(x)K(x) is not computable. MDL in practice uses computable approximations: two-part MDL, normalized maximum likelihood, or Bayesian MDL (which connects to marginal likelihood). :::

Two-Part MDL (Crude MDL)

Two-part MDL is a practical approximation: encode the model and the residuals separately.

Ltwo-part(M,D)=L(M)+L(Dθ^M,M)L_{\text{two-part}}(M, D) = L(M) + L(D | \hat{\theta}_M, M)

where θ^M\hat{\theta}_M is the maximum likelihood estimate of parameters in model MM.

L(Dθ^M,M)L(D | \hat{\theta}_M, M) is the negative log-likelihood under the fitted model - exactly the training loss!

So:

Total code length=L(M)model complexity+(logp(Dθ^M))training loss (negative log-likelihood)\text{Total code length} = \underbrace{L(M)}_{\text{model complexity}} + \underbrace{(-\log p(\mathcal{D}|\hat{\theta}_M))}_{\text{training loss (negative log-likelihood)}}

This is exactly regularized loss minimization:

  • L(M)L(M) = regularization term
  • logp(Dθ^M)-\log p(\mathcal{D}|\hat{\theta}_M) = negative log-likelihood

MDL says: regularization penalizes models that require many bits to describe. This is not a heuristic - it is the information-theoretically optimal trade-off between fit and complexity.

MDL and Regularization

L2 Regularization (Weight Decay)

Standard L2-regularized training minimizes:

Lreg=logp(Dθ)+λ2θ2\mathcal{L}_{\text{reg}} = -\log p(\mathcal{D}|\theta) + \frac{\lambda}{2}\|\theta\|^2

MDL interpretation: Assume model parameters θ\theta are encoded using a Gaussian prior p(θ)=N(0,1/λ)p(\theta) = \mathcal{N}(0, 1/\lambda). The bits needed to describe a parameter value under this prior is logp(θi)=λ2θi2+const-\log p(\theta_i) = \frac{\lambda}{2}\theta_i^2 + \text{const}.

Therefore:

L2 loss=logp(Dθ)data description bits+λ2θ2model description bits (Gaussian prior)\text{L2 loss} = \underbrace{-\log p(\mathcal{D}|\theta)}_{\text{data description bits}} + \underbrace{\frac{\lambda}{2}\|\theta\|^2}_{\text{model description bits (Gaussian prior)}}

L2 regularization is MDL with a Gaussian code for parameters. The regularization strength λ\lambda determines how many bits you "spend" to encode each unit of parameter norm.

L1 Regularization

L1 regularization corresponds to a Laplace prior p(θi)eλθip(\theta_i) \propto e^{-\lambda|\theta_i|}:

LL1=logp(Dθ)+λθ1\mathcal{L}_{\text{L1}} = -\log p(\mathcal{D}|\theta) + \lambda\|\theta\|_1

Under MDL: encoding a parameter with a Laplace prior costs logp(θi)=λθi+const-\log p(\theta_i) = \lambda|\theta_i| + \text{const}.

Laplace priors encourage sparsity (many parameters exactly zero) because they have heavier tails than Gaussian - encoding exactly zero costs near-nothing, but large values are penalized proportionally. MDL says: the sparsest model that describes the data well is the best model.

Dropout as MDL

Dropout can be interpreted as a variational approximation to MDL. Each dropped weight effectively costs zero bits to encode. A model that is robust to random zeroing of weights is one that does not rely on any single parameter - it has learned a more compressible representation.

MDL and Bayesian Model Selection

Refined MDL (using normalized maximum likelihood or Bayesian model evidence) is equivalent to Bayesian model comparison:

MDL score(M)=logp(DM)=logp(Dθ)p(θM)dθ\text{MDL score}(M) = -\log p(\mathcal{D} | M) = -\log \int p(\mathcal{D}|\theta) p(\theta|M) d\theta

This is the marginal likelihood or model evidence - the probability of the data averaged over all parameter settings under the prior.

The marginal likelihood naturally penalizes complexity: a complex model (many parameters) must spread its probability over a large parameter space, so it assigns lower probability to the specific data D\mathcal{D} than a simple model does (assuming both fit the data similarly well).

Bayes factor analogy:

P(D | M₁)
Bayes factor = ─────────
P(D | M₂)

If M₁ is simpler but fits the data nearly as well as M₂,
the Bayes factor favors M₁.

This is the quantitative form of Occam's razor:
simpler models that explain the data are preferred,
not just as a heuristic, but because they are more
"surprised" by the data when they do fit - and
therefore the data provides stronger evidence for them.

Practical MDL Criteria

BIC (Bayesian Information Criterion)

BIC(M)=2logp(Dθ^M)+klogN\text{BIC}(M) = -2\log p(\mathcal{D}|\hat{\theta}_M) + k\log N

where kk = number of parameters, NN = number of data points.

BIC approximates 2logp(DM)-2\log p(\mathcal{D}|M) (the marginal likelihood) using Laplace's approximation. It penalizes model complexity logarithmically in the number of parameters.

AIC (Akaike Information Criterion)

AIC(M)=2logp(Dθ^M)+2k\text{AIC}(M) = -2\log p(\mathcal{D}|\hat{\theta}_M) + 2k

AIC approximates the out-of-sample prediction error (not the description length). It penalizes complexity less severely than BIC for large NN.

import numpy as np
from scipy import stats


def bic(log_likelihood: float, k: int, n: int) -> float:
"""
Bayesian Information Criterion.

Args:
log_likelihood: maximized log P(D | theta_hat)
k: number of parameters
n: number of data points

Returns:
BIC (lower is better)
"""
return -2 * log_likelihood + k * np.log(n)


def aic(log_likelihood: float, k: int) -> float:
"""
Akaike Information Criterion.

Args:
log_likelihood: maximized log P(D | theta_hat)
k: number of parameters

Returns:
AIC (lower is better)
"""
return -2 * log_likelihood + 2 * k


def mdl_model_selection_demo():
"""
Demonstrate MDL-based model selection for polynomial regression.
True model: y = 2x + 3 + noise
"""
np.random.seed(42)
n = 50
x = np.linspace(-2, 2, n)
y_true = 2 * x + 3
y = y_true + np.random.normal(0, 0.5, n)

results = []
for degree in range(1, 12):
# Fit polynomial
coeffs = np.polyfit(x, y, degree)
y_pred = np.polyval(coeffs, x)

# Log-likelihood: assume Gaussian residuals
residuals = y - y_pred
sigma_hat = np.std(residuals)
if sigma_hat < 1e-10:
sigma_hat = 1e-10

log_lik = np.sum(stats.norm.logpdf(residuals, scale=sigma_hat))

k = degree + 1 # polynomial coefficients
bic_val = bic(log_lik, k, n)
aic_val = aic(log_lik, k)

results.append({
'degree': degree,
'k': k,
'log_lik': log_lik,
'bic': bic_val,
'aic': aic_val,
'rmse': np.sqrt(np.mean(residuals**2)),
})

print("=== MDL/BIC/AIC Model Selection for Polynomial Regression ===")
print(f"True model: degree 1 | n={n} samples\n")
print(f"{'Degree':>6} | {'#Params':>7} | {'Log-Lik':>10} | {'BIC':>10} | {'AIC':>10} | {'RMSE':>8}")
print("-" * 65)
for r in results:
best_bic = "*" if r['bic'] == min(res['bic'] for res in results) else " "
best_aic = "*" if r['aic'] == min(res['aic'] for res in results) else " "
print(
f"{r['degree']:>6} | {r['k']:>7} | {r['log_lik']:>10.2f} | "
f"{r['bic']:>9.2f}{best_bic} | {r['aic']:>9.2f}{best_aic} | "
f"{r['rmse']:>8.4f}"
)
print("\n* = selected by criterion (lower is better)")


mdl_model_selection_demo()

Two-Part MDL in Practice: Quantized Parameters

A concrete two-part MDL analysis for neural networks uses the idea that parameters must be encoded in finite precision.

If we encode each parameter to precision δ\delta, the cost per parameter is log2(R/δ)\log_2(R/\delta) bits, where RR is the range of values.

The total description length is:

L=klog2(R/δ)model: k parameters+i=1Nlog2pθ(yixi)data: negative log-likelihoodL = \underbrace{k \cdot \log_2(R/\delta)}_{\text{model: k parameters}} + \underbrace{-\sum_{i=1}^N \log_2 p_\theta(y_i|x_i)}_{\text{data: negative log-likelihood}}

This motivates precision-aware model selection: more parameters at lower precision vs. fewer parameters at higher precision. The optimal model minimizes this sum.

def two_part_mdl(
n_params: int,
log_likelihood: float,
param_range: float = 10.0,
precision: float = 0.001,
) -> float:
"""
Two-part MDL: L(model) + L(data | model).

L(model) = k * log2(param_range / precision) bits
L(data | model) = -log2_likelihood

Args:
n_params: number of model parameters k
log_likelihood: maximized log P(data | theta) in nats
param_range: range of parameter values (±param_range)
precision: encoding precision

Returns:
total description length in bits
"""
# Model description: k params, each taking log2(range/precision) bits
model_bits = n_params * np.log2(2 * param_range / precision)

# Data description: -log2(likelihood)
data_bits = -log_likelihood / np.log(2)

return model_bits + data_bits


# Demonstrate on linear vs polynomial
np.random.seed(0)
n = 100
x = np.linspace(0, 5, n)
y = 3.0 * x + 2.0 + np.random.normal(0, 0.5, n)

print("\n=== Two-Part MDL for Model Selection ===")
print(f"{'Model':<20} | {'Model bits':>12} | {'Data bits':>12} | {'Total bits':>12}")
print("-" * 60)

for degree in [1, 2, 3, 5, 10]:
coeffs = np.polyfit(x, y, degree)
y_pred = np.polyval(coeffs, x)
residuals = y - y_pred
sigma = np.std(residuals) + 1e-10

log_lik = np.sum(stats.norm.logpdf(residuals, scale=sigma))
k = degree + 1

model_bits = k * np.log2(2 * 10 / 0.001)
data_bits = -log_lik / np.log(2)
total = model_bits + data_bits

print(f"Degree {degree:<2} ({k} params) | {model_bits:>12.1f} | {data_bits:>12.1f} | {total:>12.1f}")

MDL and Regularization: The Formal Equivalence

The following equivalences hold exactly under mild assumptions:

Regularization Method | MDL Interpretation | Prior Distribution
-------------------------+-----------------------------+--------------------
L2 (weight decay) | Gaussian parameter code | N(0, 1/λ)
L1 (lasso) | Laplace parameter code | Laplace(0, 1/λ)
Dropout | Bernoulli parameter mask | Bernoulli(1-p_drop)
Early stopping | Bound on description length | Implicit complexity
Max-norm constraint | Bounded parameter code | Uniform(-r, r)
Variational inference | Refined MDL (stochastic) | q(θ) ≈ p(θ|D)
Evidence maximization | Normalized max likelihood | Full marginal p(D|M)

Every regularization method has an MDL interpretation: it corresponds to a particular code for model parameters. The "best" regularization is the one whose implied prior best matches the true distribution of useful parameter values.

MDL in Neural Network Pruning

MDL provides a principled framework for neural network pruning. Instead of arbitrary sparsity targets, we ask:

How many bits does each weight contribute to the description?

For a weight ww with encoding precision δ\delta:

  • Zero weight: costs 0 bits (just a single indicator bit)
  • Non-zero weight ww: costs approximately log2(w/δ)\log_2(|w|/\delta) bits + sign bit

A weight should be kept only if the improvement in data description (Δlogp(Dθ))(-\Delta \log p(\mathcal{D}|\theta)) exceeds the bits needed to describe it.

def mdl_pruning_score(
weight: float,
data_improvement: float,
precision: float = 1e-3,
) -> float:
"""
MDL score for keeping vs. pruning a weight.
Positive = keep weight (it reduces total description length).
Negative = prune weight (it costs more to describe than it saves).

Args:
weight: current weight value
data_improvement: bits saved in data description by this weight
precision: encoding precision

Returns:
net MDL benefit of keeping this weight (in bits)
"""
if abs(weight) < precision:
# Treat as zero: costs ~1 bit to describe (just a flag)
return data_improvement - 1.0

# Bits needed to encode this weight value
weight_bits = np.log2(abs(weight) / precision) + 1 # +1 for sign bit

return data_improvement - weight_bits


# Example: should we prune these weights?
print("\n=== MDL Pruning Decision ===")
print(f"{'Weight value':>15} | {'Data improvement':>17} | {'Weight bits':>12} | {'Decision':>10}")
print("-" * 65)

weights_and_improvements = [
(0.001, 0.5), # tiny weight, small improvement
(0.1, 5.0), # small weight, moderate improvement
(1.0, 20.0), # large weight, large improvement
(2.5, 3.0), # large weight, small improvement
(0.01, 0.05), # very small weight, very small improvement
]

for w, delta_bits in weights_and_improvements:
if abs(w) < 1e-3:
weight_bits = 1.0
else:
weight_bits = np.log2(abs(w) / 1e-3) + 1

net_benefit = mdl_pruning_score(w, delta_bits)
decision = "KEEP" if net_benefit > 0 else "PRUNE"
print(f"{w:>15.4f} | {delta_bits:>17.2f} | {weight_bits:>12.2f} | {decision:>10}")

MDL and Generalization: Why Compression Implies Generalization

The connection between MDL and generalization is given by a fundamental theorem:

MDL Generalization Bound: If a hypothesis hh has description length L(h)L(h) bits, then with probability at least 1δ1 - \delta:

Error(h)Error^(h)+L(h)ln2+ln(1/δ)2N\text{Error}(h) \leq \hat{\text{Error}}(h) + \sqrt{\frac{L(h)\ln 2 + \ln(1/\delta)}{2N}}

Interpretation: Models with shorter descriptions generalize better, with the generalization gap bounded by O(L(h)/N)O(\sqrt{L(h)/N}).

This is a formal statement of Occam's razor: among models with the same training error, shorter descriptions → smaller generalization gap → better test performance.

It also explains why:

  • Overparameterized models can still generalize: if the effective description length (implicit regularization, early stopping, dropout) is small, the bound holds
  • Regularization improves generalization: it reduces L(h)L(h)
  • More data helps more for complex models: the bound tightens as NN grows, making complexity less costly

Interview Questions and Answers

Q1: What is the Minimum Description Length principle, and how does it relate to model selection?

MDL states: among all models, choose the one that minimizes the total number of bits needed to describe both the model and the data given the model. Formally, minimize L(M)+L(DM)L(M) + L(D|M) where L(M)L(M) is the cost to encode the model and L(DM)L(D|M) is the cost to encode the data residuals.

For model selection, MDL operationalizes Occam's razor: a model that achieves slightly better fit but requires many more bits to describe may not be preferred. The degree-2 polynomial beats the degree-20 polynomial if the polynomial coefficients required to describe the degree-20 model cost more bits than the improvement in data fit saves. MDL quantifies this trade-off precisely using information theory rather than intuition.

Q2: How does L2 regularization relate to MDL?

L2 regularization adds λ2θ2\frac{\lambda}{2}\|\theta\|^2 to the loss, which equals logp(θ)-\log p(\theta) for a Gaussian prior p(θ)=N(0,1/λ)p(\theta) = \mathcal{N}(0, 1/\lambda). The full L2 objective is:

logp(Dθ)data description+λ2θ2model description under Gaussian code\underbrace{-\log p(\mathcal{D}|\theta)}_{\text{data description}} + \underbrace{\frac{\lambda}{2}\|\theta\|^2}_{\text{model description under Gaussian code}}

This is exactly the two-part MDL objective where we encode parameters using a Gaussian code. Larger λ\lambda = fewer bits per parameter = simpler model description = stronger Occam's razor preference. The optimal λ\lambda (by cross-validation) is the one where the total description length is minimized.

Q3: Explain the MDL generalization bound and what it says about overparameterized neural networks.

The MDL generalization bound states that test error \leq training error + O(L(h)/N)O(\sqrt{L(h)/N}) where L(h)L(h) is the description length of the hypothesis and NN is the training set size. Models with shorter descriptions have tighter generalization bounds.

For overparameterized neural networks (more parameters than data points), the naive description length (all parameters at full precision) gives a vacuous bound. But neural networks trained with SGD, weight decay, and early stopping have a much shorter effective description length - they implicitly find solutions in low-dimensional regions of parameter space. This is related to the implicit regularization of SGD (it finds flat minima with low Fisher information), which corresponds to short effective description lengths. The overparameterization is not a problem if the learned solution is compressible.

Q4: What is the difference between BIC and AIC, and when do you use each?

BIC: 2logp(Dθ^)+klogN-2\log p(\mathcal{D}|\hat\theta) + k\log N - approximates the marginal likelihood 2logp(DM)-2\log p(\mathcal{D}|M) using Laplace's method. Penalizes each parameter by logN\log N bits, which grows with dataset size.

AIC: 2logp(Dθ^)+2k-2\log p(\mathcal{D}|\hat\theta) + 2k - approximates the expected out-of-sample prediction error (via an information-theoretic argument by Akaike). Penalizes each parameter by 2 bits regardless of NN.

When to use each:

  • BIC: when you believe there is a true model of finite complexity and you have enough data to identify it. BIC is consistent - it selects the correct model as NN \to \infty. Use for explanatory modeling (which model generated the data?).
  • AIC: when you want the model with best predictive performance on new data. AIC minimizes predictive error asymptotically. Use for predictive modeling (which model makes the best predictions?).
  • For small NN, use AICc (corrected AIC): AIC+2k2+2kNk1\text{AIC} + \frac{2k^2 + 2k}{N-k-1}.

Q5: How does MDL relate to the bias-variance trade-off?

The MDL total description length L(M)+L(DM)L(M) + L(D|M) corresponds exactly to the bias-variance trade-off:

  • L(M)L(M) is the variance term: complex models have high variance (many parameters to describe), so they tend to overfit. A short L(M)L(M) means few parameters = low variance.
  • L(DM)L(D|M) is the bias term: simple models have high bias (they cannot fit the data well), so the residuals are large and require many bits. A short L(DM)L(D|M) means good fit = low bias.

Minimizing L(M)+L(DM)L(M) + L(D|M) finds the optimal bias-variance trade-off from an information-theoretic perspective. Underfitting (high bias, high L(DM)L(D|M), low L(M)L(M)) and overfitting (low bias, low L(DM)L(D|M), high L(M)L(M)) both increase total description length. The optimal model minimizes the sum - this is the quantitative justification for the bias-variance trade-off.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Entropy Explorer demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.