Minimum Description Length

The Scenario That Motivates This Lesson

You are debating model complexity with a colleague:

"We could fit this dataset with a degree-20 polynomial or a degree-2 polynomial. The degree-20 model has lower training loss. Why do we prefer the degree-2 model?"

The usual answer: "Occam's razor - prefer simpler models." But this is not a justification; it is a name for the observation. The real question is: why do simpler models generalize better?

The Minimum Description Length (MDL) principle gives a precise, information-theoretic answer:

The best model is the one that gives the shortest total description of both the model and the data given the model.

This is not a heuristic - it is a theorem connecting model selection, regularization, and compression. It formalizes exactly why you should prefer the degree-2 polynomial: it requires fewer bits to describe both the model and the residuals.

The Core Idea: Description Length

Consider two models $M_1$ and $M_2$ competing to describe data $D$ :

$\text{Description length} = L(M) + L(D | M)$

$L(M)$ : bits to describe the model
$L(D | M)$ : bits to describe the data given the model (residuals)

MDL Principle: Choose the model that minimizes total description length.

A simple example: fitting a noisy line

Data: 100 points roughly on y = 2x + 3 (with noise)

Model 1: degree-2 polynomial (3 parameters: a, b, c)
  L(M₁) = bits to describe 3 parameters ≈ 3 × 32 bits = 96 bits
  L(D|M₁) = bits to describe residuals ≈ small (good fit) = 200 bits
  Total = 296 bits

Model 2: degree-10 polynomial (11 parameters)
  L(M₂) = bits to describe 11 parameters ≈ 11 × 32 bits = 352 bits
  L(D|M₂) = bits to describe residuals ≈ very small (overfits) = 50 bits
  Total = 402 bits

MDL prefers Model 1: it compresses the data more efficiently.
The extra complexity of M₂ doesn't pay off in shorter residuals.

Connection to Kolmogorov Complexity

The ideal MDL measure uses Kolmogorov complexity $K(x)$ - the length of the shortest program that outputs $x$ :

$K(D) = \min_M [L(M) + L(D | M)]$

This is the shortest possible description of the data, combining model and data. A dataset generated by a simple process (polynomial, sinusoid) has low Kolmogorov complexity. A random dataset has high Kolmogorov complexity - it cannot be compressed.

:::warning Kolmogorov Complexity Is Uncomputable As discussed in Lesson 05, $K(x)$ is not computable. MDL in practice uses computable approximations: two-part MDL, normalized maximum likelihood, or Bayesian MDL (which connects to marginal likelihood). :::

Two-Part MDL (Crude MDL)

Two-part MDL is a practical approximation: encode the model and the residuals separately.

$L_{\text{two-part}}(M, D) = L(M) + L(D | \hat{\theta}_M, M)$

where $\hat{\theta}_M$ is the maximum likelihood estimate of parameters in model $M$ .

$L(D | \hat{\theta}_M, M)$ is the negative log-likelihood under the fitted model - exactly the training loss!

So:

$\text{Total code length} = \underbrace{L(M)}_{\text{model complexity}} + \underbrace{(-\log p(\mathcal{D}|\hat{\theta}_M))}_{\text{training loss (negative log-likelihood)}}$

This is exactly regularized loss minimization:

$L(M)$ = regularization term
$-\log p(\mathcal{D}|\hat{\theta}_M)$ = negative log-likelihood

MDL says: regularization penalizes models that require many bits to describe. This is not a heuristic - it is the information-theoretically optimal trade-off between fit and complexity.

MDL and Regularization

L2 Regularization (Weight Decay)

Standard L2-regularized training minimizes:

$\mathcal{L}_{\text{reg}} = -\log p(\mathcal{D}|\theta) + \frac{\lambda}{2}\|\theta\|^2$

MDL interpretation: Assume model parameters $\theta$ are encoded using a Gaussian prior $p(\theta) = \mathcal{N}(0, 1/\lambda)$ . The bits needed to describe a parameter value under this prior is $-\log p(\theta_i) = \frac{\lambda}{2}\theta_i^2 + \text{const}$ .

Therefore:

$\text{L2 loss} = \underbrace{-\log p(\mathcal{D}|\theta)}_{\text{data description bits}} + \underbrace{\frac{\lambda}{2}\|\theta\|^2}_{\text{model description bits (Gaussian prior)}}$

L2 regularization is MDL with a Gaussian code for parameters. The regularization strength $\lambda$ determines how many bits you "spend" to encode each unit of parameter norm.

L1 Regularization

L1 regularization corresponds to a Laplace prior $p(\theta_i) \propto e^{-\lambda|\theta_i|}$ :

$\mathcal{L}_{\text{L1}} = -\log p(\mathcal{D}|\theta) + \lambda\|\theta\|_1$

Under MDL: encoding a parameter with a Laplace prior costs $-\log p(\theta_i) = \lambda|\theta_i| + \text{const}$ .

Laplace priors encourage sparsity (many parameters exactly zero) because they have heavier tails than Gaussian - encoding exactly zero costs near-nothing, but large values are penalized proportionally. MDL says: the sparsest model that describes the data well is the best model.

Dropout as MDL

Dropout can be interpreted as a variational approximation to MDL. Each dropped weight effectively costs zero bits to encode. A model that is robust to random zeroing of weights is one that does not rely on any single parameter - it has learned a more compressible representation.

MDL and Bayesian Model Selection

Refined MDL (using normalized maximum likelihood or Bayesian model evidence) is equivalent to Bayesian model comparison:

$\text{MDL score}(M) = -\log p(\mathcal{D} | M) = -\log \int p(\mathcal{D}|\theta) p(\theta|M) d\theta$

This is the marginal likelihood or model evidence - the probability of the data averaged over all parameter settings under the prior.

The marginal likelihood naturally penalizes complexity: a complex model (many parameters) must spread its probability over a large parameter space, so it assigns lower probability to the specific data $\mathcal{D}$ than a simple model does (assuming both fit the data similarly well).

Bayes factor analogy:

              P(D | M₁)
Bayes factor = ─────────
              P(D | M₂)

If M₁ is simpler but fits the data nearly as well as M₂,
the Bayes factor favors M₁.

This is the quantitative form of Occam's razor:
simpler models that explain the data are preferred,
not just as a heuristic, but because they are more
"surprised" by the data when they do fit - and
therefore the data provides stronger evidence for them.

Practical MDL Criteria

BIC (Bayesian Information Criterion)

$\text{BIC}(M) = -2\log p(\mathcal{D}|\hat{\theta}_M) + k\log N$

where $k$ = number of parameters, $N$ = number of data points.

BIC approximates $-2\log p(\mathcal{D}|M)$ (the marginal likelihood) using Laplace's approximation. It penalizes model complexity logarithmically in the number of parameters.

AIC (Akaike Information Criterion)

$\text{AIC}(M) = -2\log p(\mathcal{D}|\hat{\theta}_M) + 2k$

AIC approximates the out-of-sample prediction error (not the description length). It penalizes complexity less severely than BIC for large $N$ .

import numpy as np
from scipy import stats


def bic(log_likelihood: float, k: int, n: int) -> float:
    """
    Bayesian Information Criterion.

    Args:
        log_likelihood: maximized log P(D | theta_hat)
        k: number of parameters
        n: number of data points

    Returns:
        BIC (lower is better)
    """
    return -2 * log_likelihood + k * np.log(n)


def aic(log_likelihood: float, k: int) -> float:
    """
    Akaike Information Criterion.

    Args:
        log_likelihood: maximized log P(D | theta_hat)
        k: number of parameters

    Returns:
        AIC (lower is better)
    """
    return -2 * log_likelihood + 2 * k


def mdl_model_selection_demo():
    """
    Demonstrate MDL-based model selection for polynomial regression.
    True model: y = 2x + 3 + noise
    """
    np.random.seed(42)
    n = 50
    x = np.linspace(-2, 2, n)
    y_true = 2 * x + 3
    y = y_true + np.random.normal(0, 0.5, n)

    results = []
    for degree in range(1, 12):
        # Fit polynomial
        coeffs = np.polyfit(x, y, degree)
        y_pred = np.polyval(coeffs, x)

        # Log-likelihood: assume Gaussian residuals
        residuals = y - y_pred
        sigma_hat = np.std(residuals)
        if sigma_hat < 1e-10:
            sigma_hat = 1e-10

        log_lik = np.sum(stats.norm.logpdf(residuals, scale=sigma_hat))

        k = degree + 1  # polynomial coefficients
        bic_val = bic(log_lik, k, n)
        aic_val = aic(log_lik, k)

        results.append({
            'degree': degree,
            'k': k,
            'log_lik': log_lik,
            'bic': bic_val,
            'aic': aic_val,
            'rmse': np.sqrt(np.mean(residuals**2)),
        })

    print("=== MDL/BIC/AIC Model Selection for Polynomial Regression ===")
    print(f"True model: degree 1  |  n={n} samples\n")
    print(f"{'Degree':>6} | {'#Params':>7} | {'Log-Lik':>10} | {'BIC':>10} | {'AIC':>10} | {'RMSE':>8}")
    print("-" * 65)
    for r in results:
        best_bic = "*" if r['bic'] == min(res['bic'] for res in results) else " "
        best_aic = "*" if r['aic'] == min(res['aic'] for res in results) else " "
        print(
            f"{r['degree']:>6} | {r['k']:>7} | {r['log_lik']:>10.2f} | "
            f"{r['bic']:>9.2f}{best_bic} | {r['aic']:>9.2f}{best_aic} | "
            f"{r['rmse']:>8.4f}"
        )
    print("\n* = selected by criterion (lower is better)")


mdl_model_selection_demo()

Two-Part MDL in Practice: Quantized Parameters

A concrete two-part MDL analysis for neural networks uses the idea that parameters must be encoded in finite precision.

If we encode each parameter to precision $\delta$ , the cost per parameter is $\log_2(R/\delta)$ bits, where $R$ is the range of values.

The total description length is:

$L = \underbrace{k \cdot \log_2(R/\delta)}_{\text{model: k parameters}} + \underbrace{-\sum_{i=1}^N \log_2 p_\theta(y_i|x_i)}_{\text{data: negative log-likelihood}}$

This motivates precision-aware model selection: more parameters at lower precision vs. fewer parameters at higher precision. The optimal model minimizes this sum.

def two_part_mdl(
    n_params: int,
    log_likelihood: float,
    param_range: float = 10.0,
    precision: float = 0.001,
) -> float:
    """
    Two-part MDL: L(model) + L(data | model).

    L(model) = k * log2(param_range / precision) bits
    L(data | model) = -log2_likelihood

    Args:
        n_params: number of model parameters k
        log_likelihood: maximized log P(data | theta) in nats
        param_range: range of parameter values (±param_range)
        precision: encoding precision

    Returns:
        total description length in bits
    """
    # Model description: k params, each taking log2(range/precision) bits
    model_bits = n_params * np.log2(2 * param_range / precision)

    # Data description: -log2(likelihood)
    data_bits = -log_likelihood / np.log(2)

    return model_bits + data_bits


# Demonstrate on linear vs polynomial
np.random.seed(0)
n = 100
x = np.linspace(0, 5, n)
y = 3.0 * x + 2.0 + np.random.normal(0, 0.5, n)

print("\n=== Two-Part MDL for Model Selection ===")
print(f"{'Model':<20} | {'Model bits':>12} | {'Data bits':>12} | {'Total bits':>12}")
print("-" * 60)

for degree in [1, 2, 3, 5, 10]:
    coeffs = np.polyfit(x, y, degree)
    y_pred = np.polyval(coeffs, x)
    residuals = y - y_pred
    sigma = np.std(residuals) + 1e-10

    log_lik = np.sum(stats.norm.logpdf(residuals, scale=sigma))
    k = degree + 1

    model_bits = k * np.log2(2 * 10 / 0.001)
    data_bits  = -log_lik / np.log(2)
    total      = model_bits + data_bits

    print(f"Degree {degree:<2} ({k} params)    | {model_bits:>12.1f} | {data_bits:>12.1f} | {total:>12.1f}")

MDL and Regularization: The Formal Equivalence

The following equivalences hold exactly under mild assumptions:

Regularization Method    | MDL Interpretation          | Prior Distribution
-------------------------+-----------------------------+--------------------
L2 (weight decay)        | Gaussian parameter code     | N(0, 1/λ)
L1 (lasso)               | Laplace parameter code      | Laplace(0, 1/λ)
Dropout                  | Bernoulli parameter mask     | Bernoulli(1-p_drop)
Early stopping           | Bound on description length | Implicit complexity
Max-norm constraint      | Bounded parameter code      | Uniform(-r, r)
Variational inference    | Refined MDL (stochastic)    | q(θ) ≈ p(θ|D)
Evidence maximization    | Normalized max likelihood   | Full marginal p(D|M)

Every regularization method has an MDL interpretation: it corresponds to a particular code for model parameters. The "best" regularization is the one whose implied prior best matches the true distribution of useful parameter values.

MDL in Neural Network Pruning

MDL provides a principled framework for neural network pruning. Instead of arbitrary sparsity targets, we ask:

How many bits does each weight contribute to the description?

For a weight $w$ with encoding precision $\delta$ :

Zero weight: costs 0 bits (just a single indicator bit)
Non-zero weight $w$ : costs approximately $\log_2(|w|/\delta)$ bits + sign bit

A weight should be kept only if the improvement in data description $(-\Delta \log p(\mathcal{D}|\theta))$ exceeds the bits needed to describe it.

def mdl_pruning_score(
    weight: float,
    data_improvement: float,
    precision: float = 1e-3,
) -> float:
    """
    MDL score for keeping vs. pruning a weight.
    Positive = keep weight (it reduces total description length).
    Negative = prune weight (it costs more to describe than it saves).

    Args:
        weight: current weight value
        data_improvement: bits saved in data description by this weight
        precision: encoding precision

    Returns:
        net MDL benefit of keeping this weight (in bits)
    """
    if abs(weight) < precision:
        # Treat as zero: costs ~1 bit to describe (just a flag)
        return data_improvement - 1.0

    # Bits needed to encode this weight value
    weight_bits = np.log2(abs(weight) / precision) + 1  # +1 for sign bit

    return data_improvement - weight_bits


# Example: should we prune these weights?
print("\n=== MDL Pruning Decision ===")
print(f"{'Weight value':>15} | {'Data improvement':>17} | {'Weight bits':>12} | {'Decision':>10}")
print("-" * 65)

weights_and_improvements = [
    (0.001, 0.5),   # tiny weight, small improvement
    (0.1,  5.0),    # small weight, moderate improvement
    (1.0,  20.0),   # large weight, large improvement
    (2.5,  3.0),    # large weight, small improvement
    (0.01, 0.05),   # very small weight, very small improvement
]

for w, delta_bits in weights_and_improvements:
    if abs(w) < 1e-3:
        weight_bits = 1.0
    else:
        weight_bits = np.log2(abs(w) / 1e-3) + 1

    net_benefit = mdl_pruning_score(w, delta_bits)
    decision = "KEEP" if net_benefit > 0 else "PRUNE"
    print(f"{w:>15.4f} | {delta_bits:>17.2f} | {weight_bits:>12.2f} | {decision:>10}")

MDL and Generalization: Why Compression Implies Generalization

The connection between MDL and generalization is given by a fundamental theorem:

MDL Generalization Bound: If a hypothesis $h$ has description length $L(h)$ bits, then with probability at least $1 - \delta$ :

$\text{Error}(h) \leq \hat{\text{Error}}(h) + \sqrt{\frac{L(h)\ln 2 + \ln(1/\delta)}{2N}}$

Interpretation: Models with shorter descriptions generalize better, with the generalization gap bounded by $O(\sqrt{L(h)/N})$ .

This is a formal statement of Occam's razor: among models with the same training error, shorter descriptions → smaller generalization gap → better test performance.

It also explains why:

Overparameterized models can still generalize: if the effective description length (implicit regularization, early stopping, dropout) is small, the bound holds
Regularization improves generalization: it reduces $L(h)$
More data helps more for complex models: the bound tightens as $N$ grows, making complexity less costly

Interview Questions and Answers

Q1: What is the Minimum Description Length principle, and how does it relate to model selection?

MDL states: among all models, choose the one that minimizes the total number of bits needed to describe both the model and the data given the model. Formally, minimize $L(M) + L(D|M)$ where $L(M)$ is the cost to encode the model and $L(D|M)$ is the cost to encode the data residuals.

For model selection, MDL operationalizes Occam's razor: a model that achieves slightly better fit but requires many more bits to describe may not be preferred. The degree-2 polynomial beats the degree-20 polynomial if the polynomial coefficients required to describe the degree-20 model cost more bits than the improvement in data fit saves. MDL quantifies this trade-off precisely using information theory rather than intuition.

Q2: How does L2 regularization relate to MDL?

L2 regularization adds $\frac{\lambda}{2}\|\theta\|^2$ to the loss, which equals $-\log p(\theta)$ for a Gaussian prior $p(\theta) = \mathcal{N}(0, 1/\lambda)$ . The full L2 objective is:

$\underbrace{-\log p(\mathcal{D}|\theta)}_{\text{data description}} + \underbrace{\frac{\lambda}{2}\|\theta\|^2}_{\text{model description under Gaussian code}}$

This is exactly the two-part MDL objective where we encode parameters using a Gaussian code. Larger $\lambda$ = fewer bits per parameter = simpler model description = stronger Occam's razor preference. The optimal $\lambda$ (by cross-validation) is the one where the total description length is minimized.

Q3: Explain the MDL generalization bound and what it says about overparameterized neural networks.

The MDL generalization bound states that test error $\leq$ training error + $O(\sqrt{L(h)/N})$ where $L(h)$ is the description length of the hypothesis and $N$ is the training set size. Models with shorter descriptions have tighter generalization bounds.

For overparameterized neural networks (more parameters than data points), the naive description length (all parameters at full precision) gives a vacuous bound. But neural networks trained with SGD, weight decay, and early stopping have a much shorter effective description length - they implicitly find solutions in low-dimensional regions of parameter space. This is related to the implicit regularization of SGD (it finds flat minima with low Fisher information), which corresponds to short effective description lengths. The overparameterization is not a problem if the learned solution is compressible.

Q4: What is the difference between BIC and AIC, and when do you use each?

BIC: $-2\log p(\mathcal{D}|\hat\theta) + k\log N$ - approximates the marginal likelihood $-2\log p(\mathcal{D}|M)$ using Laplace's method. Penalizes each parameter by $\log N$ bits, which grows with dataset size.

AIC: $-2\log p(\mathcal{D}|\hat\theta) + 2k$ - approximates the expected out-of-sample prediction error (via an information-theoretic argument by Akaike). Penalizes each parameter by 2 bits regardless of $N$ .

When to use each:

BIC: when you believe there is a true model of finite complexity and you have enough data to identify it. BIC is consistent - it selects the correct model as $N \to \infty$ . Use for explanatory modeling (which model generated the data?).
AIC: when you want the model with best predictive performance on new data. AIC minimizes predictive error asymptotically. Use for predictive modeling (which model makes the best predictions?).
For small $N$ , use AICc (corrected AIC): $\text{AIC} + \frac{2k^2 + 2k}{N-k-1}$ .

Q5: How does MDL relate to the bias-variance trade-off?

The MDL total description length $L(M) + L(D|M)$ corresponds exactly to the bias-variance trade-off:

$L(M)$ is the variance term: complex models have high variance (many parameters to describe), so they tend to overfit. A short $L(M)$ means few parameters = low variance.
$L(D|M)$ is the bias term: simple models have high bias (they cannot fit the data well), so the residuals are large and require many bits. A short $L(D|M)$ means good fit = low bias.

Minimizing $L(M) + L(D|M)$ finds the optimal bias-variance trade-off from an information-theoretic perspective. Underfitting (high bias, high $L(D|M)$ , low $L(M)$ ) and overfitting (low bias, low $L(D|M)$ , high $L(M)$ ) both increase total description length. The optimal model minimizes the sum - this is the quantitative justification for the bias-variance trade-off.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Entropy Explorer demo on the EngineersOfAI Playground - no code required.

:::

The Scenario That Motivates This Lesson​

The Core Idea: Description Length​

Connection to Kolmogorov Complexity​

Two-Part MDL (Crude MDL)​

MDL and Regularization​

L2 Regularization (Weight Decay)​

L1 Regularization​

Dropout as MDL​

MDL and Bayesian Model Selection​

Practical MDL Criteria​

BIC (Bayesian Information Criterion)​

AIC (Akaike Information Criterion)​

Two-Part MDL in Practice: Quantized Parameters​

MDL and Regularization: The Formal Equivalence​

MDL in Neural Network Pruning​

MDL and Generalization: Why Compression Implies Generalization​

Interview Questions and Answers​