Generalisation Bounds in Deep Learning

Reading time: ~28 minutes | Level: Statistical Learning Theory → Modern Deep Learning Research

The Paradox That Broke Classical Theory

In 2017, Zhang et al. published "Understanding Deep Learning Requires Rethinking Generalization" - a paper that caused a crisis in theoretical ML. Their key experiment:

They trained a ResNet-50 on CIFAR-10 where the labels were randomly shuffled. The network achieved near-zero training loss. But on any test set, performance was random chance (10% accuracy) - as expected, since the labels were meaningless.

Then they trained the same network on real labels. It achieved ~95% accuracy, generalizing well.

The paradox: The same overparameterized neural network perfectly memorizes random labels (proving its capacity to overfit arbitrarily) yet also generalizes well on real data. Classical theory says this is impossible - a model that can fit any labeling of $n$ training examples has VC dimension $\geq n$ , so the generalization bound is vacuous (greater than 1).

Classical theory predicts: more parameters → more overfitting → worse generalization. Empirical observation: more parameters sometimes → better generalization.

This lesson explains the modern theoretical understanding of why deep learning generalizes despite classical theory saying it shouldn't.

What You Will Learn

Why VC/PAC bounds are vacuous for modern deep networks - the formal argument
Double descent: the empirical phenomenon and its theoretical explanation
Benign overfitting: when interpolating solutions generalize (and when they don't)
Implicit regularisation of SGD: why gradient descent finds minimum-norm solutions
The Neural Tangent Kernel (NTK): the infinite-width limit and its generalization theory
PAC-Bayes bounds: the tightest practical generalization bounds for deep learning
Open questions: what theory still cannot explain about deep learning generalization
Five interview questions with full worked answers

What Classical Theory Gets Wrong

Classical VC/PAC bounds say: with probability $\geq 1-\delta$ ,

$R(h) \leq \hat{R}(h) + O\left(\sqrt{\frac{d_{VC}(\mathcal{H}) + \log(1/\delta)}{n}}\right)$

For a neural network with $p$ parameters, $d_{VC} = O(p \log p)$ (Bartlett & Maass, 2003). For modern networks, $p \gg n$ - the bound is $\gg 1$ , giving zero information.

The assumptions that break down:

Hypothesis class is fixed before training: Classical theory bounds worst-case risk over all $h \in \mathcal{H}$ . But SGD doesn't explore all of $\mathcal{H}$ - it follows a specific trajectory from a specific initialization.
ERM selects the worst-case hypothesis: Classical theory assumes we pick any zero-training-error hypothesis. SGD's implicit bias selects a specific one - often the minimum-norm solution, which generalizes.
No distributional structure: Classical bounds are distribution-free. Real data has massive structure that the network learns to exploit. The Rademacher complexity on real data is much smaller than on adversarial worst-case data.

Double Descent: The New Picture

The double descent curve (Belkin et al., 2019; Bartlett et al., 2020) extends the classical bias-variance U-curve:

Test error

  │      Classical regime          │  Modern ML regime
  │      (underfitting → overfitting)   (overparameterized)
  │
  │     ╱╲
  │    ╱  ╲                            ╲
  │   ╱    ╲                            ╲___
  │  ╱      ╲         ╱╲               ╱
  │ ╱        ╲_______╱  ╲_____________╱
  │
  └─────────────────────────────────────────> # parameters
                     ↑
          interpolation threshold
          (n parameters ≈ n data points)

Three regimes:

1. Classical underfitting region (few parameters, $p \ll n$ ): Bias dominates. More parameters → lower bias → better test performance.

2. Classical overfitting region ( $p \approx n$ ): Near the interpolation threshold, the model has just enough capacity to fit the training data - but doing so requires fitting noise aggressively. Test error spikes.

3. Overparameterized "modern" region ( $p \gg n$ ): Adding more parameters makes it easier to interpolate the training data while generalizing. Test error decreases again. This is counterintuitive but empirically robust.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

def demonstrate_double_descent():
    """
    Double descent in polynomial regression.
    As degree grows past n, test error first spikes then decreases.
    """
    np.random.seed(42)
    n = 30  # training points
    n_test = 5000

    # True function
    true_f = lambda x: np.sin(2*np.pi*x)

    X_train = np.random.uniform(0, 1, n).reshape(-1, 1)
    y_train = true_f(X_train.ravel()) + np.random.randn(n) * 0.1

    X_test = np.linspace(0, 1, n_test).reshape(-1, 1)
    y_test = true_f(X_test.ravel())

    degrees = range(1, 80)
    train_errors, test_errors = [], []

    for d in degrees:
        # Minimum-norm interpolation via pseudoinverse for overdetermined case
        poly = PolynomialFeatures(d)
        X_poly_train = poly.fit_transform(X_train)
        X_poly_test = poly.transform(X_test)

        n_params = X_poly_train.shape[1]

        if n_params <= n:
            # Underdetermined: use ordinary least squares
            coefs = np.linalg.lstsq(X_poly_train, y_train, rcond=None)[0]
        else:
            # Overdetermined: use minimum-norm solution (Moore-Penrose pseudoinverse)
            # This is what gradient descent from zero finds
            coefs = np.linalg.pinv(X_poly_train) @ y_train

        y_pred_train = X_poly_train @ coefs
        y_pred_test = X_poly_test @ coefs

        train_err = np.mean((y_pred_train - y_train)**2)
        test_err = np.mean((y_pred_test - y_test)**2)

        # Clip extreme values for visibility
        train_errors.append(min(train_err, 5.0))
        test_errors.append(min(test_err, 5.0))

    plt.figure(figsize=(12, 5))
    plt.semilogy(list(degrees), train_errors, 'b-', linewidth=2, label='Train error')
    plt.semilogy(list(degrees), test_errors, 'r-', linewidth=2, label='Test error')
    plt.axvline(n, color='green', linestyle='--', linewidth=2,
                label=f'Interpolation threshold (d={n})')
    plt.axhline(0.01, color='gray', linestyle=':', label='Noise floor')
    plt.xlabel('Polynomial degree (# parameters ≈ degree)')
    plt.ylabel('MSE (log scale)')
    plt.title('Double Descent: Polynomial Regression')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig('double_descent.png', dpi=150)

    # Find the double descent peak
    peak_deg = list(degrees)[np.argmax(test_errors)]
    print(f"Double descent peak: at degree {peak_deg} (n={n})")
    print(f"Test error at peak: {max(test_errors):.4f}")
    print(f"Test error at high degree (d=75): {test_errors[-1]:.4f}")
    print(f"Test error at low degree (d=5): {test_errors[4]:.4f}")

demonstrate_double_descent()

Implicit Regularisation of SGD

The key to understanding why overparameterized networks generalize: SGD has an implicit bias toward solutions with small norm.

Theorem (for linear models, Gunasekar et al., 2018): For linear regression with gradient descent starting from $\mathbf{w}_0 = \mathbf{0}$ , the solution converges to the minimum L2-norm solution:

$\hat{\mathbf{w}}_{GD} = \arg\min_\mathbf{w} \|\mathbf{w}\|_2 \quad \text{s.t.} \quad X\mathbf{w} = \mathbf{y}$

This is the Moore-Penrose pseudoinverse solution: $\hat{\mathbf{w}} = X^\top(XX^\top)^{-1}\mathbf{y}$ .

Why does the minimum-norm solution generalize? Because minimum-norm linear interpolation is equivalent to kernel regression with the inner product kernel $k(x, x') = x^\top x'$ - a smoothly varying function that doesn't overfit noise.

For neural networks: The implicit bias is more complex but analogous:

Gradient descent with small step size from random initialization tends to find solutions in a "flat minimum" region of the loss landscape
Flat minima (low loss curvature) generalize better than sharp minima (Hochreiter & Schmidhuber, 1997; Keskar et al., 2017)
SGD with small batch size has higher noise → escapes sharp minima → finds flatter solutions → better generalization

import numpy as np

def minimum_norm_interpolation_demo():
    """
    Show that minimum-norm interpolation generalizes better
    than other zero-training-error solutions.
    """
    np.random.seed(42)
    n = 20   # training points
    p = 100  # parameters (> n, so infinitely many solutions)

    # True linear function with sparse weights
    w_true = np.zeros(p)
    w_true[:3] = [2.0, -1.5, 1.0]

    X_train = np.random.randn(n, p)
    y_train = X_train @ w_true + np.random.randn(n) * 0.1

    X_test = np.random.randn(1000, p)
    y_test = X_test @ w_true

    # Minimum-norm interpolating solution (pseudoinverse)
    w_min_norm = np.linalg.pinv(X_train) @ y_train

    # Random interpolating solution (for comparison)
    # Start from a random solution and add minimum-norm component
    w_random_offset = np.random.randn(p) * 2
    # Project to constraint set: X_train @ w = y_train
    # w_feasible = w_random + X_train^T (X_train X_train^T)^{-1} (y_train - X_train @ w_random)
    XXT_inv = np.linalg.inv(X_train @ X_train.T)
    correction = X_train.T @ XXT_inv @ (y_train - X_train @ w_random_offset)
    w_random = w_random_offset + correction

    # Verify both interpolate
    print("Verification (should be near 0):")
    print(f"  Min-norm train error: {np.mean((X_train @ w_min_norm - y_train)**2):.2e}")
    print(f"  Random-init train error: {np.mean((X_train @ w_random - y_train)**2):.2e}")

    print(f"\nGeneralisation:")
    print(f"  Min-norm test MSE: {np.mean((X_test @ w_min_norm - y_test)**2):.4f}")
    print(f"  Random-init test MSE: {np.mean((X_test @ w_random - y_test)**2):.4f}")

    print(f"\nSolution norms:")
    print(f"  ||w_min_norm||: {np.linalg.norm(w_min_norm):.4f}")
    print(f"  ||w_random||: {np.linalg.norm(w_random):.4f}")
    print(f"  ||w_true||: {np.linalg.norm(w_true):.4f}")

minimum_norm_interpolation_demo()

Benign Overfitting

Benign overfitting (Bartlett et al., 2020): perfectly fitting noisy training data can be consistent - the estimator converges to the optimal predictor as $n \to \infty$ - if the feature space is sufficiently high-dimensional and the "effective rank" of the data matrix is large relative to the noise.

Theorem (informal, Bartlett et al., 2020): Consider linear regression with $p$ features, $n$ samples. The minimum-norm interpolating solution has test MSE:

$E[\text{MSE}_{test}] \leq \frac{\sigma^2 (1 + o(1))}{n} + \text{bias terms}$

when $p \gg n$ and the covariance structure satisfies a "effectively low-dimensional" condition. This goes to 0 as $n \to \infty$ - the interpolating solution is consistent!

The key condition: benign overfitting requires the data to have a "spiked" covariance structure - most variance in a few directions (the signal), with the remaining directions having low but nonzero variance (can absorb noise without hurting the signal direction's prediction).

import numpy as np

def benign_overfitting_simulation():
    """
    Demonstrate benign overfitting: interpolating estimator that generalizes.
    Spiked covariance: top k eigenvalues large, rest small.
    """
    np.random.seed(42)
    n = 100
    p = 5000   # p >> n

    # Spiked covariance: k large eigenvalues, p-k small eigenvalues
    k = 5
    sigma_signal = 10.0   # variance in signal directions
    sigma_noise_feature = 0.01  # variance in noise directions

    # Generate covariance matrix
    eigenvalues = np.concatenate([
        np.ones(k) * sigma_signal,         # k large eigenvalues (signal)
        np.ones(p - k) * sigma_noise_feature  # p-k small (noise features)
    ])
    # Generate X with this covariance structure
    X_raw = np.random.randn(n, p) * np.sqrt(eigenvalues)

    # True model: only depends on first k features
    w_true = np.zeros(p)
    w_true[:k] = np.random.randn(k)

    noise_std = 0.3
    y = X_raw @ w_true + np.random.randn(n) * noise_std

    # Test data (n_test = 10000)
    n_test = 10000
    X_test = np.random.randn(n_test, p) * np.sqrt(eigenvalues)
    y_test = X_test @ w_true

    # Minimum-norm interpolating solution
    w_interp = np.linalg.pinv(X_raw) @ y

    # Compare: OLS on only the first k features
    w_oracle = np.zeros(p)
    w_oracle[:k] = np.linalg.lstsq(X_raw[:, :k], y, rcond=None)[0]

    test_mse_interp = np.mean((X_test @ w_interp - y_test)**2)
    test_mse_oracle = np.mean((X_test @ w_oracle - y_test)**2)
    train_mse_interp = np.mean((X_raw @ w_interp - y)**2)

    print(f"Benign overfitting simulation (n={n}, p={p}):")
    print(f"  Train MSE (interpolating): {train_mse_interp:.2e} (near 0)")
    print(f"  Test MSE (interpolating):  {test_mse_interp:.4f}")
    print(f"  Test MSE (oracle k-feature): {test_mse_oracle:.4f}")
    print(f"  Noise floor (sigma^2): {noise_std**2:.4f}")
    print()
    print("Key: interpolating solution nearly matches the oracle (knowing which features matter)")
    print("because spiked covariance absorbs noise in low-variance feature directions")
    print("without contaminating the high-variance signal directions")

benign_overfitting_simulation()

PAC-Bayes Bounds: Modern Generalization Theory for Deep Learning

PAC-Bayes bounds (McAllester, 1999) give generalization guarantees for distributions over hypotheses - directly applicable to overparameterized networks.

Setup: Prior $P$ over $\mathcal{H}$ (before seeing data), posterior $Q$ (after training).

PAC-Bayes Theorem: With probability $\geq 1-\delta$ over the training set $S \sim \mathcal{D}^n$ , for all distributions $Q$ :

$\text{KL}(Q[\hat{R}(h)] \| Q[R(h)]) \leq \frac{\text{KL}(Q \| P) + \ln\frac{n}{\delta}}{n-1}$

The useful form (Maurer 2004 bound):

$E_{h \sim Q}[R(h)] \leq E_{h \sim Q}[\hat{R}(h)] + \sqrt{\frac{\text{KL}(Q \| P) + \ln\frac{2n}{\delta}}{2n}}$

What this says: If the posterior $Q$ (trained network + weight perturbation distribution) is not too different from the prior $P$ (random init), the generalization gap is controlled by $\text{KL}(Q\|P)/n$ .

The deep learning connection: The KL divergence $\text{KL}(Q\|P)$ measures how much the training procedure changed the weights from initialization. SGD with weight decay keeps weights close to initialization (small KL) → small generalization gap. This is the formal version of "flat minima generalize better."

import torch
import torch.nn as nn
import numpy as np

def pac_bayes_bound(kl_qp, n, delta=0.05):
    """
    PAC-Bayes generalization bound.
    kl_qp: KL divergence between posterior Q and prior P
    n: training set size
    Returns the bound on E_Q[R(h)] - E_Q[hat_R(h)]
    """
    return np.sqrt((kl_qp + np.log(2*n/delta)) / (2*n))

# Illustrative example: measuring PAC-Bayes KL for a simple network
def compute_pac_bayes_kl(model_weights, prior_std=0.1):
    """
    KL divergence between Gaussian posterior (trained weights)
    and Gaussian prior N(0, prior_std^2 * I).

    KL(N(mu, sigma^2) || N(0, prior_std^2)) = sum_i [
        0.5 * (sigma_i^2/prior_std^2 + mu_i^2/prior_std^2 - 1 - log(sigma_i^2/prior_std^2))
    ]
    For a point estimate (sigma -> 0), KL -> sum_i (mu_i^2 / (2*prior_std^2)) = ||mu||^2 / (2*prior_std^2)
    """
    total_params = sum(p.numel() for p in model_weights)
    weight_norm_sq = sum(p.norm()**2 for p in model_weights).item()

    # Approximation: treat network as point mass at trained weights
    # KL = ||w||^2 / (2 * prior_std^2)
    kl = weight_norm_sq / (2 * prior_std**2)
    return kl, total_params

# Create simple network
model = nn.Sequential(
    nn.Linear(100, 256),
    nn.ReLU(),
    nn.Linear(256, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)

# Initialize with small random weights (as if training with weight decay)
with torch.no_grad():
    for p in model.parameters():
        p.data.normal_(0, 0.01)  # tight init: small weights

kl_small_init, n_params = compute_pac_bayes_kl(model.parameters(), prior_std=0.1)
bound_small = pac_bayes_bound(kl_small_init, n=50000)

# Large weights (no regularisation, trained longer)
with torch.no_grad():
    for p in model.parameters():
        p.data.normal_(0, 1.0)   # large weights

kl_large_init, _ = compute_pac_bayes_kl(model.parameters(), prior_std=0.1)
bound_large = pac_bayes_bound(kl_large_init, n=50000)

print(f"Network: {n_params:,} parameters")
print(f"Prior: N(0, 0.1^2)")
print()
print(f"Small-norm weights (like strong weight decay):")
print(f"  KL(Q||P) = {kl_small_init:.2f}")
print(f"  PAC-Bayes bound on gen. gap: {bound_small:.4f}")
print()
print(f"Large-norm weights (no regularisation):")
print(f"  KL(Q||P) = {kl_large_init:.2f}")
print(f"  PAC-Bayes bound on gen. gap: {bound_large:.4f}")
print()
print(f"Weight decay directly controls KL(Q||P) -> directly controls the PAC-Bayes bound")

Summary: The Modern Picture of Generalisation in Deep Learning

Phenomenon	Theoretical Explanation
Overparameterized nets generalize	Minimum-norm implicit bias of SGD; PAC-Bayes with small KL from init
Flat minima generalize	Lower Hessian eigenvalues → smaller PAC-Bayes KL → tighter bound
More data helps even for large models	Generalization gap scales as $O(1/\sqrt{n})$ even for min-norm interpolation
Weight decay helps	Directly reduces KL(Q
Double descent	Min-norm interpolation is benign when covariance is spiked; near interpolation threshold, solution is not minimum-norm
Random labels can be memorized	High KL; but real data has lower effective complexity via NTK alignment
Larger batch size hurts generalization	Larger batches → less SGD noise → sharper minima → higher KL

Interview Questions

Q1: What did Zhang et al. (2017) show about neural network generalization, and why was it surprising?

Zhang et al. trained ResNet on CIFAR-10 with randomly permuted labels and showed the network achieved near-zero training loss - perfectly memorizing 50,000 random label assignments. This was surprising because classical learning theory says: a model that can perfectly fit any labeling of $n$ training examples has VC dimension $\geq n$ . For such a model, the generalization bound is $O(\sqrt{n/n}) = O(1)$ - vacuous, telling us nothing about generalization. Yet the same network, trained on real labels, generalizes to 94%+ test accuracy. The result showed that classical theory's measures of model complexity (VC dimension, number of parameters) are insufficient to explain neural network generalization. The explanation requires understanding what the training algorithm (SGD) selects - not just what the model class can represent.

Q2: Explain the double descent phenomenon and why the classical bias-variance picture doesn't capture it.

Classical bias-variance theory predicts a U-shaped test error curve: as model complexity grows, bias decreases and variance increases, with an optimal complexity in the middle. Double descent shows that beyond the interpolation threshold (where the model has enough parameters to exactly fit training data), test error first peaks then decreases again with further overparameterization. The classical picture doesn't capture this because it assumes: (1) the selected hypothesis is the one with minimum training error that "maximally uses" the model class's capacity; (2) models near the interpolation threshold are maximally constrained. But gradient descent from small initialization finds the minimum-norm interpolating solution - not the maximum-complexity zero-error solution. In the overparameterized regime, minimum-norm interpolation is a smooth function that generalizes well (similar to kernel regression), explaining the second descent.

Q3: What is implicit regularisation in gradient descent, and what does it find for linear models?

Implicit regularisation is the bias of the training algorithm (gradient descent) toward solutions with certain properties, without explicit regularisation terms in the loss. For linear regression with gradient descent starting from $\mathbf{w}_0 = \mathbf{0}$ , when the system is underdetermined ( $p > n$ ), GD converges to the minimum Euclidean-norm solution: $\hat{\mathbf{w}} = X^\top(XX^\top)^{-1}\mathbf{y}$ (the Moore-Penrose pseudoinverse solution). This is because the gradient update $\mathbf{w}_{t+1} = \mathbf{w}_t - \eta X^\top(X\mathbf{w}_t - y)$ always stays in the row space of $X$ (since we start at $\mathbf{0}$ and each update is in the row space of $X^\top$ ). The minimum-norm solution is the unique solution in the row space of $X$ . For neural networks, the situation is more complex but analogous - SGD from small-norm initialization with small step size tends toward flat, low-norm solutions.

Q4: What are PAC-Bayes bounds and why are they useful for deep learning?

PAC-Bayes bounds provide generalization guarantees for distributions over hypotheses: $E_{h \sim Q}[R(h)] \leq E_{h \sim Q}[\hat{R}(h)] + \sqrt{(\text{KL}(Q\|P) + \log(2n/\delta))/(2n)}$ . They're useful for deep learning because: (1) the bound applies to stochastic predictors (like MC Dropout or networks with random weight perturbations), which are more natural models for neural networks than point estimates; (2) the KL divergence $\text{KL}(Q\|P)$ between posterior $Q$ (trained weights + perturbation) and prior $P$ (random initialization) is measurable and bounded by the weight norm: $\text{KL} \leq \|\mathbf{w}_{trained} - \mathbf{w}_{init}\|^2 / (2\sigma_{prior}^2)$ ; (3) weight decay directly controls this KL; (4) PAC-Bayes bounds can give non-vacuous guarantees for neural networks (Dziugaite & Roy, 2017 showed bounds around 25% for MNIST where VC bounds give 100%+).

Q5: Why do larger batch sizes often hurt generalization in practice, despite training the same model?

Large-batch SGD tends to converge to sharp minima of the training loss, while small-batch SGD finds flat minima. Sharp minima generalize poorly because they're sensitive to small perturbations of the weights: if the minimum is sharp, nearby points on the test distribution (which are always slightly shifted from training) will have high loss. Flat minima generalize better because they're robust to this shift. The theoretical mechanism: small-batch SGD has high gradient noise, which acts like perturbations that "push" the optimizer away from sharp minima and toward flat ones (essentially doing stochastic exploration). Large-batch SGD has low gradient noise and converges to the nearest sharp minimum. In PAC-Bayes terms: flat minima have smaller Hessian eigenvalues, meaning weight perturbations of a given magnitude cause smaller loss changes - which corresponds to smaller KL between the posterior (perturbed weights) and the unperturbed point estimate. Keskar et al. (2017) quantified this experimentally; Jastrzkebski et al. (2018) showed the noise scale $\eta/B$ (step size / batch size) is the key control parameter.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the PAC Learning demo on the EngineersOfAI Playground - no code required.

:::

The Paradox That Broke Classical Theory​

What You Will Learn​

What Classical Theory Gets Wrong​

Double Descent: The New Picture​

Implicit Regularisation of SGD​

Benign Overfitting​

PAC-Bayes Bounds: Modern Generalization Theory for Deep Learning​

Summary: The Modern Picture of Generalisation in Deep Learning​

Interview Questions​