Statistical Learning Theory

Reading time: ~40 minutes | Level: ML Foundations | Role: MLE, Research Engineer, Data Scientist

A researcher presents a new neural architecture claiming 99% accuracy on a benchmark. A reviewer asks: "But what is the VC dimension of your model? How many samples would you need to guarantee generalisation within 1% of the true error with 95% confidence?"

If this question sounds foreign, you're missing the mathematical bedrock of machine learning. Statistical Learning Theory (SLT) is the branch of mathematics that rigorously answers: when can a machine learning algorithm be guaranteed to generalise, how much data is required, and what properties of the hypothesis class determine learnability. This is the language spoken at ML research frontiers.

What You Will Learn

The PAC learning framework: when and why learning is guaranteed to work
Sample complexity: how many examples are needed to learn with confidence
VC dimension: measuring the expressiveness of a hypothesis class
Sauer's Lemma and the growth function - the combinatorial heart of generalisation
Rademacher complexity: a data-dependent bound on generalisation
The Fundamental Theorem of Statistical Learning
Why regularisation works from a theoretical perspective
Structural Risk Minimisation (SRM) - model selection from first principles
Connection to modern deep learning: why do overparameterised models generalise?
Eight interview Q&As at senior research engineer level

Part 1 - The Generalisation Problem, Formally

Setup

Let:

$\mathcal{X}$ = input space, $\mathcal{Y}$ = label space
$\mathcal{D}$ = unknown data distribution over $\mathcal{X} \times \mathcal{Y}$
Training set $S = \{(x_1, y_1), \ldots, (x_m, y_m)\}$ drawn i.i.d. from $\mathcal{D}$
Hypothesis class $\mathcal{H}$ = set of all models we can express
Loss function $\ell: \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}_{\geq 0}$

True risk (what we care about): $L_{\mathcal{D}}(h) = \mathbb{E}_{(x,y)\sim\mathcal{D}}[\ell(h(x), y)]$

Empirical risk (what we can compute): $\hat{L}_S(h) = \frac{1}{m} \sum_{i=1}^m \ell(h(x_i), y_i)$

The core question: Under what conditions does minimising $\hat{L}_S(h)$ lead to small $L_{\mathcal{D}}(h)$ ?

The generalisation gap is $L_{\mathcal{D}}(h) - \hat{L}_S(h)$ . SLT provides:

Upper bounds on the generalisation gap with high probability
Conditions under which ERM (Empirical Risk Minimisation) is guaranteed to work

Part 2 - PAC Learning

The Framework

PAC stands for Probably Approximately Correct (Valiant, 1984). A learning algorithm $A$ PAC-learns hypothesis class $\mathcal{H}$ if, for any data distribution $\mathcal{D}$ , any $\varepsilon > 0$ (accuracy), and any $\delta > 0$ (confidence), there exists a sample size $m_\mathcal{H}(\varepsilon, \delta)$ such that for any $m \geq m_\mathcal{H}$ :

$\Pr_{S \sim \mathcal{D}^m}\left[L_{\mathcal{D}}(A(S)) \leq \min_{h \in \mathcal{H}} L_{\mathcal{D}}(h) + \varepsilon\right] \geq 1 - \delta$

Translation: With probability at least $1 - \delta$ over the random draw of training data, the algorithm finds a hypothesis within $\varepsilon$ of the best possible in $\mathcal{H}$ .

The PAC learning guarantee in plain English:

"Give me enough data (how much depends on ε and δ),
 and with high probability (≥ 1-δ),
 I'll output a model that's approximately (within ε)
 as good as any model in my hypothesis class."

The two knobs:
  ε (epsilon) - accuracy: how close to optimal?
  δ (delta)   - confidence: how sure are you?

There's always a trade-off:
  Want 99% confidence AND 0.1% error tolerance?
  → You'll need a LOT more data.

Finite Hypothesis Classes - A Concrete Bound

For a finite hypothesis class $|\mathcal{H}| < \infty$ with 0-1 loss:

$m \geq \frac{1}{2\varepsilon^2}\left(\ln |\mathcal{H}| + \ln \frac{1}{\delta}\right) \implies \text{PAC guarantee holds}$

This says: you need samples proportional to the log of the number of hypotheses. This is good news - even if $|\mathcal{H}| = 10^{100}$ hypotheses, $\ln(10^{100}) \approx 230$ , which is manageable.

import numpy as np

def pac_sample_complexity_finite(n_hypotheses: int, epsilon: float, delta: float) -> int:
    """
    Minimum samples for PAC learning a finite hypothesis class.

    Parameters:
        n_hypotheses: size of hypothesis class |H|
        epsilon:      accuracy parameter (tolerance from optimal)
        delta:        confidence parameter (failure probability)

    Returns:
        Minimum sample size m guaranteeing PAC learning
    """
    m = (1 / (2 * epsilon**2)) * (np.log(n_hypotheses) + np.log(1 / delta))
    return int(np.ceil(m))

print("PAC Sample Complexity - Finite Hypothesis Class:")
print(f"\n{'|H|':>15} {'ε':>8} {'δ':>8} {'m (samples)':>15}")
print("-" * 55)

test_cases = [
    (100, 0.05, 0.05),
    (10_000, 0.05, 0.05),
    (1e10, 0.05, 0.05),
    (1e10, 0.01, 0.05),
    (1e10, 0.01, 0.01),
    (2**64, 0.05, 0.05),  # 64-bit decision tree with binary features
]

for nh, eps, delt in test_cases:
    m = pac_sample_complexity_finite(int(nh), eps, delt)
    print(f"{nh:>15.0e} {eps:>8.2f} {delt:>8.2f} {m:>15,}")

print(f"""
Observations:
  - The sample complexity grows as log(|H|), not linearly in |H|
  - Halving ε (double accuracy) requires ~4x more data (1/ε² dependence)
  - Halving δ adds only log(2) ≈ 0.7 to the bound - confidence is "cheap"
  - Even |H| = 2^64 requires only ~650 samples for ε=0.05, δ=0.05
""")

The Union Bound Proof Sketch

The PAC bound for finite classes follows from the Union Bound and Hoeffding's inequality:

Proof sketch:
1. Fix any "bad" hypothesis h with L_D(h) > ε
2. Hoeffding: P[L_hat_S(h) ≤ 0 | L_D(h) > ε] ≤ exp(-2mε²)
   (bad h looks good on training with exponentially small probability)

3. Union bound over all |H| hypotheses:
   P[∃h ∈ H: h is bad but looks good] ≤ |H| · exp(-2mε²)

4. Set this ≤ δ and solve for m:
   |H| · exp(-2mε²) ≤ δ
   m ≥ [ln|H| + ln(1/δ)] / (2ε²)

Part 3 - VC Dimension

Finite hypothesis classes are too limiting. We need to handle infinite hypothesis classes (all possible linear classifiers, all possible neural networks, etc.). VC dimension extends PAC theory to this case.

Shattering

A hypothesis class $\mathcal{H}$ shatters a set $C = \{x_1, \ldots, x_k\}$ if for every possible binary labelling $b = (b_1, \ldots, b_k) \in \{0,1\}^k$ , there exists some $h \in \mathcal{H}$ that correctly classifies all points:

$\forall b \in \{0,1\}^k, \exists h \in \mathcal{H}: h(x_i) = b_i \; \forall i$

VC Dimension of $\mathcal{H}$ is the largest set size that $\mathcal{H}$ can shatter:

$\text{VCdim}(\mathcal{H}) = \max\{k : \exists C \subseteq \mathcal{X}, |C| = k, \mathcal{H} \text{ shatters } C\}$

Computing VC Dimension - Key Examples

Hypothesis Class                    │ VC Dimension     │ Notes
────────────────────────────────────┼──────────────────┼────────────────────
Threshold classifiers on ℝ         │ 1                │ h(x) = 1[x ≥ θ]
Intervals on ℝ                     │ 2                │ h(x) = 1[a ≤ x ≤ b]
Halfspaces in ℝᵈ (linear SVM)      │ d + 1            │ Vapnik-Chervonenkis 1971
Convex k-gons in ℝ²                │ 2k + 1           │
Decision trees (depth d, n feats)   │ O(d log n)       │
Rectangles in ℝ²                   │ 4                │ axis-aligned boxes
Sine functions h(x) = sgn(sin(θx)) │ ∞                │ shatters any finite set!
Neural networks (ReLU, layers L)    │ O(W² L²)         │ W = total weights; loose bound

import numpy as np
import itertools
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

def can_shatter(X_candidate: np.ndarray, model_class, n_restarts: int = 10) -> bool:
    """
    Empirically check if a model class can shatter a given set of points.
    Tries all 2^n binary labellings and checks if model achieves 100% accuracy.
    (Uses training accuracy as proxy - works for small n and expressive models)
    """
    n = len(X_candidate)
    for labels in itertools.product([0, 1], repeat=n):
        y = np.array(labels)
        found = False
        for seed in range(n_restarts):
            model = model_class(random_state=seed)
            try:
                model.fit(X_candidate, y)
                if accuracy_score(y, model.predict(X_candidate)) == 1.0:
                    found = True
                    break
            except Exception:
                continue
        if not found:
            return False  # Can't realise this labelling
    return True

# Check: can a logistic regression shatter 3 points in 1D?
from sklearn.linear_model import LogisticRegression
from functools import partial

lr_factory = partial(LogisticRegression, max_iter=1000)

# Test different configurations
np.random.seed(42)

# 2 points in 1D - logistic regression (halfspace) can shatter?
X_2d = np.array([[0.5], [2.0]])
print(f"LR can shatter 2 points in 1D: {can_shatter(X_2d, lr_factory)}")

# 3 collinear points in 1D - should fail (XOR pattern)
X_3_col = np.array([[0.5], [1.0], [2.0]])
print(f"LR can shatter 3 points in 1D: {can_shatter(X_3_col, lr_factory)}")

# 3 points in 2D (general position) - halfspace in 2D has VC dim 3
X_3_2d = np.array([[0.0, 0.0], [1.0, 0.0], [0.0, 1.0]])
print(f"LR can shatter 3 points in 2D: {can_shatter(X_3_2d, lr_factory)}")

# 4 points in 2D - halfspace in 2D can't shatter (VC dim = 3)
X_4_2d = np.array([[0.0, 0.0], [1.0, 0.0], [0.0, 1.0], [1.0, 1.0]])
print(f"LR can shatter 4 points in 2D: {can_shatter(X_4_2d, lr_factory)}")

The Sauer-Shelah Lemma

The growth function $\Pi_\mathcal{H}(m)$ counts the maximum number of distinct dichotomies $\mathcal{H}$ can produce on any $m$ points:

$\Pi_\mathcal{H}(m) = \max_{C \subseteq \mathcal{X}, |C|=m} |\{(h(x_1), \ldots, h(x_m)) : h \in \mathcal{H}\}|$

Sauer-Shelah Lemma: If $\text{VCdim}(\mathcal{H}) = d < \infty$ :

$\Pi_\mathcal{H}(m) \leq \sum_{i=0}^{d} \binom{m}{i} \leq \left(\frac{em}{d}\right)^d$

This shows that even for infinite $\mathcal{H}$ , the number of effective behaviours on $m$ points is polynomial (not exponential) in $m$ once $m > d$ .

import numpy as np
from scipy.special import comb

def growth_function_bound(m: int, vc_dim: int) -> float:
    """Sauer-Shelah upper bound on growth function."""
    if m <= vc_dim:
        return 2**m  # can shatter any m points if m ≤ VC dim
    return sum(comb(m, i, exact=False) for i in range(vc_dim + 1))

def growth_function_compact(m: int, vc_dim: int) -> float:
    """Compact form: (em/d)^d"""
    return (np.e * m / vc_dim) ** vc_dim

print("Growth function bounds (Sauer-Shelah):")
print(f"\n{'m':>6} {'VC=2':>12} {'VC=5':>12} {'VC=10':>12} {'Exponential':>14}")
print("-" * 60)
for m in [10, 50, 100, 500, 1000]:
    g2  = growth_function_bound(m, 2)
    g5  = growth_function_bound(m, 5)
    g10 = growth_function_bound(m, 10)
    exp = 2**m
    print(f"{m:>6} {g2:>12.2e} {g5:>12.2e} {g10:>12.2e} {min(exp, 1e300):>14.2e}")

print("\nFor VC dim < ∞, the growth function is POLYNOMIAL in m, not exponential.")
print("This is what makes generalisation possible!")

Part 4 - The Fundamental Theorem of Statistical Learning

Theorem (Fundamental Theorem, Shalev-Shwartz & Ben-David, 2014):

For binary classification, the following four statements are equivalent:

$\mathcal{H}$ has finite VC dimension
$\mathcal{H}$ is PAC learnable
$\mathcal{H}$ is uniformly convergent (empirical risk converges to true risk uniformly)
ERM (Empirical Risk Minimisation) is a valid learning rule for $\mathcal{H}$

Sample complexity from VC dimension:

$m_\mathcal{H}(\varepsilon, \delta) = \Theta\left(\frac{d + \ln(1/\delta)}{\varepsilon^2}\right) \quad \text{where } d = \text{VCdim}(\mathcal{H})$

Upper bound: $m \leq C \cdot \frac{d \ln(d/\varepsilon) + \ln(1/\delta)}{\varepsilon^2}$

Lower bound: $m \geq C' \cdot \frac{d + \ln(1/\delta)}{\varepsilon^2}$

def vc_sample_complexity(vc_dim: int, epsilon: float, delta: float,
                          use_upper_bound: bool = True) -> int:
    """
    Sample complexity from VC dimension.
    Upper bound: m ≤ (8/ε²) * (d*ln(13/ε) + ln(2/δ))
    Lower bound: m ≥ (d-1) / (32*ε²) [simplified]
    """
    if use_upper_bound:
        # Standard upper bound
        m = (8 / epsilon**2) * (vc_dim * np.log(13 / epsilon) + np.log(2 / delta))
    else:
        # Lower bound (must need at least this many)
        m = max(vc_dim, (1 / (32 * epsilon**2)) * np.log(1 / (4 * delta)))
    return int(np.ceil(m))

print("\nVC Dimension → Sample Complexity:")
print(f"\n{'VC dim':>8} {'ε':>6} {'δ':>6} {'Upper bound':>14} {'Lower bound':>14}")
print("-" * 55)

for d in [1, 3, 10, 50, 100]:
    eps, delt = 0.05, 0.05
    upper = vc_sample_complexity(d, eps, delt, use_upper_bound=True)
    lower = vc_sample_complexity(d, eps, delt, use_upper_bound=False)
    print(f"{d:>8} {eps:>6.2f} {delt:>6.2f} {upper:>14,} {lower:>14,}")

# Practical comparison: linear classifiers in different dimensions
print("\nLinear classifiers in d dimensions: VC dim = d+1")
for d in [10, 100, 784, 4096]:
    vc = d + 1
    m = vc_sample_complexity(vc, 0.05, 0.05)
    print(f"  d={d:5d}, VC={vc:5d}: ~{m:,} samples needed")

Part 5 - Rademacher Complexity

VC dimension is a worst-case, combinatorial measure - it doesn't depend on the actual training data distribution. Rademacher complexity is a data-dependent measure that is often tighter.

Empirical Rademacher Complexity:

$\hat{\mathfrak{R}}_S(\mathcal{H}) = \mathbb{E}_{\sigma}\left[\sup_{h \in \mathcal{H}} \frac{1}{m} \sum_{i=1}^m \sigma_i h(x_i)\right]$

where $\sigma_i \in \{-1, +1\}$ are i.i.d. Rademacher random variables (fair coin flips).

Intuition: How well can hypotheses in $\mathcal{H}$ correlate with random noise? A large Rademacher complexity means the class can fit any pattern - including random labels - which is dangerous for generalisation.

Generalisation Bound via Rademacher Complexity

With probability $\geq 1 - \delta$ over training set $S$ of size $m$ :

$L_{\mathcal{D}}(h) \leq \hat{L}_S(h) + 2\hat{\mathfrak{R}}_S(\mathcal{H}) + \sqrt{\frac{\ln(1/\delta)}{2m}}$

import numpy as np

def estimate_rademacher(X: np.ndarray, hypothesis_class_fn, n_samples: int = 100) -> float:
    """
    Monte Carlo estimate of empirical Rademacher complexity.

    Parameters:
        X:                 training points (m × d)
        hypothesis_class_fn: function(X, sigma) → max correlation achievable
                            (e.g., for linear models: max over ||w||≤1 of w·X·sigma/m)
        n_samples:         number of Rademacher sign vectors to sample

    Returns:
        Estimated empirical Rademacher complexity
    """
    m = X.shape[0]
    correlations = []

    for _ in range(n_samples):
        sigma = np.random.choice([-1.0, 1.0], size=m)
        max_corr = hypothesis_class_fn(X, sigma)
        correlations.append(max_corr)

    return np.mean(correlations)

# Example: Rademacher complexity for linear classifiers with ||w||₂ ≤ 1
# By Cauchy-Schwarz: max_{||w||≤1} w^T X^T σ / m = ||X^T σ||₂ / m
def linear_class_corr(X: np.ndarray, sigma: np.ndarray) -> float:
    """Max correlation over unit-ball linear classifiers."""
    return np.linalg.norm(X.T @ sigma) / len(X)

np.random.seed(42)
m, d = 200, 5
X_data = np.random.randn(m, d) / np.sqrt(d)  # normalised features

rad_hat = estimate_rademacher(X_data, linear_class_corr, n_samples=500)
print(f"Estimated Empirical Rademacher Complexity: {rad_hat:.4f}")

# Theoretical: for linear classifiers with features normalized by 1/sqrt(m),
# Rad(H) ≈ sqrt(max_x ||x||²/m) ≈ 1/sqrt(m) for unit-norm data
theoretical = 1.0 / np.sqrt(m)
print(f"Theoretical bound (1/√m):                 {theoretical:.4f}")

# Bound on generalisation gap
delta = 0.05
confidence_term = np.sqrt(np.log(1/delta) / (2*m))
gen_gap_bound = 2 * rad_hat + confidence_term
print(f"\nGeneralisation gap bound (95% confidence):  {gen_gap_bound:.4f}")
print(f"  = 2 × Rademacher ({2*rad_hat:.4f}) + confidence ({confidence_term:.4f})")

Connection to Model Complexity

import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

X, y = make_classification(n_samples=500, n_features=20,
                            n_informative=5, random_state=42)

# Demonstrate: Rademacher complexity grows with model class complexity
# Here we approximate by measuring how well models fit random labels

def measure_random_label_fit(X: np.ndarray, n_trials: int = 20, **model_kwargs) -> float:
    """
    Proxy for Rademacher complexity: accuracy on random labels.
    High random-label accuracy → high Rademacher complexity → poor generalisation bounds.
    """
    n = len(X)
    random_accuracies = []
    for _ in range(n_trials):
        y_random = np.random.randint(0, 2, n)
        model = LogisticRegression(**model_kwargs, max_iter=500, random_state=42)
        model.fit(X, y_random)
        random_accuracies.append(accuracy_score(y_random, model.predict(X)))
    return np.mean(random_accuracies)

print("Random label fit (proxy for Rademacher complexity):")
for C in [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]:
    random_fit = measure_random_label_fit(X, C=C)
    real_cv    = cross_val_score(LogisticRegression(C=C, max_iter=500),
                                  X, y, cv=5).mean()
    print(f"  C={C:8.3f}: random_fit={random_fit:.3f}, real_CV={real_cv:.3f} "
          f"{'← overfit risk' if random_fit > 0.65 else ''}")

Part 6 - No Free Lunch Theorem

Theorem (No Free Lunch, Wolpert 1996): For any learning algorithm $A$ and training set size $m < |\mathcal{X}|/2$ , there exists a distribution $\mathcal{D}$ such that:

$\mathbb{E}_{S \sim \mathcal{D}^m}[L_\mathcal{D}(A(S))] \geq \frac{1}{8}$

No single algorithm dominates all others on all distributions. Every learning algorithm that works well on one distribution will fail on some other distribution.

Practical consequence: The choice of algorithm (and its inductive biases) is a claim about which distributions are more likely in your domain. Regularisation, architecture choices, data augmentation - all encode prior beliefs about the structure of the problem.

"""
NFL Theorem: Demonstrated via permutation attack

If I permute the labels in my test set randomly (unknown to you),
no learning algorithm can outperform random guessing.
The NFL theorem formalises this insight.

Practical implication:
  - There is NO universally best ML algorithm
  - Bias is not always bad - it's what makes learning possible!
  - Inductive bias = assumption about which functions are more likely
    Linear models: assume linear relationships
    Neural networks with weight decay: assume smooth functions
    CNNs: assume translational equivariance
"""

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score

X_real, y_real = make_classification(n_samples=500, n_features=10,
                                      n_informative=5, random_state=42)

# True signal: algorithms that match the data structure perform well
print("CV Accuracy on REAL data (with structure):")
for name, model in [
    ('Logistic Regression', LogisticRegression(max_iter=300)),
    ('Random Forest',       RandomForestClassifier(n_estimators=100, random_state=42)),
    ('SVM (RBF)',           SVC(probability=True)),
]:
    cv = cross_val_score(model, X_real, y_real, cv=5).mean()
    print(f"  {name:<25}: {cv:.4f}")

# NFL: permute labels → all algorithms equally confused
y_random = np.random.permutation(y_real)
print("\nCV Accuracy on RANDOM labels (no structure):")
for name, model in [
    ('Logistic Regression', LogisticRegression(max_iter=300)),
    ('Random Forest',       RandomForestClassifier(n_estimators=100, random_state=42)),
    ('SVM (RBF)',           SVC(probability=True)),
]:
    cv = cross_val_score(model, X_real, y_random, cv=5).mean()
    print(f"  {name:<25}: {cv:.4f}  ← should be ≈ 0.50 (random chance)")

Part 7 - Structural Risk Minimisation (SRM)

SRM provides a principled way to select model complexity by trading off empirical risk and complexity penalty.

$h^* = \arg\min_{n} \left[\min_{h \in \mathcal{H}_n} \hat{L}_S(h) + \text{complexity\_penalty}(n, m, \delta)\right]$

where $\mathcal{H}_1 \subseteq \mathcal{H}_2 \subseteq \ldots$ is a sequence of hypothesis classes of increasing complexity.

SRM in Practice:
─────────────────────────────────────────────────────────────
Hypothesis classes ordered by VC dimension:
  H₁: linear models (VC dim = d+1)
  H₂: degree-2 polynomials (larger VC dim)
  H₃: degree-3 polynomials (even larger)
  ...

For each class Hₙ:
  1. Find best model (minimise empirical risk)
  2. Compute generalisation bound (empirical risk + complexity penalty)
  3. Select the class with the best BOUND, not best training score

This is exactly what regularisation achieves:
  - Small λ → larger effective hypothesis class → lower train error but higher bound
  - Large λ → smaller effective hypothesis class → higher train error but tighter bound
  - Optimal λ → minimises the bound → best generalisation

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

np.random.seed(42)
X_poly = np.random.randn(200, 1)
y_poly = (X_poly**2 + 0.5 * X_poly + np.random.randn(200, 1) * 0.5).ravel()

train_errors = []
val_errors   = []
complexity_penalties = []
degrees = range(1, 12)

for degree in degrees:
    pipe = Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('ridge', Ridge(alpha=0.001))
    ])

    # Training error
    pipe.fit(X_poly, y_poly)
    train_err = np.mean((y_poly - pipe.predict(X_poly))**2)

    # Cross-validation error
    cv_scores = -cross_val_score(pipe, X_poly, y_poly,
                                  cv=5, scoring='neg_mean_squared_error')
    val_err = cv_scores.mean()

    # Complexity penalty (crude VC-based approximation)
    vc_dim = degree + 1  # for univariate polynomial of degree d
    m = len(X_poly)
    delta = 0.05
    penalty = np.sqrt(vc_dim * np.log(m) / m + np.log(1/delta) / m)

    train_errors.append(train_err)
    val_errors.append(val_err)
    complexity_penalties.append(train_err + penalty)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.plot(list(degrees), train_errors, 'b-o', label='Train MSE')
ax1.plot(list(degrees), val_errors,   'r-o', label='Validation MSE')
ax1.axvline(np.argmin(val_errors) + 1, color='r', linestyle='--', alpha=0.7)
ax1.set_title('Bias-Variance Tradeoff in Practice\n(min val error = optimal complexity)')
ax1.set_xlabel('Polynomial Degree (≈ model complexity)')
ax1.set_ylabel('MSE')
ax1.legend()
ax1.set_yscale('log')

ax2.plot(list(degrees), complexity_penalties, 'g-o', label='Train MSE + Complexity Penalty (SRM)')
ax2.axvline(np.argmin(complexity_penalties) + 1, color='g', linestyle='--', alpha=0.7)
ax2.set_title('Structural Risk Minimisation\n(min SRM bound = theoretical optimal)')
ax2.set_xlabel('Polynomial Degree')
ax2.set_ylabel('SRM Bound')
ax2.legend()

plt.tight_layout()
print(f"Optimal degree by CV:  {np.argmin(val_errors) + 1}")
print(f"Optimal degree by SRM: {np.argmin(complexity_penalties) + 1}")

Part 8 - Why Deep Learning Challenges Classical Theory

Modern neural networks violate classical SLT predictions in important ways:

The Double Descent Phenomenon

Classical picture (bias-variance trade-off):
  Train more parameters → overfit → test error increases
  Optimal is "just right" complexity

Modern observation (double descent):
  Test error actually decreases AGAIN after the "interpolation threshold"
  (when model has enough capacity to fit training data exactly)

Double Descent Curve:
  Error │
        │    Classical zone
        │   /\
        │  /  \      Modern zone
        │ /    \    _______________
        │/      \  /
        └──────────────────────────→ Model complexity / Training epochs
                  ^
         Interpolation threshold (train error → 0)

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

"""
Demonstrate double descent in an overparameterised linear model.
When p (features) > n (samples), the model interpolates the training data,
but test error may still decrease as p → ∞.
"""
np.random.seed(42)
n_train = 100
n_test  = 500
signal_dim = 20  # true signal comes from 20 features

# Generate data with sparse signal
X_full = np.random.randn(n_train + n_test, 1000)
beta_true = np.zeros(1000)
beta_true[:signal_dim] = np.random.randn(signal_dim)
y_full = X_full @ beta_true + np.random.randn(n_train + n_test) * 0.5

X_train, X_test = X_full[:n_train], X_full[n_train:]
y_train, y_test = y_full[:n_train], y_full[n_train:]

train_errors = []
test_errors  = []
p_values     = range(5, 400, 10)  # number of features

for p in p_values:
    X_tr, X_te = X_train[:, :p], X_test[:, :p]

    if p < n_train:
        # Underdetermined: standard Ridge
        model = Ridge(alpha=1e-8).fit(X_tr, y_train)
    else:
        # Overdetermined: minimum norm solution (pseudo-inverse)
        # Ridge with very small alpha approximates min-norm
        model = Ridge(alpha=1e-8).fit(X_tr, y_train)

    train_err = mean_squared_error(y_train, model.predict(X_tr))
    test_err  = mean_squared_error(y_test,  model.predict(X_te))
    train_errors.append(train_err)
    test_errors.append(test_err)

fig, ax = plt.subplots(figsize=(12, 5))
ax.semilogy(list(p_values), train_errors, 'b-', label='Train MSE')
ax.semilogy(list(p_values), test_errors,  'r-', label='Test MSE')
ax.axvline(n_train, color='gray', linestyle='--', alpha=0.7,
           label=f'Interpolation threshold (p={n_train})')
ax.set_xlabel('Number of features p')
ax.set_ylabel('MSE (log scale)')
ax.set_title('Double Descent: Overparameterisation Can Improve Test Performance')
ax.legend()
plt.tight_layout()

Why Overparameterised Models Can Generalise

Classical theory: more parameters → larger VC dimension → worse generalisation bounds.

But modern deep learning defies this. Reasons being actively researched:

1. Implicit Regularisation of SGD:
   Stochastic gradient descent preferentially finds solutions with
   small norm (minimum norm interpolating solutions) - a form of
   implicit regularisation not captured by VC theory.

2. Geometry of the Loss Landscape:
   Deep models have many equally-good minima. SGD's trajectory
   biases toward "flat" minima that generalise better.

3. Inductive Biases of Architecture:
   CNNs assume translation equivariance; Transformers assume
   attention patterns. These architectural biases effectively reduce
   the "true" complexity, even if VC dim is huge.

4. Benign Overfitting:
   In high-dimensional settings, interpolation (fitting all training
   points exactly) can occur "in the noise direction" while the
   signal component is still learned correctly.

Part 9 - Practical Bounds and Their Limitations

Understanding what SLT bounds tell you (and don't tell you) in practice:

import numpy as np

def classical_gen_bound(
    train_error: float,
    vc_dim: int,
    m: int,
    delta: float = 0.05
) -> dict:
    """
    Classical generalisation bound from VC theory.
    true_error ≤ train_error + bound_term
    """
    # VC-based bound (simplified)
    vc_term = np.sqrt((vc_dim * np.log(2 * np.e * m / vc_dim) + np.log(2/delta)) / m)

    return {
        'train_error':   train_error,
        'vc_bound_term': vc_term,
        'upper_bound':   train_error + vc_term,
        'note':          'Worst-case bound - typically very loose for deep networks'
    }

# Compute for various scenarios
print("Classical VC Generalisation Bounds:")
print(f"\n{'Scenario':<30} {'Train err':>10} {'VC bound':>12} {'Upper bound':>13}")
print("-" * 70)

scenarios = [
    ("Linear (d=100)",    100+1,          0.05, 10000),
    ("Linear (d=1000)",   1000+1,         0.05, 10000),
    ("Deep NN (W=1M)",    int(1e6),       0.001, 100000),
    ("Deep NN (W=100M)",  int(1e8),       0.001, 1000000),
]

for name, vc, train_err, m in scenarios:
    bounds = classical_gen_bound(train_err, vc, m)
    print(f"{name:<30} {bounds['train_error']:>10.4f} "
          f"{bounds['vc_bound_term']:>12.4f} {bounds['upper_bound']:>13.4f}")

print("""
IMPORTANT: For deep neural networks, classical VC bounds are loose by orders
of magnitude. A ResNet-50 (25M params) generalises to 75% on ImageNet with
100K training samples, but the VC bound predicts it should fail to generalise
at all (VC dim >> training size). This motivates PAC-Bayes and algorithm-
dependent bounds as more appropriate for deep learning theory.
""")

PAC-Bayes Bounds (Brief Introduction)

PAC-Bayes bounds are tighter for deep learning because they're algorithm-dependent:

$\mathbb{E}_{h \sim Q}[L_\mathcal{D}(h)] \leq \mathbb{E}_{h \sim Q}[\hat{L}_S(h)] + \sqrt{\frac{D_{\text{KL}}(Q \| P) + \ln(m/\delta)}{2(m-1)}}$

where $P$ is a prior (before seeing data), $Q$ is a posterior (after training). The KL term measures how much the posterior changed from the prior - a small KL (model didn't change much from its initialisation) gives a tighter bound.

YouTube Resources

Video	Channel	Focus
VC Dimension	StatQuest	VC dimension visual intuition
PAC Learning	Mutual Information	PAC framework clearly explained
Statistical Learning Theory	MIT OpenCourseWare	Full lecture on SLT foundations
Double Descent	Yannic Kilcher	Double descent phenomenon
No Free Lunch Theorem	ritvikmath	NFL implications for ML

Interview Questions

Q1: What does it mean for a hypothesis class to be PAC learnable? Give the formal condition.

A hypothesis class $\mathcal{H}$ is PAC learnable if there exists a learning algorithm $A$ and a polynomial function $m_\mathcal{H}(\varepsilon, \delta)$ such that: for any $\varepsilon, \delta > 0$ , any distribution $\mathcal{D}$ , and any training set $S$ of size $m \geq m_\mathcal{H}(\varepsilon, \delta)$ drawn i.i.d. from $\mathcal{D}$ , with probability at least $1 - \delta$ over $S$ , $A(S)$ outputs a hypothesis with $L_\mathcal{D}(h) \leq \min_{h^* \in \mathcal{H}} L_\mathcal{D}(h^*) + \varepsilon$ . The key elements: (1) $\varepsilon$ controls accuracy (how close to optimal), (2) $\delta$ controls confidence (how often we succeed), (3) the sample complexity must be polynomial in $1/\varepsilon$ and $1/\delta$ - exponential sample complexity is computationally infeasible.

Q2: What is VC dimension and why does it determine PAC learnability?

The VC (Vapnik-Chervonenkis) dimension of hypothesis class $\mathcal{H}$ is the largest set of points that $\mathcal{H}$ can shatter - i.e., for which $\mathcal{H}$ can realise every possible binary labelling. It determines PAC learnability through the Fundamental Theorem: $\mathcal{H}$ is PAC learnable if and only if its VC dimension is finite. When VC dim = $d < \infty$ , the sample complexity is $\Theta(d/\varepsilon^2)$ . The mechanism: finite VC dim implies the growth function is polynomial (Sauer-Shelah lemma), which bounds the probability that any "bad" hypothesis looks good on training data, enabling uniform convergence of empirical to true risk.

Q3: What is the No Free Lunch theorem and what does it imply for machine learning practice?

The NFL theorem states that for any learning algorithm, there exists a distribution on which it performs no better than random guessing, regardless of training set size. More precisely, when averaged over all possible target functions (uniformly), all learning algorithms perform identically - no algorithm is universally better than any other. Practical implications: (1) "Best algorithm" only makes sense relative to a specific problem domain and distribution; (2) All assumptions baked into an algorithm (linear models assume linear relationships, CNNs assume translational equivariance) are inductive biases that specialise the algorithm for certain distributions at the expense of others; (3) There's no such thing as a "free" algorithm - any bias that helps on one class of problems hurts on another; (4) The goal is to choose an algorithm whose inductive bias matches the structure of your problem.

Q4: Explain Rademacher complexity and how it differs from VC dimension.

Rademacher complexity measures how well hypothesis class $\mathcal{H}$ can fit random noise: $\hat{\mathfrak{R}}_S(\mathcal{H}) = \mathbb{E}_\sigma[\sup_{h \in \mathcal{H}} \frac{1}{m}\sum_i \sigma_i h(x_i)]$ where $\sigma_i$ are fair coin flips. High Rademacher complexity means the class can correlate with random labels - dangerous for generalisation. Key differences from VC dimension: (1) Rademacher is data-dependent - computed from the actual training distribution, giving tighter bounds when data is "easier"; VC is distribution-free (worst case). (2) Rademacher gives tight bounds for many hypothesis classes where VC bounds are loose. (3) Rademacher handles regression and multi-class naturally; VC is fundamentally binary classification. (4) Rademacher complexity decreases as $1/\sqrt{m}$ for bounded hypothesis classes - directly showing how more data tightens generalisation bounds.

Q5: Why does regularisation work, from a theoretical perspective?

From MLE/MAP: L2 regularisation is MAP estimation with a Gaussian prior over parameters. Stronger regularisation (larger $\lambda$ ) encodes a stronger prior belief that parameters are near zero, preventing the model from overfitting to noise. From SLT: regularisation effectively reduces the hypothesis class. L2 regularisation with constraint $\|w\|_2 \leq B$ creates hypothesis class $\mathcal{H}_B = \{x \mapsto w^Tx : \|w\|_2 \leq B\}$ . The Rademacher complexity of this class is $\mathcal{O}(B/\sqrt{m})$ - smaller $B$ (stronger regularisation) gives tighter generalisation bounds. From SRM: regularisation is implicit model selection, choosing the optimal hypothesis class complexity for the training set size. Theoretically, optimal regularisation strength grows as $1/\sqrt{m}$ - which is why you typically need less regularisation as you gather more training data.

Q6: What is the double descent phenomenon and why does it challenge classical bias-variance theory?

Classical bias-variance theory predicts a U-shaped test error curve: increasing model complexity first decreases bias (test error improves), then increases variance (overfitting) past the optimal point. Double descent shows that for overparameterised models (more parameters than training samples), test error can actually decrease again after initially increasing - forming a second descent. The "interpolation threshold" (where the model can fit training data exactly) is the peak. After it, further increasing capacity leads to better generalisation. Classical VC theory can't explain this: more parameters → larger VC dim → looser generalisation bounds → should generalise worse. The explanation comes from implicit regularisation of gradient descent: SGD finds minimum-norm solutions among all solutions that perfectly fit training data. In high dimensions, minimum-norm solutions can still generalise well because the "noise" is spread across many directions while the signal is concentrated.

Q7: What is the Sauer-Shelah lemma and why is it crucial for SLT?

The Sauer-Shelah lemma bounds the growth function - the maximum number of distinct classifications a hypothesis class can produce on $m$ points - by a polynomial when VC dim $d < \infty$ : $\Pi_\mathcal{H}(m) \leq \sum_{i=0}^d \binom{m}{i} \leq (em/d)^d$ . It's crucial because: (1) Without it, infinite hypothesis classes could produce $2^m$ behaviours on $m$ points, making uniform convergence impossible; (2) Polynomial growth means the effective number of "distinct models" grows slowly with $m$ , allowing union bound arguments to control the probability of overfitting across all hypotheses; (3) It's the bridge between the combinatorial property (VC dimension) and the probabilistic guarantee (PAC bound) - finite VC dim → polynomial growth function → uniform convergence → PAC learnability.

Q8: How do PAC-Bayes bounds improve on VC dimension bounds for deep learning?

Classical VC bounds are typically vacuous for deep neural networks - they predict generalisation gaps of hundreds of percent when in practice networks generalise within a few percent. PAC-Bayes bounds take an algorithm-dependent perspective: instead of bounding the worst case over all hypotheses in $\mathcal{H}$ , they bound the expected generalisation of a posterior distribution $Q$ over hypotheses: true error ≤ empirical error + $\sqrt{D_{\text{KL}}(Q\|P)/2(m-1)}$ . The KL term measures how far the posterior (learned model) moved from the prior (initialisation). For networks trained with SGD from small random initialisation, the KL divergence is often much smaller than the VC dimension predicts, giving tighter bounds. Additionally, PAC-Bayes bounds can capture the geometry of the loss landscape - flat minima (small KL from prior) generalise better than sharp minima, which empirically matches the success of large batch vs small batch training and explicit flatness-seeking optimisers like SAM.

:::tip Role-Specific Angles Research Engineer Interview: PAC bounds derivation, VC dimension computation, Rademacher complexity, double descent, PAC-Bayes MLE Interview: NFL theorem implications, SRM for model selection, why regularisation works theoretically, sample complexity planning Data Scientist: Sample size calculations from error tolerance, why VC dim matters for feature selection, practical generalisation bound interpretation AI Engineer: Implicit regularisation in SGD, flat minima and generalisation, why overparameterised models work, inductive biases of architectures :::

:::tip 🎮 Interactive Playground

Visualize this concept: Try the PAC Learning demo on the EngineersOfAI Playground - no code required.

:::

What You Will Learn​

Part 1 - The Generalisation Problem, Formally​

Setup​

Part 2 - PAC Learning​

The Framework​

Finite Hypothesis Classes - A Concrete Bound​

The Union Bound Proof Sketch​

Part 3 - VC Dimension​

Shattering​

Computing VC Dimension - Key Examples​

The Sauer-Shelah Lemma​

Part 4 - The Fundamental Theorem of Statistical Learning​

Part 5 - Rademacher Complexity​

Generalisation Bound via Rademacher Complexity​

Connection to Model Complexity​

Part 6 - No Free Lunch Theorem​

Part 7 - Structural Risk Minimisation (SRM)​

Part 8 - Why Deep Learning Challenges Classical Theory​

The Double Descent Phenomenon​

Why Overparameterised Models Can Generalise​

Part 9 - Practical Bounds and Their Limitations​

PAC-Bayes Bounds (Brief Introduction)​

YouTube Resources​

Interview Questions​