Skip to main content

Statistical Learning Theory

Reading time: ~40 minutes | Level: ML Foundations | Role: MLE, Research Engineer, Data Scientist

A researcher presents a new neural architecture claiming 99% accuracy on a benchmark. A reviewer asks: "But what is the VC dimension of your model? How many samples would you need to guarantee generalisation within 1% of the true error with 95% confidence?"

If this question sounds foreign, you're missing the mathematical bedrock of machine learning. Statistical Learning Theory (SLT) is the branch of mathematics that rigorously answers: when can a machine learning algorithm be guaranteed to generalise, how much data is required, and what properties of the hypothesis class determine learnability. This is the language spoken at ML research frontiers.

What You Will Learn

  • The PAC learning framework: when and why learning is guaranteed to work
  • Sample complexity: how many examples are needed to learn with confidence
  • VC dimension: measuring the expressiveness of a hypothesis class
  • Sauer's Lemma and the growth function - the combinatorial heart of generalisation
  • Rademacher complexity: a data-dependent bound on generalisation
  • The Fundamental Theorem of Statistical Learning
  • Why regularisation works from a theoretical perspective
  • Structural Risk Minimisation (SRM) - model selection from first principles
  • Connection to modern deep learning: why do overparameterised models generalise?
  • Eight interview Q&As at senior research engineer level

Part 1 - The Generalisation Problem, Formally

Setup

Let:

  • X\mathcal{X} = input space, Y\mathcal{Y} = label space
  • D\mathcal{D} = unknown data distribution over X×Y\mathcal{X} \times \mathcal{Y}
  • Training set S={(x1,y1),,(xm,ym)}S = \{(x_1, y_1), \ldots, (x_m, y_m)\} drawn i.i.d. from D\mathcal{D}
  • Hypothesis class H\mathcal{H} = set of all models we can express
  • Loss function :Y×YR0\ell: \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}_{\geq 0}

True risk (what we care about): LD(h)=E(x,y)D[(h(x),y)]L_{\mathcal{D}}(h) = \mathbb{E}_{(x,y)\sim\mathcal{D}}[\ell(h(x), y)]

Empirical risk (what we can compute): L^S(h)=1mi=1m(h(xi),yi)\hat{L}_S(h) = \frac{1}{m} \sum_{i=1}^m \ell(h(x_i), y_i)

The core question: Under what conditions does minimising L^S(h)\hat{L}_S(h) lead to small LD(h)L_{\mathcal{D}}(h)?

The generalisation gap is LD(h)L^S(h)L_{\mathcal{D}}(h) - \hat{L}_S(h). SLT provides:

  • Upper bounds on the generalisation gap with high probability
  • Conditions under which ERM (Empirical Risk Minimisation) is guaranteed to work

Part 2 - PAC Learning

The Framework

PAC stands for Probably Approximately Correct (Valiant, 1984). A learning algorithm AA PAC-learns hypothesis class H\mathcal{H} if, for any data distribution D\mathcal{D}, any ε>0\varepsilon > 0 (accuracy), and any δ>0\delta > 0 (confidence), there exists a sample size mH(ε,δ)m_\mathcal{H}(\varepsilon, \delta) such that for any mmHm \geq m_\mathcal{H}:

PrSDm[LD(A(S))minhHLD(h)+ε]1δ\Pr_{S \sim \mathcal{D}^m}\left[L_{\mathcal{D}}(A(S)) \leq \min_{h \in \mathcal{H}} L_{\mathcal{D}}(h) + \varepsilon\right] \geq 1 - \delta

Translation: With probability at least 1δ1 - \delta over the random draw of training data, the algorithm finds a hypothesis within ε\varepsilon of the best possible in H\mathcal{H}.

The PAC learning guarantee in plain English:

"Give me enough data (how much depends on ε and δ),
and with high probability (≥ 1-δ),
I'll output a model that's approximately (within ε)
as good as any model in my hypothesis class."

The two knobs:
ε (epsilon) - accuracy: how close to optimal?
δ (delta) - confidence: how sure are you?

There's always a trade-off:
Want 99% confidence AND 0.1% error tolerance?
→ You'll need a LOT more data.

Finite Hypothesis Classes - A Concrete Bound

For a finite hypothesis class H<|\mathcal{H}| < \infty with 0-1 loss:

m12ε2(lnH+ln1δ)    PAC guarantee holdsm \geq \frac{1}{2\varepsilon^2}\left(\ln |\mathcal{H}| + \ln \frac{1}{\delta}\right) \implies \text{PAC guarantee holds}

This says: you need samples proportional to the log of the number of hypotheses. This is good news - even if H=10100|\mathcal{H}| = 10^{100} hypotheses, ln(10100)230\ln(10^{100}) \approx 230, which is manageable.

import numpy as np

def pac_sample_complexity_finite(n_hypotheses: int, epsilon: float, delta: float) -> int:
"""
Minimum samples for PAC learning a finite hypothesis class.

Parameters:
n_hypotheses: size of hypothesis class |H|
epsilon: accuracy parameter (tolerance from optimal)
delta: confidence parameter (failure probability)

Returns:
Minimum sample size m guaranteeing PAC learning
"""
m = (1 / (2 * epsilon**2)) * (np.log(n_hypotheses) + np.log(1 / delta))
return int(np.ceil(m))

print("PAC Sample Complexity - Finite Hypothesis Class:")
print(f"\n{'|H|':>15} {'ε':>8} {'δ':>8} {'m (samples)':>15}")
print("-" * 55)

test_cases = [
(100, 0.05, 0.05),
(10_000, 0.05, 0.05),
(1e10, 0.05, 0.05),
(1e10, 0.01, 0.05),
(1e10, 0.01, 0.01),
(2**64, 0.05, 0.05), # 64-bit decision tree with binary features
]

for nh, eps, delt in test_cases:
m = pac_sample_complexity_finite(int(nh), eps, delt)
print(f"{nh:>15.0e} {eps:>8.2f} {delt:>8.2f} {m:>15,}")

print(f"""
Observations:
- The sample complexity grows as log(|H|), not linearly in |H|
- Halving ε (double accuracy) requires ~4x more data (1/ε² dependence)
- Halving δ adds only log(2) ≈ 0.7 to the bound - confidence is "cheap"
- Even |H| = 2^64 requires only ~650 samples for ε=0.05, δ=0.05
""")

The Union Bound Proof Sketch

The PAC bound for finite classes follows from the Union Bound and Hoeffding's inequality:

Proof sketch:
1. Fix any "bad" hypothesis h with L_D(h) > ε
2. Hoeffding: P[L_hat_S(h) ≤ 0 | L_D(h) > ε] ≤ exp(-2mε²)
(bad h looks good on training with exponentially small probability)

3. Union bound over all |H| hypotheses:
P[∃h ∈ H: h is bad but looks good] ≤ |H| · exp(-2mε²)

4. Set this ≤ δ and solve for m:
|H| · exp(-2mε²) ≤ δ
m ≥ [ln|H| + ln(1/δ)] / (2ε²)

Part 3 - VC Dimension

Finite hypothesis classes are too limiting. We need to handle infinite hypothesis classes (all possible linear classifiers, all possible neural networks, etc.). VC dimension extends PAC theory to this case.

Shattering

A hypothesis class H\mathcal{H} shatters a set C={x1,,xk}C = \{x_1, \ldots, x_k\} if for every possible binary labelling b=(b1,,bk){0,1}kb = (b_1, \ldots, b_k) \in \{0,1\}^k, there exists some hHh \in \mathcal{H} that correctly classifies all points:

b{0,1}k,hH:h(xi)=bi  i\forall b \in \{0,1\}^k, \exists h \in \mathcal{H}: h(x_i) = b_i \; \forall i

VC Dimension of H\mathcal{H} is the largest set size that H\mathcal{H} can shatter:

VCdim(H)=max{k:CX,C=k,H shatters C}\text{VCdim}(\mathcal{H}) = \max\{k : \exists C \subseteq \mathcal{X}, |C| = k, \mathcal{H} \text{ shatters } C\}

Computing VC Dimension - Key Examples

Hypothesis Class │ VC Dimension │ Notes
────────────────────────────────────┼──────────────────┼────────────────────
Threshold classifiers on ℝ │ 1 │ h(x) = 1[x ≥ θ]
Intervals on ℝ │ 2 │ h(x) = 1[a ≤ x ≤ b]
Halfspaces in ℝᵈ (linear SVM) │ d + 1 │ Vapnik-Chervonenkis 1971
Convex k-gons in ℝ² │ 2k + 1 │
Decision trees (depth d, n feats) │ O(d log n) │
Rectangles in ℝ² │ 4 │ axis-aligned boxes
Sine functions h(x) = sgn(sin(θx)) │ ∞ │ shatters any finite set!
Neural networks (ReLU, layers L) │ O(W² L²) │ W = total weights; loose bound
import numpy as np
import itertools
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

def can_shatter(X_candidate: np.ndarray, model_class, n_restarts: int = 10) -> bool:
"""
Empirically check if a model class can shatter a given set of points.
Tries all 2^n binary labellings and checks if model achieves 100% accuracy.
(Uses training accuracy as proxy - works for small n and expressive models)
"""
n = len(X_candidate)
for labels in itertools.product([0, 1], repeat=n):
y = np.array(labels)
found = False
for seed in range(n_restarts):
model = model_class(random_state=seed)
try:
model.fit(X_candidate, y)
if accuracy_score(y, model.predict(X_candidate)) == 1.0:
found = True
break
except Exception:
continue
if not found:
return False # Can't realise this labelling
return True

# Check: can a logistic regression shatter 3 points in 1D?
from sklearn.linear_model import LogisticRegression
from functools import partial

lr_factory = partial(LogisticRegression, max_iter=1000)

# Test different configurations
np.random.seed(42)

# 2 points in 1D - logistic regression (halfspace) can shatter?
X_2d = np.array([[0.5], [2.0]])
print(f"LR can shatter 2 points in 1D: {can_shatter(X_2d, lr_factory)}")

# 3 collinear points in 1D - should fail (XOR pattern)
X_3_col = np.array([[0.5], [1.0], [2.0]])
print(f"LR can shatter 3 points in 1D: {can_shatter(X_3_col, lr_factory)}")

# 3 points in 2D (general position) - halfspace in 2D has VC dim 3
X_3_2d = np.array([[0.0, 0.0], [1.0, 0.0], [0.0, 1.0]])
print(f"LR can shatter 3 points in 2D: {can_shatter(X_3_2d, lr_factory)}")

# 4 points in 2D - halfspace in 2D can't shatter (VC dim = 3)
X_4_2d = np.array([[0.0, 0.0], [1.0, 0.0], [0.0, 1.0], [1.0, 1.0]])
print(f"LR can shatter 4 points in 2D: {can_shatter(X_4_2d, lr_factory)}")

The Sauer-Shelah Lemma

The growth function ΠH(m)\Pi_\mathcal{H}(m) counts the maximum number of distinct dichotomies H\mathcal{H} can produce on any mm points:

ΠH(m)=maxCX,C=m{(h(x1),,h(xm)):hH}\Pi_\mathcal{H}(m) = \max_{C \subseteq \mathcal{X}, |C|=m} |\{(h(x_1), \ldots, h(x_m)) : h \in \mathcal{H}\}|

Sauer-Shelah Lemma: If VCdim(H)=d<\text{VCdim}(\mathcal{H}) = d < \infty:

ΠH(m)i=0d(mi)(emd)d\Pi_\mathcal{H}(m) \leq \sum_{i=0}^{d} \binom{m}{i} \leq \left(\frac{em}{d}\right)^d

This shows that even for infinite H\mathcal{H}, the number of effective behaviours on mm points is polynomial (not exponential) in mm once m>dm > d.

import numpy as np
from scipy.special import comb

def growth_function_bound(m: int, vc_dim: int) -> float:
"""Sauer-Shelah upper bound on growth function."""
if m <= vc_dim:
return 2**m # can shatter any m points if m ≤ VC dim
return sum(comb(m, i, exact=False) for i in range(vc_dim + 1))

def growth_function_compact(m: int, vc_dim: int) -> float:
"""Compact form: (em/d)^d"""
return (np.e * m / vc_dim) ** vc_dim

print("Growth function bounds (Sauer-Shelah):")
print(f"\n{'m':>6} {'VC=2':>12} {'VC=5':>12} {'VC=10':>12} {'Exponential':>14}")
print("-" * 60)
for m in [10, 50, 100, 500, 1000]:
g2 = growth_function_bound(m, 2)
g5 = growth_function_bound(m, 5)
g10 = growth_function_bound(m, 10)
exp = 2**m
print(f"{m:>6} {g2:>12.2e} {g5:>12.2e} {g10:>12.2e} {min(exp, 1e300):>14.2e}")

print("\nFor VC dim < ∞, the growth function is POLYNOMIAL in m, not exponential.")
print("This is what makes generalisation possible!")

Part 4 - The Fundamental Theorem of Statistical Learning

Theorem (Fundamental Theorem, Shalev-Shwartz & Ben-David, 2014):

For binary classification, the following four statements are equivalent:

  1. H\mathcal{H} has finite VC dimension
  2. H\mathcal{H} is PAC learnable
  3. H\mathcal{H} is uniformly convergent (empirical risk converges to true risk uniformly)
  4. ERM (Empirical Risk Minimisation) is a valid learning rule for H\mathcal{H}

Sample complexity from VC dimension:

mH(ε,δ)=Θ(d+ln(1/δ)ε2)where d=VCdim(H)m_\mathcal{H}(\varepsilon, \delta) = \Theta\left(\frac{d + \ln(1/\delta)}{\varepsilon^2}\right) \quad \text{where } d = \text{VCdim}(\mathcal{H})

Upper bound: mCdln(d/ε)+ln(1/δ)ε2m \leq C \cdot \frac{d \ln(d/\varepsilon) + \ln(1/\delta)}{\varepsilon^2}

Lower bound: mCd+ln(1/δ)ε2m \geq C' \cdot \frac{d + \ln(1/\delta)}{\varepsilon^2}

def vc_sample_complexity(vc_dim: int, epsilon: float, delta: float,
use_upper_bound: bool = True) -> int:
"""
Sample complexity from VC dimension.
Upper bound: m ≤ (8/ε²) * (d*ln(13/ε) + ln(2/δ))
Lower bound: m ≥ (d-1) / (32*ε²) [simplified]
"""
if use_upper_bound:
# Standard upper bound
m = (8 / epsilon**2) * (vc_dim * np.log(13 / epsilon) + np.log(2 / delta))
else:
# Lower bound (must need at least this many)
m = max(vc_dim, (1 / (32 * epsilon**2)) * np.log(1 / (4 * delta)))
return int(np.ceil(m))

print("\nVC Dimension → Sample Complexity:")
print(f"\n{'VC dim':>8} {'ε':>6} {'δ':>6} {'Upper bound':>14} {'Lower bound':>14}")
print("-" * 55)

for d in [1, 3, 10, 50, 100]:
eps, delt = 0.05, 0.05
upper = vc_sample_complexity(d, eps, delt, use_upper_bound=True)
lower = vc_sample_complexity(d, eps, delt, use_upper_bound=False)
print(f"{d:>8} {eps:>6.2f} {delt:>6.2f} {upper:>14,} {lower:>14,}")

# Practical comparison: linear classifiers in different dimensions
print("\nLinear classifiers in d dimensions: VC dim = d+1")
for d in [10, 100, 784, 4096]:
vc = d + 1
m = vc_sample_complexity(vc, 0.05, 0.05)
print(f" d={d:5d}, VC={vc:5d}: ~{m:,} samples needed")

Part 5 - Rademacher Complexity

VC dimension is a worst-case, combinatorial measure - it doesn't depend on the actual training data distribution. Rademacher complexity is a data-dependent measure that is often tighter.

Empirical Rademacher Complexity:

R^S(H)=Eσ[suphH1mi=1mσih(xi)]\hat{\mathfrak{R}}_S(\mathcal{H}) = \mathbb{E}_{\sigma}\left[\sup_{h \in \mathcal{H}} \frac{1}{m} \sum_{i=1}^m \sigma_i h(x_i)\right]

where σi{1,+1}\sigma_i \in \{-1, +1\} are i.i.d. Rademacher random variables (fair coin flips).

Intuition: How well can hypotheses in H\mathcal{H} correlate with random noise? A large Rademacher complexity means the class can fit any pattern - including random labels - which is dangerous for generalisation.

Generalisation Bound via Rademacher Complexity

With probability 1δ\geq 1 - \delta over training set SS of size mm:

LD(h)L^S(h)+2R^S(H)+ln(1/δ)2mL_{\mathcal{D}}(h) \leq \hat{L}_S(h) + 2\hat{\mathfrak{R}}_S(\mathcal{H}) + \sqrt{\frac{\ln(1/\delta)}{2m}}

import numpy as np

def estimate_rademacher(X: np.ndarray, hypothesis_class_fn, n_samples: int = 100) -> float:
"""
Monte Carlo estimate of empirical Rademacher complexity.

Parameters:
X: training points (m × d)
hypothesis_class_fn: function(X, sigma) → max correlation achievable
(e.g., for linear models: max over ||w||≤1 of w·X·sigma/m)
n_samples: number of Rademacher sign vectors to sample

Returns:
Estimated empirical Rademacher complexity
"""
m = X.shape[0]
correlations = []

for _ in range(n_samples):
sigma = np.random.choice([-1.0, 1.0], size=m)
max_corr = hypothesis_class_fn(X, sigma)
correlations.append(max_corr)

return np.mean(correlations)

# Example: Rademacher complexity for linear classifiers with ||w||₂ ≤ 1
# By Cauchy-Schwarz: max_{||w||≤1} w^T X^T σ / m = ||X^T σ||₂ / m
def linear_class_corr(X: np.ndarray, sigma: np.ndarray) -> float:
"""Max correlation over unit-ball linear classifiers."""
return np.linalg.norm(X.T @ sigma) / len(X)

np.random.seed(42)
m, d = 200, 5
X_data = np.random.randn(m, d) / np.sqrt(d) # normalised features

rad_hat = estimate_rademacher(X_data, linear_class_corr, n_samples=500)
print(f"Estimated Empirical Rademacher Complexity: {rad_hat:.4f}")

# Theoretical: for linear classifiers with features normalized by 1/sqrt(m),
# Rad(H) ≈ sqrt(max_x ||x||²/m) ≈ 1/sqrt(m) for unit-norm data
theoretical = 1.0 / np.sqrt(m)
print(f"Theoretical bound (1/√m): {theoretical:.4f}")

# Bound on generalisation gap
delta = 0.05
confidence_term = np.sqrt(np.log(1/delta) / (2*m))
gen_gap_bound = 2 * rad_hat + confidence_term
print(f"\nGeneralisation gap bound (95% confidence): {gen_gap_bound:.4f}")
print(f" = 2 × Rademacher ({2*rad_hat:.4f}) + confidence ({confidence_term:.4f})")

Connection to Model Complexity

import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

X, y = make_classification(n_samples=500, n_features=20,
n_informative=5, random_state=42)

# Demonstrate: Rademacher complexity grows with model class complexity
# Here we approximate by measuring how well models fit random labels

def measure_random_label_fit(X: np.ndarray, n_trials: int = 20, **model_kwargs) -> float:
"""
Proxy for Rademacher complexity: accuracy on random labels.
High random-label accuracy → high Rademacher complexity → poor generalisation bounds.
"""
n = len(X)
random_accuracies = []
for _ in range(n_trials):
y_random = np.random.randint(0, 2, n)
model = LogisticRegression(**model_kwargs, max_iter=500, random_state=42)
model.fit(X, y_random)
random_accuracies.append(accuracy_score(y_random, model.predict(X)))
return np.mean(random_accuracies)

print("Random label fit (proxy for Rademacher complexity):")
for C in [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]:
random_fit = measure_random_label_fit(X, C=C)
real_cv = cross_val_score(LogisticRegression(C=C, max_iter=500),
X, y, cv=5).mean()
print(f" C={C:8.3f}: random_fit={random_fit:.3f}, real_CV={real_cv:.3f} "
f"{'← overfit risk' if random_fit > 0.65 else ''}")

Part 6 - No Free Lunch Theorem

Theorem (No Free Lunch, Wolpert 1996): For any learning algorithm AA and training set size m<X/2m < |\mathcal{X}|/2, there exists a distribution D\mathcal{D} such that:

ESDm[LD(A(S))]18\mathbb{E}_{S \sim \mathcal{D}^m}[L_\mathcal{D}(A(S))] \geq \frac{1}{8}

No single algorithm dominates all others on all distributions. Every learning algorithm that works well on one distribution will fail on some other distribution.

Practical consequence: The choice of algorithm (and its inductive biases) is a claim about which distributions are more likely in your domain. Regularisation, architecture choices, data augmentation - all encode prior beliefs about the structure of the problem.

"""
NFL Theorem: Demonstrated via permutation attack

If I permute the labels in my test set randomly (unknown to you),
no learning algorithm can outperform random guessing.
The NFL theorem formalises this insight.

Practical implication:
- There is NO universally best ML algorithm
- Bias is not always bad - it's what makes learning possible!
- Inductive bias = assumption about which functions are more likely
Linear models: assume linear relationships
Neural networks with weight decay: assume smooth functions
CNNs: assume translational equivariance
"""

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score

X_real, y_real = make_classification(n_samples=500, n_features=10,
n_informative=5, random_state=42)

# True signal: algorithms that match the data structure perform well
print("CV Accuracy on REAL data (with structure):")
for name, model in [
('Logistic Regression', LogisticRegression(max_iter=300)),
('Random Forest', RandomForestClassifier(n_estimators=100, random_state=42)),
('SVM (RBF)', SVC(probability=True)),
]:
cv = cross_val_score(model, X_real, y_real, cv=5).mean()
print(f" {name:<25}: {cv:.4f}")

# NFL: permute labels → all algorithms equally confused
y_random = np.random.permutation(y_real)
print("\nCV Accuracy on RANDOM labels (no structure):")
for name, model in [
('Logistic Regression', LogisticRegression(max_iter=300)),
('Random Forest', RandomForestClassifier(n_estimators=100, random_state=42)),
('SVM (RBF)', SVC(probability=True)),
]:
cv = cross_val_score(model, X_real, y_random, cv=5).mean()
print(f" {name:<25}: {cv:.4f} ← should be ≈ 0.50 (random chance)")

Part 7 - Structural Risk Minimisation (SRM)

SRM provides a principled way to select model complexity by trading off empirical risk and complexity penalty.

h=argminn[minhHnL^S(h)+complexity_penalty(n,m,δ)]h^* = \arg\min_{n} \left[\min_{h \in \mathcal{H}_n} \hat{L}_S(h) + \text{complexity\_penalty}(n, m, \delta)\right]

where H1H2\mathcal{H}_1 \subseteq \mathcal{H}_2 \subseteq \ldots is a sequence of hypothesis classes of increasing complexity.

SRM in Practice:
─────────────────────────────────────────────────────────────
Hypothesis classes ordered by VC dimension:
H₁: linear models (VC dim = d+1)
H₂: degree-2 polynomials (larger VC dim)
H₃: degree-3 polynomials (even larger)
...

For each class Hₙ:
1. Find best model (minimise empirical risk)
2. Compute generalisation bound (empirical risk + complexity penalty)
3. Select the class with the best BOUND, not best training score

This is exactly what regularisation achieves:
- Small λ → larger effective hypothesis class → lower train error but higher bound
- Large λ → smaller effective hypothesis class → higher train error but tighter bound
- Optimal λ → minimises the bound → best generalisation
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

np.random.seed(42)
X_poly = np.random.randn(200, 1)
y_poly = (X_poly**2 + 0.5 * X_poly + np.random.randn(200, 1) * 0.5).ravel()

train_errors = []
val_errors = []
complexity_penalties = []
degrees = range(1, 12)

for degree in degrees:
pipe = Pipeline([
('poly', PolynomialFeatures(degree=degree)),
('ridge', Ridge(alpha=0.001))
])

# Training error
pipe.fit(X_poly, y_poly)
train_err = np.mean((y_poly - pipe.predict(X_poly))**2)

# Cross-validation error
cv_scores = -cross_val_score(pipe, X_poly, y_poly,
cv=5, scoring='neg_mean_squared_error')
val_err = cv_scores.mean()

# Complexity penalty (crude VC-based approximation)
vc_dim = degree + 1 # for univariate polynomial of degree d
m = len(X_poly)
delta = 0.05
penalty = np.sqrt(vc_dim * np.log(m) / m + np.log(1/delta) / m)

train_errors.append(train_err)
val_errors.append(val_err)
complexity_penalties.append(train_err + penalty)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

ax1.plot(list(degrees), train_errors, 'b-o', label='Train MSE')
ax1.plot(list(degrees), val_errors, 'r-o', label='Validation MSE')
ax1.axvline(np.argmin(val_errors) + 1, color='r', linestyle='--', alpha=0.7)
ax1.set_title('Bias-Variance Tradeoff in Practice\n(min val error = optimal complexity)')
ax1.set_xlabel('Polynomial Degree (≈ model complexity)')
ax1.set_ylabel('MSE')
ax1.legend()
ax1.set_yscale('log')

ax2.plot(list(degrees), complexity_penalties, 'g-o', label='Train MSE + Complexity Penalty (SRM)')
ax2.axvline(np.argmin(complexity_penalties) + 1, color='g', linestyle='--', alpha=0.7)
ax2.set_title('Structural Risk Minimisation\n(min SRM bound = theoretical optimal)')
ax2.set_xlabel('Polynomial Degree')
ax2.set_ylabel('SRM Bound')
ax2.legend()

plt.tight_layout()
print(f"Optimal degree by CV: {np.argmin(val_errors) + 1}")
print(f"Optimal degree by SRM: {np.argmin(complexity_penalties) + 1}")

Part 8 - Why Deep Learning Challenges Classical Theory

Modern neural networks violate classical SLT predictions in important ways:

The Double Descent Phenomenon

Classical picture (bias-variance trade-off):
Train more parameters → overfit → test error increases
Optimal is "just right" complexity

Modern observation (double descent):
Test error actually decreases AGAIN after the "interpolation threshold"
(when model has enough capacity to fit training data exactly)

Double Descent Curve:
Error │
│ Classical zone
│ /\
│ / \ Modern zone
│ / \ _______________
│/ \ /
└──────────────────────────→ Model complexity / Training epochs
^
Interpolation threshold (train error → 0)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

"""
Demonstrate double descent in an overparameterised linear model.
When p (features) > n (samples), the model interpolates the training data,
but test error may still decrease as p → ∞.
"""
np.random.seed(42)
n_train = 100
n_test = 500
signal_dim = 20 # true signal comes from 20 features

# Generate data with sparse signal
X_full = np.random.randn(n_train + n_test, 1000)
beta_true = np.zeros(1000)
beta_true[:signal_dim] = np.random.randn(signal_dim)
y_full = X_full @ beta_true + np.random.randn(n_train + n_test) * 0.5

X_train, X_test = X_full[:n_train], X_full[n_train:]
y_train, y_test = y_full[:n_train], y_full[n_train:]

train_errors = []
test_errors = []
p_values = range(5, 400, 10) # number of features

for p in p_values:
X_tr, X_te = X_train[:, :p], X_test[:, :p]

if p < n_train:
# Underdetermined: standard Ridge
model = Ridge(alpha=1e-8).fit(X_tr, y_train)
else:
# Overdetermined: minimum norm solution (pseudo-inverse)
# Ridge with very small alpha approximates min-norm
model = Ridge(alpha=1e-8).fit(X_tr, y_train)

train_err = mean_squared_error(y_train, model.predict(X_tr))
test_err = mean_squared_error(y_test, model.predict(X_te))
train_errors.append(train_err)
test_errors.append(test_err)

fig, ax = plt.subplots(figsize=(12, 5))
ax.semilogy(list(p_values), train_errors, 'b-', label='Train MSE')
ax.semilogy(list(p_values), test_errors, 'r-', label='Test MSE')
ax.axvline(n_train, color='gray', linestyle='--', alpha=0.7,
label=f'Interpolation threshold (p={n_train})')
ax.set_xlabel('Number of features p')
ax.set_ylabel('MSE (log scale)')
ax.set_title('Double Descent: Overparameterisation Can Improve Test Performance')
ax.legend()
plt.tight_layout()

Why Overparameterised Models Can Generalise

Classical theory: more parameters → larger VC dimension → worse generalisation bounds.

But modern deep learning defies this. Reasons being actively researched:

1. Implicit Regularisation of SGD:
Stochastic gradient descent preferentially finds solutions with
small norm (minimum norm interpolating solutions) - a form of
implicit regularisation not captured by VC theory.

2. Geometry of the Loss Landscape:
Deep models have many equally-good minima. SGD's trajectory
biases toward "flat" minima that generalise better.

3. Inductive Biases of Architecture:
CNNs assume translation equivariance; Transformers assume
attention patterns. These architectural biases effectively reduce
the "true" complexity, even if VC dim is huge.

4. Benign Overfitting:
In high-dimensional settings, interpolation (fitting all training
points exactly) can occur "in the noise direction" while the
signal component is still learned correctly.

Part 9 - Practical Bounds and Their Limitations

Understanding what SLT bounds tell you (and don't tell you) in practice:

import numpy as np

def classical_gen_bound(
train_error: float,
vc_dim: int,
m: int,
delta: float = 0.05
) -> dict:
"""
Classical generalisation bound from VC theory.
true_error ≤ train_error + bound_term
"""
# VC-based bound (simplified)
vc_term = np.sqrt((vc_dim * np.log(2 * np.e * m / vc_dim) + np.log(2/delta)) / m)

return {
'train_error': train_error,
'vc_bound_term': vc_term,
'upper_bound': train_error + vc_term,
'note': 'Worst-case bound - typically very loose for deep networks'
}

# Compute for various scenarios
print("Classical VC Generalisation Bounds:")
print(f"\n{'Scenario':<30} {'Train err':>10} {'VC bound':>12} {'Upper bound':>13}")
print("-" * 70)

scenarios = [
("Linear (d=100)", 100+1, 0.05, 10000),
("Linear (d=1000)", 1000+1, 0.05, 10000),
("Deep NN (W=1M)", int(1e6), 0.001, 100000),
("Deep NN (W=100M)", int(1e8), 0.001, 1000000),
]

for name, vc, train_err, m in scenarios:
bounds = classical_gen_bound(train_err, vc, m)
print(f"{name:<30} {bounds['train_error']:>10.4f} "
f"{bounds['vc_bound_term']:>12.4f} {bounds['upper_bound']:>13.4f}")

print("""
IMPORTANT: For deep neural networks, classical VC bounds are loose by orders
of magnitude. A ResNet-50 (25M params) generalises to 75% on ImageNet with
100K training samples, but the VC bound predicts it should fail to generalise
at all (VC dim >> training size). This motivates PAC-Bayes and algorithm-
dependent bounds as more appropriate for deep learning theory.
""")

PAC-Bayes Bounds (Brief Introduction)

PAC-Bayes bounds are tighter for deep learning because they're algorithm-dependent:

EhQ[LD(h)]EhQ[L^S(h)]+DKL(QP)+ln(m/δ)2(m1)\mathbb{E}_{h \sim Q}[L_\mathcal{D}(h)] \leq \mathbb{E}_{h \sim Q}[\hat{L}_S(h)] + \sqrt{\frac{D_{\text{KL}}(Q \| P) + \ln(m/\delta)}{2(m-1)}}

where PP is a prior (before seeing data), QQ is a posterior (after training). The KL term measures how much the posterior changed from the prior - a small KL (model didn't change much from its initialisation) gives a tighter bound.

YouTube Resources

VideoChannelFocus
VC DimensionStatQuestVC dimension visual intuition
PAC LearningMutual InformationPAC framework clearly explained
Statistical Learning TheoryMIT OpenCourseWareFull lecture on SLT foundations
Double DescentYannic KilcherDouble descent phenomenon
No Free Lunch TheoremritvikmathNFL implications for ML

Interview Questions

Q1: What does it mean for a hypothesis class to be PAC learnable? Give the formal condition.

A hypothesis class H\mathcal{H} is PAC learnable if there exists a learning algorithm AA and a polynomial function mH(ε,δ)m_\mathcal{H}(\varepsilon, \delta) such that: for any ε,δ>0\varepsilon, \delta > 0, any distribution D\mathcal{D}, and any training set SS of size mmH(ε,δ)m \geq m_\mathcal{H}(\varepsilon, \delta) drawn i.i.d. from D\mathcal{D}, with probability at least 1δ1 - \delta over SS, A(S)A(S) outputs a hypothesis with LD(h)minhHLD(h)+εL_\mathcal{D}(h) \leq \min_{h^* \in \mathcal{H}} L_\mathcal{D}(h^*) + \varepsilon. The key elements: (1) ε\varepsilon controls accuracy (how close to optimal), (2) δ\delta controls confidence (how often we succeed), (3) the sample complexity must be polynomial in 1/ε1/\varepsilon and 1/δ1/\delta - exponential sample complexity is computationally infeasible.

Q2: What is VC dimension and why does it determine PAC learnability?

The VC (Vapnik-Chervonenkis) dimension of hypothesis class H\mathcal{H} is the largest set of points that H\mathcal{H} can shatter - i.e., for which H\mathcal{H} can realise every possible binary labelling. It determines PAC learnability through the Fundamental Theorem: H\mathcal{H} is PAC learnable if and only if its VC dimension is finite. When VC dim = d<d < \infty, the sample complexity is Θ(d/ε2)\Theta(d/\varepsilon^2). The mechanism: finite VC dim implies the growth function is polynomial (Sauer-Shelah lemma), which bounds the probability that any "bad" hypothesis looks good on training data, enabling uniform convergence of empirical to true risk.

Q3: What is the No Free Lunch theorem and what does it imply for machine learning practice?

The NFL theorem states that for any learning algorithm, there exists a distribution on which it performs no better than random guessing, regardless of training set size. More precisely, when averaged over all possible target functions (uniformly), all learning algorithms perform identically - no algorithm is universally better than any other. Practical implications: (1) "Best algorithm" only makes sense relative to a specific problem domain and distribution; (2) All assumptions baked into an algorithm (linear models assume linear relationships, CNNs assume translational equivariance) are inductive biases that specialise the algorithm for certain distributions at the expense of others; (3) There's no such thing as a "free" algorithm - any bias that helps on one class of problems hurts on another; (4) The goal is to choose an algorithm whose inductive bias matches the structure of your problem.

Q4: Explain Rademacher complexity and how it differs from VC dimension.

Rademacher complexity measures how well hypothesis class H\mathcal{H} can fit random noise: R^S(H)=Eσ[suphH1miσih(xi)]\hat{\mathfrak{R}}_S(\mathcal{H}) = \mathbb{E}_\sigma[\sup_{h \in \mathcal{H}} \frac{1}{m}\sum_i \sigma_i h(x_i)] where σi\sigma_i are fair coin flips. High Rademacher complexity means the class can correlate with random labels - dangerous for generalisation. Key differences from VC dimension: (1) Rademacher is data-dependent - computed from the actual training distribution, giving tighter bounds when data is "easier"; VC is distribution-free (worst case). (2) Rademacher gives tight bounds for many hypothesis classes where VC bounds are loose. (3) Rademacher handles regression and multi-class naturally; VC is fundamentally binary classification. (4) Rademacher complexity decreases as 1/m1/\sqrt{m} for bounded hypothesis classes - directly showing how more data tightens generalisation bounds.

Q5: Why does regularisation work, from a theoretical perspective?

From MLE/MAP: L2 regularisation is MAP estimation with a Gaussian prior over parameters. Stronger regularisation (larger λ\lambda) encodes a stronger prior belief that parameters are near zero, preventing the model from overfitting to noise. From SLT: regularisation effectively reduces the hypothesis class. L2 regularisation with constraint w2B\|w\|_2 \leq B creates hypothesis class HB={xwTx:w2B}\mathcal{H}_B = \{x \mapsto w^Tx : \|w\|_2 \leq B\}. The Rademacher complexity of this class is O(B/m)\mathcal{O}(B/\sqrt{m}) - smaller BB (stronger regularisation) gives tighter generalisation bounds. From SRM: regularisation is implicit model selection, choosing the optimal hypothesis class complexity for the training set size. Theoretically, optimal regularisation strength grows as 1/m1/\sqrt{m} - which is why you typically need less regularisation as you gather more training data.

Q6: What is the double descent phenomenon and why does it challenge classical bias-variance theory?

Classical bias-variance theory predicts a U-shaped test error curve: increasing model complexity first decreases bias (test error improves), then increases variance (overfitting) past the optimal point. Double descent shows that for overparameterised models (more parameters than training samples), test error can actually decrease again after initially increasing - forming a second descent. The "interpolation threshold" (where the model can fit training data exactly) is the peak. After it, further increasing capacity leads to better generalisation. Classical VC theory can't explain this: more parameters → larger VC dim → looser generalisation bounds → should generalise worse. The explanation comes from implicit regularisation of gradient descent: SGD finds minimum-norm solutions among all solutions that perfectly fit training data. In high dimensions, minimum-norm solutions can still generalise well because the "noise" is spread across many directions while the signal is concentrated.

Q7: What is the Sauer-Shelah lemma and why is it crucial for SLT?

The Sauer-Shelah lemma bounds the growth function - the maximum number of distinct classifications a hypothesis class can produce on mm points - by a polynomial when VC dim d<d < \infty: ΠH(m)i=0d(mi)(em/d)d\Pi_\mathcal{H}(m) \leq \sum_{i=0}^d \binom{m}{i} \leq (em/d)^d. It's crucial because: (1) Without it, infinite hypothesis classes could produce 2m2^m behaviours on mm points, making uniform convergence impossible; (2) Polynomial growth means the effective number of "distinct models" grows slowly with mm, allowing union bound arguments to control the probability of overfitting across all hypotheses; (3) It's the bridge between the combinatorial property (VC dimension) and the probabilistic guarantee (PAC bound) - finite VC dim → polynomial growth function → uniform convergence → PAC learnability.

Q8: How do PAC-Bayes bounds improve on VC dimension bounds for deep learning?

Classical VC bounds are typically vacuous for deep neural networks - they predict generalisation gaps of hundreds of percent when in practice networks generalise within a few percent. PAC-Bayes bounds take an algorithm-dependent perspective: instead of bounding the worst case over all hypotheses in H\mathcal{H}, they bound the expected generalisation of a posterior distribution QQ over hypotheses: true error ≤ empirical error + DKL(QP)/2(m1)\sqrt{D_{\text{KL}}(Q\|P)/2(m-1)}. The KL term measures how far the posterior (learned model) moved from the prior (initialisation). For networks trained with SGD from small random initialisation, the KL divergence is often much smaller than the VC dimension predicts, giving tighter bounds. Additionally, PAC-Bayes bounds can capture the geometry of the loss landscape - flat minima (small KL from prior) generalise better than sharp minima, which empirically matches the success of large batch vs small batch training and explicit flatness-seeking optimisers like SAM.

:::tip Role-Specific Angles Research Engineer Interview: PAC bounds derivation, VC dimension computation, Rademacher complexity, double descent, PAC-Bayes MLE Interview: NFL theorem implications, SRM for model selection, why regularisation works theoretically, sample complexity planning Data Scientist: Sample size calculations from error tolerance, why VC dim matters for feature selection, practical generalisation bound interpretation AI Engineer: Implicit regularisation in SGD, flat minima and generalisation, why overparameterised models work, inductive biases of architectures :::

:::tip 🎮 Interactive Playground

Visualize this concept: Try the PAC Learning demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.