PAC Learning
Reading time: ~25 minutes | Level: Statistical Learning Theory → ML Foundations
The Question Every ML Engineer Should Be Able to Answer
Your manager asks: "How much data do we need to train a reliable model?"
The instinctive answer - "as much as possible" - isn't useful. The real answer depends on:
- What kind of model are you training? (hypothesis class complexity)
- How accurate does it need to be? (error tolerance )
- How confident do you need to be in that accuracy? (confidence parameter )
PAC (Probably Approximately Correct) learning is the formal framework that turns these intuitions into a precise mathematical answer. It was introduced by Leslie Valiant in 1984 (Turing Award, 2010) and remains the foundational framework of statistical learning theory.
What You Will Learn
- The PAC framework: setup, error definitions, and the formal definition of learnability
- Sample complexity for finite hypothesis classes: the derivation using union bound + exponential tail bound
- Why sample complexity scales as , not - the key insight
- The realizable vs agnostic PAC settings and why the sample complexity changes from to
- Empirical Risk Minimization: the canonical learning algorithm with its PAC guarantees
- The No Free Lunch theorem: why inductive bias is not a flaw but a necessity
- Code to compute sample complexity and visualize how it changes with model class size
- Five interview questions with full worked answers
The PAC Framework
Setup
- Input space: (e.g., images, text tokens, feature vectors)
- Label space: (binary classification for simplicity)
- Data distribution: over - unknown, fixed
- Target concept: - the true labeling function
- Hypothesis class: - the set of functions the learner can choose from
- Training set: sampled i.i.d. from
The Error Rate
The true error (generalization error) of hypothesis :
The empirical error (training error):
We want to be small, but we can only observe .
PAC Learnability
Definition: A hypothesis class is PAC learnable if there exists an algorithm and a function such that for any , any distribution over , and any target , if receives a training set of size drawn from , then with probability at least (over the random draw of ):
where is the hypothesis output by algorithm .
In plain English: Given enough data, the algorithm probably outputs an approximately correct hypothesis. "Probably" = with probability . "Approximately correct" = error .
The function is the sample complexity - how much data you need. This is the key quantity.
The Consistent Learner and Finite Hypothesis Classes
Consistent Learner
A consistent learner returns a hypothesis with zero training error: .
This is achievable when (the target is in the hypothesis class) - the realizable case.
Empirical Risk Minimization (ERM) is a consistent learner in the realizable case: it finds the hypothesis with minimum training error (which is 0 if any consistent hypothesis exists).
Sample Complexity for Finite Hypothesis Classes
Theorem: Let be a finite hypothesis class (). If a consistent learner outputs with , then for any :
Setting this and solving for :
Derivation: Using the union bound and an exponential tail bound.
Step 1: Call bad if . We want to bound .
Step 2: If , then each training example is correctly classified by with probability at most . For i.i.d. examples:
Step 3: By the union bound over all bad hypotheses in :
Step 4: Set this and solve for : .
import numpy as np
import matplotlib.pyplot as plt
def pac_sample_complexity_finite(H_size, epsilon, delta):
"""
Sample complexity for PAC learning with finite hypothesis class.
Consistent learner (zero training error required).
Args:
H_size: |H| -- size of hypothesis class
epsilon: error tolerance (e.g., 0.05 = 5% error allowed)
delta: confidence parameter (e.g., 0.05 = 5% failure probability)
Returns:
minimum sample size n
"""
return int(np.ceil((np.log(H_size) + np.log(1/delta)) / epsilon))
# Examples
print("PAC Sample Complexity for Consistent Learner")
print(f"{'|H|':<15} {'epsilon':<10} {'delta':<10} {'n required':<15}")
print("-" * 50)
configs = [
(100, 0.05, 0.05),
(10000, 0.05, 0.05),
(1e6, 0.05, 0.05),
(100, 0.01, 0.05), # tighter error requirement
(100, 0.05, 0.01), # higher confidence requirement
(2**1000, 0.05, 0.05), # boolean functions on 1000 features
]
for H_size, eps, delta in configs:
n = pac_sample_complexity_finite(H_size, eps, delta)
print(f"{H_size:<15.2g} {eps:<10.3f} {delta:<10.3f} {n:<15,}")
print()
print("Key insight: sample complexity grows as log|H|, not |H|.")
print("Even for |H| = 2^1000, we need ~1000/epsilon samples, not 2^1000!")
The crucial insight: Sample complexity grows as , not . This is why learning is feasible even for exponentially large hypothesis classes - a feature space with binary features has possible Boolean functions, but we only need examples to PAC learn (if our hypothesis class is restricted to, say, conjunctions).
From Finite to Infinite Hypothesis Classes
Most ML models have infinite hypothesis classes (all linear classifiers in , all neural networks with bounded weights). The union bound over doesn't directly apply.
The key generalization: replace with a complexity measure that can be defined for infinite classes:
| Complexity Measure | When to Use | Lesson |
|---|---|---|
| $\ln | \mathcal{H} | $ |
| VC dimension | Binary classification, infinite | Lesson 02 |
| Rademacher complexity | Data-dependent, tighter bounds | Lesson 04 |
| PAC-Bayes bounds | Bayesian perspective, deep learning | Lesson 07 |
The Realizable vs Agnostic Cases
Realizable Case (target )
The target concept is in the hypothesis class. A consistent learner can achieve zero training error and the analysis above applies. The sample complexity is .
Agnostic Case (target )
The target may not be in - the best any can achieve is . We want to find such that .
Agnostic PAC sample complexity (uniform convergence):
This requires the generalization gap to be small for ALL simultaneously - uniform convergence. The sample complexity has instead of because we're in the harder agnostic setting.
def pac_sample_complexity_agnostic(H_size, epsilon, delta):
"""
Sample complexity for agnostic PAC learning (no realizability assumption).
Uses Hoeffding + union bound.
"""
return int(np.ceil((np.log(H_size) + np.log(2/delta)) / (2 * epsilon**2)))
def pac_sample_complexity_realizable(H_size, epsilon, delta):
return int(np.ceil((np.log(H_size) + np.log(1/delta)) / epsilon))
print("Realizable vs Agnostic Sample Complexity")
print(f"{'|H|':<12} {'epsilon':<10} {'Realizable n':<16} {'Agnostic n':<14}")
print("-" * 52)
for H_size in [100, 10000, 1e6]:
for eps in [0.1, 0.05]:
delta = 0.05
n_real = pac_sample_complexity_realizable(H_size, eps, delta)
n_agn = pac_sample_complexity_agnostic(H_size, eps, delta)
print(f"{H_size:<12.0g} {eps:<10.3f} {n_real:<16,} {n_agn:<14,}")
print()
print(f"Agnostic requires O(1/eps^2) vs realizable O(1/eps).")
print(f"For eps=0.05: agnostic needs 400x more data!")
ERM: Empirical Risk Minimization
The most natural learning algorithm: find the hypothesis minimizing training error.
Fundamental theorem (for finite , agnostic): ERM achieves sample complexity . It is PAC learnable.
In practice, ERM is just gradient descent on the empirical loss - which is exactly what we do in deep learning (minimizing cross-entropy / MSE on the training set). The theory tells us when this is enough to generalize.
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
def learning_curve_simulation(n_train_sizes, model, n_test=5000):
"""
Simulate PAC learning behavior: how training and test error
change as n grows.
"""
np.random.seed(42)
# Generate large pool of data
X_all, y_all = make_classification(n_samples=n_train_sizes[-1] + n_test,
n_features=10, n_informative=5,
random_state=42)
X_test, y_test = X_all[:n_test], y_all[:n_test]
X_pool, y_pool = X_all[n_test:], y_all[n_test:]
train_errors, test_errors = [], []
for n in n_train_sizes:
X_train = X_pool[:n]
y_train = y_pool[:n]
model.fit(X_train, y_train)
train_err = 1 - model.score(X_train, y_train)
test_err = 1 - model.score(X_test, y_test)
train_errors.append(train_err)
test_errors.append(test_err)
return train_errors, test_errors
n_sizes = [10, 20, 50, 100, 200, 500, 1000, 2000, 5000]
# High-capacity model: Decision Tree (can perfectly fit any training set)
tree = DecisionTreeClassifier(max_depth=None)
train_dt, test_dt = learning_curve_simulation(n_sizes, tree)
# Low-capacity model: Logistic Regression (limited hypothesis class)
lr = LogisticRegression(C=1.0, max_iter=1000)
train_lr, test_lr = learning_curve_simulation(n_sizes, lr)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
for ax, (train, test, name) in zip(axes, [
(train_dt, test_dt, 'Decision Tree (high capacity)'),
(train_lr, test_lr, 'Logistic Regression (low capacity)')
]):
ax.semilogx(n_sizes, train, 'b-o', label='Training error', linewidth=2)
ax.semilogx(n_sizes, test, 'r-s', label='Test error', linewidth=2)
ax.fill_between(n_sizes,
[min(t, te) for t, te in zip(train, test)],
test,
alpha=0.2, color='orange', label='Generalization gap')
ax.set_xlabel('Training set size n')
ax.set_ylabel('Error rate')
ax.set_title(name)
ax.legend()
ax.grid(True, alpha=0.3)
plt.suptitle('PAC Learning: How Error Changes With Training Set Size', fontsize=13)
plt.tight_layout()
plt.savefig('pac_learning_curves.png', dpi=150)
print("Learning curves saved")
# Print key observations
print(f"\nDecision Tree at n=10: train={train_dt[0]:.3f}, test={test_dt[0]:.3f}")
print(f"Decision Tree at n=5000: train={train_dt[-1]:.3f}, test={test_dt[-1]:.3f}")
print(f"\nLogistic Reg at n=10: train={train_lr[0]:.3f}, test={test_lr[0]:.3f}")
print(f"Logistic Reg at n=5000: train={train_lr[-1]:.3f}, test={test_lr[-1]:.3f}")
print()
print("Key observations:")
print("1. Decision tree: train error = 0 always, test error converges slowly")
print("2. Logistic regression: train error > 0 but test error converges faster")
print("3. Both eventually converge: PAC theory predicts this")
The PAC Bound Visualized
The generalization bound says: with probability ,
Let's visualize how the bound tightens with data:
import numpy as np
import matplotlib.pyplot as plt
def pac_bound_gap(H_log_size, n, delta=0.05):
"""PAC generalization gap (agnostic case)."""
return np.sqrt((H_log_size + np.log(1/delta)) / (2*n))
n_range = np.logspace(1, 5, 300)
H_sizes = {'Small (|H|=100)': np.log(100),
'Medium (|H|=1e6)': np.log(1e6),
'Large (|H|=2^20)': 20 * np.log(2)}
plt.figure(figsize=(10, 5))
for name, log_H in H_sizes.items():
gaps = [pac_bound_gap(log_H, n) for n in n_range]
plt.semilogx(n_range, gaps, linewidth=2, label=name)
plt.axhline(0.05, color='red', linestyle='--', label='epsilon=0.05 target')
plt.xlabel('Training set size n')
plt.ylabel('PAC bound on generalization gap')
plt.title('PAC Generalization Bounds: How Much Data Is Enough?')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('pac_bounds.png', dpi=150)
# When does each model class require <5% gap?
for name, log_H in H_sizes.items():
n_needed = int((log_H + np.log(1/0.05)) / (2 * 0.05**2))
print(f"{name}: need n > {n_needed:,} for gap < 0.05 (delta=0.05)")
Practical Consequences of PAC Theory for ML Engineering
The PAC framework answers several questions that come up constantly in ML practice:
"How much data do I need?" - Use the sample complexity formula. For a logistic regression model with 100 binary features (effective ), the PAC bound says you need examples. For : ~28,000 examples. For : ~700,000. These are upper bounds, but they give order-of-magnitude guidance.
"Is my test set large enough?" - The same bound applies. If you have 1,000 test examples and want to estimate error within 3% with 95% confidence, use: . Your 1,000-example test set gives only ~4.5% precision. For 3% precision: use ~2,000 test examples.
"Why does regularization help?" - Regularization implicitly reduces (by restricting which hypotheses are reachable) or switches from a hard capacity constraint to a soft one. PAC theory quantifies this: a regularized model with effective hypothesis count needs fewer examples to generalize.
"Why does transfer learning work?" - Fine-tuning a pretrained model restricts the hypothesis class to a small neighborhood of the pretrained weights - effectively reducing dramatically. This is why fine-tuned models often outperform from-scratch models with limited labeled data.
import numpy as np
# PAC-guided sample size calculator
def pac_sample_size_required(epsilon, delta, H_description):
"""
Required n for PAC guarantee: error <= epsilon with prob >= 1-delta.
Returns both realizable and agnostic bounds.
"""
results = []
for name, log_H in H_description.items():
n_real = int(np.ceil((log_H + np.log(1/delta)) / epsilon))
n_agn = int(np.ceil((log_H + np.log(2/delta)) / (2 * epsilon**2)))
results.append((name, n_real, n_agn))
return results
print("PAC Sample Size Calculator")
print(f"Target: epsilon=0.05 error, delta=0.05 failure probability")
print()
H_models = {
"Logistic reg, 10 features": 10 * np.log(2),
"Logistic reg, 100 features": 100 * np.log(2),
"Logistic reg, 1000 features": 1000 * np.log(2),
"Decision tree depth=5 (32 leaves)": 100, # ~k log k
"Decision tree depth=10 (1024 leaves)": 3000,
"Linear classifier, 784 features (MNIST)": 784 * np.log(2),
}
print(f"{'Model':<45} {'Realizable n':>15} {'Agnostic n':>14}")
print("-" * 76)
for name, n_real, n_agn in pac_sample_size_required(0.05, 0.05, H_models):
print(f"{name:<45} {n_real:>15,} {n_agn:>14,}")
print()
print("Key: agnostic setting needs ~1/epsilon more samples than realizable.")
print("Most real ML is agnostic (no perfect classifier exists in H).")
:::note Role-specific relevance ML Engineers: The PAC framework directly answers "how much data do I need?" - the most common practical question. Use it to argue for data collection budgets, justify test set sizes, and explain why regularization helps (it reduces effective ).
Research Engineers / Scientists: PAC learning is foundational for understanding ICML/NeurIPS theory papers. The Fundamental Theorem (finite VC dimension ↔ PAC learnability), the extension to infinite classes via VC dimension and Rademacher complexity, and the agnostic-to-realizable gap all appear in modern research on transformers, in-context learning, and LLM generalization.
Data Scientists: The "agnostic vs realizable" distinction maps directly to the question: "Is the perfect classifier for our task achievable by this model family?" If not (almost always), you're in the agnostic setting and need data - 20× more than the optimistic formula. :::
No Free Lunch Theorem
PAC theory has a sobering counterpart: the No Free Lunch (NFL) Theorem.
Theorem (informal): For any learning algorithm , there exists a distribution such that performs no better than random guessing.
What it means: Without assumptions about the data distribution or the hypothesis class, no learning algorithm can generalize. Every learning algorithm implicitly makes assumptions (inductive bias). PAC learning makes this explicit: (the target is learnable by the hypothesis class).
This is why model selection matters. A deep neural network with the right architecture encodes the right inductive biases for image data (local patterns, translation invariance via CNNs) or sequential data (recurrence, attention). Without these biases, the model wouldn't generalize.
Interview Questions
Q1: What does "Probably Approximately Correct" mean, and what are and ?
PAC learning says: a learning algorithm is probably ( probability) approximately (within error) correct. is the error tolerance - how inaccurate the learned hypothesis is allowed to be. is the failure probability - the probability that the algorithm fails to achieve error (due to unlucky training set). The algorithm is PAC learnable if for any , using training examples, it outputs a hypothesis with error with probability . The key: this must hold for any distribution and any target .
Q2: How does the union bound give us the PAC generalization bound for finite hypothesis classes?
We want to bound the probability that the ERM learner outputs a bad hypothesis (true error > ) despite achieving zero training error. Step 1: For a fixed bad hypothesis with , each training example is correctly classified with probability , so the probability all examples are correctly classified is . Step 2: The union bound says the probability that ANY bad hypothesis has zero training error is at most the sum over all bad hypotheses of the probability that specific hypothesis has zero training error: . Setting this gives .
Q3: Why does sample complexity grow as rather than ? What's the intuition?
The logarithm comes from the exponential tail bound on the probability that a bad hypothesis fools the learner. Each new training example multiplies the probability of being fooled by - exponential decay with each example. So after examples, the failure probability decays as . To beat bad hypotheses simultaneously (via union bound), we need , which requires . Intuitively: each training example provides a "test" that rules out bad hypotheses exponentially fast. bits of information are needed to identify the correct hypothesis from candidates - and examples provide roughly bits of useful information about which hypotheses are wrong.
Q4: What is the difference between the realizable and agnostic PAC settings, and why does it matter?
In the realizable setting, the target concept is in the hypothesis class (), so a perfect classifier exists. In the agnostic setting, may not be in , and we aim to find with error where is the best achievable. The sample complexity difference is significant: realizable vs agnostic . The factor in the agnostic case comes from uniform convergence - we need to be close to for all simultaneously (not just at zero), which requires a concentration inequality (Hoeffding's) rather than just an exponential tail bound. In practice, almost all real ML problems are agnostic - the true data distribution is not perfectly captured by any linear model, neural network, or other hypothesis class.
Q5: What is the No Free Lunch theorem and what are its practical implications?
The No Free Lunch theorem states that over all possible data-generating distributions, all learning algorithms have the same expected performance. For any algorithm that generalizes well on distribution , there exists a distribution on which performs no better than random. Proof sketch: any training algorithm that doesn't observe all inputs must make assumptions about unobserved inputs - any assumption that's correct on will be wrong on some . Practical implications: (1) There is no universally best algorithm - you must choose based on your data's structure; (2) Inductive bias (assumptions encoded in the model) is not a flaw - it's necessary for generalization; (3) Claims like "our method is always better than X" are mathematical impossibilities; (4) Domain knowledge matters - CNNs work on images because images have locality structure, not because convolutions are universally optimal. Every successful ML architecture encodes specific assumptions about the data distribution.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the PAC Learning demo on the EngineersOfAI Playground - no code required.
:::
