VC Dimension
Reading time: ~28 minutes | Level: Statistical Learning Theory → ML Foundations
You are designing a text classifier. Your first instinct: use a bag-of-words representation with 50,000 features and a linear classifier. Your colleague pushes back - "50,000 features with only 10,000 training examples? You'll overfit."
Will you? Not necessarily. VC theory tells you that a linear classifier with features has VC dimension - so for 50,000 features, VC dimension is 50,001. The sample complexity bound requires roughly million examples for guaranteed error.
Your colleague is right. 10,000 examples is insufficient to guarantee that a linear classifier in this feature space generalizes.
But this does not mean you will definitely overfit - it means you have no formal guarantee. In practice, with L2 regularization, the effective complexity is much lower. VC theory tells you the ceiling. Rademacher complexity and regularization theory (Lesson 04, 05) tell you the practical bound.
Understanding VC dimension precisely - not just "big VC = bad" but the formal shattering definition, the key examples, and the Fundamental Theorem - is what separates a principled ML practitioner from one guessing at hyperparameters.
What You Will Learn
- The problem with PAC bounds for infinite hypothesis classes
- The shattering concept - formal definition and geometric intuition
- VC dimension defined, with worked examples for threshold functions, intervals, and linear classifiers in
- VC dimensions of common ML models (decision trees, SVMs, neural networks)
- Sauer's Lemma - the mathematical bridge from shattering to generalization
- The Fundamental Theorem of Statistical Learning: VC dimension characterizes PAC learnability
- Margin theory: why SVMs generalize in high-dimensional spaces
- Code to verify VC dimension claims and plot generalization bounds
- Five interview questions with full worked answers
Part 1 - The Problem with Infinite Hypothesis Classes
Recall the PAC generalization bound for a finite hypothesis class :
This bound requires . For finite classes, it gives meaningful guarantees. But every practical ML model has an infinite hypothesis class:
| Model | Hypothesis class | | |---|---|---| | Linear classifier in | All hyperplanes: for | | | Polynomial classifier of degree | All polynomials of degree | | | Neural network (fixed architecture) | All weight assignments in | | | Decision tree (fixed depth) | All splits at all possible thresholds | | | SVM with RBF kernel | All RKHS functions with bounded norm | |
For infinite , and the PAC bound is vacuous - it tells you nothing.
The question: Can we still prove meaningful generalization bounds for infinite hypothesis classes?
The answer: Yes - by replacing with a combinatorial measure of effective complexity that applies to infinite classes. This measure is the VC dimension.
Part 2 - Shattering: The Foundation of VC Theory
Formal Definition
Definition (Shattering): A hypothesis class shatters a finite set if for every possible binary labeling of the points in , there exists some hypothesis in that achieves it:
Equivalently, the set of all labelings can produce on equals all possible labelings:
Definition (VC Dimension): The VC dimension of , denoted or , is the size of the largest set that can shatter:
If can shatter arbitrarily large sets, .
Geometric Intuition
Shattering a set of points means the hypothesis class is expressive enough to produce every possible dichotomy (two-class partition) of those points. Intuitively:
- If you can shatter 3 points in with a line, then for any assignment of +/- labels to those 3 points, a line exists that separates them correctly.
- 3 points not in a line can be shattered by linear classifiers in : VCdim = 3.
- 4 points in cannot always be shattered (the XOR pattern can't be separated by a line): VCdim ≤ 3.
3 points, non-collinear: all 8 labelings achievable by some line
* (+) (-) (-)
/ \ (-) \ (-)
* * → (+) (+) (-) ... all 2³ = 8 patterns
one of 8 patterns shown: line separates * from the rest
4 points, XOR pattern: no line can separate all 4 patterns
(-) (+)
(+) (-) ← this labeling requires a nonlinear boundary
Part 3 - Computing VC Dimension: Key Examples
Example 1: Threshold Functions on
where - classify as positive if .
VCdim = 1:
Can shatter (1 point):
- Label +1: use (threshold below ), so . ✓
- Label 0: use (threshold above ), so . ✓
Cannot shatter with :
- The labeling requires and , which means .
- But , so is impossible. ✗
Therefore VCdim = 1.
Example 2: Intervals on
where .
VCdim = 2:
Can shatter with : All 4 labelings:
- : Use (empty interval). ✓
- : Use . ✓
- : Use . ✓
- : Use . ✓
Cannot shatter with :
- The labeling requires an interval containing and but not .
- Any interval with and must contain all of . ✗
Therefore VCdim = 2.
Example 3: Linear Classifiers in - The Central Result
.
VCdim = - the most important result in classical learning theory.
Proof sketch (VCdim ): The points (standard basis vectors and origin) can be shattered. For any labeling , choose:
This gives = the desired label for each point (with appropriate scaling).
Proof sketch (VCdim , i.e., cannot shatter points): By Radon's Theorem, any points in can be partitioned into two disjoint sets such that the convex hulls and intersect. A hyperplane cannot separate all labelings of - in particular, it cannot separate (labeled +1) from (labeled -1) because their convex hulls intersect.
Concrete values:
| Model | VC Dimension |
|---|---|
| Threshold on | 1 |
| Interval on | 2 |
| Linear classifier in (half-lines) | 2 |
| Linear classifier in (half-planes) | 3 |
| Linear classifier in (hyperplanes) | |
| Rectangle (axis-aligned) in | 4 |
| -nearest neighbor classifier | |
| Decision tree with leaves | |
| Neural net: 1 hidden layer, neurons, sigmoid | |
| Sine classifier | |
| Convex polygon classifier in |
Part 4 - Code: Verifying Shattering with Linear Classifiers
import numpy as np
from scipy.optimize import linprog
from itertools import product
def can_shatter_with_linear(points, tolerance=1e-6):
"""
Check if a set of 2D points can be shattered by a linear classifier.
For each of the 2^m labelings, checks LP feasibility:
∃ w, b: y_i * (w^T x_i + b) >= 1 for all i
Args:
points: list of (x1, x2) tuples
tolerance: LP feasibility tolerance
Returns:
bool: True if all 2^m labelings are achievable
"""
m = len(points)
X = np.array(points, dtype=float)
X_aug = np.column_stack([X, np.ones(m)]) # augment with bias term
d_aug = X_aug.shape[1] # d + 1
for labels in product([1, -1], repeat=m):
labels_arr = np.array(labels, dtype=float)
Y = np.diag(labels_arr)
# Constraint: Y @ X_aug @ w >= 1 ↔ -Y @ X_aug @ w <= -1
A_ub = -Y @ X_aug # shape (m, d+1)
b_ub = -np.ones(m)
# LP: feasibility (minimize 0 subject to constraints)
result = linprog(
c=np.zeros(d_aug),
A_ub=A_ub,
b_ub=b_ub,
bounds=[(-100, 100)] * d_aug,
method='highs'
)
if result.status != 0:
return False # This labeling is not achievable
return True
# --- Verify VCdim(linear in R²) = 3 ---
# 3 non-collinear points: should be shatterable
three_pts_good = [(1.0, 0.0), (-1.0, 0.0), (0.0, 1.0)] # triangle
print(f"3 non-collinear points shattered by linear: "
f"{can_shatter_with_linear(three_pts_good)}") # True
# 3 collinear points: cannot be shattered (interior point creates problem)
three_pts_collinear = [(-1.0, 0.0), (0.0, 0.0), (1.0, 0.0)]
print(f"3 collinear points shattered by linear: "
f"{can_shatter_with_linear(three_pts_collinear)}") # False
# 4 points: cannot be shattered (regardless of configuration)
four_pts_corners = [(1.0, 1.0), (-1.0, 1.0), (1.0, -1.0), (-1.0, -1.0)]
print(f"4 points (corners) shattered by linear: "
f"{can_shatter_with_linear(four_pts_corners)}") # False
print(f"\nConclusion: VCdim(linear in R²) = 3 = d+1 where d=2 ✓")
# --- Verify VCdim of intervals on R = 2 ---
def can_shatter_with_interval(points):
"""
Check if a set of 1D points can be shattered by intervals [a,b].
An interval labels x as +1 if a <= x <= b, else -1.
"""
points = sorted(points)
m = len(points)
for labels in product([1, -1], repeat=m):
# Labels (1, 0, 1) for consecutive points requires a gap in the interval
# An interval must cover a contiguous range.
# So if label[i] = 1 and label[j] = 0 and label[k] = 1 where i < j < k,
# no interval [a,b] can satisfy this.
achievable = False
# Check all possible intervals [points[i]-eps, points[j]+eps] for i<=j
# and also the empty interval
for start_idx in range(-1, m):
for end_idx in range(start_idx, m):
# Interval covers points[start_idx..end_idx] (if start_idx >= 0)
predicted = []
for k, pt in enumerate(points):
if start_idx == -1: # empty interval
predicted.append(-1)
elif start_idx <= k <= end_idx:
predicted.append(1)
else:
predicted.append(-1)
if tuple(predicted) == labels:
achievable = True
break
if achievable:
break
if not achievable:
return False
return True
# 2 points: should be shatterable
print(f"2 points shattered by interval: "
f"{can_shatter_with_interval([1.0, 3.0])}") # True
# 3 points: (1, 0, 1) pattern impossible → not shatterable
print(f"3 points shattered by interval: "
f"{can_shatter_with_interval([1.0, 2.0, 3.0])}") # False
print(f"\nConclusion: VCdim(intervals on R) = 2 ✓")
Part 5 - Sauer's Lemma: From Shattering to Generalization
The key technical tool that makes VC theory work for infinite hypothesis classes:
Definition (Growth Function): The growth function of measures the maximum number of distinct labelings any points can receive:
For a finite class: .
For an infinite class: can still be finite!
Sauer-Shelah Lemma: If , then:
The growth function is polynomial in once .
Phase transition at :
- For : (class can shatter those points)
- For : (polynomial growth)
Why this enables generalization bounds: The union bound in the PAC proof requires a sum over all hypotheses. For infinite , this sum is infinite. But after restricting to training points, there are at most distinct behaviors. By Sauer's Lemma, this is at most - polynomial! We can apply the union bound over effective hypotheses, giving a log factor of instead of .
import numpy as np
import matplotlib.pyplot as plt
def growth_function_upper_bound(d_vc, m):
"""Sauer-Shelah upper bound on the growth function."""
if d_vc == 0:
return 1
# Sum of C(m, i) for i = 0, ..., d_vc
from scipy.special import comb
return int(sum(comb(m, i, exact=True) for i in range(min(d_vc, m) + 1)))
def growth_function_exponential(m):
"""Exponential (unrestricted): 2^m labelings."""
return 2**m
# Show the transition from exponential to polynomial growth
d_vc = 5
m_vals = range(1, 25)
exp_vals = [growth_function_exponential(m) for m in m_vals]
poly_vals = [growth_function_upper_bound(d_vc, m) for m in m_vals]
print(f"Growth function for d_VC = {d_vc}")
print(f"{'m':<8} {'2^m (unrestricted)':<25} {'Sauer bound':<20} {'Ratio':<10}")
print("-" * 63)
for m in range(1, 20):
exp_v = growth_function_exponential(m)
poly_v = growth_function_upper_bound(d_vc, m)
ratio = poly_v / exp_v
marker = " ← breakpoint" if m == d_vc else ""
print(f"{m:<8} {exp_v:<25} {poly_v:<20} {ratio:.4f}{marker}")
print("\nAfter m = d_VC, growth becomes polynomial (ratio → 0)")
Part 6 - The Fundamental Theorem of Statistical Learning
This theorem completely characterizes when binary classification is PAC learnable:
Theorem (Fundamental Theorem of Statistical Learning): Let be a binary hypothesis class. The following are equivalent:
- is PAC learnable (agnostic setting)
- is PAC learnable (realizable setting)
- satisfies the uniform convergence property
- has finite VC dimension:
Moreover, the sample complexity satisfies:
Upper bound (ERM achieves this with enough data):
Lower bound (no algorithm can do better with fewer samples):
The exact bound using Sauer's Lemma:
with probability .
The key takeaways:
- Sample complexity scales linearly with - the price of model complexity
- For linear classifiers in : need examples - one per dimension
- Infinite VC dimension = NOT PAC learnable - no algorithm can guarantee generalization
- ERM with finite VC dimension achieves optimal sample complexity (up to constants)
import numpy as np
import matplotlib.pyplot as plt
def vc_generalization_bound(d_vc, n, delta=0.05):
"""
VC generalization bound via Sauer's Lemma.
R(h_ERM) - R_hat(h_ERM) <= this value with prob >= 1-delta.
"""
if d_vc == 0:
return np.sqrt(np.log(2/delta) / (2*n))
log_growth = d_vc * np.log(2*np.e*n/d_vc) + np.log(4/delta)
return np.sqrt(8 * log_growth / n)
def vc_sample_complexity(d_vc, epsilon, delta=0.05, max_n=10_000_000):
"""
Minimum n such that the VC bound <= epsilon.
Binary search.
"""
lo, hi = max(d_vc, 10), max_n
while lo < hi:
mid = (lo + hi) // 2
if vc_generalization_bound(d_vc, mid, delta) <= epsilon:
hi = mid
else:
lo = mid + 1
return lo
# Sample complexity for different model types
print("VC Sample Complexity: Required Training Set Size")
print(f"epsilon=0.05 (5% generalization gap), delta=0.05 (95% confidence)")
print()
print(f"{'Model':<45} {'VC dim':>8} {'n required':>15}")
print("-" * 70)
models = [
("Threshold on R (single feature)", 1),
("Interval on R", 2),
("Linear classifier in R^2", 3),
("Linear classifier in R^10 (10 features)", 11),
("Linear classifier in R^100 (100 features)", 101),
("Linear in R^1000 (bag-of-words, 1000 features)", 1001),
("Linear in R^10000 (large bag-of-words)", 10001),
("Decision tree, k=32 leaves", 200), # ~k log k
("SVM, same as linear (d+1 for hard-margin)", 101),
]
for name, d_vc in models:
n = vc_sample_complexity(d_vc, epsilon=0.05, delta=0.05)
print(f"{name:<45} {d_vc:>8,} {n:>15,}")
# Plot: generalization bound vs n for different VC dimensions
print("\n\nGeneralization bounds decay as 1/√n...")
n_range = np.logspace(2, 6, 500)
fig, ax = plt.subplots(figsize=(11, 5))
for d_vc in [1, 5, 20, 100, 500]:
bounds = [vc_generalization_bound(d_vc, n) for n in n_range]
ax.semilogx(n_range, bounds, linewidth=2, label=f'd_VC = {d_vc:,}')
ax.axhline(0.05, color='gray', linestyle='--', linewidth=1.5, label='ε = 0.05')
ax.set_xlabel('Training set size n', fontsize=12)
ax.set_ylabel('Generalization bound', fontsize=12)
ax.set_title('VC Generalization Bounds vs Training Set Size', fontsize=13)
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)
ax.set_ylim(0, 1.5)
plt.tight_layout()
plt.savefig('vc_bounds.png', dpi=150)
print("Plot saved: vc_bounds.png")
Part 7 - VC Dimension and Margin Theory for SVMs
The classical VC dimension of a linear classifier in is , which scales with input dimension. But SVMs have a much better story.
Fat-Shattering Dimension: For a linear classifier with weight norm bound and data radius , the margin- shattering dimension (fat-shattering dimension) is:
where is the geometric margin. Substituting into the VC-type bound:
The profound consequence: This bound is independent of input dimension . An SVM with a large margin in has the same generalization bound as in - what matters is the margin, not the dimension.
Why SVMs generalize in high-dimensional space:
Classical VC bound: O(√(d/n))
- For d = 10,000 features: need ~millions of examples
Margin-based bound: O(√(B²R²/(γ²n))) = O(R/(γ√n))
- For margin γ = 0.5, radius R = 1: bound = O(2/√n)
- Requires only O(4/ε²) examples
- INDEPENDENT of d!
This explains the success of SVMs in text classification (very high-dimensional sparse features) and bioinformatics (gene expression data with thousands of features).
import numpy as np
def margin_generalization_bound(B, R, gamma, n, delta=0.05):
"""
Margin-based generalization bound for linear classifiers.
R(h) <= R_hat(h) + sqrt(B^2 R^2 / (gamma^2 n)) + confidence
For SVMs: B = 1/gamma (from the SVM constraint ||w|| = 1/gamma).
Simplified: bound = O(R / (gamma * sqrt(n)))
"""
fat_vc_proxy = (B * R / gamma)**2 # effective VC dimension
main_term = np.sqrt(fat_vc_proxy / n)
confidence = np.sqrt(np.log(2/delta) / (2*n))
return main_term + confidence
print("Margin-based Bound: Independence from Input Dimension")
print("="*65)
print(f"SVM parameters: margin γ=0.5, data radius R=1.0 (normalized)")
print(f"Training size: n=10,000")
print()
print(f"{'Input dimension d':<25} {'Classical VC bound':<25} {'Margin bound':<20}")
print("-" * 70)
n = 10_000
gamma = 0.5 # margin
R = 1.0 # data radius (normalized)
B = 1/gamma # SVM weight norm constraint: ||w|| = 1/gamma
for d in [10, 100, 1_000, 10_000, 100_000]:
d_vc = d + 1 # classical VC dimension of linear classifier
classical = np.sqrt((d_vc + 10) / n) # simplified classical bound
margin_b = margin_generalization_bound(B, R, gamma, n)
print(f"{d:<25,} {classical:<25.4f} {margin_b:<20.4f}")
print()
print("Margin bound stays constant regardless of d! ✓")
print("Classical bound grows with d - useless for high-d text classification")
Part 8 - VC Dimension of Neural Networks
For neural networks, VC dimension is well-studied:
Result (Goldberg-Jerrum, 1995; Bartlett-Maass, 2003):
For a feedforward neural network with:
- total parameters (weights and biases)
- layers
- Piecewise-linear activations (ReLU, threshold, etc.)
So VC dimension is roughly proportional to the number of parameters (with an extra log factor and depth factor).
For sigmoid/tanh activations (Bartlett-Maass):
For threshold (step function) activations: similar scaling.
Implications:
| Network | Parameters | VC dimension (approx) | Required samples |
|---|---|---|---|
| 2-layer MLP, 100 hidden | ~10K | ~100K | ~billions |
| ResNet-50 | ~25M | ~500M | ~astronomical |
| GPT-2 small | ~117M | ~billions | physically impossible |
These bounds are vacuous for every practical deep network. Modern theory (Rademacher complexity, PAC-Bayes, NTK) provides better tools - but VC dimension gives the conceptual foundation.
:::tip The Double Descent Connection The classical bias-variance tradeoff predicts a "sweet spot" model size - too small (high bias), too large (high variance). VC dimension formalizes this: VC dimension grows with model size, and the VC bound predicts a U-shaped test error curve.
But modern deep networks violate this! Overparameterized models (VC dimension >> n) often generalize well - the "double descent" phenomenon. This shows VC dimension is a worst-case bound: it doesn't account for the implicit regularization of SGD that prevents overparameterized models from reaching the worst-case function. :::
:::note Role-specific relevance ML Engineers: When designing model architecture, VC dimension gives a rough sense of required data: . For a linear model with 1000 features, you need ~400K examples for 5% guaranteed error. This motivates regularization (reduces effective VC dimension) when you have less data.
Research Engineers / Scientists: VC dimension is the foundation for NeurIPS/ICML theory papers. The Fundamental Theorem, Sauer's Lemma, and uniform convergence appear in papers on generalization bounds for transformers, in-context learning, and large language models.
Data Engineers building ML pipelines: Understanding that VC dimension scales with feature count motivates feature selection - reducing from 50K to 5K features reduces sample complexity by 10×. :::
Interview Questions
Q1: What does it mean for a hypothesis class to shatter a set of points?
A hypothesis class shatters a set if for every possible binary labeling of these points (there are such labelings), there exists some hypothesis that assigns exactly those labels. Shattering means the hypothesis class is expressive enough to perfectly explain every possible data pattern on those specific points - no labeling can "surprise" it. The VC dimension is the size of the largest set the class can shatter. A critical subtlety: shattering requires the existence of a shatterable set of that size - not that every set of that size is shatterable. So VCdim = 3 for linear classifiers in means there exist 3 non-collinear points that are shattered - but 3 collinear points are not shattered (the interior point creates a problem). To prove VCdim : exhibit one specific set of points that is shattered. To prove VCdim : show that no set of points can be shattered.
Q2: What is the VC dimension of linear classifiers in , and why?
VCdim = . To prove VCdim : Consider the points (standard basis vectors and origin). For any labeling , define and bias appropriately. With this choice, for each of the points. To prove VCdim (i.e., no set of points can be shattered): By Radon's Theorem, any points in can be partitioned into two sets with intersecting convex hulls. For the labeling that assigns +1 to all points in and -1 to all points in : a hyperplane separating from would have to separate the intersecting convex hulls, which is impossible. Practical implication: for a bag-of-words classifier with 10,000 vocabulary terms, VC dimension is 10,001. Guaranteed 5% error requires ~ million examples. This is why L2 regularization (reducing effective VC dimension) is essential for high-dimensional text classification.
Q3: State the Fundamental Theorem of Statistical Learning and explain its significance.
The Fundamental Theorem states: a binary hypothesis class is PAC learnable if and only if it has finite VC dimension. Moreover, the sample complexity is in the agnostic case (up to log factors in ). Significance: (1) Complete characterization of learnability - you don't need to check infinitely many distributions; VC dimension is the single number that determines whether learning is possible; (2) Optimal sample complexity - ERM achieves the optimal bound ; no algorithm can do better; (3) Infinite classes are learnable - the class of all linear classifiers in is infinite but has VCdim = 1001, so it is PAC learnable with enough data; (4) Impossibility results - the class of all functions has infinite VC dimension (it can shatter any finite set), hence is NOT PAC learnable - there is no algorithm that guarantees generalization without structural assumptions on the hypothesis class. Limitations: The theorem is for binary classification; extensions to multiclass and regression require different complexity measures (Natarajan dimension, fat-shattering dimension). VC bounds are worst-case and often loose for practical deep learning.
Q4: How does the margin of an SVM relate to its generalization, and why does this explain SVM success in high-dimensional spaces?
Classical VC theory says a linear classifier in has VC dimension - bad news for high-dimensional data. But SVMs optimize for the margin - the geometric distance from the decision boundary to the nearest training points. The fat-shattering dimension of a linear classifier with margin is where (SVM weight norm) and (data radius). Substituting: fat-shattering dimension - dimension-free! The SVM generalization bound becomes , independent of input dimension . For text classification with 50,000 TF-IDF features: classical VC bound gives - requires millions of examples. Margin-based SVM bound gives - if the data is linearly separable with a good margin, even 10,000 examples may suffice. The practical implication: maximize the SVM margin (maximize ) by (1) proper feature normalization (reduces ), (2) soft-margin SVM with regularization (increases by ignoring outliers), and (3) using an RBF kernel to map to a space where a large margin exists.
Q5: Give an example of a hypothesis class with infinite VC dimension. What does this imply theoretically and practically?
The class of sinusoid classifiers for has infinite VC dimension. For any points and any labeling, we can choose so that the sinusoid oscillates with period small enough to separate consecutive points - with the sign in each interval matching the desired label. More concretely: choose such that puts each in the positive half (for label +1) or negative half (for label -1) of the sine. By the Fundamental Theorem, infinite VC dimension means the class is not PAC learnable - no algorithm can guarantee generalization on this class without additional assumptions, regardless of how much data is collected. Practical implications: -nearest neighbor (for any ) with : can memorize any training set, VCdim = , and generalizes only through implicit smoothness of natural data (not guaranteed). Decision stumps with unbounded precision: if thresholds are real-valued, the class has VCdim = ; in practice, data has finite precision, effectively limiting VC dimension. Neural networks with the special activation : infinite VC dimension - this "Kolmogorov-Arnold" type activation can represent any function, which is why practical neural networks use bounded activations (ReLU, sigmoid) and weight norm constraints. The resolution in practice: we never use hypothesis classes with truly infinite VC dimension - we always add implicit or explicit regularization that restricts to a finite-VC-dimension subset.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the VC Dimension demo on the EngineersOfAI Playground - no code required.
:::
