Skip to main content

VC Dimension

Reading time: ~28 minutes | Level: Statistical Learning Theory → ML Foundations

You are designing a text classifier. Your first instinct: use a bag-of-words representation with 50,000 features and a linear classifier. Your colleague pushes back - "50,000 features with only 10,000 training examples? You'll overfit."

Will you? Not necessarily. VC theory tells you that a linear classifier with dd features has VC dimension d+1d+1 - so for 50,000 features, VC dimension is 50,001. The sample complexity bound requires roughly O(d/ϵ2)=O(50000/0.0025)=20O(d/\epsilon^2) = O(50000/0.0025) = 20 million examples for guaranteed ϵ=0.05\epsilon = 0.05 error.

Your colleague is right. 10,000 examples is insufficient to guarantee that a linear classifier in this feature space generalizes.

But this does not mean you will definitely overfit - it means you have no formal guarantee. In practice, with L2 regularization, the effective complexity is much lower. VC theory tells you the ceiling. Rademacher complexity and regularization theory (Lesson 04, 05) tell you the practical bound.

Understanding VC dimension precisely - not just "big VC = bad" but the formal shattering definition, the key examples, and the Fundamental Theorem - is what separates a principled ML practitioner from one guessing at hyperparameters.

What You Will Learn

  • The problem with PAC bounds for infinite hypothesis classes
  • The shattering concept - formal definition and geometric intuition
  • VC dimension defined, with worked examples for threshold functions, intervals, and linear classifiers in Rd\mathbb{R}^d
  • VC dimensions of common ML models (decision trees, SVMs, neural networks)
  • Sauer's Lemma - the mathematical bridge from shattering to generalization
  • The Fundamental Theorem of Statistical Learning: VC dimension characterizes PAC learnability
  • Margin theory: why SVMs generalize in high-dimensional spaces
  • Code to verify VC dimension claims and plot generalization bounds
  • Five interview questions with full worked answers

Part 1 - The Problem with Infinite Hypothesis Classes

Recall the PAC generalization bound for a finite hypothesis class H\mathcal{H}:

R(hERM)R^(hERM)+lnH+ln(1/δ)2nR(h_{ERM}) \leq \hat{R}(h_{ERM}) + \sqrt{\frac{\ln|\mathcal{H}| + \ln(1/\delta)}{2n}}

This bound requires H<|\mathcal{H}| < \infty. For finite classes, it gives meaningful guarantees. But every practical ML model has an infinite hypothesis class:

| Model | Hypothesis class | H|\mathcal{H}| | |---|---|---| | Linear classifier in Rd\mathbb{R}^d | All hyperplanes: sign(wx+b)\text{sign}(\mathbf{w}^\top \mathbf{x} + b) for wRd\mathbf{w} \in \mathbb{R}^d | \infty | | Polynomial classifier of degree kk | All polynomials of degree k\leq k | \infty | | Neural network (fixed architecture) | All weight assignments in Rp\mathbb{R}^p | \infty | | Decision tree (fixed depth) | All splits at all possible thresholds | \infty | | SVM with RBF kernel | All RKHS functions with bounded norm | \infty |

For infinite H\mathcal{H}, lnH=\ln|\mathcal{H}| = \infty and the PAC bound is vacuous - it tells you nothing.

The question: Can we still prove meaningful generalization bounds for infinite hypothesis classes?

The answer: Yes - by replacing logH\log|\mathcal{H}| with a combinatorial measure of effective complexity that applies to infinite classes. This measure is the VC dimension.

Part 2 - Shattering: The Foundation of VC Theory

Formal Definition

Definition (Shattering): A hypothesis class H\mathcal{H} shatters a finite set C={x1,,xm}XC = \{x_1, \ldots, x_m\} \subseteq \mathcal{X} if for every possible binary labeling of the points in CC, there exists some hypothesis in H\mathcal{H} that achieves it:

y{0,1}m,hH:h(xi)=yi for all i=1,,m\forall \mathbf{y} \in \{0,1\}^m, \quad \exists h \in \mathcal{H}: \quad h(x_i) = y_i \text{ for all } i = 1, \ldots, m

Equivalently, the set of all labelings H\mathcal{H} can produce on CC equals all 2m2^m possible labelings:

{(h(x1),,h(xm)):hH}=2m|\{(h(x_1), \ldots, h(x_m)) : h \in \mathcal{H}\}| = 2^m

Definition (VC Dimension): The VC dimension of H\mathcal{H}, denoted VCdim(H)\text{VCdim}(\mathcal{H}) or dVCd_{VC}, is the size of the largest set that H\mathcal{H} can shatter:

dVC(H)=max{m:CX,C=m,H shatters C}d_{VC}(\mathcal{H}) = \max\{m : \exists C \subseteq \mathcal{X}, |C| = m, \mathcal{H} \text{ shatters } C\}

If H\mathcal{H} can shatter arbitrarily large sets, dVC(H)=d_{VC}(\mathcal{H}) = \infty.

Geometric Intuition

Shattering a set of mm points means the hypothesis class is expressive enough to produce every possible dichotomy (two-class partition) of those points. Intuitively:

  • If you can shatter 3 points in R2\mathbb{R}^2 with a line, then for any assignment of +/- labels to those 3 points, a line exists that separates them correctly.
  • 3 points not in a line can be shattered by linear classifiers in R2\mathbb{R}^2: VCdim = 3.
  • 4 points in R2\mathbb{R}^2 cannot always be shattered (the XOR pattern can't be separated by a line): VCdim ≤ 3.
3 points, non-collinear: all 8 labelings achievable by some line
* (+) (-) (-)
/ \ (-) \ (-)
* * → (+) (+) (-) ... all 2³ = 8 patterns
one of 8 patterns shown: line separates * from the rest

4 points, XOR pattern: no line can separate all 4 patterns
(-) (+)
(+) (-) ← this labeling requires a nonlinear boundary

Part 3 - Computing VC Dimension: Key Examples

Example 1: Threshold Functions on R\mathbb{R}

H={ht:tR}\mathcal{H} = \{h_t : t \in \mathbb{R}\} where ht(x)=1[xt]h_t(x) = \mathbf{1}[x \geq t] - classify as positive if xtx \geq t.

VCdim = 1:

Can shatter {x1}\{x_1\} (1 point):

  • Label +1: use t=x11t = x_1 - 1 (threshold below x1x_1), so ht(x1)=1h_t(x_1) = 1. ✓
  • Label 0: use t=x1+1t = x_1 + 1 (threshold above x1x_1), so ht(x1)=0h_t(x_1) = 0. ✓

Cannot shatter {x1,x2}\{x_1, x_2\} with x1<x2x_1 < x_2:

  • The labeling (h(x1),h(x2))=(1,0)(h(x_1), h(x_2)) = (1, 0) requires x1tx_1 \geq t and x2<tx_2 < t, which means x1t>x2x_1 \geq t > x_2.
  • But x1<x2x_1 < x_2, so x1t>x2>x1x_1 \geq t > x_2 > x_1 is impossible. ✗

Therefore VCdim = 1.

Example 2: Intervals on R\mathbb{R}

H={h[a,b]:ab}\mathcal{H} = \{h_{[a,b]} : a \leq b\} where h[a,b](x)=1[axb]h_{[a,b]}(x) = \mathbf{1}[a \leq x \leq b].

VCdim = 2:

Can shatter {x1,x2}\{x_1, x_2\} with x1<x2x_1 < x_2: All 4 labelings:

  • (0,0)(0, 0): Use a=b=a = b = -\infty (empty interval). ✓
  • (1,0)(1, 0): Use [x1ϵ,x1+ϵ][x_1 - \epsilon, x_1 + \epsilon]. ✓
  • (0,1)(0, 1): Use [x2ϵ,x2+ϵ][x_2 - \epsilon, x_2 + \epsilon]. ✓
  • (1,1)(1, 1): Use [x1ϵ,x2+ϵ][x_1 - \epsilon, x_2 + \epsilon]. ✓

Cannot shatter {x1,x2,x3}\{x_1, x_2, x_3\} with x1<x2<x3x_1 < x_2 < x_3:

  • The labeling (1,0,1)(1, 0, 1) requires an interval containing x1x_1 and x3x_3 but not x2x_2.
  • Any interval [a,b][a, b] with ax1a \leq x_1 and bx3b \geq x_3 must contain all of [x1,x3]x2[x_1, x_3] \ni x_2. ✗

Therefore VCdim = 2.

Example 3: Linear Classifiers in Rd\mathbb{R}^d - The Central Result

H={sign(wx+b):wRd,bR}\mathcal{H} = \{\text{sign}(\mathbf{w}^\top \mathbf{x} + b) : \mathbf{w} \in \mathbb{R}^d, b \in \mathbb{R}\}.

VCdim = d+1d + 1 - the most important result in classical learning theory.

Proof sketch (VCdim d+1\geq d+1): The d+1d+1 points e1,e2,,ed,0e_1, e_2, \ldots, e_d, \mathbf{0} (standard basis vectors and origin) can be shattered. For any labeling y{+1,1}d+1\mathbf{y} \in \{+1,-1\}^{d+1}, choose:

w=j:yj=+1,jdejandb=12+yd+112\mathbf{w} = \sum_{j : y_j = +1, j \leq d} e_j \quad \text{and} \quad b = -\frac{1}{2} + y_{d+1} \cdot \frac{1}{2}

This gives sign(wej+b)\text{sign}(\mathbf{w}^\top e_j + b) = the desired label for each point (with appropriate scaling).

Proof sketch (VCdim d+1\leq d+1, i.e., cannot shatter d+2d+2 points): By Radon's Theorem, any d+2d+2 points in Rd\mathbb{R}^d can be partitioned into two disjoint sets A,BA, B such that the convex hulls conv(A)\text{conv}(A) and conv(B)\text{conv}(B) intersect. A hyperplane cannot separate all labelings of ABA \cup B - in particular, it cannot separate AA (labeled +1) from BB (labeled -1) because their convex hulls intersect.

Concrete values:

ModelVC Dimension
Threshold on R\mathbb{R}1
Interval on R\mathbb{R}2
Linear classifier in R1\mathbb{R}^1 (half-lines)2
Linear classifier in R2\mathbb{R}^2 (half-planes)3
Linear classifier in Rd\mathbb{R}^d (hyperplanes)d+1d+1
Rectangle (axis-aligned) in R2\mathbb{R}^24
kk-nearest neighbor classifier\infty
Decision tree with kk leavesΘ(klogk)\Theta(k \log k)
Neural net: 1 hidden layer, kk neurons, sigmoidO(k2)O(k^2)
Sine classifier sign(sin(ωx))\text{sign}(\sin(\omega x))\infty
Convex polygon classifier in R2\mathbb{R}^2\infty

Part 4 - Code: Verifying Shattering with Linear Classifiers

import numpy as np
from scipy.optimize import linprog
from itertools import product

def can_shatter_with_linear(points, tolerance=1e-6):
"""
Check if a set of 2D points can be shattered by a linear classifier.
For each of the 2^m labelings, checks LP feasibility:
∃ w, b: y_i * (w^T x_i + b) >= 1 for all i

Args:
points: list of (x1, x2) tuples
tolerance: LP feasibility tolerance

Returns:
bool: True if all 2^m labelings are achievable
"""
m = len(points)
X = np.array(points, dtype=float)
X_aug = np.column_stack([X, np.ones(m)]) # augment with bias term
d_aug = X_aug.shape[1] # d + 1

for labels in product([1, -1], repeat=m):
labels_arr = np.array(labels, dtype=float)
Y = np.diag(labels_arr)

# Constraint: Y @ X_aug @ w >= 1 ↔ -Y @ X_aug @ w <= -1
A_ub = -Y @ X_aug # shape (m, d+1)
b_ub = -np.ones(m)

# LP: feasibility (minimize 0 subject to constraints)
result = linprog(
c=np.zeros(d_aug),
A_ub=A_ub,
b_ub=b_ub,
bounds=[(-100, 100)] * d_aug,
method='highs'
)
if result.status != 0:
return False # This labeling is not achievable
return True


# --- Verify VCdim(linear in R²) = 3 ---

# 3 non-collinear points: should be shatterable
three_pts_good = [(1.0, 0.0), (-1.0, 0.0), (0.0, 1.0)] # triangle
print(f"3 non-collinear points shattered by linear: "
f"{can_shatter_with_linear(three_pts_good)}") # True

# 3 collinear points: cannot be shattered (interior point creates problem)
three_pts_collinear = [(-1.0, 0.0), (0.0, 0.0), (1.0, 0.0)]
print(f"3 collinear points shattered by linear: "
f"{can_shatter_with_linear(three_pts_collinear)}") # False

# 4 points: cannot be shattered (regardless of configuration)
four_pts_corners = [(1.0, 1.0), (-1.0, 1.0), (1.0, -1.0), (-1.0, -1.0)]
print(f"4 points (corners) shattered by linear: "
f"{can_shatter_with_linear(four_pts_corners)}") # False

print(f"\nConclusion: VCdim(linear in R²) = 3 = d+1 where d=2 ✓")
# --- Verify VCdim of intervals on R = 2 ---

def can_shatter_with_interval(points):
"""
Check if a set of 1D points can be shattered by intervals [a,b].
An interval labels x as +1 if a <= x <= b, else -1.
"""
points = sorted(points)
m = len(points)

for labels in product([1, -1], repeat=m):
# Labels (1, 0, 1) for consecutive points requires a gap in the interval
# An interval must cover a contiguous range.
# So if label[i] = 1 and label[j] = 0 and label[k] = 1 where i < j < k,
# no interval [a,b] can satisfy this.
achievable = False

# Check all possible intervals [points[i]-eps, points[j]+eps] for i<=j
# and also the empty interval
for start_idx in range(-1, m):
for end_idx in range(start_idx, m):
# Interval covers points[start_idx..end_idx] (if start_idx >= 0)
predicted = []
for k, pt in enumerate(points):
if start_idx == -1: # empty interval
predicted.append(-1)
elif start_idx <= k <= end_idx:
predicted.append(1)
else:
predicted.append(-1)

if tuple(predicted) == labels:
achievable = True
break
if achievable:
break

if not achievable:
return False
return True


# 2 points: should be shatterable
print(f"2 points shattered by interval: "
f"{can_shatter_with_interval([1.0, 3.0])}") # True

# 3 points: (1, 0, 1) pattern impossible → not shatterable
print(f"3 points shattered by interval: "
f"{can_shatter_with_interval([1.0, 2.0, 3.0])}") # False

print(f"\nConclusion: VCdim(intervals on R) = 2 ✓")

Part 5 - Sauer's Lemma: From Shattering to Generalization

The key technical tool that makes VC theory work for infinite hypothesis classes:

Definition (Growth Function): The growth function of H\mathcal{H} measures the maximum number of distinct labelings any mm points can receive:

ΠH(m)=maxCX,C=m{(h(x1),,h(xm)):hH}\Pi_\mathcal{H}(m) = \max_{C \subseteq \mathcal{X}, |C|=m} |\{(h(x_1), \ldots, h(x_m)) : h \in \mathcal{H}\}|

For a finite class: ΠH(m)=H\Pi_\mathcal{H}(m) = |\mathcal{H}|.

For an infinite class: ΠH(m)\Pi_\mathcal{H}(m) can still be finite!

Sauer-Shelah Lemma: If VCdim(H)=d\text{VCdim}(\mathcal{H}) = d, then:

ΠH(m)i=0d(mi)(emd)d\Pi_\mathcal{H}(m) \leq \sum_{i=0}^d \binom{m}{i} \leq \left(\frac{em}{d}\right)^d

The growth function is polynomial in mm once m>dVCm > d_{VC}.

Phase transition at m=dVCm = d_{VC}:

  • For mdVCm \leq d_{VC}: ΠH(m)=2m\Pi_\mathcal{H}(m) = 2^m (class can shatter those points)
  • For m>dVCm > d_{VC}: ΠH(m)(em/d)d\Pi_\mathcal{H}(m) \leq (em/d)^d (polynomial growth)

Why this enables generalization bounds: The union bound in the PAC proof requires a sum over all hypotheses. For infinite H\mathcal{H}, this sum is infinite. But after restricting to nn training points, there are at most ΠH(n)\Pi_\mathcal{H}(n) distinct behaviors. By Sauer's Lemma, this is at most (en/d)d(en/d)^d - polynomial! We can apply the union bound over (en/d)d(en/d)^d effective hypotheses, giving a log factor of dlog(en/d)d\log(en/d) instead of logH=\log|\mathcal{H}| = \infty.

import numpy as np
import matplotlib.pyplot as plt

def growth_function_upper_bound(d_vc, m):
"""Sauer-Shelah upper bound on the growth function."""
if d_vc == 0:
return 1
# Sum of C(m, i) for i = 0, ..., d_vc
from scipy.special import comb
return int(sum(comb(m, i, exact=True) for i in range(min(d_vc, m) + 1)))

def growth_function_exponential(m):
"""Exponential (unrestricted): 2^m labelings."""
return 2**m

# Show the transition from exponential to polynomial growth
d_vc = 5
m_vals = range(1, 25)

exp_vals = [growth_function_exponential(m) for m in m_vals]
poly_vals = [growth_function_upper_bound(d_vc, m) for m in m_vals]

print(f"Growth function for d_VC = {d_vc}")
print(f"{'m':<8} {'2^m (unrestricted)':<25} {'Sauer bound':<20} {'Ratio':<10}")
print("-" * 63)
for m in range(1, 20):
exp_v = growth_function_exponential(m)
poly_v = growth_function_upper_bound(d_vc, m)
ratio = poly_v / exp_v
marker = " ← breakpoint" if m == d_vc else ""
print(f"{m:<8} {exp_v:<25} {poly_v:<20} {ratio:.4f}{marker}")

print("\nAfter m = d_VC, growth becomes polynomial (ratio → 0)")

Part 6 - The Fundamental Theorem of Statistical Learning

This theorem completely characterizes when binary classification is PAC learnable:

Theorem (Fundamental Theorem of Statistical Learning): Let H\mathcal{H} be a binary hypothesis class. The following are equivalent:

  1. H\mathcal{H} is PAC learnable (agnostic setting)
  2. H\mathcal{H} is PAC learnable (realizable setting)
  3. H\mathcal{H} satisfies the uniform convergence property
  4. H\mathcal{H} has finite VC dimension: dVC(H)<d_{VC}(\mathcal{H}) < \infty

Moreover, the sample complexity satisfies:

Upper bound (ERM achieves this with enough data):

nH(ϵ,δ)=O(dVC+log(1/δ)ϵ2)n_\mathcal{H}(\epsilon, \delta) = O\left(\frac{d_{VC} + \log(1/\delta)}{\epsilon^2}\right)

Lower bound (no algorithm can do better with fewer samples):

nH(ϵ,δ)=Ω(dVC+log(1/δ)ϵ2)n_\mathcal{H}(\epsilon, \delta) = \Omega\left(\frac{d_{VC} + \log(1/\delta)}{\epsilon^2}\right)

The exact bound using Sauer's Lemma:

R(hERM)R^(hERM)+8n(dVCln2endVC+ln4δ)R(h_{ERM}) \leq \hat{R}(h_{ERM}) + \sqrt{\frac{8}{n}\left(d_{VC}\ln\frac{2en}{d_{VC}} + \ln\frac{4}{\delta}\right)}

with probability 1δ\geq 1 - \delta.

The key takeaways:

  1. Sample complexity scales linearly with dVCd_{VC} - the price of model complexity
  2. For linear classifiers in Rd\mathbb{R}^d: need O(d/ϵ2)O(d/\epsilon^2) examples - one per dimension
  3. Infinite VC dimension = NOT PAC learnable - no algorithm can guarantee generalization
  4. ERM with finite VC dimension achieves optimal sample complexity (up to constants)
import numpy as np
import matplotlib.pyplot as plt


def vc_generalization_bound(d_vc, n, delta=0.05):
"""
VC generalization bound via Sauer's Lemma.
R(h_ERM) - R_hat(h_ERM) <= this value with prob >= 1-delta.
"""
if d_vc == 0:
return np.sqrt(np.log(2/delta) / (2*n))
log_growth = d_vc * np.log(2*np.e*n/d_vc) + np.log(4/delta)
return np.sqrt(8 * log_growth / n)


def vc_sample_complexity(d_vc, epsilon, delta=0.05, max_n=10_000_000):
"""
Minimum n such that the VC bound <= epsilon.
Binary search.
"""
lo, hi = max(d_vc, 10), max_n
while lo < hi:
mid = (lo + hi) // 2
if vc_generalization_bound(d_vc, mid, delta) <= epsilon:
hi = mid
else:
lo = mid + 1
return lo


# Sample complexity for different model types
print("VC Sample Complexity: Required Training Set Size")
print(f"epsilon=0.05 (5% generalization gap), delta=0.05 (95% confidence)")
print()
print(f"{'Model':<45} {'VC dim':>8} {'n required':>15}")
print("-" * 70)

models = [
("Threshold on R (single feature)", 1),
("Interval on R", 2),
("Linear classifier in R^2", 3),
("Linear classifier in R^10 (10 features)", 11),
("Linear classifier in R^100 (100 features)", 101),
("Linear in R^1000 (bag-of-words, 1000 features)", 1001),
("Linear in R^10000 (large bag-of-words)", 10001),
("Decision tree, k=32 leaves", 200), # ~k log k
("SVM, same as linear (d+1 for hard-margin)", 101),
]
for name, d_vc in models:
n = vc_sample_complexity(d_vc, epsilon=0.05, delta=0.05)
print(f"{name:<45} {d_vc:>8,} {n:>15,}")

# Plot: generalization bound vs n for different VC dimensions
print("\n\nGeneralization bounds decay as 1/√n...")
n_range = np.logspace(2, 6, 500)
fig, ax = plt.subplots(figsize=(11, 5))
for d_vc in [1, 5, 20, 100, 500]:
bounds = [vc_generalization_bound(d_vc, n) for n in n_range]
ax.semilogx(n_range, bounds, linewidth=2, label=f'd_VC = {d_vc:,}')

ax.axhline(0.05, color='gray', linestyle='--', linewidth=1.5, label='ε = 0.05')
ax.set_xlabel('Training set size n', fontsize=12)
ax.set_ylabel('Generalization bound', fontsize=12)
ax.set_title('VC Generalization Bounds vs Training Set Size', fontsize=13)
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)
ax.set_ylim(0, 1.5)
plt.tight_layout()
plt.savefig('vc_bounds.png', dpi=150)
print("Plot saved: vc_bounds.png")

Part 7 - VC Dimension and Margin Theory for SVMs

The classical VC dimension of a linear classifier in Rd\mathbb{R}^d is d+1d+1, which scales with input dimension. But SVMs have a much better story.

Fat-Shattering Dimension: For a linear classifier with weight norm bound wB\|\mathbf{w}\| \leq B and data radius xR\|x\| \leq R, the margin-γ\gamma shattering dimension (fat-shattering dimension) is:

dFS(γ)=O(B2R2γ2)d_{FS}(\gamma) = O\left(\frac{B^2 R^2}{\gamma^2}\right)

where γ\gamma is the geometric margin. Substituting into the VC-type bound:

R(hSVM)R^(hSVM)+O(B2R2γ2n)R(h_{SVM}) \leq \hat{R}(h_{SVM}) + O\left(\sqrt{\frac{B^2 R^2}{\gamma^2 n}}\right)

The profound consequence: This bound is independent of input dimension dd. An SVM with a large margin in R100000\mathbb{R}^{100000} has the same generalization bound as in R10\mathbb{R}^{10} - what matters is the margin, not the dimension.

Why SVMs generalize in high-dimensional space:

Classical VC bound: O(√(d/n))
- For d = 10,000 features: need ~millions of examples

Margin-based bound: O(√(B²R²/(γ²n))) = O(R/(γ√n))
- For margin γ = 0.5, radius R = 1: bound = O(2/√n)
- Requires only O(4/ε²) examples
- INDEPENDENT of d!

This explains the success of SVMs in text classification (very high-dimensional sparse features) and bioinformatics (gene expression data with thousands of features).

import numpy as np

def margin_generalization_bound(B, R, gamma, n, delta=0.05):
"""
Margin-based generalization bound for linear classifiers.
R(h) <= R_hat(h) + sqrt(B^2 R^2 / (gamma^2 n)) + confidence

For SVMs: B = 1/gamma (from the SVM constraint ||w|| = 1/gamma).
Simplified: bound = O(R / (gamma * sqrt(n)))
"""
fat_vc_proxy = (B * R / gamma)**2 # effective VC dimension
main_term = np.sqrt(fat_vc_proxy / n)
confidence = np.sqrt(np.log(2/delta) / (2*n))
return main_term + confidence

print("Margin-based Bound: Independence from Input Dimension")
print("="*65)
print(f"SVM parameters: margin γ=0.5, data radius R=1.0 (normalized)")
print(f"Training size: n=10,000")
print()
print(f"{'Input dimension d':<25} {'Classical VC bound':<25} {'Margin bound':<20}")
print("-" * 70)

n = 10_000
gamma = 0.5 # margin
R = 1.0 # data radius (normalized)
B = 1/gamma # SVM weight norm constraint: ||w|| = 1/gamma

for d in [10, 100, 1_000, 10_000, 100_000]:
d_vc = d + 1 # classical VC dimension of linear classifier
classical = np.sqrt((d_vc + 10) / n) # simplified classical bound
margin_b = margin_generalization_bound(B, R, gamma, n)
print(f"{d:<25,} {classical:<25.4f} {margin_b:<20.4f}")

print()
print("Margin bound stays constant regardless of d! ✓")
print("Classical bound grows with d - useless for high-d text classification")

Part 8 - VC Dimension of Neural Networks

For neural networks, VC dimension is well-studied:

Result (Goldberg-Jerrum, 1995; Bartlett-Maass, 2003):

For a feedforward neural network with:

  • pp total parameters (weights and biases)
  • LL layers
  • Piecewise-linear activations (ReLU, threshold, etc.)

Ω(plogp)dVCO(pLlog(pL))\Omega(p \log p) \leq d_{VC} \leq O(p L \log(pL))

So VC dimension is roughly proportional to the number of parameters (with an extra log factor and depth factor).

For sigmoid/tanh activations (Bartlett-Maass): dVC=Θ(pL)d_{VC} = \Theta(pL)

For threshold (step function) activations: similar scaling.

Implications:

NetworkParametersVC dimension (approx)Required samples
2-layer MLP, 100 hidden~10K~100K~billions
ResNet-50~25M~500M~astronomical
GPT-2 small~117M~billionsphysically impossible

These bounds are vacuous for every practical deep network. Modern theory (Rademacher complexity, PAC-Bayes, NTK) provides better tools - but VC dimension gives the conceptual foundation.

:::tip The Double Descent Connection The classical bias-variance tradeoff predicts a "sweet spot" model size - too small (high bias), too large (high variance). VC dimension formalizes this: VC dimension grows with model size, and the VC bound predicts a U-shaped test error curve.

But modern deep networks violate this! Overparameterized models (VC dimension >> n) often generalize well - the "double descent" phenomenon. This shows VC dimension is a worst-case bound: it doesn't account for the implicit regularization of SGD that prevents overparameterized models from reaching the worst-case function. :::

:::note Role-specific relevance ML Engineers: When designing model architecture, VC dimension gives a rough sense of required data: O(dVC/ϵ2)O(d_{VC}/\epsilon^2). For a linear model with 1000 features, you need ~400K examples for 5% guaranteed error. This motivates regularization (reduces effective VC dimension) when you have less data.

Research Engineers / Scientists: VC dimension is the foundation for NeurIPS/ICML theory papers. The Fundamental Theorem, Sauer's Lemma, and uniform convergence appear in papers on generalization bounds for transformers, in-context learning, and large language models.

Data Engineers building ML pipelines: Understanding that VC dimension scales with feature count motivates feature selection - reducing dd from 50K to 5K features reduces sample complexity by 10×. :::

Interview Questions

Q1: What does it mean for a hypothesis class to shatter a set of points?

A hypothesis class H\mathcal{H} shatters a set C={x1,,xm}C = \{x_1, \ldots, x_m\} if for every possible binary labeling of these points (there are 2m2^m such labelings), there exists some hypothesis hHh \in \mathcal{H} that assigns exactly those labels. Shattering means the hypothesis class is expressive enough to perfectly explain every possible data pattern on those specific points - no labeling can "surprise" it. The VC dimension is the size of the largest set the class can shatter. A critical subtlety: shattering requires the existence of a shatterable set of that size - not that every set of that size is shatterable. So VCdim = 3 for linear classifiers in R2\mathbb{R}^2 means there exist 3 non-collinear points that are shattered - but 3 collinear points are not shattered (the interior point creates a problem). To prove VCdim d\geq d: exhibit one specific set of dd points that is shattered. To prove VCdim <d< d: show that no set of dd points can be shattered.

Q2: What is the VC dimension of linear classifiers in Rd\mathbb{R}^d, and why?

VCdim = d+1d+1. To prove VCdim d+1\geq d+1: Consider the d+1d+1 points {e1,,ed,0}\{e_1, \ldots, e_d, \mathbf{0}\} (standard basis vectors and origin). For any labeling y{+1,1}d+1\mathbf{y} \in \{+1,-1\}^{d+1}, define w=yd+1/20+j=1dyjej\mathbf{w} = y_{d+1}/2 \cdot \mathbf{0} + \sum_{j=1}^d y_j e_j and bias bb appropriately. With this choice, sign(wxi+b)=yi\text{sign}(\mathbf{w}^\top x_i + b) = y_i for each of the d+1d+1 points. To prove VCdim d+1\leq d+1 (i.e., no set of d+2d+2 points can be shattered): By Radon's Theorem, any d+2d+2 points in Rd\mathbb{R}^d can be partitioned into two sets A,BA, B with intersecting convex hulls. For the labeling that assigns +1 to all points in AA and -1 to all points in BB: a hyperplane separating AA from BB would have to separate the intersecting convex hulls, which is impossible. Practical implication: for a bag-of-words classifier with 10,000 vocabulary terms, VC dimension is 10,001. Guaranteed 5% error requires ~O(10001/0.0025)4O(10001 / 0.0025) \approx 4 million examples. This is why L2 regularization (reducing effective VC dimension) is essential for high-dimensional text classification.

Q3: State the Fundamental Theorem of Statistical Learning and explain its significance.

The Fundamental Theorem states: a binary hypothesis class H\mathcal{H} is PAC learnable if and only if it has finite VC dimension. Moreover, the sample complexity is Θ(dVC/ϵ2)\Theta(d_{VC}/\epsilon^2) in the agnostic case (up to log factors in δ\delta). Significance: (1) Complete characterization of learnability - you don't need to check infinitely many distributions; VC dimension is the single number that determines whether learning is possible; (2) Optimal sample complexity - ERM achieves the optimal bound O(dVC/ϵ2)O(d_{VC}/\epsilon^2); no algorithm can do better; (3) Infinite classes are learnable - the class of all linear classifiers in R1000\mathbb{R}^{1000} is infinite but has VCdim = 1001, so it is PAC learnable with enough data; (4) Impossibility results - the class of all functions f:X{0,1}f: \mathcal{X} \to \{0,1\} has infinite VC dimension (it can shatter any finite set), hence is NOT PAC learnable - there is no algorithm that guarantees generalization without structural assumptions on the hypothesis class. Limitations: The theorem is for binary classification; extensions to multiclass and regression require different complexity measures (Natarajan dimension, fat-shattering dimension). VC bounds are worst-case and often loose for practical deep learning.

Q4: How does the margin of an SVM relate to its generalization, and why does this explain SVM success in high-dimensional spaces?

Classical VC theory says a linear classifier in Rd\mathbb{R}^d has VC dimension d+1d+1 - bad news for high-dimensional data. But SVMs optimize for the margin γ\gamma - the geometric distance from the decision boundary to the nearest training points. The fat-shattering dimension of a linear classifier with margin γ\gamma is O(B2R2/γ2)O(B^2R^2/\gamma^2) where B=1/γB = 1/\gamma (SVM weight norm) and R=maxxiR = \max\|x_i\| (data radius). Substituting: fat-shattering dimension =O(R2/γ2)= O(R^2/\gamma^2) - dimension-free! The SVM generalization bound becomes O(R/(γn))O(R/(\gamma\sqrt{n})), independent of input dimension dd. For text classification with 50,000 TF-IDF features: classical VC bound gives O(50000/n)O(\sqrt{50000/n}) - requires millions of examples. Margin-based SVM bound gives O(R/(γn))O(R/(\gamma\sqrt{n})) - if the data is linearly separable with a good margin, even 10,000 examples may suffice. The practical implication: maximize the SVM margin (maximize γ\gamma) by (1) proper feature normalization (reduces RR), (2) soft-margin SVM with regularization (increases γ\gamma by ignoring outliers), and (3) using an RBF kernel to map to a space where a large margin exists.

Q5: Give an example of a hypothesis class with infinite VC dimension. What does this imply theoretically and practically?

The class of sinusoid classifiers hω(x)=sign(sin(ωx))h_\omega(x) = \text{sign}(\sin(\omega x)) for ωR\omega \in \mathbb{R} has infinite VC dimension. For any mm points 0<x1<x2<<xm0 < x_1 < x_2 < \cdots < x_m and any labeling, we can choose ω\omega so that the sinusoid oscillates with period small enough to separate consecutive points - with the sign in each interval matching the desired label. More concretely: choose ω\omega such that ωximod2π\omega x_i \mod 2\pi puts each xix_i in the positive half (for label +1) or negative half (for label -1) of the sine. By the Fundamental Theorem, infinite VC dimension means the class is not PAC learnable - no algorithm can guarantee generalization on this class without additional assumptions, regardless of how much data is collected. Practical implications: kk-nearest neighbor (for any kk) with k=1k=1: can memorize any training set, VCdim = \infty, and generalizes only through implicit smoothness of natural data (not guaranteed). Decision stumps with unbounded precision: if thresholds are real-valued, the class has VCdim = \infty; in practice, data has finite precision, effectively limiting VC dimension. Neural networks with the special activation sign(sin(ωx))\text{sign}(\sin(\omega x)): infinite VC dimension - this "Kolmogorov-Arnold" type activation can represent any function, which is why practical neural networks use bounded activations (ReLU, sigmoid) and weight norm constraints. The resolution in practice: we never use hypothesis classes with truly infinite VC dimension - we always add implicit or explicit regularization that restricts to a finite-VC-dimension subset.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the VC Dimension demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.