Skip to main content

Universal Approximation Theorem

The Real Interview Moment

Your team architect reviews your model design proposal. You have specified a 4-layer network with 512 units per layer. He asks: "Why 4 layers and 512 units? The parameter count is roughly the same as a 2-layer network with 2048 units. Have you compared them? What is your justification for the deeper architecture?"

You have tried both and the 4-layer version performs better, but "I tried it and it worked" is not an answer that survives a design review with a demanding architect. You need a principled justification grounded in theory.

The Universal Approximation Theorem and its extensions - particularly the depth separation results - give you that foundation. This lesson covers the UAT rigorously, explains what it actually guarantees and what it conspicuously does not, connects the theory to the NTK and Lottery Ticket Hypothesis, and grounds everything in empirical demonstrations. By the end, you can justify architectural choices with the same rigor as any published paper.

Why This Exists: The Need for Theoretical Foundations

Before the UAT, neural networks were justified entirely by analogy to biological neurons and by empirical observation. This was unsatisfying: the perceptron's limitations had been proved mathematically (Minsky and Papert, 1969), and without theoretical justification for multi-layer networks, the field could not rule out the possibility that deeper architectures had equally fundamental limitations.

The UAT changed this. It provided a rigorous existence proof: for any function you want to approximate, a neural network of sufficient size can approximate it to arbitrary precision. This gave theoretical legitimacy to the entire enterprise of deep learning. It did not tell practitioners how to build or train such networks - but it ruled out a class of impossibility results.

The subsequent depth separation results (Telgarsky, 2015/2016) added the missing piece: shallow networks are not just less efficient than deep ones - for important function classes, they are exponentially less efficient. This validates depth as a fundamental architectural principle, not just an engineering convenience.

The Cybenko Theorem (1989)

Formal statement (Cybenko 1989):

Let σ\sigma be any continuous sigmoidal function - monotone increasing, bounded, with σ(x)0\sigma(x) \to 0 as xx \to -\infty and σ(x)1\sigma(x) \to 1 as x+x \to +\infty. Let Id=[0,1]dI_d = [0,1]^d be the dd-dimensional unit hypercube.

For every continuous function f:IdRf: I_d \to \mathbb{R} and every ϵ>0\epsilon > 0, there exist NNN \in \mathbb{N} and parameters {(wi,bi,ci)}i=1N\{(\mathbf{w}_i, b_i, c_i)\}_{i=1}^N with wiRd\mathbf{w}_i \in \mathbb{R}^d, bi,ciRb_i, c_i \in \mathbb{R} such that:

f(x)i=1Nciσ ⁣(wiTx+bi)<ϵxId\left| f(\mathbf{x}) - \sum_{i=1}^{N} c_i \, \sigma\!\left(\mathbf{w}_i^T \mathbf{x} + b_i\right) \right| < \epsilon \quad \forall \mathbf{x} \in I_d

In plain terms: any continuous function on a compact domain can be approximated to arbitrary accuracy by a single hidden layer neural network with a sigmoidal activation, given sufficiently many neurons.

Proof Sketch: Stone-Weierstrass Approach

The proof uses functional analysis rather than constructive arguments. Key steps:

  1. The set of functions F=span{σ(wTx+b):wRd,bR}\mathcal{F} = \text{span}\{\sigma(\mathbf{w}^T\mathbf{x} + b) : \mathbf{w} \in \mathbb{R}^d, b \in \mathbb{R}\} is a linear subspace of the space C(Id)C(I_d) of continuous functions on the cube
  2. By the Hahn-Banach theorem, if F\mathcal{F} is not dense in C(Id)C(I_d), there exists a nonzero bounded linear functional μ\mu on C(Id)C(I_d) that vanishes on F\mathcal{F} - i.e., σ(wTx+b)dμ(x)=0\int \sigma(\mathbf{w}^T\mathbf{x} + b) d\mu(\mathbf{x}) = 0 for all w,b\mathbf{w}, b
  3. By a discriminatory property of sigmoidal functions, such a μ\mu must be zero - contradiction
  4. Therefore F\mathcal{F} is dense in C(Id)C(I_d) - any continuous function can be approximated

The proof is an existence argument. It says a good approximation exists but gives no algorithm to find it and no bound on how large NN must be.

Hornik's Generalization (1991)

Kurt Hornik generalized Cybenko's result:

Theorem (Hornik 1991): Any continuous, non-constant, bounded, and monotone-increasing function σ\sigma can serve as the activation function for a universal approximator. More generally, any continuous non-constant activation function that is not a polynomial can be used.

This immediately covers tanh, softplus, and essentially any practically useful activation function. Hornik also showed:

  • Multi-layer networks are also universal approximators (not just single-hidden-layer)
  • The result extends to output functions that are not necessarily bounded (regression, not just classification)
  • Approximation holds in LpL^p norm (integrals over probability measures), not just supremum norm

Leshno's Necessary and Sufficient Condition (1993)

Leshno et al. (1993) established the exact condition:

Theorem: A function σ\sigma can be used to build a universal approximator if and only if σ\sigma is not a polynomial (on any open interval).

This is both necessary and sufficient. Every polynomial activation function fails to be a universal approximator. Every non-polynomial activation function (sigmoid, tanh, ReLU, GELU, any smooth non-polynomial) succeeds.

Why polynomials fail: a network with polynomial activations can only represent polynomial functions - the composition of polynomials is a polynomial, and polynomial functions form a proper closed subset of all continuous functions. Any non-polynomial activation breaks out of this limitation.

What the UAT Guarantees vs. What It Does Not

This is the most important practical distinction:

What UAT does guaranteeWhat UAT does not guarantee
An approximating network existsHow to find it (no training algorithm)
Any continuous ff on compact domain approximableHow many neurons are needed (could be exponential)
Approximation to any ϵ>0\epsilon > 0Generalization to unseen data
One hidden layer is sufficientOne hidden layer is efficient
Any non-polynomial activation worksBetter generalizable approximation
Result holds on any compact domainResult on unbounded or discontinuous functions

:::warning The UAT is Frequently Misquoted The UAT is an existence proof, not a constructive or practical result. "Neural networks can approximate any function" does not mean "a trained neural network will approximate the target function." Finding the right approximating network through gradient descent on finite data is an entirely separate problem - and one the UAT says nothing about. The theorem proves that the right network exists in the parameter space; gradient descent may or may not find it. :::

NumPy Demonstration: Approximation Quality vs Width

The theorem predicts: more neurons → better approximation (given sufficient training). Let us verify:

import numpy as np
from scipy.special import erf


def sigmoid(x):
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))


def single_layer_mlp_forward(X, W1, b1, W2, b2):
"""Single hidden layer network: input -> sigmoid -> output."""
H = sigmoid(X @ W1.T + b1)
return H @ W2.T + b2


def train_1d_approximator(target_fn, hidden_size: int,
n_points: int = 1000, n_epochs: int = 5000,
lr: float = 1e-2, seed: int = 42) -> tuple:
"""
Train a single hidden layer network to approximate target_fn on [-3, 3].
Demonstrates UAT: more neurons → smaller approximation error.
"""
rng = np.random.default_rng(seed)

x_train = np.linspace(-3, 3, n_points).reshape(-1, 1)
y_train = target_fn(x_train)

# Initialize weights (Xavier for sigmoid)
fan_in, fan_out = 1, hidden_size
std = np.sqrt(2.0 / (fan_in + fan_out))
W1 = rng.standard_normal((hidden_size, 1)) * std
b1 = np.zeros(hidden_size)
W2 = rng.standard_normal((1, hidden_size)) * std
b2 = np.zeros(1)

for epoch in range(n_epochs):
# Forward
Z1 = x_train @ W1.T + b1 # (N, hidden_size)
H1 = sigmoid(Z1) # (N, hidden_size)
Z2 = H1 @ W2.T + b2 # (N, 1)
y_pred = Z2

# MSE loss
diff = y_pred - y_train
loss = np.mean(diff ** 2)

# Backward
d_Z2 = 2 * diff / n_points # (N, 1)
d_W2 = d_Z2.T @ H1 # (1, hidden_size)
d_b2 = d_Z2.sum(axis=0) # (1,)

d_H1 = d_Z2 @ W2 # (N, hidden_size)
d_Z1 = d_H1 * H1 * (1 - H1) # sigmoid gradient (N, hidden_size)
d_W1 = d_Z1.T @ x_train # (hidden_size, 1)
d_b1 = d_Z1.sum(axis=0) # (hidden_size,)

# Update
W1 -= lr * d_W1
b1 -= lr * d_b1
W2 -= lr * d_W2
b2 -= lr * d_b2

# Final predictions and MSE
H1_final = sigmoid(x_train @ W1.T + b1)
y_pred_final = H1_final @ W2.T + b2
final_mse = np.mean((y_pred_final - y_train) ** 2)
n_params = hidden_size * 1 + hidden_size + 1 * hidden_size + 1

return y_pred_final.flatten(), final_mse, n_params


def demonstrate_uat():
"""
Show that approximation error decreases as hidden size increases.
This is the UAT in action: more neurons → better approximation.
"""
# Target: sum of sinusoids (multi-frequency - requires many "kinks" to approximate)
def target(x):
return np.sin(x) + 0.5 * np.sin(3 * x) + 0.25 * np.sin(5 * x)

hidden_sizes = [4, 8, 16, 32, 64, 128, 256]

print(f"{'Hidden Size':>12} | {'MSE':>12} | {'Parameters':>12} | {'MSE/param':>12}")
print("-" * 55)

for hidden_size in hidden_sizes:
_, mse, n_params = train_1d_approximator(target, hidden_size, n_epochs=3000)
print(f"{hidden_size:>12} | {mse:>12.8f} | {n_params:>12} | {mse/n_params:>12.2e}")

# Key observation: MSE decreases dramatically with hidden size
# This is the universal approximation theorem empirically demonstrated

demonstrate_uat()

Depth Separation: The Theoretical Case for Deep Networks

The UAT guarantees that wide-shallow networks are sufficient. Depth separation results prove that shallow networks are sometimes exponentially inefficient.

Telgarsky's Depth Separation (2015, 2016)

Theorem (Telgarsky 2016): For any k1k \geq 1, there exists a function ff (specifically, a sawtooth function with 2k2^k peaks) such that:

  • A ReLU network with O(k)O(k) layers and O(1)O(1) width per layer can compute ff exactly
  • Any ReLU network with fewer than k/2k/2 layers requires at least 2k/42^{k/4} total units to approximate ff with error less than 1/81/8

The construction: the key function is the triangle wave iterated kk times. One ReLU layer can double the frequency of an oscillating function. A kk-layer network can compute 2k2^k oscillations. A 2-layer network needs 2k2^k neurons to match those oscillations.

The intuition - compositional reuse: real data has hierarchical structure. An image detector for wheels needs to detect circles, which needs to detect curves, which needs to detect edges. A depth-kk network reuses each level of representation in all subsequent layers. A shallow network must redundantly compute every combination from scratch.

Consider a simple example: computing x2kx^{2^k} via iterated squaring.

  • Depth-kk network: z1=x2z_1 = x^2, z2=z12=x4z_2 = z_1^2 = x^4, ..., zk=x2kz_k = x^{2^k}. Each layer squares the previous. kk layers, kk neurons.
  • Depth-1 network: must compute x2kx^{2^k} directly from xx in one polynomial layer - requires 2k2^k distinct neurons to represent this polynomial.
import numpy as np


def demonstrate_depth_efficiency():
"""
Empirically compare depth vs width at equal parameter budgets.
On a compositional task (iterated function application), depth wins.
"""

def deep_vs_shallow_on_sawtooth(k: int = 5):
"""
Approximate a sawtooth function with k iterations (2^k oscillations).
Deep: k layers, shallow: 1 hidden layer with exponentially more neurons.
"""
# The target: triangle wave iterated k times
# After k applications: 2^k oscillations in [0, 1]
def triangle(x):
return 2 * np.abs(x - np.floor(x + 0.5))

def iterated_triangle(x, k):
for _ in range(k):
x = triangle(x)
return x

x = np.linspace(0, 1, 10000)
y_target = iterated_triangle(x, k)

print(f"\nSawtooth with k={k} iterations: {2**k} oscillations")
print(f"Deep network (k layers, 4 wide): ~{k * 4} parameters")
print(f"Shallow network (1 hidden): needs ~{2**k} neurons, ~{2**k} parameters")
print(f"Parameter ratio: {2**k / (k*4):.1f}x more for equivalent approximation")

for k in [3, 5, 7, 10]:
demonstrate_depth_efficiency_k = k
deep_params = k * 4 # k layers, 4 neurons each
shallow_params = 2**k # exponentially many for 1 layer
print(f"k={k}: deep≈{deep_params} params, shallow≈{shallow_params} params "
f"({shallow_params//deep_params}x ratio)")


demonstrate_depth_efficiency()
# k=3: deep≈12 params, shallow≈8 params (0.7x ratio - depth doesn't help for small k)
# k=5: deep≈20 params, shallow≈32 params (1.6x ratio)
# k=7: deep≈28 params, shallow≈128 params (4.6x ratio)
# k=10: deep≈40 params, shallow≈1024 params (25.6x ratio - depth wins dramatically)

Eldan and Shamir (2016): Radial Functions

Eldan and Shamir showed a concrete 3-layer network that approximates a radial function (depends only on x\|\mathbf{x}\|) which any 2-layer network requires exponentially many neurons to match. The gap between 2 layers and 3 layers can be exponential - not just between deep and shallow.

The Linear Regions Argument

For ReLU networks, expressivity can be measured by the number of linear regions the network carves out of input space:

  • Single hidden layer, nn neurons, dd-dimensional input: at most k=0d(nk)\sum_{k=0}^{d} \binom{n}{k} regions - polynomial in nn, with degree dd
  • LL-layer network, width nn: up to O((nd)L1)O\left(\binom{n}{d}^{L-1}\right) regions - exponential in LL

More linear regions = ability to represent more complex decision boundaries. A deep network creates exponentially more linear regions per parameter than a wide-shallow network.

Barron's Theorem (1993): Avoiding the Curse of Dimensionality

Barron's theorem characterizes exactly which functions can be efficiently approximated by neural networks, regardless of input dimension.

Setup: define the spectral norm of a function f:RdRf: \mathbb{R}^d \to \mathbb{R}:

Cf=Rdω2f^(ω)dωC_f = \int_{\mathbb{R}^d} \|\boldsymbol{\omega}\|_2 |\hat{f}(\boldsymbol{\omega})| \, d\boldsymbol{\omega}

Where f^\hat{f} is the Fourier transform of ff and CfC_f measures how concentrated the function's Fourier energy is at high frequencies.

Theorem (Barron 1993): For any ff with Cf<C_f < \infty, a single hidden layer network with nn neurons can approximate ff with integrated squared error:

f(x)fn(x)2dμ(x)(2Cf)2n\int |f(\mathbf{x}) - f_n(\mathbf{x})|^2 d\mu(\mathbf{x}) \leq \frac{(2 C_f)^2}{n}

The approximation error is O(1/n)O(1/n) regardless of input dimension dd.

Why this is profound: without neural networks, approximating a function of dd variables typically requires exponentially many basis functions (the curse of dimensionality). Barron's theorem says that for functions with bounded Fourier spectrum, neural networks avoid this curse - the number of neurons needed depends only on the desired accuracy and CfC_f, not on dd.

The catch: Cf<C_f < \infty is not free. Functions with bounded spectral norm are "smooth" in a specific sense - they do not have arbitrarily high-frequency components. Many physically meaningful functions (smooth physical processes, natural images at coarse scale) satisfy this. Highly discontinuous or fractal functions may not.

Neural Tangent Kernel: Infinite-Width Networks

Jacot, Gabriel, and Hongler (2018) proved a remarkable result about infinitely-wide networks:

Theorem (NTK, Jacot et al. 2018): In the limit of infinite width, a neural network trained with gradient descent converges to a linear model in the feature space defined by the Neural Tangent Kernel:

K(x,x)=Eθinit[θf(x;θ)θf(x;θ)]K(\mathbf{x}, \mathbf{x}') = \mathbb{E}_{\theta \sim \text{init}}\left[\nabla_\theta f(\mathbf{x}; \theta) \cdot \nabla_\theta f(\mathbf{x}'; \theta)\right]

Key implications:

  1. An infinitely-wide network trained with small learning rate behaves like kernel ridge regression with the NTK kernel
  2. The kernel is determined by the architecture and initialization, not the training data
  3. Gradient flow dynamics are linear in infinite width - the Jacobian does not change during training
  4. The NTK kernel can be computed analytically for standard architectures

Why this matters practically: NTK theory explains why very wide networks (with appropriate initialization and learning rate scaling) are easy to train - they behave like convex kernel methods. It also explains a limitation: infinitely-wide networks cannot "feature learn" - the kernel is fixed at initialization. Finite-width networks that feature learn (kernels change during training) are more powerful but harder to analyze.

The mean-field regime: at finite but large width, there is a regime where the network is neither infinitely wide (NTK regime) nor so narrow that individual neuron dynamics matter. This regime is called the mean-field or "feature learning" regime and is where practical large models like GPT-4 likely operate.

Lottery Ticket Hypothesis (Frankle and Carlin, 2019)

Hypothesis: dense, randomly-initialized neural networks contain sparse subnetworks (winning tickets) that, when trained from scratch with the same initialization, can match the full network's performance in comparable training time.

Formal statement: for a network f(x;θ)f(\mathbf{x}; \theta) with initial weights θ0\theta_0, there exists a sparse mask mm (with mθ0|m| \ll |\theta_0|) such that the subnetwork f(x;mθ0)f(\mathbf{x}; m \odot \theta_0) can be trained to the same accuracy as the full network in the same or fewer iterations.

Finding winning tickets: iterative magnitude pruning.

  1. Train the full network for nn iterations
  2. Prune the p%p\% of weights with smallest magnitude
  3. Reset remaining weights to their initialization values θ0\theta_0 (not the final trained values)
  4. Repeat from step 1 with the pruned network

The "reset to initialization" step is critical - the winning ticket must be trained from the specific initialization θ0\theta_0, not from a random new initialization.

Implications for the UAT: the Lottery Ticket Hypothesis suggests that large networks are not just "expressive" in the UAT sense - they are expressive in a way that includes many sparse, efficiently trainable subnetworks. The role of overparameterization may be to ensure that good sparse subnetworks exist with high probability.

import torch
import torch.nn as nn
import copy


def find_winning_ticket(model: nn.Module, train_loader, criterion,
n_rounds: int = 5, prune_pct: float = 0.2,
n_epochs: int = 10) -> tuple:
"""
Iterative magnitude pruning to find the winning ticket.
Returns the sparse mask and initial weights.
"""
device = next(model.parameters()).device
initial_weights = copy.deepcopy(model.state_dict())

# Initialize mask: all ones (no pruning yet)
masks = {}
for name, param in model.named_parameters():
if 'weight' in name:
masks[name] = torch.ones_like(param.data)

def apply_mask():
for name, param in model.named_parameters():
if name in masks:
param.data *= masks[name]

def get_sparse_fraction():
total = sum(m.numel() for m in masks.values())
remaining = sum(m.sum().item() for m in masks.values())
return 1 - remaining / total

print(f"Initial density: 100%")

for round_idx in range(n_rounds):
# Restore initial weights * current mask
model.load_state_dict(initial_weights)
apply_mask()

# Train for n_epochs
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
model.train()
for epoch in range(n_epochs):
for x, y in train_loader:
x, y = x.to(device), y.to(device)
optimizer.zero_grad()
loss = criterion(model(x), y)
loss.backward()
apply_mask() # zero out pruned weights' gradients
optimizer.step()

# Prune: remove prune_pct of remaining weights by magnitude
all_weights = []
for name, param in model.named_parameters():
if name in masks:
# Only consider currently active (unmasked) weights
active_weights = param.data[masks[name].bool()].abs()
all_weights.append(active_weights.view(-1))

all_weights_flat = torch.cat(all_weights)
threshold = all_weights_flat.quantile(prune_pct)

for name, param in model.named_parameters():
if name in masks:
masks[name] = (param.data.abs() >= threshold).float() * masks[name]

sparsity = get_sparse_fraction()
print(f"Round {round_idx+1}: pruned to {100*(1-sparsity):.1f}% density")

# Restore the winning ticket: initial weights * final mask
model.load_state_dict(initial_weights)
apply_mask()

return masks, initial_weights

Double Descent: Beyond the Bias-Variance Tradeoff

The classical bias-variance tradeoff says: increasing model capacity reduces bias but increases variance. The optimal model lies at the sweet spot. For overparameterized models (parameters > training examples), the story changes dramatically.

Double descent phenomenon (Belkin et al., 2019; Nakkiran et al., 2020): as model size increases from zero to very large, the test error undergoes two descents:

  1. Classical regime: error decreases as model gets larger (more capacity)
  2. Interpolation threshold: model is just large enough to fit training data exactly - error spikes to its maximum
  3. Overparameterized regime: as the model grows beyond the threshold, test error decreases again, eventually reaching very low values
Test Error
|
| *
| * * ← classical regime: bias-variance tradeoff
| * *
| * ← interpolation threshold (peak error)
| **
| ***
| ***** ← double descent: overparameterized regime
+-----------------------------> Model Size / Parameters
Under- Interp. Over-param.
parameterized

Why overparameterized models can generalize: in the overparameterized regime, there are many different parameter settings that fit the training data exactly. Gradient descent (with appropriate inductive bias) finds the minimum-norm solution - the simplest model that fits the training data. For smooth target functions, the minimum-norm interpolating solution has low complexity and generalizes well.

Implications: the bias-variance tradeoff does not predict the behavior of large modern neural networks. GPT-4, LLaMA, and similar models have billions of parameters and vastly more parameters than training examples in any fine-tuning setup. They are firmly in the overparameterized regime where classical theory does not apply. Double descent explains why they generalize despite massive overparameterization.

Kolmogorov's Superposition Theorem: Connection and Limitation

Kolmogorov's superposition theorem (1957) predates neural networks and seems to offer an even stronger result:

Theorem: Any continuous function f:[0,1]nRf: [0,1]^n \to \mathbb{R} can be written as:

f(x)=q=02nΦq(p=1nψq,p(xp))f(\mathbf{x}) = \sum_{q=0}^{2n} \Phi_q\left(\sum_{p=1}^{n} \psi_{q,p}(x_p)\right)

Where ψq,p\psi_{q,p} are fixed (non-learnable) functions and Φq\Phi_q are learnable univariate functions.

Why this is not practically useful: the functions ψq,p\psi_{q,p} that make the theorem work are pathological - they are typically not smooth, not computable, and their construction depends on ff in an opaque way. The theorem shows representation exists but the representation is not learnable by gradient descent. It is a mathematical curiosity rather than an engineering tool.

Neural network universality is practically useful because the parameters (wi,bi,ci\mathbf{w}_i, b_i, c_i) are learnable via gradient descent on data. Kolmogorov's theorem does not have this property.

NumPy: Depth vs Width Empirical Comparison

import numpy as np


def relu(x):
return np.maximum(0, x)


def mlp_forward(x, layers):
"""Forward pass through an MLP defined as list of (W, b) tuples."""
a = x
for i, (W, b) in enumerate(layers):
z = a @ W.T + b
if i < len(layers) - 1:
a = relu(z)
else:
a = z # no activation on output
return a


def train_mlp(layers, x_train, y_train, n_epochs=2000, lr=0.01):
"""Simple gradient descent training for a 1D regression task."""
for epoch in range(n_epochs):
# Forward
activations = [x_train]
pre_acts = []
a = x_train
for W, b in layers:
z = a @ W.T + b
pre_acts.append(z)
if len(activations) < len(layers): # hidden layers
a = relu(z)
else:
a = z
activations.append(a)

# Backward (simplified: only last layer update for demonstration)
y_pred = activations[-1]
loss = np.mean((y_pred - y_train) ** 2)

# Only update last layer for this simplified demo
delta = 2 * (y_pred - y_train) / len(y_train)
W_last, b_last = layers[-1]
dW = delta.T @ activations[-2]
db = delta.sum(axis=0)

layers[-1] = (W_last - lr * dW, b_last - lr * db)

# Return final loss
final_pred = mlp_forward(x_train, layers)
return np.mean((final_pred - y_train) ** 2)


def compare_depth_vs_width():
"""
Compare networks with similar parameter counts:
- Shallow wide: 1 hidden layer with many neurons
- Deep narrow: multiple hidden layers with fewer neurons
Both trained on the same regression task.
"""
rng = np.random.default_rng(42)

# Task: approximate a highly oscillatory function (requires many "kinks")
x = np.linspace(-3, 3, 500).reshape(-1, 1)
y = np.sin(x) * np.cos(2*x) * np.exp(-x**2 / 4) # complex, oscillatory

def count_params(layers):
return sum(W.size + b.size for W, b in layers)

def init_layers(dims, seed=42):
rng = np.random.default_rng(seed)
layers = []
for i in range(len(dims)-1):
std = np.sqrt(2.0 / dims[i])
W = rng.standard_normal((dims[i+1], dims[i])) * std
b = np.zeros(dims[i+1])
layers.append((W, b))
return layers

# Comparable parameter budgets: ~1000 parameters each
architectures = {
"1 hidden (width=60)": [1, 60, 1],
"2 hidden (width=20)": [1, 20, 20, 1],
"3 hidden (width=12)": [1, 12, 12, 12, 1],
"5 hidden (width=8)": [1, 8, 8, 8, 8, 8, 1],
"8 hidden (width=5)": [1, 5, 5, 5, 5, 5, 5, 5, 5, 1],
}

print(f"{'Architecture':<25} | {'Params':>8} | {'Final MSE':>12}")
print("-" * 52)

for name, dims in architectures.items():
layers = init_layers(dims)
n_params = count_params(layers)
mse = train_mlp(layers, x, y, n_epochs=3000, lr=0.005)
print(f"{name:<25} | {n_params:>8} | {mse:>12.6f}")


compare_depth_vs_width()

The Role of Non-Linearity: Why Linear Depth Fails

import torch
import torch.nn as nn


class LinearOnlyMLP(nn.Module):
"""MLP with no activation functions - depth provides no benefit."""
def __init__(self, depth: int, width: int):
super().__init__()
self.layers = nn.ModuleList([nn.Linear(width, width) for _ in range(depth)])

def forward(self, x):
for layer in self.layers:
x = layer(x) # no activation
return x


class NonlinearMLP(nn.Module):
"""MLP with ReLU activations - depth provides exponential benefit."""
def __init__(self, depth: int, width: int):
super().__init__()
self.layers = nn.ModuleList([nn.Linear(width, width) for _ in range(depth)])

def forward(self, x):
for i, layer in enumerate(self.layers):
x = layer(x)
if i < len(self.layers) - 1:
x = torch.relu(x)
return x


def demonstrate_linear_depth_collapse():
"""
Show that a linear deep network is equivalent to a linear shallow network.
The effective rank and capacity do not increase with depth without activation.
"""
width = 64

linear_deep = LinearOnlyMLP(depth=20, width=width)
linear_shallow = LinearOnlyMLP(depth=1, width=width)
nonlinear_deep = NonlinearMLP(depth=20, width=width)

x = torch.randn(100, width)

with torch.no_grad():
out_linear_deep = linear_deep(x)
out_linear_shallow = linear_shallow(x)
out_nonlinear = nonlinear_deep(x)

# Linear deep: output lies in at most 64-dimensional subspace (same as shallow)
# Rank of output matrix - should be same for deep and shallow linear
rank_deep = torch.linalg.matrix_rank(out_linear_deep).item()
rank_shallow = torch.linalg.matrix_rank(out_linear_shallow).item()
rank_nonlinear = torch.linalg.matrix_rank(out_nonlinear).item()

print(f"Linear 20-layer network output rank: {rank_deep}")
print(f"Linear 1-layer network output rank: {rank_shallow}")
print(f"ReLU 20-layer network output rank: {rank_nonlinear}")
print(f"\nConclusion: linear depth doesn't increase expressivity.")
print(f"ReLU depth adds genuine nonlinearity and higher effective rank.")


demonstrate_linear_depth_collapse()

Answering the Architect's Question

Back to the production scenario. Your 4 × 512 network vs a 2 × 2048 network:

Why depth wins for compositional tasks:

  1. Exponentially more linear regions: a 4-layer ReLU network creates O((512d)3)O(\binom{512}{d}^3) linear regions - exponentially more than a 2-layer network with O((2048d))O(\binom{2048}{d}) regions. More regions = more complex decision boundaries per parameter.

  2. Hierarchical feature reuse: a 4-layer network can compute layer-1 features (edges) and reuse them in all subsequent layers. A 2-layer network must compute every needed combination directly from inputs - redundant and expensive.

  3. Telgarsky's guarantee: for the specific class of compositional functions that characterize image or text data, depth-4 can represent functions that depth-2 requires exponential width to match.

  4. Empirical consistency: on CIFAR-10, ImageNet, and NLP benchmarks, deeper networks consistently outperform matched-parameter shallow networks on compositional tasks.

The principled answer: "Depth separation results show that for compositional functions - which is what this dataset represents - deep networks can represent the target function with exponentially fewer parameters than shallow networks. Empirically, 4-layer networks consistently outperform 2-layer networks at matched parameter count on similar tasks. The 4×512 architecture encodes the right inductive bias for hierarchical feature composition."

Architecture Selection Guide

:::danger The UAT Does Not Guarantee Learnability The UAT proves existence of an approximating network. Existence does not guarantee learnability. Three additional challenges the UAT ignores: (1) Optimization: gradient descent may converge to a local minimum, saddle point, or never converge for the correct network size. The loss landscape for a UAT-sufficient network may be highly non-convex. (2) Generalization: the approximating network that minimizes training loss may overfit. The UAT gives no bound on generalization gap. (3) Sample complexity: the correct approximating network may require exponentially many neurons, and learning it may require exponentially many training examples. The UAT is a theoretical foundation, not a practical guarantee. :::

YouTube Resources

VideoChannelWhy Watch It
Universal Approximation Theorem3Blue1BrownVisual proof sketch and intuition for function approximation
Depth vs Width - Why Deep Learning WorksYannic KilcherTelgarsky depth separation paper explained
Neural Tangent KernelYannic KilcherNTK paper walkthrough - infinite width behavior
Lottery Ticket HypothesisYannic KilcherFrankle and Carlin 2019 explained with iterative pruning
Double DescentStanford CS229Modern understanding of generalization in overparameterized models

Interview Q&A

Q1: State the Universal Approximation Theorem precisely. What does it actually guarantee, and what does it not guarantee?

The UAT (Cybenko 1989, Hornik 1991) states: for any continuous function ff on a compact domain IdRdI_d \subset \mathbb{R}^d and any ϵ>0\epsilon > 0, there exists a single hidden layer neural network with a continuous non-polynomial activation function and finitely many neurons NN such that the network approximates ff to within ϵ\epsilon uniformly on IdI_d. What it guarantees: an approximating network exists; approximation to any desired accuracy is possible; any non-polynomial activation works. What it does not guarantee: how large NN must be (could be astronomical); how to find the approximating network through optimization; whether gradient descent will converge to it; whether the network will generalize to unseen data; whether the approximation is efficient.

Q2: The UAT proves that one hidden layer is sufficient. Why do we use deep networks in practice?

"Sufficient" does not mean "efficient." Telgarsky's depth separation results (2015/2016) prove that for specific function classes - particularly compositional, hierarchical functions like those in images and language - a shallow network requires exponentially more neurons than a deep network to achieve the same approximation accuracy. For a kk-layer iterated sawtooth function, a depth-kk network with O(1)O(1) neurons per layer computes it exactly, while a 2-layer network requires Ω(2k)\Omega(2^k) neurons. The exponential parameter savings from depth translate directly to better generalization from finite data: with fewer parameters needed for the same expressivity, the effective hypothesis space is smaller and overfitting is reduced. Additionally, deep networks create exponentially more linear regions per parameter (more complex decision boundaries), and their hierarchical computation matches the hierarchical structure of natural data.

Q3: What does the Neural Tangent Kernel tell us about infinitely-wide networks?

The NTK (Jacot et al., 2018) shows that infinitely-wide networks trained with gradient descent converge to linear models in the function space defined by the NTK kernel K(x,x)=Eθ0[θf(x;θ0)θf(x;θ0)]K(\mathbf{x}, \mathbf{x}') = \mathbb{E}_{\theta_0}[\nabla_\theta f(\mathbf{x}; \theta_0) \cdot \nabla_\theta f(\mathbf{x}'; \theta_0)]. In the infinite-width limit: (1) the Jacobian does not change during training - gradient flow is linear; (2) training is equivalent to kernel regression with the NTK; (3) the network cannot "feature learn" - the effective kernel is fixed at initialization. Practical implications: very wide networks with small learning rate behave like kernel methods - easy to analyze and guaranteed to converge to the global minimum, but less powerful than finite-width networks that feature learn. Large modern models (GPT-4, LLaMA) operate in a different regime - finite width, large learning rate - where feature learning is essential.

Q4: Explain the Lottery Ticket Hypothesis. What does it imply about overparameterization?

The Lottery Ticket Hypothesis (Frankle and Carlin, 2019) states that large randomly-initialized networks contain sparse subnetworks (winning tickets) that, trained from their original initialization values, can match the performance of the full network. Finding winning tickets requires iterative magnitude pruning followed by weight reset to initialization values - not any new initialization. The hypothesis implies that overparameterization serves a specific purpose: it ensures that good sparse subnetworks exist with high probability. With random initialization, the probability of any single sparse subnetwork being a winning ticket is low - but in a large network, many sparse subnetworks are sampled simultaneously, and at least one is likely to be a winning ticket. Larger networks sample more subnetworks and are therefore more likely to contain good ones. This reframes overparameterization from "waste" to "search": we train a large network to find the subnetwork (winning ticket), then the subnetwork does the actual learning.

Q5: What is double descent and why does it break the classical bias-variance tradeoff narrative?

The classical bias-variance tradeoff predicts a U-shaped test error curve: as model size increases, test error decreases (bias decreases) then increases (variance increases) after an optimal point. Double descent shows a second descent beyond the classical U-shaped curve. As model size crosses the interpolation threshold (parameters = training examples), test error spikes to its maximum. As model size continues to grow, test error decreases again, eventually surpassing the classical optimum. The mechanism: at the interpolation threshold, there is essentially one way to fit the training data, and that solution may not generalize. In the overparameterized regime, there are many ways to fit the training data, and gradient descent finds the minimum-norm solution - the simplest one - which for smooth target functions generalizes well. This explains why GPT-4 with billions of parameters generalizes despite having more parameters than training examples in any fine-tuning context. The bias-variance tradeoff is a special case that applies only to the underparameterized regime, not to modern large neural networks.

Q6: A colleague argues: "The UAT proves any function can be approximated, so width and depth don't matter - just add more neurons." How do you respond?

This argument conflates existence with efficiency and learnability. Three responses: (1) The UAT gives no bound on the width required. For compositional functions, the required width for a single-hidden-layer network may be exponential in the function's compositional depth (Telgarsky 2016). In practice, exponential width is computationally infeasible and impossible to train on finite data. (2) More neurons means more parameters, which can increase overfitting. The UAT says nothing about generalization - the approximating network may memorize training data rather than learning the underlying function. (3) Width and depth interact differently with gradient descent and the loss landscape. Deep networks with skip connections create more benign optimization landscapes with fewer saddle points. Shallow networks with many neurons often have harder-to-optimize loss landscapes. In virtually every empirical comparison on non-trivial tasks, depth outperforms matched-width shallow networks. The UAT is an asymptotic existence theorem; practical deep learning operates far from that asymptotic regime.

UAT and the Inductive Bias Argument

The UAT proves universality - any function can be approximated. But the practical question is: which functions does gradient descent actually learn from finite data? This is where architecture inductive biases dominate.

The three factors that determine what a network learns:

  1. Architecture (what it can represent): the UAT guarantees the representation exists, but does not say it will be found. The architecture's inductive biases determine which representations are easy to find.
  2. Optimization (how it trains): gradient descent with momentum on a finite-width network follows a specific trajectory through parameter space. The shape of the loss landscape - determined by both architecture and initialization - dictates what minima are reachable.
  3. Data (what it generalizes to): a network can approximate any function on training data. The question is which of the infinitely many consistent functions it learns when generalizing beyond training.

Inductive biases in common architectures:

ArchitectureInductive biasWhy it helps
CNNTranslation invariance, local connectivityNatural images have local structure
RNNSequential inductive biasLanguage and time series have order dependencies
TransformerPermutation equivariance, attentionLong-range dependencies in language and vision
GNNGraph structure invarianceMolecular and social graph tasks
MLPNo structural biasGeneral function approximator, but needs more data

When the inductive bias matches the data structure, the effective search space of functions is reduced - gradient descent finds the right answer faster with less data. When it mismatches (e.g., CNN for text), training still works but requires more examples to learn what a structurally-aware architecture learns immediately.

Approximation vs Generalization: The Critical Distinction

The UAT is a theorem about approximation - how well a network can fit a target function on its training domain. Generalization - how well it performs on new examples from the same distribution - is a separate question entirely.

Statistical learning theory framework:

The generalization gap is bounded by:

L(f^)population riskLn(f^)empirical riskO ⁣(VC(F)n)\underbrace{L(\hat{f})}_{\text{population risk}} - \underbrace{L_n(\hat{f})}_{\text{empirical risk}} \leq O\!\left(\sqrt{\frac{\text{VC}(F)}{n}}\right)

where VC(F)\text{VC}(F) is the Vapnik-Chervonenkis dimension of the function class FF (a measure of complexity) and nn is the number of training examples. For neural networks, the VC dimension scales approximately as O(WlogW)O(|W| \log |W|) where W|W| is the number of parameters.

The bound says: to guarantee good generalization with a network of complexity W|W|, you need at least O(WlogW)O(|W| \log |W|) training examples. However:

  1. Modern networks routinely generalize with far fewer examples than this bound requires
  2. Double descent shows the bound is not tight in the overparameterized regime
  3. Implicit regularization from gradient descent and architectural inductive biases fill the gap

The takeaway: the UAT gives the upper bound on what can be represented. Real performance depends on how much data you have, whether your architecture matches the data structure, and whether your optimizer finds a generalizable solution.

Practical Implications for Architecture Design

The theoretical results translate to concrete engineering decisions:

"""
Heuristics derived from UAT and depth separation theory.
"""


def estimate_min_layers_for_task(task_compositionality: str,
input_dim: int) -> dict:
"""
Rough guidance on minimum depth for different task types.
Based on depth separation results and empirical practice.
"""
recommendations = {
"tabular_regression": {
"min_layers": 2,
"reasoning": "Tabular data rarely has deep hierarchical structure. "
"3-5 hidden layers typically sufficient.",
"typical_width": max(64, input_dim * 2),
},
"tabular_classification": {
"min_layers": 3,
"reasoning": "Classification boundaries can be complex. "
"Depth adds decision boundary expressivity.",
"typical_width": max(128, input_dim * 4),
},
"image_classification": {
"min_layers": 10, # minimum meaningful CNN
"reasoning": "Images have hierarchical structure: pixels → edges → "
"shapes → parts → objects. Each layer adds one level.",
"typical_width": 64, # for CNN channels, not MLP
},
"sequence_modeling": {
"min_layers": 6, # transformer blocks
"reasoning": "Language has multi-level syntax + semantics. "
"Attention heads at different layers capture different phenomena.",
"typical_width": 512, # d_model for small transformers
},
"molecular_property": {
"min_layers": 4, # GNN layers
"reasoning": "Chemical properties depend on multi-hop neighborhoods. "
"Each GNN layer adds one hop of context.",
"typical_width": 256, # GNN hidden dim
},
}

if task_compositionality not in recommendations:
return {"error": f"Unknown task type: {task_compositionality}"}

rec = recommendations[task_compositionality]
print(f"Task: {task_compositionality}")
print(f" Minimum depth: {rec['min_layers']} layers")
print(f" Reasoning: {rec['reasoning']}")
print(f" Typical width: {rec['typical_width']}")
return rec


for task in ["tabular_regression", "image_classification", "sequence_modeling"]:
estimate_min_layers_for_task(task, input_dim=100)
print()

Connecting Theory to the Opening Question

Returning to the architect's question: why 4 layers vs 2 layers at matched parameter count?

The complete answer draws on all the theory in this lesson:

  1. UAT (existence): both architectures can represent the target function - so the question reduces to efficiency, not existence.

  2. Depth separation (efficiency): for compositional data (which most real-world datasets are), Telgarsky's theorem guarantees that a 4-layer network can represent functions that require exponentially more neurons in 2 layers. At matched parameter count, 4 layers is strictly more expressive.

  3. Linear regions (expressivity): a 4-layer ReLU network creates O(n12)O(n^{12}) linear regions (exponent ≈ d×(L1)d \times (L-1)) while a 2-layer network creates O(nd)O(n^d) regions. More regions = more complex decision boundaries per parameter.

  4. NTK theory (trainability): at moderate width, 4-layer networks are in the feature learning regime - they can adapt their representations to the data. They are not in the NTK (kernel) regime where additional depth provides no benefit.

  5. Lottery Ticket (regularization): deeper networks contain more diverse sparse subnetworks, increasing the probability that gradient descent finds a well-generalizing one.

The concise answer for the design review: "For compositional data, depth separation results guarantee that 4-layer networks are exponentially more parameter-efficient than 2-layer networks. The extra parameters in the 2-layer version would need to redundantly represent hierarchical features that the deeper network computes compositionally. Empirically, 4-layer architectures consistently outperform 2-layer architectures on this class of problem at matched parameter count."

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Neural Network Forward Pass demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.