Regularisation Theory
Reading time: ~24 minutes | Level: Statistical Learning Theory → Production ML
The Fundamental Tension
You have data and you want to find that minimizes the true risk . You can only measure the empirical risk .
The tension: minimizing over a complex hypothesis class will find a hypothesis that fits training noise - low , high .
Regularisation is the formal mechanism for controlling this tension. It adds a complexity term to the objective:
where is a complexity penalty and controls the trade-off.
This lesson provides the theoretical framework for why regularisation works, connecting it to Bayesian MAP estimation, Structural Risk Minimisation, stability theory, and the implicit regularisation of gradient descent.
What You Will Learn
- The regularisation objective: training loss + complexity penalty, and what each term does
- Tikhonov (L2/Ridge) regularisation: the closed-form solution, the Bayesian MAP interpretation, and the shrinkage effect
- Lasso (L1) regularisation: the sparsity-inducing geometry and when to use it over L2
- Structural Risk Minimisation: the formal framework that justifies choosing model complexity from data
- Stability theory: regularisation reduces algorithmic stability variance (the Bousquet-Elisseeff framework)
- Dropout, early stopping, and data augmentation as regularisation - with the formal theory behind each
- The implicit regularisation of gradient descent: why SGD finds minimum-norm solutions
- Five interview questions with full worked answers
Tikhonov Regularisation (L2 / Ridge)
The oldest and most theoretically understood form of regularisation:
Closed form solution:
The term ensures the matrix is invertible even when is singular (rank deficient, when ). This is the original motivation for Ridge regression: stability under collinearity.
Three Equivalent Views of L2 Regularisation
View 1 - Penalised ERM:
View 2 - Constrained ERM (Lagrangian duality):
For every in View 1, there exists in View 2 that gives the same solution (and vice versa).
View 3 - MAP Estimation with Gaussian Prior (Bayesian):
With :
Taking gives exactly the Ridge objective. More weight regularisation (larger ) corresponds to a tighter Gaussian prior (smaller ) on weights.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler
def compare_regularisation(n_samples=50, n_features=30, noise_std=0.5):
"""
Compare OLS, Ridge, and Lasso on underdetermined problem (n < d).
"""
np.random.seed(42)
# True model: only first 5 features are relevant
true_w = np.zeros(n_features)
true_w[:5] = np.array([3.0, -2.0, 1.5, -1.0, 0.8])
# Generate data
X = np.random.randn(n_samples, n_features)
y = X @ true_w + np.random.normal(0, noise_std, n_samples)
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit models
lambdas = [0.01, 0.1, 1.0, 10.0, 100.0]
ridge_coefs = []
lasso_coefs = []
for lam in lambdas:
ridge = Ridge(alpha=lam * n_samples, fit_intercept=False)
ridge.fit(X_scaled, y)
ridge_coefs.append(ridge.coef_.copy())
lasso = Lasso(alpha=lam, fit_intercept=False, max_iter=5000)
lasso.fit(X_scaled, y)
lasso_coefs.append(lasso.coef_.copy())
ridge_coefs = np.array(ridge_coefs)
lasso_coefs = np.array(lasso_coefs)
# Plot coefficient paths
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
for ax, coefs, name in zip(axes,
[ridge_coefs, lasso_coefs],
['Ridge (L2)', 'Lasso (L1)']):
for j in range(n_features):
color = 'red' if j < 5 else 'blue'
alpha = 0.8 if j < 5 else 0.2
lw = 2 if j < 5 else 0.5
ax.semilogx(lambdas, coefs[:, j], color=color, alpha=alpha, linewidth=lw)
ax.set_xlabel('Regularisation strength λ')
ax.set_ylabel('Coefficient value')
ax.set_title(f'{name} Regularisation Path')
ax.axhline(0, color='gray', linestyle='--', linewidth=0.5)
ax.grid(True, alpha=0.3)
# Legend
from matplotlib.lines import Line2D
legend = [Line2D([0],[0], color='red', lw=2, label='True features (first 5)'),
Line2D([0],[0], color='blue', lw=1, label='Noise features')]
ax.legend(handles=legend)
plt.suptitle('Regularisation Paths: How λ Controls Model Complexity', fontsize=13)
plt.tight_layout()
plt.savefig('regularisation_paths.png', dpi=150)
print("Regularisation paths saved")
print("\nKey observations:")
print("Ridge: all coefficients shrink toward 0, none exactly zero (no sparsity)")
print("Lasso: coefficients shrink and become exactly 0 (sparsity)")
print("Lasso = Laplace prior (sparse); Ridge = Gaussian prior (dense small weights)")
compare_regularisation()
Lasso (L1 Regularisation): Sparsity from Laplace Prior
Bayesian view: MAP estimation with a Laplace prior .
Why L1 produces sparsity: Geometrically, the L1 ball () has corners at the axes. The optimal solution tends to hit corners - setting all but a few coordinates to exactly zero. The L2 ball () is smooth and has no corners - solutions tend to be dense with many small nonzero values.
| Property | Ridge (L2) | Lasso (L1) | Elastic Net (L1+L2) |
|---|---|---|---|
| Bayesian prior | Gaussian | Laplace | Gaussian + Laplace |
| Sparsity | No (dense) | Yes (exact zeros) | Yes (sparse) |
| Closed form | Yes | No | No |
| Feature selection | No | Yes | Yes |
| Correlated features | Averages them | Picks one arbitrarily | Handles both |
| Stability | Always | Can oscillate | Stable |
Structural Risk Minimisation (SRM)
SRM (Vapnik, 1995) provides the theoretical framework for choosing model complexity:
Consider a nested sequence of hypothesis classes:
with increasing VC dimensions .
For each , there's a generalization bound:
where .
SRM chooses the class and hypothesis minimizing the structural risk:
Connection to regularisation: The regularised objective is an approximation to SRM, where (e.g., ) serves as a proxy for the complexity penalty , and balances empirical fit against complexity.
SRM makes explicit: the optimal model complexity depends on . With more data, you can afford more complex models (larger ). This is the formal justification for scaling - as you get more data, increase model size.
import numpy as np
import matplotlib.pyplot as plt
def structural_risk_minimisation_demo():
"""
Show how SRM selects optimal model complexity as n grows.
"""
def empirical_risk(complexity, n):
"""Simulated empirical risk vs complexity (U-shaped training error)."""
return np.maximum(0.5 - 0.02 * complexity + 0.001 * complexity**2, 0)
def complexity_penalty(d_vc, n, delta=0.05):
"""VC-based complexity penalty term."""
return np.sqrt(d_vc * (np.log(2*n/max(d_vc, 1)) + 1) / n)
n_values = [50, 200, 1000, 5000]
complexities = np.arange(1, 50)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
for n, ax_idx in zip([200, 2000], [0, 1]):
ax = axes[ax_idx]
emp_risks = [empirical_risk(k, n) for k in complexities]
penalties = [complexity_penalty(k, n) for k in complexities]
structural_risks = [e + p for e, p in zip(emp_risks, penalties)]
ax.plot(complexities, emp_risks, 'b-', linewidth=2, label='Empirical risk')
ax.plot(complexities, penalties, 'r--', linewidth=2, label='Complexity penalty')
ax.plot(complexities, structural_risks, 'g-', linewidth=3, label='Structural risk')
opt_k = complexities[np.argmin(structural_risks)]
ax.axvline(opt_k, color='orange', linewidth=2, linestyle=':',
label=f'Optimal complexity k*={opt_k}')
ax.set_xlabel('Model complexity (VC dimension)')
ax.set_ylabel('Risk / Bound')
ax.set_title(f'SRM: n={n} training examples')
ax.legend()
ax.grid(True, alpha=0.3)
plt.suptitle('Structural Risk Minimisation: Optimal Complexity Grows With n', fontsize=13)
plt.tight_layout()
plt.savefig('srm_demo.png', dpi=150)
print("SRM plot saved")
print("Key insight: as n grows, optimal complexity increases")
print("This justifies scaling laws in deep learning!")
structural_risk_minimisation_demo()
Algorithm Stability and Generalisation
A different perspective on regularisation: a learning algorithm is stable if small changes to the training set produce small changes in the output hypothesis.
Definition (Uniform Stability): Algorithm is -uniformly stable if for any training sets differing in at most one example:
Theorem (Bousquet & Elisseeff, 2002): If algorithm is -uniformly stable, then with probability :
Ridge regression is -stable: The L2 penalty makes the objective strongly convex, which forces the solution to change smoothly when one training example changes. This is the formal reason L2 regularisation helps generalisation - not just because it constrains complexity, but because it makes the algorithm stable.
In contrast, 1-nearest-neighbor is -stable in a different sense (changing one example can completely change predictions in a region), and ERM without regularisation on a complex class can be highly unstable.
Dropout as Regularisation: Theory
Dropout was introduced empirically in Hinton et al. (2012). Subsequent theory has provided several interpretations:
Interpretation 1: Noise-Based Regularisation
Dropout randomly sets neurons to zero during training. This adds multiplicative noise to hidden activations. The regularisation effect: the network cannot rely on any single neuron - it must learn redundant representations.
Mathematically, for a linear model with dropout on inputs ( drop probability):
Dropout is equivalent to weighted L2 regularisation - each weight is penalised proportional to the scale of its input feature.
Interpretation 2: Bayesian Approximation (Gal & Ghahramani, 2016)
Dropout at test time (Monte Carlo Dropout) approximates Bayesian inference in a deep Gaussian process. The posterior distribution over network outputs is approximated by:
where are the weights with different dropout masks applied. This gives epistemic uncertainty estimates - predicting both the mean output and the variance, where high variance indicates the model is uncertain.
import torch
import torch.nn as nn
import numpy as np
class BayesianDropoutNN(nn.Module):
"""
Neural network with MC Dropout for uncertainty estimation.
Dropout is kept ON at test time for approximate Bayesian inference.
"""
def __init__(self, input_dim, hidden_dim, output_dim, dropout_p=0.5):
super().__init__()
self.dropout_p = dropout_p
self.network = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(p=dropout_p), # Dropout after first layer
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(p=dropout_p), # Dropout after second layer
nn.Linear(hidden_dim, output_dim)
)
def forward(self, x):
return self.network(x)
def predict_with_uncertainty(self, x, n_forward_passes=100):
"""
MC Dropout: run T stochastic forward passes to estimate uncertainty.
Returns mean prediction and epistemic uncertainty.
"""
self.train() # Keep dropout active during inference
predictions = []
with torch.no_grad():
for _ in range(n_forward_passes):
pred = self.forward(x)
predictions.append(pred)
predictions = torch.stack(predictions) # (T, batch, output_dim)
mean = predictions.mean(0)
epistemic_uncertainty = predictions.var(0) # variance across passes
return mean, epistemic_uncertainty
# Example: regression with uncertainty
np.random.seed(42)
model = BayesianDropoutNN(1, 50, 1, dropout_p=0.1)
# Generate test points
X_test = torch.linspace(-4, 4, 100).reshape(-1, 1)
# Before training: high uncertainty everywhere (prior)
mean_pred, uncertainty = model.predict_with_uncertainty(X_test)
print("MC Dropout Uncertainty Estimation:")
print(f"Mean prediction range: [{mean_pred.min():.3f}, {mean_pred.max():.3f}]")
print(f"Uncertainty range: [{uncertainty.min():.4f}, {uncertainty.max():.4f}]")
print("After training on x in [-2, 2], uncertainty should be lower in that range")
print("and higher outside (epistemic uncertainty from lack of training data)")
Early Stopping as Regularisation
Early stopping - halting gradient descent before convergence - is one of the most effective and widely used regularisation techniques, yet often misunderstood.
Theoretical connection to L2 regularisation (Yao et al., 2007): For gradient descent on a linear model with step size , stopping at step is equivalent to L2 regularisation with:
More training steps → smaller effective → less regularisation → more complex solutions.
Intuition: Gradient descent on a quadratic loss starts with the solution at the initialisation (often near zero) and moves toward the ERM minimiser. The trajectory traverses solutions with increasing weight magnitude. Stopping early gives a solution with small weights - similar to the effect of L2 regularisation.
import numpy as np
import matplotlib.pyplot as plt
def early_stopping_demonstration():
"""
Demonstrate early stopping as implicit L2 regularisation.
Compare test MSE for gradient descent at different stopping points.
"""
np.random.seed(42)
n, d = 30, 50 # underdetermined: n < d
# True sparse model
w_true = np.zeros(d)
w_true[:3] = [2.0, -1.5, 1.0]
X = np.random.randn(n, d)
y_train = X @ w_true + np.random.randn(n) * 0.3
# Test set
X_test = np.random.randn(500, d)
y_test = X_test @ w_true + np.random.randn(500) * 0.3
# Gradient descent on linear regression (no explicit regularisation)
w = np.zeros(d)
lr = 0.001
n_steps = 3000
train_errors, test_errors, weight_norms = [], [], []
for step in range(n_steps):
residuals = X @ w - y_train
grad = X.T @ residuals / n
w = w - lr * grad
if step % 10 == 0:
train_mse = np.mean((X @ w - y_train)**2)
test_mse = np.mean((X_test @ w - y_test)**2)
train_errors.append(train_mse)
test_errors.append(test_mse)
weight_norms.append(np.linalg.norm(w))
steps = list(range(0, n_steps, 10))
optimal_step = steps[np.argmin(test_errors)]
optimal_test_err = min(test_errors)
print(f"Early stopping analysis:")
print(f" Optimal stopping step: {optimal_step}")
print(f" Test MSE at optimal stop: {optimal_test_err:.4f}")
print(f" Test MSE at final step: {test_errors[-1]:.4f}")
print(f" Weight norm at optimal: {weight_norms[steps.index(optimal_step)]:.4f}")
print(f" Weight norm at final: {weight_norms[-1]:.4f}")
print(f" Equivalent L2 lambda ≈ {1/(lr*optimal_step):.4f}")
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].semilogy(steps, train_errors, 'b-', label='Train MSE')
axes[0].semilogy(steps, test_errors, 'r-', label='Test MSE')
axes[0].axvline(optimal_step, color='green', linestyle='--',
label=f'Optimal stop (step={optimal_step})')
axes[0].set_xlabel('Gradient descent step')
axes[0].set_ylabel('MSE')
axes[0].set_title('Early Stopping: Train vs Test Error')
axes[0].legend()
axes[1].plot(steps, weight_norms, 'purple')
axes[1].axvline(optimal_step, color='green', linestyle='--')
axes[1].set_xlabel('Gradient descent step')
axes[1].set_ylabel('||w||_2')
axes[1].set_title('Weight Norm Grows With Training Steps')
plt.tight_layout()
plt.savefig('early_stopping.png', dpi=150)
return optimal_step, optimal_test_err
early_stopping_demonstration()
Practical Guide: Choosing Regularisation
| Situation | Recommended Regularisation |
|---|---|
| Linear/logistic regression, many features | L2 (Ridge) by default; L1 if sparsity desired |
| Feature selection needed | L1 (Lasso) or Elastic Net |
| Neural network training | Weight decay (L2) + Dropout |
| Sequence models (RNN, LSTM) | Variational dropout, recurrent dropout |
| Transformers | Attention dropout, label smoothing |
| Convolutional networks | Batch normalisation, data augmentation |
| Small dataset | Stronger regularisation, more data augmentation |
| NLP fine-tuning | Layer-wise learning rate decay, weight decay |
| GAN discriminator | Spectral normalisation (Rademacher complexity control) |
Interview Questions
Q1: What are the three equivalent views of L2 regularisation, and which is most useful for understanding why it works?
(1) Penalised ERM: - penalises large weights, useful for computation; (2) Constrained ERM: s.t. - controls model complexity by bounding the hypothesis class, connects to Rademacher complexity ; (3) MAP estimation: MLE under Gaussian prior , equivalent to - Bayesian interpretation. The constrained view is most fundamental for understanding generalisation: it shows L2 regularisation limits the effective VC dimension / Rademacher complexity of the hypothesis class, directly reducing the generalisation gap. The Bayesian view explains why the choice of encodes prior beliefs about weight magnitudes.
Q2: Why does L1 regularisation produce sparsity but L2 does not?
Geometrically: the L1 constraint set is a diamond (in 2D) with corners at the coordinate axes. The L2 constraint set is a smooth sphere with no corners. When you minimise the empirical loss (an elliptical function) subject to these constraints, the solution tends to lie at the boundary of the constraint set. For L2, the smooth boundary means the solution can be anywhere - rarely exactly on an axis. For L1, the boundary has sharp corners at the axes, and the solution is "attracted" to these corners where one or more coordinates are zero. Analytically: the L1 subdifferential at is - any gradient in this range leaves as the optimal solution, creating exact zeros. L2 gradient at is exactly 0, so any positive gradient pushes away from zero.
Q3: Explain algorithm stability and its connection to L2 regularisation.
Algorithm stability measures how much the output changes when you remove or change one training example. For Ridge regression, the objective is , which is -strongly convex. Strong convexity implies -stability with . By the stability-based generalisation bound, . With optimal , the bound becomes . This is the formal reason L2 regularisation guarantees generalisation: it makes the algorithm stable (insensitive to individual training examples), which implies the training error concentrates around the test error.
Q4: What is the theoretical connection between dropout and L2 regularisation?
For a linear model with dropout (each set to 0 with probability and scaled by ), the expected loss under dropout masking is: . With Bernoulli dropout at rate : . So dropout is equivalent to weighted L2 regularisation with penalty for each weight - each weight is regularised proportionally to the variance of its input feature. This is adaptive L2 regularisation: features with higher variance receive stronger regularisation.
Q5: What does early stopping correspond to in regularisation theory, and how do you choose when to stop?
Early stopping corresponds to L2 regularisation with effective strength where is the step size and is the number of steps. More precisely, for gradient descent on squared loss: after gradient steps from initialisation near zero, the solution lies in the Krylov subspace spanned by the first Lanczos vectors of , which filters out small singular values of - equivalent to Tikhonov regularisation with threshold . In practice, use a validation set and stop when validation loss stops decreasing (with patience = some number of epochs). Alternatively, use learning rate schedulers (cosine annealing, one-cycle) that naturally prevent late-stage overfitting. In deep learning, the relationship is less exact due to nonlinearity, but the intuition holds: training longer from small-norm initialisation → larger-norm solutions → less implicit regularisation.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Regularization Path demo on the EngineersOfAI Playground - no code required.
:::
