Skip to main content

Regularisation Theory

Reading time: ~24 minutes | Level: Statistical Learning Theory → Production ML

The Fundamental Tension

You have data D\mathcal{D} and you want to find hHh \in \mathcal{H} that minimizes the true risk R(h)R(h). You can only measure the empirical risk R^(h)\hat{R}(h).

The tension: minimizing R^(h)\hat{R}(h) over a complex hypothesis class H\mathcal{H} will find a hypothesis that fits training noise - low R^(h)\hat{R}(h), high R(h)R(h).

Regularisation is the formal mechanism for controlling this tension. It adds a complexity term to the objective:

h=argminhH[R^(h)+λΩ(h)]h^* = \arg\min_{h \in \mathcal{H}} \left[\hat{R}(h) + \lambda \Omega(h)\right]

where Ω(h)\Omega(h) is a complexity penalty and λ>0\lambda > 0 controls the trade-off.

This lesson provides the theoretical framework for why regularisation works, connecting it to Bayesian MAP estimation, Structural Risk Minimisation, stability theory, and the implicit regularisation of gradient descent.

What You Will Learn

  • The regularisation objective: training loss + complexity penalty, and what each term does
  • Tikhonov (L2/Ridge) regularisation: the closed-form solution, the Bayesian MAP interpretation, and the shrinkage effect
  • Lasso (L1) regularisation: the sparsity-inducing geometry and when to use it over L2
  • Structural Risk Minimisation: the formal framework that justifies choosing model complexity from data
  • Stability theory: regularisation reduces algorithmic stability variance (the Bousquet-Elisseeff framework)
  • Dropout, early stopping, and data augmentation as regularisation - with the formal theory behind each
  • The implicit regularisation of gradient descent: why SGD finds minimum-norm solutions
  • Five interview questions with full worked answers

Tikhonov Regularisation (L2 / Ridge)

The oldest and most theoretically understood form of regularisation:

hRidge=argminwi=1n(yiwxi)2+λw22h^*_{Ridge} = \arg\min_{\mathbf{w}} \sum_{i=1}^n (y_i - \mathbf{w}^\top x_i)^2 + \lambda\|\mathbf{w}\|_2^2

Closed form solution: w^Ridge=(XX+λI)1Xy\hat{\mathbf{w}}_{Ridge} = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}

The λI\lambda\mathbf{I} term ensures the matrix is invertible even when XX\mathbf{X}^\top\mathbf{X} is singular (rank deficient, when n<dn < d). This is the original motivation for Ridge regression: stability under collinearity.

Three Equivalent Views of L2 Regularisation

View 1 - Penalised ERM:

minw1ni(yi,wxi)+λ2w22\min_\mathbf{w} \frac{1}{n}\sum_i \ell(y_i, \mathbf{w}^\top x_i) + \frac{\lambda}{2}\|\mathbf{w}\|_2^2

View 2 - Constrained ERM (Lagrangian duality):

minw1ni(yi,wxi)s.t.w22B2\min_\mathbf{w} \frac{1}{n}\sum_i \ell(y_i, \mathbf{w}^\top x_i) \quad \text{s.t.} \quad \|\mathbf{w}\|_2^2 \leq B^2

For every λ\lambda in View 1, there exists BB in View 2 that gives the same solution (and vice versa).

View 3 - MAP Estimation with Gaussian Prior (Bayesian):

maxwP(wD)maxwP(Dw)P(w)\max_\mathbf{w} P(\mathbf{w} \mid \mathcal{D}) \propto \max_\mathbf{w} P(\mathcal{D} \mid \mathbf{w}) \cdot P(\mathbf{w})

With P(w)=N(0,σw2I)P(\mathbf{w}) = \mathcal{N}(\mathbf{0}, \sigma_w^2 \mathbf{I}): logP(w)=12σw2w22+const\log P(\mathbf{w}) = -\frac{1}{2\sigma_w^2}\|\mathbf{w}\|_2^2 + \text{const}

Taking λ=1/(2σw2)\lambda = 1/(2\sigma_w^2) gives exactly the Ridge objective. More weight regularisation (larger λ\lambda) corresponds to a tighter Gaussian prior (smaller σw2\sigma_w^2) on weights.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler

def compare_regularisation(n_samples=50, n_features=30, noise_std=0.5):
"""
Compare OLS, Ridge, and Lasso on underdetermined problem (n < d).
"""
np.random.seed(42)
# True model: only first 5 features are relevant
true_w = np.zeros(n_features)
true_w[:5] = np.array([3.0, -2.0, 1.5, -1.0, 0.8])

# Generate data
X = np.random.randn(n_samples, n_features)
y = X @ true_w + np.random.normal(0, noise_std, n_samples)

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit models
lambdas = [0.01, 0.1, 1.0, 10.0, 100.0]

ridge_coefs = []
lasso_coefs = []
for lam in lambdas:
ridge = Ridge(alpha=lam * n_samples, fit_intercept=False)
ridge.fit(X_scaled, y)
ridge_coefs.append(ridge.coef_.copy())

lasso = Lasso(alpha=lam, fit_intercept=False, max_iter=5000)
lasso.fit(X_scaled, y)
lasso_coefs.append(lasso.coef_.copy())

ridge_coefs = np.array(ridge_coefs)
lasso_coefs = np.array(lasso_coefs)

# Plot coefficient paths
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for ax, coefs, name in zip(axes,
[ridge_coefs, lasso_coefs],
['Ridge (L2)', 'Lasso (L1)']):
for j in range(n_features):
color = 'red' if j < 5 else 'blue'
alpha = 0.8 if j < 5 else 0.2
lw = 2 if j < 5 else 0.5
ax.semilogx(lambdas, coefs[:, j], color=color, alpha=alpha, linewidth=lw)
ax.set_xlabel('Regularisation strength λ')
ax.set_ylabel('Coefficient value')
ax.set_title(f'{name} Regularisation Path')
ax.axhline(0, color='gray', linestyle='--', linewidth=0.5)
ax.grid(True, alpha=0.3)
# Legend
from matplotlib.lines import Line2D
legend = [Line2D([0],[0], color='red', lw=2, label='True features (first 5)'),
Line2D([0],[0], color='blue', lw=1, label='Noise features')]
ax.legend(handles=legend)

plt.suptitle('Regularisation Paths: How λ Controls Model Complexity', fontsize=13)
plt.tight_layout()
plt.savefig('regularisation_paths.png', dpi=150)
print("Regularisation paths saved")

print("\nKey observations:")
print("Ridge: all coefficients shrink toward 0, none exactly zero (no sparsity)")
print("Lasso: coefficients shrink and become exactly 0 (sparsity)")
print("Lasso = Laplace prior (sparse); Ridge = Gaussian prior (dense small weights)")

compare_regularisation()

Lasso (L1 Regularisation): Sparsity from Laplace Prior

hLasso=argminwi=1n(yiwxi)2+λw1h^*_{Lasso} = \arg\min_\mathbf{w} \sum_{i=1}^n (y_i - \mathbf{w}^\top x_i)^2 + \lambda\|\mathbf{w}\|_1

Bayesian view: MAP estimation with a Laplace prior P(wj)=λ2exp(λwj)P(w_j) = \frac{\lambda}{2}\exp(-\lambda|w_j|).

Why L1 produces sparsity: Geometrically, the L1 ball (w1B\|\mathbf{w}\|_1 \leq B) has corners at the axes. The optimal solution tends to hit corners - setting all but a few coordinates to exactly zero. The L2 ball (w2B\|\mathbf{w}\|_2 \leq B) is smooth and has no corners - solutions tend to be dense with many small nonzero values.

PropertyRidge (L2)Lasso (L1)Elastic Net (L1+L2)
Bayesian priorGaussianLaplaceGaussian + Laplace
SparsityNo (dense)Yes (exact zeros)Yes (sparse)
Closed formYesNoNo
Feature selectionNoYesYes
Correlated featuresAverages themPicks one arbitrarilyHandles both
StabilityAlwaysCan oscillateStable

Structural Risk Minimisation (SRM)

SRM (Vapnik, 1995) provides the theoretical framework for choosing model complexity:

Consider a nested sequence of hypothesis classes:

H1H2Hk\mathcal{H}_1 \subseteq \mathcal{H}_2 \subseteq \cdots \subseteq \mathcal{H}_k \subseteq \cdots

with increasing VC dimensions d1d2d_1 \leq d_2 \leq \cdots.

For each Hk\mathcal{H}_k, there's a generalization bound:

R(h)R^(h)+ϵk(n,δ)complexity penaltyR(h) \leq \hat{R}(h) + \underbrace{\epsilon_k(n, \delta)}_{\text{complexity penalty}}

where ϵk=dk(ln(2n/dk)+1)+ln(4/δ)n\epsilon_k = \sqrt{\frac{d_k(\ln(2n/d_k) + 1) + \ln(4/\delta)}{n}}.

SRM chooses the class kk^* and hypothesis hh^* minimizing the structural risk:

k,h=argmink,hHk[R^(h)+ϵk(n,δ)]k^*, h^* = \arg\min_{k, h \in \mathcal{H}_k} \left[\hat{R}(h) + \epsilon_k(n, \delta)\right]

Connection to regularisation: The regularised objective R^(h)+λΩ(h)\hat{R}(h) + \lambda\Omega(h) is an approximation to SRM, where Ω(h)\Omega(h) (e.g., w2\|\mathbf{w}\|^2) serves as a proxy for the complexity penalty ϵk\epsilon_k, and λ\lambda balances empirical fit against complexity.

SRM makes explicit: the optimal model complexity depends on nn. With more data, you can afford more complex models (larger kk). This is the formal justification for scaling - as you get more data, increase model size.

import numpy as np
import matplotlib.pyplot as plt

def structural_risk_minimisation_demo():
"""
Show how SRM selects optimal model complexity as n grows.
"""
def empirical_risk(complexity, n):
"""Simulated empirical risk vs complexity (U-shaped training error)."""
return np.maximum(0.5 - 0.02 * complexity + 0.001 * complexity**2, 0)

def complexity_penalty(d_vc, n, delta=0.05):
"""VC-based complexity penalty term."""
return np.sqrt(d_vc * (np.log(2*n/max(d_vc, 1)) + 1) / n)

n_values = [50, 200, 1000, 5000]
complexities = np.arange(1, 50)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for n, ax_idx in zip([200, 2000], [0, 1]):
ax = axes[ax_idx]
emp_risks = [empirical_risk(k, n) for k in complexities]
penalties = [complexity_penalty(k, n) for k in complexities]
structural_risks = [e + p for e, p in zip(emp_risks, penalties)]

ax.plot(complexities, emp_risks, 'b-', linewidth=2, label='Empirical risk')
ax.plot(complexities, penalties, 'r--', linewidth=2, label='Complexity penalty')
ax.plot(complexities, structural_risks, 'g-', linewidth=3, label='Structural risk')

opt_k = complexities[np.argmin(structural_risks)]
ax.axvline(opt_k, color='orange', linewidth=2, linestyle=':',
label=f'Optimal complexity k*={opt_k}')

ax.set_xlabel('Model complexity (VC dimension)')
ax.set_ylabel('Risk / Bound')
ax.set_title(f'SRM: n={n} training examples')
ax.legend()
ax.grid(True, alpha=0.3)

plt.suptitle('Structural Risk Minimisation: Optimal Complexity Grows With n', fontsize=13)
plt.tight_layout()
plt.savefig('srm_demo.png', dpi=150)
print("SRM plot saved")
print("Key insight: as n grows, optimal complexity increases")
print("This justifies scaling laws in deep learning!")

structural_risk_minimisation_demo()

Algorithm Stability and Generalisation

A different perspective on regularisation: a learning algorithm is stable if small changes to the training set produce small changes in the output hypothesis.

Definition (Uniform Stability): Algorithm AA is β\beta-uniformly stable if for any training sets S,SS, S' differing in at most one example:

supz(A(S),z)(A(S),z)β\sup_z |\ell(A(S), z) - \ell(A(S'), z)| \leq \beta

Theorem (Bousquet & Elisseeff, 2002): If algorithm AA is β\beta-uniformly stable, then with probability 1δ\geq 1-\delta:

R(A(S))R^(A(S))+2β+(4nβ+1)ln(1/δ)2nR(A(S)) \leq \hat{R}(A(S)) + 2\beta + (4n\beta + 1)\sqrt{\frac{\ln(1/\delta)}{2n}}

Ridge regression is O(λ1/n)O(\lambda^{-1}/n)-stable: The L2 penalty makes the objective strongly convex, which forces the solution to change smoothly when one training example changes. This is the formal reason L2 regularisation helps generalisation - not just because it constrains complexity, but because it makes the algorithm stable.

In contrast, 1-nearest-neighbor is O(1)O(1)-stable in a different sense (changing one example can completely change predictions in a region), and ERM without regularisation on a complex class can be highly unstable.

Dropout as Regularisation: Theory

Dropout was introduced empirically in Hinton et al. (2012). Subsequent theory has provided several interpretations:

Interpretation 1: Noise-Based Regularisation

Dropout randomly sets neurons to zero during training. This adds multiplicative noise to hidden activations. The regularisation effect: the network cannot rely on any single neuron - it must learn redundant representations.

Mathematically, for a linear model with dropout on inputs (p=0.5p = 0.5 drop probability):

Ldropout=Eϵ[i(yi(wϵ)xi)2]=Xwy2+constjwj2xj2L_{dropout} = E_\epsilon\left[\sum_i (y_i - (\mathbf{w} \odot \epsilon)^\top x_i)^2\right] = \|X\mathbf{w} - \mathbf{y}\|^2 + \text{const} \cdot \sum_j w_j^2 \|x_j\|^2

Dropout is equivalent to weighted L2 regularisation - each weight is penalised proportional to the scale of its input feature.

Interpretation 2: Bayesian Approximation (Gal & Ghahramani, 2016)

Dropout at test time (Monte Carlo Dropout) approximates Bayesian inference in a deep Gaussian process. The posterior distribution over network outputs is approximated by:

P(yx,D)1Tt=1TP(yx,W^t)P(y^* \mid x^*, \mathcal{D}) \approx \frac{1}{T}\sum_{t=1}^T P(y^* \mid x^*, \hat{\mathbf{W}}_t)

where W^t\hat{\mathbf{W}}_t are the weights with different dropout masks applied. This gives epistemic uncertainty estimates - predicting both the mean output and the variance, where high variance indicates the model is uncertain.

import torch
import torch.nn as nn
import numpy as np

class BayesianDropoutNN(nn.Module):
"""
Neural network with MC Dropout for uncertainty estimation.
Dropout is kept ON at test time for approximate Bayesian inference.
"""
def __init__(self, input_dim, hidden_dim, output_dim, dropout_p=0.5):
super().__init__()
self.dropout_p = dropout_p
self.network = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(p=dropout_p), # Dropout after first layer
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Dropout(p=dropout_p), # Dropout after second layer
nn.Linear(hidden_dim, output_dim)
)

def forward(self, x):
return self.network(x)

def predict_with_uncertainty(self, x, n_forward_passes=100):
"""
MC Dropout: run T stochastic forward passes to estimate uncertainty.
Returns mean prediction and epistemic uncertainty.
"""
self.train() # Keep dropout active during inference
predictions = []
with torch.no_grad():
for _ in range(n_forward_passes):
pred = self.forward(x)
predictions.append(pred)

predictions = torch.stack(predictions) # (T, batch, output_dim)
mean = predictions.mean(0)
epistemic_uncertainty = predictions.var(0) # variance across passes
return mean, epistemic_uncertainty

# Example: regression with uncertainty
np.random.seed(42)
model = BayesianDropoutNN(1, 50, 1, dropout_p=0.1)

# Generate test points
X_test = torch.linspace(-4, 4, 100).reshape(-1, 1)

# Before training: high uncertainty everywhere (prior)
mean_pred, uncertainty = model.predict_with_uncertainty(X_test)

print("MC Dropout Uncertainty Estimation:")
print(f"Mean prediction range: [{mean_pred.min():.3f}, {mean_pred.max():.3f}]")
print(f"Uncertainty range: [{uncertainty.min():.4f}, {uncertainty.max():.4f}]")
print("After training on x in [-2, 2], uncertainty should be lower in that range")
print("and higher outside (epistemic uncertainty from lack of training data)")

Early Stopping as Regularisation

Early stopping - halting gradient descent before convergence - is one of the most effective and widely used regularisation techniques, yet often misunderstood.

Theoretical connection to L2 regularisation (Yao et al., 2007): For gradient descent on a linear model with step size η\eta, stopping at step TT is equivalent to L2 regularisation with:

λequiv1ηT\lambda_{equiv} \approx \frac{1}{\eta T}

More training steps → smaller effective λ\lambda → less regularisation → more complex solutions.

Intuition: Gradient descent on a quadratic loss starts with the solution at the initialisation (often near zero) and moves toward the ERM minimiser. The trajectory traverses solutions with increasing weight magnitude. Stopping early gives a solution with small weights - similar to the effect of L2 regularisation.

import numpy as np
import matplotlib.pyplot as plt

def early_stopping_demonstration():
"""
Demonstrate early stopping as implicit L2 regularisation.
Compare test MSE for gradient descent at different stopping points.
"""
np.random.seed(42)
n, d = 30, 50 # underdetermined: n < d

# True sparse model
w_true = np.zeros(d)
w_true[:3] = [2.0, -1.5, 1.0]

X = np.random.randn(n, d)
y_train = X @ w_true + np.random.randn(n) * 0.3

# Test set
X_test = np.random.randn(500, d)
y_test = X_test @ w_true + np.random.randn(500) * 0.3

# Gradient descent on linear regression (no explicit regularisation)
w = np.zeros(d)
lr = 0.001
n_steps = 3000

train_errors, test_errors, weight_norms = [], [], []

for step in range(n_steps):
residuals = X @ w - y_train
grad = X.T @ residuals / n
w = w - lr * grad

if step % 10 == 0:
train_mse = np.mean((X @ w - y_train)**2)
test_mse = np.mean((X_test @ w - y_test)**2)
train_errors.append(train_mse)
test_errors.append(test_mse)
weight_norms.append(np.linalg.norm(w))

steps = list(range(0, n_steps, 10))
optimal_step = steps[np.argmin(test_errors)]
optimal_test_err = min(test_errors)

print(f"Early stopping analysis:")
print(f" Optimal stopping step: {optimal_step}")
print(f" Test MSE at optimal stop: {optimal_test_err:.4f}")
print(f" Test MSE at final step: {test_errors[-1]:.4f}")
print(f" Weight norm at optimal: {weight_norms[steps.index(optimal_step)]:.4f}")
print(f" Weight norm at final: {weight_norms[-1]:.4f}")
print(f" Equivalent L2 lambda ≈ {1/(lr*optimal_step):.4f}")

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].semilogy(steps, train_errors, 'b-', label='Train MSE')
axes[0].semilogy(steps, test_errors, 'r-', label='Test MSE')
axes[0].axvline(optimal_step, color='green', linestyle='--',
label=f'Optimal stop (step={optimal_step})')
axes[0].set_xlabel('Gradient descent step')
axes[0].set_ylabel('MSE')
axes[0].set_title('Early Stopping: Train vs Test Error')
axes[0].legend()

axes[1].plot(steps, weight_norms, 'purple')
axes[1].axvline(optimal_step, color='green', linestyle='--')
axes[1].set_xlabel('Gradient descent step')
axes[1].set_ylabel('||w||_2')
axes[1].set_title('Weight Norm Grows With Training Steps')

plt.tight_layout()
plt.savefig('early_stopping.png', dpi=150)
return optimal_step, optimal_test_err

early_stopping_demonstration()

Practical Guide: Choosing Regularisation

SituationRecommended Regularisation
Linear/logistic regression, many featuresL2 (Ridge) by default; L1 if sparsity desired
Feature selection neededL1 (Lasso) or Elastic Net
Neural network trainingWeight decay (L2) + Dropout
Sequence models (RNN, LSTM)Variational dropout, recurrent dropout
TransformersAttention dropout, label smoothing
Convolutional networksBatch normalisation, data augmentation
Small datasetStronger regularisation, more data augmentation
NLP fine-tuningLayer-wise learning rate decay, weight decay
GAN discriminatorSpectral normalisation (Rademacher complexity control)

Interview Questions

Q1: What are the three equivalent views of L2 regularisation, and which is most useful for understanding why it works?

(1) Penalised ERM: minwR^(w)+λw2\min_\mathbf{w} \hat{R}(\mathbf{w}) + \lambda\|\mathbf{w}\|^2 - penalises large weights, useful for computation; (2) Constrained ERM: minwR^(w)\min_\mathbf{w} \hat{R}(\mathbf{w}) s.t. wB\|\mathbf{w}\| \leq B - controls model complexity by bounding the hypothesis class, connects to Rademacher complexity RnBR/n\mathcal{R}_n \leq BR/\sqrt{n}; (3) MAP estimation: MLE under Gaussian prior wN(0,σ2I)\mathbf{w} \sim \mathcal{N}(0, \sigma^2 I), equivalent to λ=1/(2σ2)\lambda = 1/(2\sigma^2) - Bayesian interpretation. The constrained view is most fundamental for understanding generalisation: it shows L2 regularisation limits the effective VC dimension / Rademacher complexity of the hypothesis class, directly reducing the generalisation gap. The Bayesian view explains why the choice of λ\lambda encodes prior beliefs about weight magnitudes.

Q2: Why does L1 regularisation produce sparsity but L2 does not?

Geometrically: the L1 constraint set {w:w1B}\{\mathbf{w}: \|\mathbf{w}\|_1 \leq B\} is a diamond (in 2D) with corners at the coordinate axes. The L2 constraint set is a smooth sphere with no corners. When you minimise the empirical loss (an elliptical function) subject to these constraints, the solution tends to lie at the boundary of the constraint set. For L2, the smooth boundary means the solution can be anywhere - rarely exactly on an axis. For L1, the boundary has sharp corners at the axes, and the solution is "attracted" to these corners where one or more coordinates are zero. Analytically: the L1 subdifferential at wj=0w_j = 0 is [λ,λ][-\lambda, \lambda] - any gradient in this range leaves wj=0w_j = 0 as the optimal solution, creating exact zeros. L2 gradient at wj=0w_j = 0 is exactly 0, so any positive gradient pushes wjw_j away from zero.

Q3: Explain algorithm stability and its connection to L2 regularisation.

Algorithm stability measures how much the output changes when you remove or change one training example. For Ridge regression, the objective is 1nXwy2+λw2\frac{1}{n}\|X\mathbf{w} - y\|^2 + \lambda\|\mathbf{w}\|^2, which is (2λ)(2\lambda)-strongly convex. Strong convexity implies β\beta-stability with β=O(1/(λn))\beta = O(1/(\lambda n)). By the stability-based generalisation bound, R(h)R^(h)+O(ln(1/δ)/n)+O(1/(λn))R(h) \leq \hat{R}(h) + O(\sqrt{\ln(1/\delta)/n}) + O(1/(\lambda n)). With optimal λ=O(1/n)\lambda = O(1/\sqrt{n}), the bound becomes O(1/n)O(1/\sqrt{n}). This is the formal reason L2 regularisation guarantees generalisation: it makes the algorithm stable (insensitive to individual training examples), which implies the training error concentrates around the test error.

Q4: What is the theoretical connection between dropout and L2 regularisation?

For a linear model y=wxy = \mathbf{w}^\top x with dropout (each xjx_j set to 0 with probability pp and scaled by 1/(1p)1/(1-p)), the expected loss under dropout masking is: Eξ[yw(ξx)2]=ywE[ξx]2+jwj2Var[ξjxj]E_\xi[\|y - \mathbf{w}^\top (\xi \odot x)\|^2] = \|y - \mathbf{w}^\top E[\xi \odot x]\|^2 + \sum_j w_j^2 \text{Var}[\xi_j x_j]. With Bernoulli dropout at rate pp: Var[ξjxj]=p(1p)xj2/(1p)2=p/(1p)xj2\text{Var}[\xi_j x_j] = p(1-p)x_j^2 / (1-p)^2 = p/(1-p) \cdot x_j^2. So dropout is equivalent to weighted L2 regularisation with penalty λj=p/(1p)xj2/n\lambda_j = p/(1-p) \cdot x_j^2/n for each weight - each weight is regularised proportionally to the variance of its input feature. This is adaptive L2 regularisation: features with higher variance receive stronger regularisation.

Q5: What does early stopping correspond to in regularisation theory, and how do you choose when to stop?

Early stopping corresponds to L2 regularisation with effective strength λeff1/(ηT)\lambda_{eff} \approx 1/(\eta T) where η\eta is the step size and TT is the number of steps. More precisely, for gradient descent on squared loss: after TT gradient steps from initialisation near zero, the solution lies in the Krylov subspace spanned by the first TT Lanczos vectors of XX\mathbf{X}^\top\mathbf{X}, which filters out small singular values of X\mathbf{X} - equivalent to Tikhonov regularisation with threshold 1/(ηT)1/(\eta T). In practice, use a validation set and stop when validation loss stops decreasing (with patience = some number of epochs). Alternatively, use learning rate schedulers (cosine annealing, one-cycle) that naturally prevent late-stage overfitting. In deep learning, the relationship is less exact due to nonlinearity, but the intuition holds: training longer from small-norm initialisation → larger-norm solutions → less implicit regularisation.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Regularization Path demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.