Regularisation Theory

Reading time: ~24 minutes | Level: Statistical Learning Theory → Production ML

The Fundamental Tension

You have data $\mathcal{D}$ and you want to find $h \in \mathcal{H}$ that minimizes the true risk $R(h)$ . You can only measure the empirical risk $\hat{R}(h)$ .

The tension: minimizing $\hat{R}(h)$ over a complex hypothesis class $\mathcal{H}$ will find a hypothesis that fits training noise - low $\hat{R}(h)$ , high $R(h)$ .

Regularisation is the formal mechanism for controlling this tension. It adds a complexity term to the objective:

$h^* = \arg\min_{h \in \mathcal{H}} \left[\hat{R}(h) + \lambda \Omega(h)\right]$

where $\Omega(h)$ is a complexity penalty and $\lambda > 0$ controls the trade-off.

This lesson provides the theoretical framework for why regularisation works, connecting it to Bayesian MAP estimation, Structural Risk Minimisation, stability theory, and the implicit regularisation of gradient descent.

What You Will Learn

The regularisation objective: training loss + complexity penalty, and what each term does
Tikhonov (L2/Ridge) regularisation: the closed-form solution, the Bayesian MAP interpretation, and the shrinkage effect
Lasso (L1) regularisation: the sparsity-inducing geometry and when to use it over L2
Structural Risk Minimisation: the formal framework that justifies choosing model complexity from data
Stability theory: regularisation reduces algorithmic stability variance (the Bousquet-Elisseeff framework)
Dropout, early stopping, and data augmentation as regularisation - with the formal theory behind each
The implicit regularisation of gradient descent: why SGD finds minimum-norm solutions
Five interview questions with full worked answers

Tikhonov Regularisation (L2 / Ridge)

The oldest and most theoretically understood form of regularisation:

$h^*_{Ridge} = \arg\min_{\mathbf{w}} \sum_{i=1}^n (y_i - \mathbf{w}^\top x_i)^2 + \lambda\|\mathbf{w}\|_2^2$

Closed form solution: $\hat{\mathbf{w}}_{Ridge} = (\mathbf{X}^\top\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}$

The $\lambda\mathbf{I}$ term ensures the matrix is invertible even when $\mathbf{X}^\top\mathbf{X}$ is singular (rank deficient, when $n < d$ ). This is the original motivation for Ridge regression: stability under collinearity.

Three Equivalent Views of L2 Regularisation

View 1 - Penalised ERM:

$\min_\mathbf{w} \frac{1}{n}\sum_i \ell(y_i, \mathbf{w}^\top x_i) + \frac{\lambda}{2}\|\mathbf{w}\|_2^2$

View 2 - Constrained ERM (Lagrangian duality):

$\min_\mathbf{w} \frac{1}{n}\sum_i \ell(y_i, \mathbf{w}^\top x_i) \quad \text{s.t.} \quad \|\mathbf{w}\|_2^2 \leq B^2$

For every $\lambda$ in View 1, there exists $B$ in View 2 that gives the same solution (and vice versa).

View 3 - MAP Estimation with Gaussian Prior (Bayesian):

$\max_\mathbf{w} P(\mathbf{w} \mid \mathcal{D}) \propto \max_\mathbf{w} P(\mathcal{D} \mid \mathbf{w}) \cdot P(\mathbf{w})$

With $P(\mathbf{w}) = \mathcal{N}(\mathbf{0}, \sigma_w^2 \mathbf{I})$ : $\log P(\mathbf{w}) = -\frac{1}{2\sigma_w^2}\|\mathbf{w}\|_2^2 + \text{const}$

Taking $\lambda = 1/(2\sigma_w^2)$ gives exactly the Ridge objective. More weight regularisation (larger $\lambda$ ) corresponds to a tighter Gaussian prior (smaller $\sigma_w^2$ ) on weights.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso
from sklearn.preprocessing import StandardScaler

def compare_regularisation(n_samples=50, n_features=30, noise_std=0.5):
    """
    Compare OLS, Ridge, and Lasso on underdetermined problem (n < d).
    """
    np.random.seed(42)
    # True model: only first 5 features are relevant
    true_w = np.zeros(n_features)
    true_w[:5] = np.array([3.0, -2.0, 1.5, -1.0, 0.8])

    # Generate data
    X = np.random.randn(n_samples, n_features)
    y = X @ true_w + np.random.normal(0, noise_std, n_samples)

    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # Fit models
    lambdas = [0.01, 0.1, 1.0, 10.0, 100.0]

    ridge_coefs = []
    lasso_coefs = []
    for lam in lambdas:
        ridge = Ridge(alpha=lam * n_samples, fit_intercept=False)
        ridge.fit(X_scaled, y)
        ridge_coefs.append(ridge.coef_.copy())

        lasso = Lasso(alpha=lam, fit_intercept=False, max_iter=5000)
        lasso.fit(X_scaled, y)
        lasso_coefs.append(lasso.coef_.copy())

    ridge_coefs = np.array(ridge_coefs)
    lasso_coefs = np.array(lasso_coefs)

    # Plot coefficient paths
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    for ax, coefs, name in zip(axes,
                                [ridge_coefs, lasso_coefs],
                                ['Ridge (L2)', 'Lasso (L1)']):
        for j in range(n_features):
            color = 'red' if j < 5 else 'blue'
            alpha = 0.8 if j < 5 else 0.2
            lw = 2 if j < 5 else 0.5
            ax.semilogx(lambdas, coefs[:, j], color=color, alpha=alpha, linewidth=lw)
        ax.set_xlabel('Regularisation strength λ')
        ax.set_ylabel('Coefficient value')
        ax.set_title(f'{name} Regularisation Path')
        ax.axhline(0, color='gray', linestyle='--', linewidth=0.5)
        ax.grid(True, alpha=0.3)
        # Legend
        from matplotlib.lines import Line2D
        legend = [Line2D([0],[0], color='red', lw=2, label='True features (first 5)'),
                  Line2D([0],[0], color='blue', lw=1, label='Noise features')]
        ax.legend(handles=legend)

    plt.suptitle('Regularisation Paths: How λ Controls Model Complexity', fontsize=13)
    plt.tight_layout()
    plt.savefig('regularisation_paths.png', dpi=150)
    print("Regularisation paths saved")

    print("\nKey observations:")
    print("Ridge: all coefficients shrink toward 0, none exactly zero (no sparsity)")
    print("Lasso: coefficients shrink and become exactly 0 (sparsity)")
    print("Lasso = Laplace prior (sparse); Ridge = Gaussian prior (dense small weights)")

compare_regularisation()

Lasso (L1 Regularisation): Sparsity from Laplace Prior

$h^*_{Lasso} = \arg\min_\mathbf{w} \sum_{i=1}^n (y_i - \mathbf{w}^\top x_i)^2 + \lambda\|\mathbf{w}\|_1$

Bayesian view: MAP estimation with a Laplace prior $P(w_j) = \frac{\lambda}{2}\exp(-\lambda|w_j|)$ .

Why L1 produces sparsity: Geometrically, the L1 ball ( $\|\mathbf{w}\|_1 \leq B$ ) has corners at the axes. The optimal solution tends to hit corners - setting all but a few coordinates to exactly zero. The L2 ball ( $\|\mathbf{w}\|_2 \leq B$ ) is smooth and has no corners - solutions tend to be dense with many small nonzero values.

Property	Ridge (L2)	Lasso (L1)	Elastic Net (L1+L2)
Bayesian prior	Gaussian	Laplace	Gaussian + Laplace
Sparsity	No (dense)	Yes (exact zeros)	Yes (sparse)
Closed form	Yes	No	No
Feature selection	No	Yes	Yes
Correlated features	Averages them	Picks one arbitrarily	Handles both
Stability	Always	Can oscillate	Stable

Structural Risk Minimisation (SRM)

SRM (Vapnik, 1995) provides the theoretical framework for choosing model complexity:

Consider a nested sequence of hypothesis classes:

$\mathcal{H}_1 \subseteq \mathcal{H}_2 \subseteq \cdots \subseteq \mathcal{H}_k \subseteq \cdots$

with increasing VC dimensions $d_1 \leq d_2 \leq \cdots$ .

For each $\mathcal{H}_k$ , there's a generalization bound:

$R(h) \leq \hat{R}(h) + \underbrace{\epsilon_k(n, \delta)}_{\text{complexity penalty}}$

where $\epsilon_k = \sqrt{\frac{d_k(\ln(2n/d_k) + 1) + \ln(4/\delta)}{n}}$ .

SRM chooses the class $k^*$ and hypothesis $h^*$ minimizing the structural risk:

$k^*, h^* = \arg\min_{k, h \in \mathcal{H}_k} \left[\hat{R}(h) + \epsilon_k(n, \delta)\right]$

Connection to regularisation: The regularised objective $\hat{R}(h) + \lambda\Omega(h)$ is an approximation to SRM, where $\Omega(h)$ (e.g., $\|\mathbf{w}\|^2$ ) serves as a proxy for the complexity penalty $\epsilon_k$ , and $\lambda$ balances empirical fit against complexity.

SRM makes explicit: the optimal model complexity depends on $n$ . With more data, you can afford more complex models (larger $k$ ). This is the formal justification for scaling - as you get more data, increase model size.

import numpy as np
import matplotlib.pyplot as plt

def structural_risk_minimisation_demo():
    """
    Show how SRM selects optimal model complexity as n grows.
    """
    def empirical_risk(complexity, n):
        """Simulated empirical risk vs complexity (U-shaped training error)."""
        return np.maximum(0.5 - 0.02 * complexity + 0.001 * complexity**2, 0)

    def complexity_penalty(d_vc, n, delta=0.05):
        """VC-based complexity penalty term."""
        return np.sqrt(d_vc * (np.log(2*n/max(d_vc, 1)) + 1) / n)

    n_values = [50, 200, 1000, 5000]
    complexities = np.arange(1, 50)

    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    for n, ax_idx in zip([200, 2000], [0, 1]):
        ax = axes[ax_idx]
        emp_risks = [empirical_risk(k, n) for k in complexities]
        penalties = [complexity_penalty(k, n) for k in complexities]
        structural_risks = [e + p for e, p in zip(emp_risks, penalties)]

        ax.plot(complexities, emp_risks, 'b-', linewidth=2, label='Empirical risk')
        ax.plot(complexities, penalties, 'r--', linewidth=2, label='Complexity penalty')
        ax.plot(complexities, structural_risks, 'g-', linewidth=3, label='Structural risk')

        opt_k = complexities[np.argmin(structural_risks)]
        ax.axvline(opt_k, color='orange', linewidth=2, linestyle=':',
                   label=f'Optimal complexity k*={opt_k}')

        ax.set_xlabel('Model complexity (VC dimension)')
        ax.set_ylabel('Risk / Bound')
        ax.set_title(f'SRM: n={n} training examples')
        ax.legend()
        ax.grid(True, alpha=0.3)

    plt.suptitle('Structural Risk Minimisation: Optimal Complexity Grows With n', fontsize=13)
    plt.tight_layout()
    plt.savefig('srm_demo.png', dpi=150)
    print("SRM plot saved")
    print("Key insight: as n grows, optimal complexity increases")
    print("This justifies scaling laws in deep learning!")

structural_risk_minimisation_demo()

Algorithm Stability and Generalisation

A different perspective on regularisation: a learning algorithm is stable if small changes to the training set produce small changes in the output hypothesis.

Definition (Uniform Stability): Algorithm $A$ is $\beta$ -uniformly stable if for any training sets $S, S'$ differing in at most one example:

$\sup_z |\ell(A(S), z) - \ell(A(S'), z)| \leq \beta$

Theorem (Bousquet & Elisseeff, 2002): If algorithm $A$ is $\beta$ -uniformly stable, then with probability $\geq 1-\delta$ :

$R(A(S)) \leq \hat{R}(A(S)) + 2\beta + (4n\beta + 1)\sqrt{\frac{\ln(1/\delta)}{2n}}$

Ridge regression is $O(\lambda^{-1}/n)$ -stable: The L2 penalty makes the objective strongly convex, which forces the solution to change smoothly when one training example changes. This is the formal reason L2 regularisation helps generalisation - not just because it constrains complexity, but because it makes the algorithm stable.

In contrast, 1-nearest-neighbor is $O(1)$ -stable in a different sense (changing one example can completely change predictions in a region), and ERM without regularisation on a complex class can be highly unstable.

Dropout as Regularisation: Theory

Dropout was introduced empirically in Hinton et al. (2012). Subsequent theory has provided several interpretations:

Interpretation 1: Noise-Based Regularisation

Dropout randomly sets neurons to zero during training. This adds multiplicative noise to hidden activations. The regularisation effect: the network cannot rely on any single neuron - it must learn redundant representations.

Mathematically, for a linear model with dropout on inputs ( $p = 0.5$ drop probability):

$L_{dropout} = E_\epsilon\left[\sum_i (y_i - (\mathbf{w} \odot \epsilon)^\top x_i)^2\right] = \|X\mathbf{w} - \mathbf{y}\|^2 + \text{const} \cdot \sum_j w_j^2 \|x_j\|^2$

Dropout is equivalent to weighted L2 regularisation - each weight is penalised proportional to the scale of its input feature.

Interpretation 2: Bayesian Approximation (Gal & Ghahramani, 2016)

Dropout at test time (Monte Carlo Dropout) approximates Bayesian inference in a deep Gaussian process. The posterior distribution over network outputs is approximated by:

$P(y^* \mid x^*, \mathcal{D}) \approx \frac{1}{T}\sum_{t=1}^T P(y^* \mid x^*, \hat{\mathbf{W}}_t)$

where $\hat{\mathbf{W}}_t$ are the weights with different dropout masks applied. This gives epistemic uncertainty estimates - predicting both the mean output and the variance, where high variance indicates the model is uncertain.

import torch
import torch.nn as nn
import numpy as np

class BayesianDropoutNN(nn.Module):
    """
    Neural network with MC Dropout for uncertainty estimation.
    Dropout is kept ON at test time for approximate Bayesian inference.
    """
    def __init__(self, input_dim, hidden_dim, output_dim, dropout_p=0.5):
        super().__init__()
        self.dropout_p = dropout_p
        self.network = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(p=dropout_p),   # Dropout after first layer
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Dropout(p=dropout_p),   # Dropout after second layer
            nn.Linear(hidden_dim, output_dim)
        )

    def forward(self, x):
        return self.network(x)

    def predict_with_uncertainty(self, x, n_forward_passes=100):
        """
        MC Dropout: run T stochastic forward passes to estimate uncertainty.
        Returns mean prediction and epistemic uncertainty.
        """
        self.train()  # Keep dropout active during inference
        predictions = []
        with torch.no_grad():
            for _ in range(n_forward_passes):
                pred = self.forward(x)
                predictions.append(pred)

        predictions = torch.stack(predictions)  # (T, batch, output_dim)
        mean = predictions.mean(0)
        epistemic_uncertainty = predictions.var(0)  # variance across passes
        return mean, epistemic_uncertainty

# Example: regression with uncertainty
np.random.seed(42)
model = BayesianDropoutNN(1, 50, 1, dropout_p=0.1)

# Generate test points
X_test = torch.linspace(-4, 4, 100).reshape(-1, 1)

# Before training: high uncertainty everywhere (prior)
mean_pred, uncertainty = model.predict_with_uncertainty(X_test)

print("MC Dropout Uncertainty Estimation:")
print(f"Mean prediction range: [{mean_pred.min():.3f}, {mean_pred.max():.3f}]")
print(f"Uncertainty range: [{uncertainty.min():.4f}, {uncertainty.max():.4f}]")
print("After training on x in [-2, 2], uncertainty should be lower in that range")
print("and higher outside (epistemic uncertainty from lack of training data)")

Early Stopping as Regularisation

Early stopping - halting gradient descent before convergence - is one of the most effective and widely used regularisation techniques, yet often misunderstood.

Theoretical connection to L2 regularisation (Yao et al., 2007): For gradient descent on a linear model with step size $\eta$ , stopping at step $T$ is equivalent to L2 regularisation with:

$\lambda_{equiv} \approx \frac{1}{\eta T}$

More training steps → smaller effective $\lambda$ → less regularisation → more complex solutions.

Intuition: Gradient descent on a quadratic loss starts with the solution at the initialisation (often near zero) and moves toward the ERM minimiser. The trajectory traverses solutions with increasing weight magnitude. Stopping early gives a solution with small weights - similar to the effect of L2 regularisation.

import numpy as np
import matplotlib.pyplot as plt

def early_stopping_demonstration():
    """
    Demonstrate early stopping as implicit L2 regularisation.
    Compare test MSE for gradient descent at different stopping points.
    """
    np.random.seed(42)
    n, d = 30, 50  # underdetermined: n < d

    # True sparse model
    w_true = np.zeros(d)
    w_true[:3] = [2.0, -1.5, 1.0]

    X = np.random.randn(n, d)
    y_train = X @ w_true + np.random.randn(n) * 0.3

    # Test set
    X_test = np.random.randn(500, d)
    y_test = X_test @ w_true + np.random.randn(500) * 0.3

    # Gradient descent on linear regression (no explicit regularisation)
    w = np.zeros(d)
    lr = 0.001
    n_steps = 3000

    train_errors, test_errors, weight_norms = [], [], []

    for step in range(n_steps):
        residuals = X @ w - y_train
        grad = X.T @ residuals / n
        w = w - lr * grad

        if step % 10 == 0:
            train_mse = np.mean((X @ w - y_train)**2)
            test_mse = np.mean((X_test @ w - y_test)**2)
            train_errors.append(train_mse)
            test_errors.append(test_mse)
            weight_norms.append(np.linalg.norm(w))

    steps = list(range(0, n_steps, 10))
    optimal_step = steps[np.argmin(test_errors)]
    optimal_test_err = min(test_errors)

    print(f"Early stopping analysis:")
    print(f"  Optimal stopping step: {optimal_step}")
    print(f"  Test MSE at optimal stop: {optimal_test_err:.4f}")
    print(f"  Test MSE at final step: {test_errors[-1]:.4f}")
    print(f"  Weight norm at optimal: {weight_norms[steps.index(optimal_step)]:.4f}")
    print(f"  Weight norm at final: {weight_norms[-1]:.4f}")
    print(f"  Equivalent L2 lambda ≈ {1/(lr*optimal_step):.4f}")

    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    axes[0].semilogy(steps, train_errors, 'b-', label='Train MSE')
    axes[0].semilogy(steps, test_errors, 'r-', label='Test MSE')
    axes[0].axvline(optimal_step, color='green', linestyle='--',
                    label=f'Optimal stop (step={optimal_step})')
    axes[0].set_xlabel('Gradient descent step')
    axes[0].set_ylabel('MSE')
    axes[0].set_title('Early Stopping: Train vs Test Error')
    axes[0].legend()

    axes[1].plot(steps, weight_norms, 'purple')
    axes[1].axvline(optimal_step, color='green', linestyle='--')
    axes[1].set_xlabel('Gradient descent step')
    axes[1].set_ylabel('||w||_2')
    axes[1].set_title('Weight Norm Grows With Training Steps')

    plt.tight_layout()
    plt.savefig('early_stopping.png', dpi=150)
    return optimal_step, optimal_test_err

early_stopping_demonstration()

Practical Guide: Choosing Regularisation

Situation	Recommended Regularisation
Linear/logistic regression, many features	L2 (Ridge) by default; L1 if sparsity desired
Feature selection needed	L1 (Lasso) or Elastic Net
Neural network training	Weight decay (L2) + Dropout
Sequence models (RNN, LSTM)	Variational dropout, recurrent dropout
Transformers	Attention dropout, label smoothing
Convolutional networks	Batch normalisation, data augmentation
Small dataset	Stronger regularisation, more data augmentation
NLP fine-tuning	Layer-wise learning rate decay, weight decay
GAN discriminator	Spectral normalisation (Rademacher complexity control)

Interview Questions

Q1: What are the three equivalent views of L2 regularisation, and which is most useful for understanding why it works?

(1) Penalised ERM: $\min_\mathbf{w} \hat{R}(\mathbf{w}) + \lambda\|\mathbf{w}\|^2$ - penalises large weights, useful for computation; (2) Constrained ERM: $\min_\mathbf{w} \hat{R}(\mathbf{w})$ s.t. $\|\mathbf{w}\| \leq B$ - controls model complexity by bounding the hypothesis class, connects to Rademacher complexity $\mathcal{R}_n \leq BR/\sqrt{n}$ ; (3) MAP estimation: MLE under Gaussian prior $\mathbf{w} \sim \mathcal{N}(0, \sigma^2 I)$ , equivalent to $\lambda = 1/(2\sigma^2)$ - Bayesian interpretation. The constrained view is most fundamental for understanding generalisation: it shows L2 regularisation limits the effective VC dimension / Rademacher complexity of the hypothesis class, directly reducing the generalisation gap. The Bayesian view explains why the choice of $\lambda$ encodes prior beliefs about weight magnitudes.

Q2: Why does L1 regularisation produce sparsity but L2 does not?

Geometrically: the L1 constraint set $\{\mathbf{w}: \|\mathbf{w}\|_1 \leq B\}$ is a diamond (in 2D) with corners at the coordinate axes. The L2 constraint set is a smooth sphere with no corners. When you minimise the empirical loss (an elliptical function) subject to these constraints, the solution tends to lie at the boundary of the constraint set. For L2, the smooth boundary means the solution can be anywhere - rarely exactly on an axis. For L1, the boundary has sharp corners at the axes, and the solution is "attracted" to these corners where one or more coordinates are zero. Analytically: the L1 subdifferential at $w_j = 0$ is $[-\lambda, \lambda]$ - any gradient in this range leaves $w_j = 0$ as the optimal solution, creating exact zeros. L2 gradient at $w_j = 0$ is exactly 0, so any positive gradient pushes $w_j$ away from zero.

Q3: Explain algorithm stability and its connection to L2 regularisation.

Algorithm stability measures how much the output changes when you remove or change one training example. For Ridge regression, the objective is $\frac{1}{n}\|X\mathbf{w} - y\|^2 + \lambda\|\mathbf{w}\|^2$ , which is $(2\lambda)$ -strongly convex. Strong convexity implies $\beta$ -stability with $\beta = O(1/(\lambda n))$ . By the stability-based generalisation bound, $R(h) \leq \hat{R}(h) + O(\sqrt{\ln(1/\delta)/n}) + O(1/(\lambda n))$ . With optimal $\lambda = O(1/\sqrt{n})$ , the bound becomes $O(1/\sqrt{n})$ . This is the formal reason L2 regularisation guarantees generalisation: it makes the algorithm stable (insensitive to individual training examples), which implies the training error concentrates around the test error.

Q4: What is the theoretical connection between dropout and L2 regularisation?

For a linear model $y = \mathbf{w}^\top x$ with dropout (each $x_j$ set to 0 with probability $p$ and scaled by $1/(1-p)$ ), the expected loss under dropout masking is: $E_\xi[\|y - \mathbf{w}^\top (\xi \odot x)\|^2] = \|y - \mathbf{w}^\top E[\xi \odot x]\|^2 + \sum_j w_j^2 \text{Var}[\xi_j x_j]$ . With Bernoulli dropout at rate $p$ : $\text{Var}[\xi_j x_j] = p(1-p)x_j^2 / (1-p)^2 = p/(1-p) \cdot x_j^2$ . So dropout is equivalent to weighted L2 regularisation with penalty $\lambda_j = p/(1-p) \cdot x_j^2/n$ for each weight - each weight is regularised proportionally to the variance of its input feature. This is adaptive L2 regularisation: features with higher variance receive stronger regularisation.

Q5: What does early stopping correspond to in regularisation theory, and how do you choose when to stop?

Early stopping corresponds to L2 regularisation with effective strength $\lambda_{eff} \approx 1/(\eta T)$ where $\eta$ is the step size and $T$ is the number of steps. More precisely, for gradient descent on squared loss: after $T$ gradient steps from initialisation near zero, the solution lies in the Krylov subspace spanned by the first $T$ Lanczos vectors of $\mathbf{X}^\top\mathbf{X}$ , which filters out small singular values of $\mathbf{X}$ - equivalent to Tikhonov regularisation with threshold $1/(\eta T)$ . In practice, use a validation set and stop when validation loss stops decreasing (with patience = some number of epochs). Alternatively, use learning rate schedulers (cosine annealing, one-cycle) that naturally prevent late-stage overfitting. In deep learning, the relationship is less exact due to nonlinearity, but the intuition holds: training longer from small-norm initialisation → larger-norm solutions → less implicit regularisation.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Regularization Path demo on the EngineersOfAI Playground - no code required.

:::

The Fundamental Tension​

What You Will Learn​

Tikhonov Regularisation (L2 / Ridge)​

Three Equivalent Views of L2 Regularisation​

Lasso (L1 Regularisation): Sparsity from Laplace Prior​

Structural Risk Minimisation (SRM)​

Algorithm Stability and Generalisation​

Dropout as Regularisation: Theory​

Interpretation 1: Noise-Based Regularisation​

Interpretation 2: Bayesian Approximation (Gal & Ghahramani, 2016)​

Early Stopping as Regularisation​

Practical Guide: Choosing Regularisation​

Interview Questions​