What is regularization?

Master regularization from first principles - bias-variance decomposition, L2 Bayesian interpretation as Gaussian prior, L1 sparsity via subdifferential geometry, elastic net path algorithms, coordinate descent for LASSO, and cross-validation for lambda selection.

How does L1 regularization work in practice?

Regularization - L1, L2, and ElasticNet covers regularization, L1 regularization, L2 regularization from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/linear-models/regularization-l1-l2-elasticnet

What is the difference between regularization and L2 regularization?

See the full breakdown at https://engineersofai.com/docs/ml/linear-models/regularization-l1-l2-elasticnet

Regularization: L1, L2, and ElasticNet

:::note Reading time: ~45 minutes | Interview relevance: Very High | Target roles: ML Engineer, Data Scientist, Research Engineer, Applied Scientist :::

The Real Interview Moment

You are interviewing for an ML engineer role at a genomics startup. The interviewer shows you this: a dataset with 800 tumor samples and 22,000 gene expression features. "We tried ordinary linear regression. It took two hours and the test accuracy was 43%. What went wrong, and how do you fix it?"

The candidate who says "add regularization" gets a follow-up: "Which regularization, how much, and why? What does L1 do that L2 cannot?"

The candidate who can derive why L1 produces sparse solutions from the subdifferential calculus, explain why L2 is equivalent to a Gaussian prior in Bayesian statistics, show how Ridge handles multicollinearity through its SVD analysis, and implement coordinate descent for LASSO from scratch - that candidate gets the offer.

Regularization is not a hyperparameter to tune mindlessly. It is a principled statistical technique with deep mathematical foundations connecting optimization geometry, Bayesian inference, and information theory. The bias-variance tradeoff, the sparsity of the L1 ball, the grouping effect of ElasticNet - every one of these has a rigorous derivation. This lesson covers all of it.

The genomics problem you just saw is not exotic. Every high-dimensional problem - text classification, financial factor modeling, drug discovery, medical imaging - shares the same structure: too many features, not enough samples. Regularization is the universal answer. But which regularization, and how much, determines whether your model finds the true signal or fits noise.

Why This Exists - The Overfitting Problem

For an unregularized model with $d$ features and $n$ training samples, three regimes exist:

When $d \ll n$ : a unique OLS solution exists and the model generalizes reasonably well.
When $d \approx n$ : the model memorizes training data and generalizes poorly. The Gram matrix $\mathbf{X}^\top\mathbf{X}$ is near-singular.
When $d > n$ : infinitely many solutions minimize training loss perfectly. Most generalize terribly. The Gram matrix is exactly singular - uninvertible.

The 22,000-gene dataset has 800 samples and 22,000 features. OLS literally has no unique solution. But this is not just a mathematical inconvenience - it reflects a statistical reality: there is not enough data to pin down 22,000 independent coefficients. The model is free to fit the noise in the training data, and it does.

Historical Context

The idea of penalized regression dates to Tikhonov (1943), who used $\ell_2$ penalties to regularize ill-posed inverse problems in physics. Ridge regression in the statistical sense was formalized by Hoerl and Kennard (1970) specifically to handle multicollinearity. LASSO (Least Absolute Shrinkage and Selection Operator) was introduced by Robert Tibshirani in 1996, motivated by the observation that Ridge shrinks coefficients but never zeroes them. The Least Angle Regression (LARS) algorithm of Efron et al. (2004) revealed the full LASSO regularization path and established its connection to forward stepwise selection. ElasticNet was proposed by Zou and Hastie (2005) to handle correlated features that LASSO handles poorly.

Formal Bias-Variance Decomposition

The fundamental theorem that motivates regularization: the expected squared prediction error at any point $x$ decomposes into three irreducible components.

$\mathbb{E}\!\left[(y - \hat{f}(x))^2\right] = \underbrace{\left(\mathbb{E}[\hat{f}(x)] - f(x)\right)^2}_{\text{Bias}^2[\hat{f}]} + \underbrace{\mathbb{E}\!\left[\left(\hat{f}(x) - \mathbb{E}[\hat{f}(x)]\right)^2\right]}_{\text{Var}[\hat{f}]} + \underbrace{\sigma^2}_{\text{irreducible noise}}$

where:

$\text{Bias}^2[\hat{f}]$ : the squared systematic error - how far the average prediction is from the true function $f(x)$ .
$\text{Var}[\hat{f}]$ : the sensitivity of predictions to the specific training set used - different training sets produce different $\hat{f}$ values.
$\sigma^2$ : the irreducible noise variance - present in any model no matter how good.

The regularization trade-off: OLS with $d \approx n$ has low bias (the model is flexible enough to approximate $f$ well) but very high variance (different training sets of 800 samples give wildly different 22,000-dimensional coefficient vectors). Regularization adds a penalty that forces coefficients toward zero, introducing bias but dramatically reducing variance. The optimal $\lambda$ sits at the sweet spot where total error $= \text{Bias}^2 + \text{Var}$ is minimized.

As $\lambda \to 0$ : OLS - minimum bias, maximum variance. As $\lambda \to \infty$ : predict the global mean - maximum bias, zero variance. Cross-validation finds the optimal $\lambda$ between these extremes.

L2 Regularization (Ridge Regression)

L2 regularization adds the squared L2 norm of weights to the loss:

$\mathcal{J}_\text{ridge}(\mathbf{w}) = \frac{1}{n}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda\|\mathbf{w}\|^2_2$

Bayesian Derivation: Gaussian Prior

Ridge regression is exactly MAP (Maximum A Posteriori) estimation with a Gaussian prior on the weights:

$P(\mathbf{w}) = \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I}) \propto \exp\!\left(-\frac{\|\mathbf{w}\|^2}{2\tau^2}\right)$

Assuming Gaussian likelihood $P(\mathbf{y}|\mathbf{X},\mathbf{w}) = \mathcal{N}(\mathbf{X}\mathbf{w}, \sigma^2\mathbf{I})$ , the MAP objective is:

$\hat{\mathbf{w}}_\text{MAP} = \arg\max_\mathbf{w}\; \log P(\mathbf{y}|\mathbf{X},\mathbf{w}) + \log P(\mathbf{w})$

Substituting:

$= \arg\min_\mathbf{w}\; \frac{1}{2\sigma^2}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \frac{1}{2\tau^2}\|\mathbf{w}\|^2$

Setting $\lambda = \sigma^2/\tau^2$ recovers exactly the Ridge objective. The regularization strength equals the ratio of noise variance to prior variance - when noise is large relative to the prior spread, $\lambda$ is large, penalizing coefficients heavily. The Gaussian prior's prior precision $1/\tau^2$ equals $\lambda/\sigma^2$ - stronger regularization means a tighter prior.

This Bayesian framing provides something the frequentist framing cannot: a principled way to set $\lambda$ if you have prior knowledge about the scale of the true coefficients.

Ridge Closed-Form Solution

Setting the gradient to zero:

$\nabla_\mathbf{w}\mathcal{J}_\text{ridge} = \frac{2}{n}\mathbf{X}^\top(\mathbf{X}\mathbf{w} - \mathbf{y}) + 2\lambda\mathbf{w} = \mathbf{0}$

$(\mathbf{X}^\top\mathbf{X} + n\lambda\mathbf{I})\mathbf{w} = \mathbf{X}^\top\mathbf{y}$

$\boxed{\hat{\mathbf{w}}_\text{ridge} = (\mathbf{X}^\top\mathbf{X} + n\lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}}$

Adding $n\lambda\mathbf{I}$ to $\mathbf{X}^\top\mathbf{X}$ shifts all eigenvalues up by $n\lambda > 0$ , making the matrix always positive definite and invertible, regardless of whether $d > n$ . This directly fixes multicollinearity: near-zero eigenvalues of $\mathbf{X}^\top\mathbf{X}$ (which cause numerical instability in OLS) become at least $n\lambda$ under Ridge. The regularization is literally adding a small positive number to the diagonal of an ill-conditioned matrix to stabilize it.

SVD Analysis: Which Directions Are Shrunk?

The SVD of $\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top$ reveals exactly what Ridge does to each direction. With singular values $\sigma_1 \geq \sigma_2 \geq \ldots \geq \sigma_d$ :

$\hat{\mathbf{w}}_\text{ridge} = \sum_{j=1}^{d} \underbrace{\frac{\sigma_j^2}{\sigma_j^2 + n\lambda}}_{\text{shrinkage factor}} \cdot \frac{\mathbf{u}_j^\top\mathbf{y}}{\sigma_j} \cdot \mathbf{v}_j$

Compare to OLS:

$\hat{\mathbf{w}}_\text{OLS} = \sum_{j=1}^{d} \frac{\mathbf{u}_j^\top\mathbf{y}}{\sigma_j} \cdot \mathbf{v}_j$

Ridge multiplies each component by the shrinkage factor $\frac{\sigma_j^2}{\sigma_j^2 + n\lambda}$ :

Large $\sigma_j$ (direction well-supported by data): factor $\approx 1$ - OLS estimate kept almost unchanged.
Small $\sigma_j$ (direction with little data support): factor $\approx 0$ - coefficient shrunk toward zero.

This is the key insight: Ridge selectively penalizes directions where the data provides little information. These are exactly the directions where OLS is most unreliable. The regularization is adaptive - it is strongest where it is most needed.

For the gene expression problem: genes that are nearly collinear (correlated pathway members) have near-zero singular values in $\mathbf{X}^\top\mathbf{X}$ . Ridge shrinks their coefficients toward zero, effectively pooling their contribution and preventing any one gene from getting an enormous coefficient just because it correlates with another.

L1 Regularization (Lasso)

L1 regularization adds the L1 norm of weights:

$\mathcal{J}_\text{lasso}(\mathbf{w}) = \frac{1}{n}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda\|\mathbf{w}\|_1 = \frac{1}{n}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda\sum_{j=1}^d |w_j|$

No closed-form solution. The L1 norm is not differentiable at $w_j = 0$ . But this non-differentiability is precisely the source of sparsity.

Bayesian Derivation: Laplace Prior

LASSO is MAP estimation with a Laplace (double exponential) prior:

$P(w_j) = \text{Laplace}(0, b) = \frac{1}{2b}\exp\!\left(-\frac{|w_j|}{b}\right)$

The log-prior is $\log P(\mathbf{w}) = -\frac{1}{b}\|\mathbf{w}\|_1 + \text{const}$ . Setting $\lambda = \sigma^2/(nb)$ recovers the LASSO objective from the MAP criterion.

The Laplace distribution has two properties that explain LASSO's behavior:

Heavier tails than Gaussian: the prior assigns non-negligible probability to large values. Informative features can have large coefficients - the prior does not fight them.
Sharp peak at zero: the prior has a spike at $w_j = 0$ , creating a strong pull toward zero for small coefficients. Uninformative features get zeroed out.

In contrast, the Gaussian prior has a gentle rounded peak at zero - it pulls coefficients toward zero but symmetrically and smoothly. Only the Laplace prior's sharp kink at zero creates the hard zeros that define sparsity.

Why L1 Induces Sparsity: The Geometry

The constrained-optimization view provides the clearest geometric explanation. The regularized problem is equivalent to:

$\min_\mathbf{w}\; \mathcal{L}(\mathbf{w}) \quad \text{subject to}\quad \|\mathbf{w}\|_p \leq t(\lambda)$

The OLS solution $\mathbf{w}^*$ lives somewhere in $\mathbb{R}^d$ . The MSE loss contours are ellipses (in 2D) centered at $\mathbf{w}^*$ . The constrained solution is where the smallest loss ellipse first touches the constraint set.

L2 constraint (sphere):          L1 constraint (diamond):

     w₂                               w₂
      |                                |
  . . | . .                       .   |   .
 .  --+-- .                      .  \ | /  .
.  /     \ .                    .    \|/    .
. | w*  O |.                   .  w* /|\    .
 .  \     /  .                  .   / | \   .
  . . \ / . .                    . /  |  \ .
        |                           /  |  \
 -------+------- w₁          ------+------- w₁
        |                        Corner at w₁=0!

Ellipse touches sphere at a        Ellipse touches at a corner
generic angle. Rarely on axis.     where w₂=0 exactly.
Ridge: never exact zeros.          LASSO: exact zeros at corners.

L2 (sphere/circle): smooth everywhere, no preferred directions. The loss ellipse touches the constraint at a generic angle - almost never where $w_j = 0$ exactly. Ridge shrinks all coefficients but never exactly zeros them for finite $\lambda$ .

L1 (diamond/cross-polytope): has corners exactly on the coordinate axes (where one or more $w_j = 0$ ). The loss ellipse, as we inflate it outward from $\mathbf{w}^*$ , almost always first touches one of these corners. L1 produces exact zeros - not approximately zero, but exactly zero.

The Subdifferential Condition and KKT Analysis

At a non-differentiable point ( $w_j = 0$ ), we need the subdifferential of the L1 norm: $\partial|w_j| = [-1, 1]$ when $w_j = 0$ .

The KKT optimality condition for coordinate $j$ :

$0 \in \partial_{w_j}\mathcal{J}_\text{lasso} = -\frac{2}{n}\mathbf{x}_j^\top\mathbf{r} + \lambda \cdot \partial|w_j|$

where $\mathbf{r} = \mathbf{y} - \mathbf{X}\mathbf{w}$ is the current residual. The partial correlation is $\rho_j = \frac{1}{n}\mathbf{x}_j^\top\mathbf{r}$ .

Feature selection condition: feature $j$ is set to zero (excluded from the model) if and only if:

$|\rho_j| = \left|\frac{1}{n}\mathbf{x}_j^\top\mathbf{r}\right| \leq \frac{\lambda}{2}$

In words: if the correlation between feature $j$ and the current residual is smaller than $\lambda/2$ , the L1 penalty "wins" and drives $w_j$ to exactly zero. Feature selection is automatic - and it has a precise mathematical criterion.

For the 22,000-gene problem: most genes have low correlation with the residual (after accounting for the informative genes). For those genes, $|\rho_j| < \lambda/2$ and they are automatically excluded. Only the truly predictive genes survive.

Coordinate Descent for LASSO

Since L1 is non-differentiable at $w_j = 0$ , we cannot use standard gradient descent. Coordinate descent minimizes one coordinate at a time while holding all others fixed - and each coordinate update has a closed-form solution.

Derivation of the Soft-Threshold Update

For coordinate $j$ , the partial residual (contribution of all other features) is:

$r_i^{(j)} = y_i - \sum_{k \neq j} w_k x_{ik} = y_i - (\hat{y}_i - w_j x_{ij})$

The objective as a function of $w_j$ alone (with standardized features so $\frac{1}{n}\sum_i x_{ij}^2 = 1$ ):

$\mathcal{J}(w_j) = \frac{1}{n}\sum_{i=1}^n (r_i^{(j)} - w_j x_{ij})^2 + \lambda|w_j| + \text{const}$

Expanding: $= \underbrace{\frac{1}{n}\|r^{(j)}\|^2}_{\text{const}} - \underbrace{\frac{2}{n}\mathbf{x}_j^\top r^{(j)}}_{= 2\rho_j} \cdot w_j + w_j^2 + \lambda|w_j|$

The unconstrained minimizer in $w_j$ is $\rho_j$ . With the L1 penalty, the optimal value is the soft-thresholding operator:

$\hat{w}_j = \mathcal{S}(\rho_j, \lambda/2) = \text{sign}(\rho_j)\max(|\rho_j| - \lambda/2,\; 0)$

This can be written explicitly:

$\hat{w}_j = \begin{cases} \rho_j - \lambda/2 & \text{if } \rho_j > \lambda/2 \\ 0 & \text{if } |\rho_j| \leq \lambda/2 \\ \rho_j + \lambda/2 & \text{if } \rho_j < -\lambda/2 \end{cases}$

The name "soft-thresholding" describes the shape: values with $|\rho_j| \leq \lambda/2$ are set to exactly zero (hard threshold), while values outside are shifted toward zero by $\lambda/2$ (soft shrinkage). This is in contrast to hard thresholding which would jump discontinuously.

Why exact zeros: when $|\rho_j| \leq \lambda/2$ , moving $w_j$ away from zero in any direction increases the L1 penalty faster than it decreases the data fit. The L1 penalty at zero has a "flat spot" in the subdifferential sense - zero is optimal. With L2, the gradient is proportional to $w_j$ , so it shrinks but never reaches zero.

Full NumPy Coordinate Descent LASSO

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


def soft_threshold(rho: float, lam: float) -> float:
    """
    Soft-thresholding operator S(ρ, λ).
    Core of LASSO coordinate descent.
    Sets |ρ| ≤ λ to exactly zero (L1 sparsity origin).
    """
    if rho > lam:
        return rho - lam
    elif rho < -lam:
        return rho + lam
    else:
        return 0.0


class LassoCoordinateDescent:
    """
    LASSO via coordinate descent with soft-thresholding.
    Assumes standardized features (unit variance per feature).
    This is the algorithm underlying sklearn's Lasso.

    Each iteration: for j in 1..d:
        rho_j = (1/n) * x_j^T * (y - X * w + w_j * x_j)
        w_j   = S(rho_j, alpha/2)

    Converges to the global LASSO solution (convex problem).
    """

    def __init__(self, alpha: float = 0.1, max_iter: int = 2000,
                 tol: float = 1e-6):
        self.alpha = alpha
        self.max_iter = max_iter
        self.tol = tol
        self.coef_: np.ndarray | None = None
        self.intercept_: float = 0.0
        self.n_iter_: int = 0
        self.loss_path_: list[float] = []

    def fit(self, X: np.ndarray, y: np.ndarray) -> "LassoCoordinateDescent":
        n, d = X.shape

        # Center the target; use mean as intercept
        self.intercept_ = np.mean(y)
        y_c = y - self.intercept_

        # Initialize at zero (warm start)
        self.coef_ = np.zeros(d)

        for iteration in range(self.max_iter):
            coef_prev = self.coef_.copy()

            for j in range(d):
                # Partial residual: y - sum_{k≠j} w_k x_k
                # Efficient: full residual + add back the j-th term
                residual_j = y_c - X @ self.coef_ + X[:, j] * self.coef_[j]

                # Partial correlation (standardized features: ||x_j||²/n = 1)
                rho_j = (1.0 / n) * (X[:, j] @ residual_j)

                # Soft-threshold update
                self.coef_[j] = soft_threshold(rho_j, self.alpha / 2.0)

            # Track loss for convergence visualization
            y_hat = X @ self.coef_ + self.intercept_
            mse = np.mean((y - y_hat) ** 2)
            l1_pen = self.alpha * np.sum(np.abs(self.coef_))
            self.loss_path_.append(mse + l1_pen)

            # Convergence: max absolute change in coefficients
            max_change = np.max(np.abs(self.coef_ - coef_prev))
            if max_change < self.tol:
                self.n_iter_ = iteration + 1
                break
        else:
            self.n_iter_ = self.max_iter

        return self

    def predict(self, X: np.ndarray) -> np.ndarray:
        return X @ self.coef_ + self.intercept_

    def score(self, X: np.ndarray, y: np.ndarray) -> float:
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return float(1.0 - ss_res / ss_tot)

    @property
    def n_nonzero_(self) -> int:
        return int(np.sum(np.abs(self.coef_) > 1e-8))


# ── Demonstration on high-dimensional sparse problem ──────────────────────────
np.random.seed(42)
X, y, true_coef = make_regression(
    n_samples=500, n_features=200, n_informative=20,
    noise=10.0, coef=True, random_state=42
)

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_tr_sc = scaler.fit_transform(X_tr)
X_te_sc  = scaler.transform(X_te)

# Fit our from-scratch LASSO
lasso_scratch = LassoCoordinateDescent(alpha=0.1, max_iter=2000, tol=1e-7)
lasso_scratch.fit(X_tr_sc, y_tr)

print("LASSO (coordinate descent from scratch):")
print(f"  Non-zero coefficients : {lasso_scratch.n_nonzero_}/200")
print(f"  Test R²               : {lasso_scratch.score(X_te_sc, y_te):.4f}")
print(f"  Iterations to converge: {lasso_scratch.n_iter_}")

# Verify: do we recover the true sparse support?
true_support  = set(np.where(np.abs(true_coef) > 1e-6)[0])
found_support = set(np.where(np.abs(lasso_scratch.coef_) > 1e-6)[0])
precision = len(true_support & found_support) / max(len(found_support), 1)
recall    = len(true_support & found_support) / max(len(true_support), 1)
print(f"  Support precision     : {precision:.3f}")
print(f"  Support recall        : {recall:.3f}")

Full Sklearn Implementation - Ridge, Lasso, ElasticNet

from sklearn.linear_model import (
    Ridge, Lasso, ElasticNet, LinearRegression,
    RidgeCV, LassoCV, ElasticNetCV, lasso_path
)


def compare_regularizers(X_tr, y_tr, X_te, y_te) -> None:
    """Compare all regularizers on the same high-dimensional problem."""

    configs = [
        ("OLS (no regularization)",  LinearRegression()),
        ("Ridge (α=1.0)",            Ridge(alpha=1.0)),
        ("Ridge (α=10.0)",           Ridge(alpha=10.0)),
        ("Lasso (α=0.05)",           Lasso(alpha=0.05, max_iter=10_000)),
        ("Lasso (α=0.5)",            Lasso(alpha=0.5,  max_iter=10_000)),
        ("ElasticNet (α=0.1, l1=0.5)", ElasticNet(alpha=0.1, l1_ratio=0.5,
                                                    max_iter=10_000)),
        ("ElasticNet (α=0.1, l1=0.9)", ElasticNet(alpha=0.1, l1_ratio=0.9,
                                                    max_iter=10_000)),
    ]

    print(f"{'Model':40s}  {'Train R²':>9}  {'Test R²':>8}  {'Non-zero':>9}")
    print("-" * 75)

    for name, model in configs:
        model.fit(X_tr, y_tr)
        tr_r2 = model.score(X_tr, y_tr)
        te_r2 = model.score(X_te, y_te)
        coef   = getattr(model, "coef_", None)
        nnz    = int(np.sum(np.abs(coef) > 1e-8)) if coef is not None else "-"
        print(f"{name:40s}  {tr_r2:9.4f}  {te_r2:8.4f}  {nnz:>9}")


compare_regularizers(X_tr_sc, y_tr, X_te_sc, y_te)

Regularization Paths and the LARS Connection

The regularization path shows how coefficients evolve as $\lambda$ decreases from $\infty$ (all-zero model) to $0$ (OLS model). The path reveals which features enter the model first - those most correlated with the target.

# ── Lasso regularization path ─────────────────────────────────────────────────
alphas_path, coefs_path, _ = lasso_path(
    X_tr_sc, y_tr,
    alphas=np.logspace(-3, 1, 150)[::-1],  # from large to small
    max_iter=20_000
)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Left: Lasso path (features enter the model one at a time)
ax = axes[0]
# Highlight top-20 features by maximum absolute coefficient
peak_importance = np.max(np.abs(coefs_path), axis=1)
top20 = np.argsort(peak_importance)[::-1][:20]
for idx in top20:
    ax.semilogx(alphas_path, coefs_path[idx], linewidth=1.5, alpha=0.8)
ax.axvline(x=0.05, color="crimson", linestyle="--", linewidth=2,
           label="Selected α=0.05")
ax.set_xlabel("Regularization α (log scale → decreasing)")
ax.set_ylabel("Coefficient value")
ax.set_title("Lasso Regularization Path\n"
             "Features enter model as α decreases; many stay exactly zero")
ax.legend()
ax.grid(True, alpha=0.25)

# Right: Number of non-zero features vs alpha
ax = axes[1]
n_nonzero_vs_alpha = [np.sum(np.abs(c) > 1e-8) for c in coefs_path.T]
ax.semilogx(alphas_path, n_nonzero_vs_alpha, "b-", linewidth=2)
ax.axvline(x=0.05, color="crimson", linestyle="--", linewidth=2)
ax.set_xlabel("Regularization α (log scale)")
ax.set_ylabel("Number of non-zero features")
ax.set_title("LASSO Feature Selection vs Regularization Strength\n"
             "Increasing α forces more coefficients to zero")
ax.grid(True, alpha=0.25)

plt.tight_layout()
plt.savefig("lasso_path.png", dpi=150, bbox_inches="tight")
plt.show()

# ── Ridge path (no features reach exactly zero) ───────────────────────────────
alphas_ridge = np.logspace(-3, 5, 100)
coefs_ridge = np.array([
    Ridge(alpha=a).fit(X_tr_sc, y_tr).coef_
    for a in alphas_ridge
])

plt.figure(figsize=(8, 5))
for idx in top20:
    plt.semilogx(alphas_ridge, coefs_ridge[:, idx], linewidth=1.2, alpha=0.7)
plt.xlabel("Regularization α (log scale)")
plt.ylabel("Coefficient value")
plt.title("Ridge Regularization Path\n"
          "All features shrink toward 0 but none are exactly 0")
plt.grid(True, alpha=0.25)
plt.tight_layout()
plt.savefig("ridge_path.png", dpi=150, bbox_inches="tight")
plt.show()

The LARS algorithm (Least Angle Regression, Efron et al. 2004) traces the LASSO path exactly and efficiently. LARS starts with all coefficients at zero and at each step adds the feature most correlated with the current residual, moving its coefficient in the direction of the correlation until another feature becomes equally correlated. At that bifurcation point, both features move together. This reveals the LASSO path as a sequence of piecewise-linear segments - and LARS computes the entire path in the same computational cost as a single OLS fit.

ElasticNet: Combining L1 and L2

ElasticNet (Zou and Hastie, 2005) adds both penalties:

$\mathcal{J}_\text{enet}(\mathbf{w}) = \frac{1}{n}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \alpha \cdot \text{l1\_ratio} \cdot \|\mathbf{w}\|_1 + \frac{\alpha(1-\text{l1\_ratio})}{2}\|\mathbf{w}\|_2^2$

The l1_ratio parameter (also written as $\rho$ or the mixing parameter) controls the balance:

l1_ratio = 0: pure Ridge (no sparsity)
l1_ratio = 1: pure Lasso (maximum sparsity)
l1_ratio = 0.5: equal mix (sklearn default)

In the general form: $\lambda[(1-\alpha)\|\mathbf{w}\|_2^2 + \alpha\|\mathbf{w}\|_1]$ where $\alpha$ is the mixing parameter.

The grouping effect: LASSO's fundamental limitation is that when features are highly correlated (e.g., genes in the same biological pathway, TF-IDF features for related words), LASSO arbitrarily selects one representative from each correlated group and zeros out the rest. The selected representative can change dramatically between training runs on similar data - LASSO is unstable under correlated features.

The L2 component of ElasticNet encourages correlated features to have similar coefficients - the "grouping effect." When features are nearly identical ( $\mathbf{x}_j \approx \mathbf{x}_k$ ), their ElasticNet coefficients are nearly identical, while LASSO would assign all weight to one arbitrarily.

# ── Grouping effect: Lasso vs ElasticNet on correlated features ───────────────
np.random.seed(42)
n_corr, d_corr = 400, 60

# Create strongly correlated group: features 0–4 are near-duplicates of feature 0
X_corr = np.random.randn(n_corr, d_corr)
X_corr[:, 1] = X_corr[:, 0] + 0.05 * np.random.randn(n_corr)
X_corr[:, 2] = X_corr[:, 0] + 0.05 * np.random.randn(n_corr)
X_corr[:, 3] = X_corr[:, 0] + 0.08 * np.random.randn(n_corr)
X_corr[:, 4] = X_corr[:, 0] + 0.08 * np.random.randn(n_corr)

# True model: ALL five correlated features contribute equally
y_corr = 2.0 * (X_corr[:, 0] + X_corr[:, 1] + X_corr[:, 2] +
                X_corr[:, 3] + X_corr[:, 4]) / 5.0
y_corr += X_corr[:, 20] * 1.5   # one uncorrelated feature
y_corr += np.random.randn(n_corr) * 0.3

sc_corr = StandardScaler()
X_corr_sc = sc_corr.fit_transform(X_corr)

lasso_corr = Lasso(alpha=0.05, max_iter=10_000).fit(X_corr_sc, y_corr)
enet_corr  = ElasticNet(alpha=0.05, l1_ratio=0.5, max_iter=10_000).fit(
    X_corr_sc, y_corr)

print("\nGrouping effect - correlated features 0–4:")
print(f"  Lasso      coef[0:5] = {np.round(lasso_corr.coef_[:5], 3)}")
print(f"  ElasticNet coef[0:5] = {np.round(enet_corr.coef_[:5], 3)}")
print("  Lasso picks 1 arbitrarily; ElasticNet distributes weight across all 5")

# Stability: run Lasso on 10 subsamples - does it pick the same feature?
print("\nLasso instability on correlated features (10 subsamples):")
selected_feature = []
for seed in range(10):
    rng = np.random.default_rng(seed)
    idx = rng.choice(n_corr, size=300, replace=False)
    X_sub = sc_corr.fit_transform(X_corr[idx])
    y_sub = y_corr[idx]
    lasso_sub = Lasso(alpha=0.05, max_iter=10_000).fit(X_sub, y_sub)
    # Which of features 0-4 was selected?
    selected = np.argmax(np.abs(lasso_sub.coef_[:5]))
    selected_feature.append(selected)
from collections import Counter
print(f"  Feature selected from correlated group: {Counter(selected_feature)}")
print("  Different subsamples → different arbitrary choices (instability)")

Cross-Validation for Lambda Selection

The optimal $\lambda$ must be chosen via cross-validation. Never set it manually - the optimal value depends entirely on the scale of your features, the noise level, and the sparsity of the true model.

# ── RidgeCV: built-in efficient generalized CV ────────────────────────────────
ridge_cv = RidgeCV(
    alphas=np.logspace(-3, 5, 200),
    cv=5,
    scoring="r2"
)
ridge_cv.fit(X_tr_sc, y_tr)
print(f"\nRidgeCV:")
print(f"  Optimal α  : {ridge_cv.alpha_:.5f}")
print(f"  Test R²    : {ridge_cv.score(X_te_sc, y_te):.4f}")

# ── LassoCV: computes entire regularization path efficiently ──────────────────
lasso_cv = LassoCV(
    n_alphas=200,
    cv=5,
    max_iter=20_000,
    random_state=42
)
lasso_cv.fit(X_tr_sc, y_tr)
print(f"\nLassoCV:")
print(f"  Optimal α  : {lasso_cv.alpha_:.5f}")
print(f"  Test R²    : {lasso_cv.score(X_te_sc, y_te):.4f}")
print(f"  Features   : {np.sum(np.abs(lasso_cv.coef_) > 1e-8)}/200")

# ── ElasticNetCV: searches over both α and l1_ratio jointly ──────────────────
enet_cv = ElasticNetCV(
    l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 1.0],
    n_alphas=100,
    cv=5,
    max_iter=20_000,
    random_state=42,
    n_jobs=-1
)
enet_cv.fit(X_tr_sc, y_tr)
print(f"\nElasticNetCV:")
print(f"  Optimal α   : {enet_cv.alpha_:.5f}")
print(f"  Optimal l1  : {enet_cv.l1_ratio_:.2f}")
print(f"  Test R²     : {enet_cv.score(X_te_sc, y_te):.4f}")
print(f"  Features    : {np.sum(np.abs(enet_cv.coef_) > 1e-8)}/200")

# ── Visualize CV error curve ──────────────────────────────────────────────────
plt.figure(figsize=(9, 5))
mean_mse = lasso_cv.mse_path_.mean(axis=1)
std_mse  = lasso_cv.mse_path_.std(axis=1) / np.sqrt(5)

plt.semilogx(lasso_cv.alphas_, mean_mse, "b-", linewidth=2, label="Mean CV MSE")
plt.fill_between(lasso_cv.alphas_,
                 mean_mse - std_mse,
                 mean_mse + std_mse,
                 alpha=0.2, color="blue", label="±1 SE")
plt.axvline(lasso_cv.alpha_, color="crimson", linestyle="--", linewidth=2,
            label=f"Optimal α={lasso_cv.alpha_:.4f}")
plt.xlabel("Regularization α (log scale)")
plt.ylabel("Cross-Validated MSE")
plt.title("LassoCV Error Curve\n"
          "Minimum identifies the best bias-variance operating point")
plt.legend()
plt.grid(True, alpha=0.25)
plt.tight_layout()
plt.savefig("lasso_cv_curve.png", dpi=150, bbox_inches="tight")
plt.show()

Bias-Variance Tradeoff Visualization

# ── How bias and variance change as α increases ────────────────────────────────
from sklearn.model_selection import cross_validate

alphas_bv = np.logspace(-4, 5, 60)
train_errs, val_errs = [], []

for alpha in alphas_bv:
    cv_res = cross_validate(
        Ridge(alpha=alpha), X_tr_sc, y_tr,
        cv=5, scoring="neg_mean_squared_error",
        return_train_score=True
    )
    train_errs.append(-cv_res["train_score"].mean())
    val_errs.append(-cv_res["test_score"].mean())

opt_idx   = int(np.argmin(val_errs))
opt_alpha = alphas_bv[opt_idx]

plt.figure(figsize=(10, 5))
plt.semilogx(alphas_bv, train_errs, "b-", linewidth=2,
             label="Train MSE (proxy for bias²)")
plt.semilogx(alphas_bv, val_errs,   "r-", linewidth=2,
             label="Val MSE = bias² + variance")
plt.axvline(opt_alpha, color="green", linestyle="--", linewidth=2,
            label=f"Optimal α={opt_alpha:.2f}")

plt.annotate("High variance\n(underfitting: λ too small)",
             xy=(alphas_bv[3], val_errs[3]),
             fontsize=9, ha="left", color="darkred")
plt.annotate("High bias\n(underfitting: λ too large)",
             xy=(alphas_bv[-5], val_errs[-5]),
             fontsize=9, ha="right", color="darkblue")

plt.xlabel("Regularization α (log scale)")
plt.ylabel("MSE")
plt.title("Bias-Variance Tradeoff Under Ridge Regularization\n"
          "Optimal α balances the two sources of error")
plt.legend()
plt.grid(True, alpha=0.25)
plt.tight_layout()
plt.savefig("bias_variance_ridge.png", dpi=150, bbox_inches="tight")
plt.show()

Regularization in Deep Learning

Understanding L1/L2 regularization in linear models pays dividends in deep learning.

Weight decay = L2 regularization: in deep learning frameworks, the optimizer's weight_decay parameter adds $\lambda\|\mathbf{w}\|_2^2$ to the loss - this is identical to Ridge regularization. The gradient update becomes:

$\mathbf{w}_{t+1} = \mathbf{w}_t - \eta(\nabla\mathcal{L} + 2\lambda\mathbf{w}_t) = (1 - 2\eta\lambda)\mathbf{w}_t - \eta\nabla\mathcal{L}$

The term $(1 - 2\eta\lambda)$ multiplies weights by a factor slightly less than 1 each step - hence the name "weight decay."

L2 vs Dropout: both reduce overfitting by different mechanisms. L2 penalizes large weights, encouraging the model to use all features slightly rather than a few features heavily. Dropout randomly zeroes activations during training, forcing the network to learn redundant representations. For neural networks, dropout is generally preferred because the model is usually not capacity-limited in the same way as linear models - adding noise to activations is more effective than penalizing weights directly.

import torch
import torch.nn as nn

# Weight decay = L2 regularization in PyTorch
model = nn.Linear(100, 1)
optimizer = torch.optim.Adam(model.parameters(),
                             lr=1e-3,
                             weight_decay=1e-4)  # equivalent to λ = 1e-4 L2

# Equivalent manual implementation (verifying the connection)
def manual_l2_loss(model, inputs, targets, lam=1e-4):
    outputs = model(inputs)
    mse = nn.functional.mse_loss(outputs, targets)
    # L2 penalty on all weight parameters (not biases)
    l2_penalty = sum(p.pow(2).sum() for name, p in model.named_parameters()
                     if "weight" in name)
    return mse + lam * l2_penalty

print("Weight decay (1e-4) is equivalent to L2 penalty λ=1e-4")
print("PyTorch AdamW decouples weight decay from the adaptive learning rate")
print("This is important: Adam + weight_decay ≠ Adam + L2 penalty due to")
print("the interaction between the gradient scaling and the decay term.")

When to Use Each Regularizer

Situation	Recommended
All features likely relevant, multicollinearity present	Ridge
Feature selection needed, $d \gg n$ , sparse true model	Lasso
Correlated features that should be selected together	ElasticNet
Genomics, text, finance with correlated feature groups	ElasticNet
Need closed-form analytical solution	Ridge only
Always: lambda selection	RidgeCV / LassoCV / ElasticNetCV

Common Mistakes

:::danger Not standardizing features before regularization

# WRONG - raw features on different scales make regularization unfair
lasso = Lasso(alpha=0.1).fit(X_raw, y)

# CORRECT - standardize first so every feature is penalized equally
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_sc = sc.fit_transform(X_train)
lasso = Lasso(alpha=0.1).fit(X_sc, y)

The L1 and L2 penalties measure the weight magnitude in the units of $\mathbf{w}$ . A feature with values in $[0, 1000]$ naturally has smaller weights than a feature in $[0, 1]$ , so it gets penalized less - even though it may be no more informative. Standardizing features before applying any regularization is mandatory. Apply the same StandardScaler.transform() (never fit_transform) to the test set. :::

:::danger Choosing lambda by visual inspection of training loss The regularization strength $\lambda$ controls bias vs variance. Training loss always decreases as $\lambda$ decreases - so you cannot use training loss to choose $\lambda$ . Always use the CV variants: LassoCV, RidgeCV, ElasticNetCV. They compute the validation error efficiently over the entire regularization path. :::

:::warning Interpreting Lasso zeros as "irrelevant features" LASSO sets feature $j$ to zero when $|{\mathbf{x}_j^\top\mathbf{r}}|/n \leq \lambda/2$ - when its partial correlation with the residual falls below the threshold. With correlated features, LASSO may zero out a feature not because it is uninformative but because another correlated feature has already captured its information. The zeroed feature may have near-identical predictive power to the selected one. Use ElasticNet when correlated features should be selected together, or verify zeroed features with univariate tests. :::

:::warning Ridge does not perform feature selection Ridge shrinks all coefficients toward zero but keeps all features active. For a model with 22,000 genes, Ridge returns 22,000 non-zero coefficients. If your goal is identifying which genes are predictive (a very common goal in genomics), Ridge is wrong. Use Lasso or ElasticNet. :::

YouTube Resources

Resource	Channel	Why Watch
Regularization - Clearly Explained	StatQuest with Josh Starmer	Best visual walkthrough of Ridge, Lasso, and the bias-variance tradeoff - start here
Ridge vs Lasso Regression	StatQuest with Josh Starmer	Detailed comparison with the geometric sparsity argument explained visually
MIT 6.S897: LASSO and Sparse Recovery	MIT OpenCourseWare	Rigorous treatment of LASSO, soft thresholding, and LARS - for deeper theory
Bayesian Interpretation of Regularization	Probabilistic ML (Tübingen)	Gaussian prior → Ridge; Laplace prior → LASSO; complete Bayesian derivation
ElasticNet and Correlated Features	StatQuest	When to use ElasticNet vs LASSO - the grouping effect explained clearly

Interview Q&A

Q1: Why does L1 regularization produce sparse models but L2 does not?

There are two equivalent explanations - geometric and algebraic.

Geometrically: the regularized problem is equivalent to minimizing the loss subject to a constraint on the norm $\|\mathbf{w}\|_p \leq t$ . The MSE loss has ellipsoidal contours centered at the OLS solution. The constrained solution is where the smallest ellipse first touches the constraint set. The L1 constraint set (cross-polytope/diamond in 2D) has corners exactly on the coordinate axes. Loss ellipses almost always first touch at a corner where one or more $w_j = 0$ exactly. The L2 constraint set (sphere/circle) has no corners - the touching point is generically not on an axis. Ridge shrinks all coefficients but never zeros them.

Algebraically: the L1 penalty has a subdifferential at $w_j = 0$ of $[-\lambda, \lambda]$ . The KKT optimality condition requires that the data gradient $|\partial\mathcal{L}/\partial w_j| \leq \lambda/2$ for $w_j = 0$ to be optimal. This is achievable for many features. The L2 gradient is $2\lambda w_j$ - it is zero only when $w_j = 0$ itself, but the gradient from the data pushes $w_j$ away from zero no matter how small - so only the limit $\lambda \to \infty$ gives exact zeros.

Q2: Derive the Ridge regression closed-form solution and explain how it fixes multicollinearity.

Setting the gradient of $\mathcal{J}_\text{ridge} = \frac{1}{n}\|\mathbf{y}-\mathbf{X}\mathbf{w}\|^2 + \lambda\|\mathbf{w}\|^2$ to zero: $\frac{2}{n}\mathbf{X}^\top(\mathbf{X}\mathbf{w}-\mathbf{y}) + 2\lambda\mathbf{w} = \mathbf{0}$ . Rearranging: $(\mathbf{X}^\top\mathbf{X} + n\lambda\mathbf{I})\mathbf{w} = \mathbf{X}^\top\mathbf{y}$ , giving $\hat{\mathbf{w}}_\text{ridge} = (\mathbf{X}^\top\mathbf{X} + n\lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}$ .

Multicollinearity causes near-zero eigenvalues of $\mathbf{X}^\top\mathbf{X}$ . If the smallest eigenvalue is $\mu_\min \approx 0$ , inverting $\mathbf{X}^\top\mathbf{X}$ amplifies noise by $1/\mu_\min \to \infty$ - extremely unstable. Adding $n\lambda\mathbf{I}$ shifts all eigenvalues up by $n\lambda$ , making the smallest eigenvalue at least $n\lambda > 0$ . The condition number decreases from $\mu_\max/\mu_\min$ to $(\mu_\max + n\lambda)/(n\lambda)$ . Ridge literally adds numerical stability to an ill-conditioned system.

Q3: What is the Bayesian interpretation of L1 and L2 regularization?

Both are MAP (Maximum A Posteriori) estimation: $\hat{\mathbf{w}}_\text{MAP} = \arg\max_\mathbf{w} \log P(\mathbf{y}|\mathbf{X},\mathbf{w}) + \log P(\mathbf{w})$ .

L2 (Ridge): Gaussian prior $P(w_j) = \mathcal{N}(0, \tau^2)$ . Log-prior: $-\frac{1}{2\tau^2}\|\mathbf{w}\|^2 + \text{const}$ . This adds $-\lambda\|\mathbf{w}\|^2$ to the log-likelihood with $\lambda = \sigma^2/\tau^2$ - the Ridge objective. A Gaussian prior has a smooth, rounded peak at zero - coefficients are pulled toward zero but never exactly zeroed.

L1 (LASSO): Laplace prior $P(w_j) = \frac{1}{2b}\exp(-|w_j|/b)$ . Log-prior: $-\frac{1}{b}\|\mathbf{w}\|_1 + \text{const}$ . This adds $-\lambda\|\mathbf{w}\|_1$ to the log-likelihood - the LASSO objective. The Laplace distribution has a sharp, pointed peak at zero (non-differentiable, like the L1 norm itself) - coefficients are strongly pulled to exactly zero unless the data evidence is strong.

Q4: Explain the coordinate descent update for LASSO. Why does it produce exact zeros algebraically?

For coordinate $j$ with all others fixed, the partial residual is $r_i^{(j)} = y_i - \sum_{k\neq j} w_k x_{ik}$ . The objective reduces to $\mathcal{J}(w_j) = \frac{1}{n}\sum_i(r_i^{(j)} - w_j x_{ij})^2 + \lambda|w_j|$ .

With standardized features ( $\frac{1}{n}\sum_i x_{ij}^2 = 1$ ), the unconstrained minimum is $\rho_j = \frac{1}{n}\mathbf{x}_j^\top\mathbf{r}^{(j)}$ . Adding the L1 penalty at zero creates a non-differentiable kink. The subdifferential optimality condition gives: $\hat{w}_j = \text{sign}(\rho_j)\max(|\rho_j| - \lambda/2, 0)$ .

Exact zeros arise because: when $|\rho_j| \leq \lambda/2$ , the subdifferential at $w_j = 0$ contains the data gradient $-2\rho_j$ . So $w_j = 0$ satisfies the optimality condition - zero is a valid stationary point. With L2, the gradient at $w_j = 0$ from the penalty is $2\lambda \cdot 0 = 0$ and the data gradient $-2\rho_j \neq 0$ immediately pulls the coefficient away from zero.

Q5: How do you choose between Ridge, Lasso, and ElasticNet in practice?

Decision framework: (1) Do you need an interpretable sparse model or automatic feature selection? If yes, Lasso or ElasticNet; if no (all features expected to contribute), Ridge. (2) Are features in correlated groups (pathway genes, topic-related TF-IDF terms, financial factor exposures)? If yes, ElasticNet - Lasso picks one arbitrarily and its selection is unstable across subsamples. (3) Is $d > n$ ? Use Lasso or ElasticNet - Ridge still works but does not reduce model complexity. (4) Do you need a closed-form solution or analytical tractability? Ridge only. (5) Always use the CV variants - LassoCV, RidgeCV, ElasticNetCV - to select $\lambda$ . Never set it manually.

Q6: How does regularization interact with the bias-variance tradeoff, and when does increasing regularization actually improve test performance?

The expected test error decomposes as $\text{Bias}^2 + \text{Variance} + \sigma^2$ . Regularization increases bias (constraints prevent the model from fitting the true function perfectly) and decreases variance (the model is less sensitive to the specific training set). The question is which effect dominates.

When $d \approx n$ or $d > n$ : OLS variance is enormous - the model fits noise aggressively. Even a large increase in bias (e.g., heavily shrinking toward zero) can result in a net reduction of test error because the variance reduction dominates. In the 22,000-gene example with 800 samples, OLS has variance proportional to $d/n = 27.5 \times$ the signal - almost all of the model's variation is noise. Strong regularization (large $\lambda$ ) dramatically reduces variance at the cost of modest bias, and total error plummets.

When $n \gg d$ : OLS already has low variance. Regularization mainly introduces bias without meaningfully reducing variance. Here, only very small $\lambda$ (strong regularization only if specifically needed) improves test performance. The optimal $\lambda$ found by CV will be very small in this case - consistent with the mathematical analysis.

Stability Selection and Lasso as a Feature Screening Tool

A key limitation of LASSO for feature selection is its instability: the selected support can change substantially between slightly different training sets. Stability Selection (Meinshausen and Bühlmann, 2010) addresses this by combining LASSO with bootstrap resampling.

The algorithm: run LASSO on many (e.g., 100) bootstrap subsamples of the training data, at each of many regularization values. Record how often each feature is selected (its selection probability). Features with high selection probability across bootstrap runs are stably important; features selected erratically are likely false positives.

import numpy as np
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression

np.random.seed(42)
X, y, true_coef = make_regression(
    n_samples=300, n_features=100, n_informative=15,
    noise=15.0, coef=True, random_state=42
)
true_support = set(np.where(np.abs(true_coef) > 1e-6)[0])

sc = StandardScaler()
X_sc = sc.fit_transform(X)

# ── Stability Selection ────────────────────────────────────────────────────────
def stability_selection(X, y, alpha=0.1, n_bootstrap=100, subsample_frac=0.5):
    """
    Estimate selection probability for each feature under LASSO.
    Features with high selection probability are stably informative.
    """
    n, d = X.shape
    n_sub = int(n * subsample_frac)
    selection_counts = np.zeros(d)

    for seed in range(n_bootstrap):
        rng = np.random.default_rng(seed)
        idx = rng.choice(n, size=n_sub, replace=False)
        X_sub = X[idx]
        y_sub = y[idx]

        lasso = Lasso(alpha=alpha, max_iter=5000)
        lasso.fit(X_sub, y_sub)
        selection_counts += (np.abs(lasso.coef_) > 1e-8).astype(float)

    return selection_counts / n_bootstrap   # selection probabilities

sel_probs = stability_selection(X_sc, y, alpha=0.1, n_bootstrap=200)

# Features with selection probability > 0.6 are stably selected
threshold = 0.6
stable_features = set(np.where(sel_probs > threshold)[0])
precision = len(true_support & stable_features) / max(len(stable_features), 1)
recall    = len(true_support & stable_features) / max(len(true_support), 1)

print("Stability Selection Results:")
print(f"  Selected {len(stable_features)} features (threshold={threshold})")
print(f"  Precision: {precision:.3f}")
print(f"  Recall: {recall:.3f}")
print(f"  (Single-run LASSO precision/recall typically worse due to instability)")

# Show selection probabilities for top features
top_features = np.argsort(sel_probs)[::-1][:20]
print("\nTop 20 features by selection probability:")
for idx in top_features:
    is_true = "TRUE" if idx in true_support else "     "
    bar = "#" * int(sel_probs[idx] * 30)
    print(f"  Feature {idx:3d} [{is_true}] {sel_probs[idx]:.2f} {bar}")

Regularization Paths: The LARS-LASSO Connection

The Least Angle Regression (LARS) algorithm (Efron et al., 2004) computes the entire LASSO regularization path in $O(d^3 + nd^2)$ time - the same cost as a single OLS fit, regardless of the path length.

The LARS algorithm builds the path step by step:

Start with all coefficients at zero ( $\lambda = \infty$ ).
Identify the feature most correlated with the current residual and begin moving its coefficient in the direction of the correlation.
Continue until another feature becomes equally correlated with the residual.
At that point, add the new feature and move all currently active features jointly.
Repeat until all features are active or $\lambda = 0$ .

The LASSO path is recovered from LARS by adding a modification: when a coefficient crosses zero during the joint movement, set it to zero and remove it from the active set. This "drop step" is what makes the LARS path equal to the LASSO path.

from sklearn.linear_model import lars_path, LassoLars

# Compute the LASSO path via LARS (exact, efficient)
alphas_lars, _, coefs_lars = lars_path(
    X_sc, y, method="lasso", verbose=0
)

print(f"\nLARS-LASSO path:")
print(f"  Number of path steps: {len(alphas_lars)}")
print(f"  Alpha range: [{alphas_lars.min():.4f}, {alphas_lars.max():.4f}]")
print(f"  Path is piecewise linear - exact regularization path in O(d³) time")

# Verify: LassoLars at a specific alpha matches sklearn's Lasso
alpha_target = 0.1
lasso_lars = LassoLars(alpha=alpha_target).fit(X_sc, y)
lasso_std  = Lasso(alpha=alpha_target, max_iter=10_000).fit(X_sc, y)

max_diff = np.max(np.abs(lasso_lars.coef_ - lasso_std.coef_))
print(f"\nLassoLars vs Lasso at α={alpha_target}:")
print(f"  Max coefficient difference: {max_diff:.2e}  (should be ~0)")

Practical Regularization Checklist

When you encounter a new regression problem, follow this decision checklist:

def regularization_checklist(n_samples, n_features, has_correlated_groups,
                              need_feature_selection, need_closed_form):
    """
    Practical guide to regularization choice.
    Returns a recommendation with reasoning.
    """
    print("Regularization Decision Checklist")
    print("=" * 50)
    print(f"  n (samples)  : {n_samples}")
    print(f"  d (features) : {n_features}")
    print(f"  d/n ratio    : {n_features/n_samples:.2f}")
    print()

    if n_features / n_samples < 0.1:
        print("  d/n < 0.1: OLS may work fine.")
        print("  → Try OLS first; add Ridge if you see overfitting.")
    elif n_features / n_samples < 1.0:
        print("  0.1 ≤ d/n < 1: Regularization strongly recommended.")
    else:
        print("  d/n ≥ 1: OLS has no unique solution. Regularization required.")

    print()
    if need_closed_form:
        print("  Need closed form → Ridge only option.")
    elif need_feature_selection and has_correlated_groups:
        print("  Feature selection + correlated groups → ElasticNet (l1_ratio=0.5–0.9)")
    elif need_feature_selection:
        print("  Feature selection, no correlated groups → Lasso")
    elif has_correlated_groups:
        print("  Correlated groups, no selection needed → Ridge or ElasticNet (low l1)")
    else:
        print("  No special constraints → Ridge (default), try Lasso if sparsity useful")

    print()
    print("  Lambda selection: ALWAYS use CV variants (RidgeCV/LassoCV/ElasticNetCV)")
    print("  Feature scaling:  ALWAYS standardize features before regularization")


# Example: genomics problem
regularization_checklist(
    n_samples=800, n_features=22_000,
    has_correlated_groups=True,   # genes in same pathway are correlated
    need_feature_selection=True,  # want to identify which genes matter
    need_closed_form=False
)

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Regularization Path demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

Why This Exists - The Overfitting Problem​

Historical Context​

Formal Bias-Variance Decomposition​

L2 Regularization (Ridge Regression)​

Bayesian Derivation: Gaussian Prior​

Ridge Closed-Form Solution​

SVD Analysis: Which Directions Are Shrunk?​

L1 Regularization (Lasso)​

Bayesian Derivation: Laplace Prior​

Why L1 Induces Sparsity: The Geometry​

The Subdifferential Condition and KKT Analysis​

Coordinate Descent for LASSO​

Derivation of the Soft-Threshold Update​

Full NumPy Coordinate Descent LASSO​

Full Sklearn Implementation - Ridge, Lasso, ElasticNet​

Regularization Paths and the LARS Connection​

ElasticNet: Combining L1 and L2​

Cross-Validation for Lambda Selection​

Bias-Variance Tradeoff Visualization​

Regularization in Deep Learning​

When to Use Each Regularizer​

Common Mistakes​

YouTube Resources​

Interview Q&A​

Stability Selection and Lasso as a Feature Screening Tool​

Regularization Paths: The LARS-LASSO Connection​

Practical Regularization Checklist​