Skip to main content

Regularization: L1, L2, and ElasticNet

:::note Reading time: ~45 minutes | Interview relevance: Very High | Target roles: ML Engineer, Data Scientist, Research Engineer, Applied Scientist :::

The Real Interview Moment

You are interviewing for an ML engineer role at a genomics startup. The interviewer shows you this: a dataset with 800 tumor samples and 22,000 gene expression features. "We tried ordinary linear regression. It took two hours and the test accuracy was 43%. What went wrong, and how do you fix it?"

The candidate who says "add regularization" gets a follow-up: "Which regularization, how much, and why? What does L1 do that L2 cannot?"

The candidate who can derive why L1 produces sparse solutions from the subdifferential calculus, explain why L2 is equivalent to a Gaussian prior in Bayesian statistics, show how Ridge handles multicollinearity through its SVD analysis, and implement coordinate descent for LASSO from scratch - that candidate gets the offer.

Regularization is not a hyperparameter to tune mindlessly. It is a principled statistical technique with deep mathematical foundations connecting optimization geometry, Bayesian inference, and information theory. The bias-variance tradeoff, the sparsity of the L1 ball, the grouping effect of ElasticNet - every one of these has a rigorous derivation. This lesson covers all of it.

The genomics problem you just saw is not exotic. Every high-dimensional problem - text classification, financial factor modeling, drug discovery, medical imaging - shares the same structure: too many features, not enough samples. Regularization is the universal answer. But which regularization, and how much, determines whether your model finds the true signal or fits noise.

Why This Exists - The Overfitting Problem

For an unregularized model with dd features and nn training samples, three regimes exist:

  • When dnd \ll n: a unique OLS solution exists and the model generalizes reasonably well.
  • When dnd \approx n: the model memorizes training data and generalizes poorly. The Gram matrix XX\mathbf{X}^\top\mathbf{X} is near-singular.
  • When d>nd > n: infinitely many solutions minimize training loss perfectly. Most generalize terribly. The Gram matrix is exactly singular - uninvertible.

The 22,000-gene dataset has 800 samples and 22,000 features. OLS literally has no unique solution. But this is not just a mathematical inconvenience - it reflects a statistical reality: there is not enough data to pin down 22,000 independent coefficients. The model is free to fit the noise in the training data, and it does.

Historical Context

The idea of penalized regression dates to Tikhonov (1943), who used 2\ell_2 penalties to regularize ill-posed inverse problems in physics. Ridge regression in the statistical sense was formalized by Hoerl and Kennard (1970) specifically to handle multicollinearity. LASSO (Least Absolute Shrinkage and Selection Operator) was introduced by Robert Tibshirani in 1996, motivated by the observation that Ridge shrinks coefficients but never zeroes them. The Least Angle Regression (LARS) algorithm of Efron et al. (2004) revealed the full LASSO regularization path and established its connection to forward stepwise selection. ElasticNet was proposed by Zou and Hastie (2005) to handle correlated features that LASSO handles poorly.

Formal Bias-Variance Decomposition

The fundamental theorem that motivates regularization: the expected squared prediction error at any point xx decomposes into three irreducible components.

E ⁣[(yf^(x))2]=(E[f^(x)]f(x))2Bias2[f^]+E ⁣[(f^(x)E[f^(x)])2]Var[f^]+σ2irreducible noise\mathbb{E}\!\left[(y - \hat{f}(x))^2\right] = \underbrace{\left(\mathbb{E}[\hat{f}(x)] - f(x)\right)^2}_{\text{Bias}^2[\hat{f}]} + \underbrace{\mathbb{E}\!\left[\left(\hat{f}(x) - \mathbb{E}[\hat{f}(x)]\right)^2\right]}_{\text{Var}[\hat{f}]} + \underbrace{\sigma^2}_{\text{irreducible noise}}

where:

  • Bias2[f^]\text{Bias}^2[\hat{f}]: the squared systematic error - how far the average prediction is from the true function f(x)f(x).
  • Var[f^]\text{Var}[\hat{f}]: the sensitivity of predictions to the specific training set used - different training sets produce different f^\hat{f} values.
  • σ2\sigma^2: the irreducible noise variance - present in any model no matter how good.

The regularization trade-off: OLS with dnd \approx n has low bias (the model is flexible enough to approximate ff well) but very high variance (different training sets of 800 samples give wildly different 22,000-dimensional coefficient vectors). Regularization adds a penalty that forces coefficients toward zero, introducing bias but dramatically reducing variance. The optimal λ\lambda sits at the sweet spot where total error =Bias2+Var= \text{Bias}^2 + \text{Var} is minimized.

As λ0\lambda \to 0: OLS - minimum bias, maximum variance. As λ\lambda \to \infty: predict the global mean - maximum bias, zero variance. Cross-validation finds the optimal λ\lambda between these extremes.

L2 Regularization (Ridge Regression)

L2 regularization adds the squared L2 norm of weights to the loss:

Jridge(w)=1nyXw2+λw22\mathcal{J}_\text{ridge}(\mathbf{w}) = \frac{1}{n}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda\|\mathbf{w}\|^2_2

Bayesian Derivation: Gaussian Prior

Ridge regression is exactly MAP (Maximum A Posteriori) estimation with a Gaussian prior on the weights:

P(w)=N(0,τ2I)exp ⁣(w22τ2)P(\mathbf{w}) = \mathcal{N}(\mathbf{0}, \tau^2\mathbf{I}) \propto \exp\!\left(-\frac{\|\mathbf{w}\|^2}{2\tau^2}\right)

Assuming Gaussian likelihood P(yX,w)=N(Xw,σ2I)P(\mathbf{y}|\mathbf{X},\mathbf{w}) = \mathcal{N}(\mathbf{X}\mathbf{w}, \sigma^2\mathbf{I}), the MAP objective is:

w^MAP=argmaxw  logP(yX,w)+logP(w)\hat{\mathbf{w}}_\text{MAP} = \arg\max_\mathbf{w}\; \log P(\mathbf{y}|\mathbf{X},\mathbf{w}) + \log P(\mathbf{w})

Substituting:

=argminw  12σ2yXw2+12τ2w2= \arg\min_\mathbf{w}\; \frac{1}{2\sigma^2}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \frac{1}{2\tau^2}\|\mathbf{w}\|^2

Setting λ=σ2/τ2\lambda = \sigma^2/\tau^2 recovers exactly the Ridge objective. The regularization strength equals the ratio of noise variance to prior variance - when noise is large relative to the prior spread, λ\lambda is large, penalizing coefficients heavily. The Gaussian prior's prior precision 1/τ21/\tau^2 equals λ/σ2\lambda/\sigma^2 - stronger regularization means a tighter prior.

This Bayesian framing provides something the frequentist framing cannot: a principled way to set λ\lambda if you have prior knowledge about the scale of the true coefficients.

Ridge Closed-Form Solution

Setting the gradient to zero:

wJridge=2nX(Xwy)+2λw=0\nabla_\mathbf{w}\mathcal{J}_\text{ridge} = \frac{2}{n}\mathbf{X}^\top(\mathbf{X}\mathbf{w} - \mathbf{y}) + 2\lambda\mathbf{w} = \mathbf{0}

(XX+nλI)w=Xy(\mathbf{X}^\top\mathbf{X} + n\lambda\mathbf{I})\mathbf{w} = \mathbf{X}^\top\mathbf{y}

w^ridge=(XX+nλI)1Xy\boxed{\hat{\mathbf{w}}_\text{ridge} = (\mathbf{X}^\top\mathbf{X} + n\lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}}

Adding nλIn\lambda\mathbf{I} to XX\mathbf{X}^\top\mathbf{X} shifts all eigenvalues up by nλ>0n\lambda > 0, making the matrix always positive definite and invertible, regardless of whether d>nd > n. This directly fixes multicollinearity: near-zero eigenvalues of XX\mathbf{X}^\top\mathbf{X} (which cause numerical instability in OLS) become at least nλn\lambda under Ridge. The regularization is literally adding a small positive number to the diagonal of an ill-conditioned matrix to stabilize it.

SVD Analysis: Which Directions Are Shrunk?

The SVD of X=UΣV\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top reveals exactly what Ridge does to each direction. With singular values σ1σ2σd\sigma_1 \geq \sigma_2 \geq \ldots \geq \sigma_d:

w^ridge=j=1dσj2σj2+nλshrinkage factorujyσjvj\hat{\mathbf{w}}_\text{ridge} = \sum_{j=1}^{d} \underbrace{\frac{\sigma_j^2}{\sigma_j^2 + n\lambda}}_{\text{shrinkage factor}} \cdot \frac{\mathbf{u}_j^\top\mathbf{y}}{\sigma_j} \cdot \mathbf{v}_j

Compare to OLS:

w^OLS=j=1dujyσjvj\hat{\mathbf{w}}_\text{OLS} = \sum_{j=1}^{d} \frac{\mathbf{u}_j^\top\mathbf{y}}{\sigma_j} \cdot \mathbf{v}_j

Ridge multiplies each component by the shrinkage factor σj2σj2+nλ\frac{\sigma_j^2}{\sigma_j^2 + n\lambda}:

  • Large σj\sigma_j (direction well-supported by data): factor 1\approx 1 - OLS estimate kept almost unchanged.
  • Small σj\sigma_j (direction with little data support): factor 0\approx 0 - coefficient shrunk toward zero.

This is the key insight: Ridge selectively penalizes directions where the data provides little information. These are exactly the directions where OLS is most unreliable. The regularization is adaptive - it is strongest where it is most needed.

For the gene expression problem: genes that are nearly collinear (correlated pathway members) have near-zero singular values in XX\mathbf{X}^\top\mathbf{X}. Ridge shrinks their coefficients toward zero, effectively pooling their contribution and preventing any one gene from getting an enormous coefficient just because it correlates with another.

L1 Regularization (Lasso)

L1 regularization adds the L1 norm of weights:

Jlasso(w)=1nyXw2+λw1=1nyXw2+λj=1dwj\mathcal{J}_\text{lasso}(\mathbf{w}) = \frac{1}{n}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda\|\mathbf{w}\|_1 = \frac{1}{n}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda\sum_{j=1}^d |w_j|

No closed-form solution. The L1 norm is not differentiable at wj=0w_j = 0. But this non-differentiability is precisely the source of sparsity.

Bayesian Derivation: Laplace Prior

LASSO is MAP estimation with a Laplace (double exponential) prior:

P(wj)=Laplace(0,b)=12bexp ⁣(wjb)P(w_j) = \text{Laplace}(0, b) = \frac{1}{2b}\exp\!\left(-\frac{|w_j|}{b}\right)

The log-prior is logP(w)=1bw1+const\log P(\mathbf{w}) = -\frac{1}{b}\|\mathbf{w}\|_1 + \text{const}. Setting λ=σ2/(nb)\lambda = \sigma^2/(nb) recovers the LASSO objective from the MAP criterion.

The Laplace distribution has two properties that explain LASSO's behavior:

  1. Heavier tails than Gaussian: the prior assigns non-negligible probability to large values. Informative features can have large coefficients - the prior does not fight them.
  2. Sharp peak at zero: the prior has a spike at wj=0w_j = 0, creating a strong pull toward zero for small coefficients. Uninformative features get zeroed out.

In contrast, the Gaussian prior has a gentle rounded peak at zero - it pulls coefficients toward zero but symmetrically and smoothly. Only the Laplace prior's sharp kink at zero creates the hard zeros that define sparsity.

Why L1 Induces Sparsity: The Geometry

The constrained-optimization view provides the clearest geometric explanation. The regularized problem is equivalent to:

minw  L(w)subject towpt(λ)\min_\mathbf{w}\; \mathcal{L}(\mathbf{w}) \quad \text{subject to}\quad \|\mathbf{w}\|_p \leq t(\lambda)

The OLS solution w\mathbf{w}^* lives somewhere in Rd\mathbb{R}^d. The MSE loss contours are ellipses (in 2D) centered at w\mathbf{w}^*. The constrained solution is where the smallest loss ellipse first touches the constraint set.

L2 constraint (sphere): L1 constraint (diamond):

w₂ w₂
| |
. . | . . . | .
. --+-- . . \ | / .
. / \ . . \|/ .
. | w* O |. . w* /|\ .
. \ / . . / | \ .
. . \ / . . . / | \ .
| / | \
-------+------- w₁ ------+------- w₁
| Corner at w₁=0!

Ellipse touches sphere at a Ellipse touches at a corner
generic angle. Rarely on axis. where w₂=0 exactly.
Ridge: never exact zeros. LASSO: exact zeros at corners.

L2 (sphere/circle): smooth everywhere, no preferred directions. The loss ellipse touches the constraint at a generic angle - almost never where wj=0w_j = 0 exactly. Ridge shrinks all coefficients but never exactly zeros them for finite λ\lambda.

L1 (diamond/cross-polytope): has corners exactly on the coordinate axes (where one or more wj=0w_j = 0). The loss ellipse, as we inflate it outward from w\mathbf{w}^*, almost always first touches one of these corners. L1 produces exact zeros - not approximately zero, but exactly zero.

The Subdifferential Condition and KKT Analysis

At a non-differentiable point (wj=0w_j = 0), we need the subdifferential of the L1 norm: wj=[1,1]\partial|w_j| = [-1, 1] when wj=0w_j = 0.

The KKT optimality condition for coordinate jj:

0wjJlasso=2nxjr+λwj0 \in \partial_{w_j}\mathcal{J}_\text{lasso} = -\frac{2}{n}\mathbf{x}_j^\top\mathbf{r} + \lambda \cdot \partial|w_j|

where r=yXw\mathbf{r} = \mathbf{y} - \mathbf{X}\mathbf{w} is the current residual. The partial correlation is ρj=1nxjr\rho_j = \frac{1}{n}\mathbf{x}_j^\top\mathbf{r}.

Feature selection condition: feature jj is set to zero (excluded from the model) if and only if:

ρj=1nxjrλ2|\rho_j| = \left|\frac{1}{n}\mathbf{x}_j^\top\mathbf{r}\right| \leq \frac{\lambda}{2}

In words: if the correlation between feature jj and the current residual is smaller than λ/2\lambda/2, the L1 penalty "wins" and drives wjw_j to exactly zero. Feature selection is automatic - and it has a precise mathematical criterion.

For the 22,000-gene problem: most genes have low correlation with the residual (after accounting for the informative genes). For those genes, ρj<λ/2|\rho_j| < \lambda/2 and they are automatically excluded. Only the truly predictive genes survive.

Coordinate Descent for LASSO

Since L1 is non-differentiable at wj=0w_j = 0, we cannot use standard gradient descent. Coordinate descent minimizes one coordinate at a time while holding all others fixed - and each coordinate update has a closed-form solution.

Derivation of the Soft-Threshold Update

For coordinate jj, the partial residual (contribution of all other features) is:

ri(j)=yikjwkxik=yi(y^iwjxij)r_i^{(j)} = y_i - \sum_{k \neq j} w_k x_{ik} = y_i - (\hat{y}_i - w_j x_{ij})

The objective as a function of wjw_j alone (with standardized features so 1nixij2=1\frac{1}{n}\sum_i x_{ij}^2 = 1):

J(wj)=1ni=1n(ri(j)wjxij)2+λwj+const\mathcal{J}(w_j) = \frac{1}{n}\sum_{i=1}^n (r_i^{(j)} - w_j x_{ij})^2 + \lambda|w_j| + \text{const}

Expanding: =1nr(j)2const2nxjr(j)=2ρjwj+wj2+λwj= \underbrace{\frac{1}{n}\|r^{(j)}\|^2}_{\text{const}} - \underbrace{\frac{2}{n}\mathbf{x}_j^\top r^{(j)}}_{= 2\rho_j} \cdot w_j + w_j^2 + \lambda|w_j|

The unconstrained minimizer in wjw_j is ρj\rho_j. With the L1 penalty, the optimal value is the soft-thresholding operator:

w^j=S(ρj,λ/2)=sign(ρj)max(ρjλ/2,  0)\hat{w}_j = \mathcal{S}(\rho_j, \lambda/2) = \text{sign}(\rho_j)\max(|\rho_j| - \lambda/2,\; 0)

This can be written explicitly:

w^j={ρjλ/2if ρj>λ/20if ρjλ/2ρj+λ/2if ρj<λ/2\hat{w}_j = \begin{cases} \rho_j - \lambda/2 & \text{if } \rho_j > \lambda/2 \\ 0 & \text{if } |\rho_j| \leq \lambda/2 \\ \rho_j + \lambda/2 & \text{if } \rho_j < -\lambda/2 \end{cases}

The name "soft-thresholding" describes the shape: values with ρjλ/2|\rho_j| \leq \lambda/2 are set to exactly zero (hard threshold), while values outside are shifted toward zero by λ/2\lambda/2 (soft shrinkage). This is in contrast to hard thresholding which would jump discontinuously.

Why exact zeros: when ρjλ/2|\rho_j| \leq \lambda/2, moving wjw_j away from zero in any direction increases the L1 penalty faster than it decreases the data fit. The L1 penalty at zero has a "flat spot" in the subdifferential sense - zero is optimal. With L2, the gradient is proportional to wjw_j, so it shrinks but never reaches zero.

Full NumPy Coordinate Descent LASSO

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


def soft_threshold(rho: float, lam: float) -> float:
"""
Soft-thresholding operator S(ρ, λ).
Core of LASSO coordinate descent.
Sets |ρ| ≤ λ to exactly zero (L1 sparsity origin).
"""
if rho > lam:
return rho - lam
elif rho < -lam:
return rho + lam
else:
return 0.0


class LassoCoordinateDescent:
"""
LASSO via coordinate descent with soft-thresholding.
Assumes standardized features (unit variance per feature).
This is the algorithm underlying sklearn's Lasso.

Each iteration: for j in 1..d:
rho_j = (1/n) * x_j^T * (y - X * w + w_j * x_j)
w_j = S(rho_j, alpha/2)

Converges to the global LASSO solution (convex problem).
"""

def __init__(self, alpha: float = 0.1, max_iter: int = 2000,
tol: float = 1e-6):
self.alpha = alpha
self.max_iter = max_iter
self.tol = tol
self.coef_: np.ndarray | None = None
self.intercept_: float = 0.0
self.n_iter_: int = 0
self.loss_path_: list[float] = []

def fit(self, X: np.ndarray, y: np.ndarray) -> "LassoCoordinateDescent":
n, d = X.shape

# Center the target; use mean as intercept
self.intercept_ = np.mean(y)
y_c = y - self.intercept_

# Initialize at zero (warm start)
self.coef_ = np.zeros(d)

for iteration in range(self.max_iter):
coef_prev = self.coef_.copy()

for j in range(d):
# Partial residual: y - sum_{k≠j} w_k x_k
# Efficient: full residual + add back the j-th term
residual_j = y_c - X @ self.coef_ + X[:, j] * self.coef_[j]

# Partial correlation (standardized features: ||x_j||²/n = 1)
rho_j = (1.0 / n) * (X[:, j] @ residual_j)

# Soft-threshold update
self.coef_[j] = soft_threshold(rho_j, self.alpha / 2.0)

# Track loss for convergence visualization
y_hat = X @ self.coef_ + self.intercept_
mse = np.mean((y - y_hat) ** 2)
l1_pen = self.alpha * np.sum(np.abs(self.coef_))
self.loss_path_.append(mse + l1_pen)

# Convergence: max absolute change in coefficients
max_change = np.max(np.abs(self.coef_ - coef_prev))
if max_change < self.tol:
self.n_iter_ = iteration + 1
break
else:
self.n_iter_ = self.max_iter

return self

def predict(self, X: np.ndarray) -> np.ndarray:
return X @ self.coef_ + self.intercept_

def score(self, X: np.ndarray, y: np.ndarray) -> float:
y_pred = self.predict(X)
ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - np.mean(y)) ** 2)
return float(1.0 - ss_res / ss_tot)

@property
def n_nonzero_(self) -> int:
return int(np.sum(np.abs(self.coef_) > 1e-8))


# ── Demonstration on high-dimensional sparse problem ──────────────────────────
np.random.seed(42)
X, y, true_coef = make_regression(
n_samples=500, n_features=200, n_informative=20,
noise=10.0, coef=True, random_state=42
)

X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_tr_sc = scaler.fit_transform(X_tr)
X_te_sc = scaler.transform(X_te)

# Fit our from-scratch LASSO
lasso_scratch = LassoCoordinateDescent(alpha=0.1, max_iter=2000, tol=1e-7)
lasso_scratch.fit(X_tr_sc, y_tr)

print("LASSO (coordinate descent from scratch):")
print(f" Non-zero coefficients : {lasso_scratch.n_nonzero_}/200")
print(f" Test R² : {lasso_scratch.score(X_te_sc, y_te):.4f}")
print(f" Iterations to converge: {lasso_scratch.n_iter_}")

# Verify: do we recover the true sparse support?
true_support = set(np.where(np.abs(true_coef) > 1e-6)[0])
found_support = set(np.where(np.abs(lasso_scratch.coef_) > 1e-6)[0])
precision = len(true_support & found_support) / max(len(found_support), 1)
recall = len(true_support & found_support) / max(len(true_support), 1)
print(f" Support precision : {precision:.3f}")
print(f" Support recall : {recall:.3f}")

Full Sklearn Implementation - Ridge, Lasso, ElasticNet

from sklearn.linear_model import (
Ridge, Lasso, ElasticNet, LinearRegression,
RidgeCV, LassoCV, ElasticNetCV, lasso_path
)


def compare_regularizers(X_tr, y_tr, X_te, y_te) -> None:
"""Compare all regularizers on the same high-dimensional problem."""

configs = [
("OLS (no regularization)", LinearRegression()),
("Ridge (α=1.0)", Ridge(alpha=1.0)),
("Ridge (α=10.0)", Ridge(alpha=10.0)),
("Lasso (α=0.05)", Lasso(alpha=0.05, max_iter=10_000)),
("Lasso (α=0.5)", Lasso(alpha=0.5, max_iter=10_000)),
("ElasticNet (α=0.1, l1=0.5)", ElasticNet(alpha=0.1, l1_ratio=0.5,
max_iter=10_000)),
("ElasticNet (α=0.1, l1=0.9)", ElasticNet(alpha=0.1, l1_ratio=0.9,
max_iter=10_000)),
]

print(f"{'Model':40s} {'Train R²':>9} {'Test R²':>8} {'Non-zero':>9}")
print("-" * 75)

for name, model in configs:
model.fit(X_tr, y_tr)
tr_r2 = model.score(X_tr, y_tr)
te_r2 = model.score(X_te, y_te)
coef = getattr(model, "coef_", None)
nnz = int(np.sum(np.abs(coef) > 1e-8)) if coef is not None else "-"
print(f"{name:40s} {tr_r2:9.4f} {te_r2:8.4f} {nnz:>9}")


compare_regularizers(X_tr_sc, y_tr, X_te_sc, y_te)

Regularization Paths and the LARS Connection

The regularization path shows how coefficients evolve as λ\lambda decreases from \infty (all-zero model) to 00 (OLS model). The path reveals which features enter the model first - those most correlated with the target.

# ── Lasso regularization path ─────────────────────────────────────────────────
alphas_path, coefs_path, _ = lasso_path(
X_tr_sc, y_tr,
alphas=np.logspace(-3, 1, 150)[::-1], # from large to small
max_iter=20_000
)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Left: Lasso path (features enter the model one at a time)
ax = axes[0]
# Highlight top-20 features by maximum absolute coefficient
peak_importance = np.max(np.abs(coefs_path), axis=1)
top20 = np.argsort(peak_importance)[::-1][:20]
for idx in top20:
ax.semilogx(alphas_path, coefs_path[idx], linewidth=1.5, alpha=0.8)
ax.axvline(x=0.05, color="crimson", linestyle="--", linewidth=2,
label="Selected α=0.05")
ax.set_xlabel("Regularization α (log scale → decreasing)")
ax.set_ylabel("Coefficient value")
ax.set_title("Lasso Regularization Path\n"
"Features enter model as α decreases; many stay exactly zero")
ax.legend()
ax.grid(True, alpha=0.25)

# Right: Number of non-zero features vs alpha
ax = axes[1]
n_nonzero_vs_alpha = [np.sum(np.abs(c) > 1e-8) for c in coefs_path.T]
ax.semilogx(alphas_path, n_nonzero_vs_alpha, "b-", linewidth=2)
ax.axvline(x=0.05, color="crimson", linestyle="--", linewidth=2)
ax.set_xlabel("Regularization α (log scale)")
ax.set_ylabel("Number of non-zero features")
ax.set_title("LASSO Feature Selection vs Regularization Strength\n"
"Increasing α forces more coefficients to zero")
ax.grid(True, alpha=0.25)

plt.tight_layout()
plt.savefig("lasso_path.png", dpi=150, bbox_inches="tight")
plt.show()

# ── Ridge path (no features reach exactly zero) ───────────────────────────────
alphas_ridge = np.logspace(-3, 5, 100)
coefs_ridge = np.array([
Ridge(alpha=a).fit(X_tr_sc, y_tr).coef_
for a in alphas_ridge
])

plt.figure(figsize=(8, 5))
for idx in top20:
plt.semilogx(alphas_ridge, coefs_ridge[:, idx], linewidth=1.2, alpha=0.7)
plt.xlabel("Regularization α (log scale)")
plt.ylabel("Coefficient value")
plt.title("Ridge Regularization Path\n"
"All features shrink toward 0 but none are exactly 0")
plt.grid(True, alpha=0.25)
plt.tight_layout()
plt.savefig("ridge_path.png", dpi=150, bbox_inches="tight")
plt.show()

The LARS algorithm (Least Angle Regression, Efron et al. 2004) traces the LASSO path exactly and efficiently. LARS starts with all coefficients at zero and at each step adds the feature most correlated with the current residual, moving its coefficient in the direction of the correlation until another feature becomes equally correlated. At that bifurcation point, both features move together. This reveals the LASSO path as a sequence of piecewise-linear segments - and LARS computes the entire path in the same computational cost as a single OLS fit.

ElasticNet: Combining L1 and L2

ElasticNet (Zou and Hastie, 2005) adds both penalties:

Jenet(w)=1nyXw2+αl1_ratiow1+α(1l1_ratio)2w22\mathcal{J}_\text{enet}(\mathbf{w}) = \frac{1}{n}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \alpha \cdot \text{l1\_ratio} \cdot \|\mathbf{w}\|_1 + \frac{\alpha(1-\text{l1\_ratio})}{2}\|\mathbf{w}\|_2^2

The l1_ratio parameter (also written as ρ\rho or the mixing parameter) controls the balance:

  • l1_ratio = 0: pure Ridge (no sparsity)
  • l1_ratio = 1: pure Lasso (maximum sparsity)
  • l1_ratio = 0.5: equal mix (sklearn default)

In the general form: λ[(1α)w22+αw1]\lambda[(1-\alpha)\|\mathbf{w}\|_2^2 + \alpha\|\mathbf{w}\|_1] where α\alpha is the mixing parameter.

The grouping effect: LASSO's fundamental limitation is that when features are highly correlated (e.g., genes in the same biological pathway, TF-IDF features for related words), LASSO arbitrarily selects one representative from each correlated group and zeros out the rest. The selected representative can change dramatically between training runs on similar data - LASSO is unstable under correlated features.

The L2 component of ElasticNet encourages correlated features to have similar coefficients - the "grouping effect." When features are nearly identical (xjxk\mathbf{x}_j \approx \mathbf{x}_k), their ElasticNet coefficients are nearly identical, while LASSO would assign all weight to one arbitrarily.

# ── Grouping effect: Lasso vs ElasticNet on correlated features ───────────────
np.random.seed(42)
n_corr, d_corr = 400, 60

# Create strongly correlated group: features 0–4 are near-duplicates of feature 0
X_corr = np.random.randn(n_corr, d_corr)
X_corr[:, 1] = X_corr[:, 0] + 0.05 * np.random.randn(n_corr)
X_corr[:, 2] = X_corr[:, 0] + 0.05 * np.random.randn(n_corr)
X_corr[:, 3] = X_corr[:, 0] + 0.08 * np.random.randn(n_corr)
X_corr[:, 4] = X_corr[:, 0] + 0.08 * np.random.randn(n_corr)

# True model: ALL five correlated features contribute equally
y_corr = 2.0 * (X_corr[:, 0] + X_corr[:, 1] + X_corr[:, 2] +
X_corr[:, 3] + X_corr[:, 4]) / 5.0
y_corr += X_corr[:, 20] * 1.5 # one uncorrelated feature
y_corr += np.random.randn(n_corr) * 0.3

sc_corr = StandardScaler()
X_corr_sc = sc_corr.fit_transform(X_corr)

lasso_corr = Lasso(alpha=0.05, max_iter=10_000).fit(X_corr_sc, y_corr)
enet_corr = ElasticNet(alpha=0.05, l1_ratio=0.5, max_iter=10_000).fit(
X_corr_sc, y_corr)

print("\nGrouping effect - correlated features 0–4:")
print(f" Lasso coef[0:5] = {np.round(lasso_corr.coef_[:5], 3)}")
print(f" ElasticNet coef[0:5] = {np.round(enet_corr.coef_[:5], 3)}")
print(" Lasso picks 1 arbitrarily; ElasticNet distributes weight across all 5")

# Stability: run Lasso on 10 subsamples - does it pick the same feature?
print("\nLasso instability on correlated features (10 subsamples):")
selected_feature = []
for seed in range(10):
rng = np.random.default_rng(seed)
idx = rng.choice(n_corr, size=300, replace=False)
X_sub = sc_corr.fit_transform(X_corr[idx])
y_sub = y_corr[idx]
lasso_sub = Lasso(alpha=0.05, max_iter=10_000).fit(X_sub, y_sub)
# Which of features 0-4 was selected?
selected = np.argmax(np.abs(lasso_sub.coef_[:5]))
selected_feature.append(selected)
from collections import Counter
print(f" Feature selected from correlated group: {Counter(selected_feature)}")
print(" Different subsamples → different arbitrary choices (instability)")

Cross-Validation for Lambda Selection

The optimal λ\lambda must be chosen via cross-validation. Never set it manually - the optimal value depends entirely on the scale of your features, the noise level, and the sparsity of the true model.

# ── RidgeCV: built-in efficient generalized CV ────────────────────────────────
ridge_cv = RidgeCV(
alphas=np.logspace(-3, 5, 200),
cv=5,
scoring="r2"
)
ridge_cv.fit(X_tr_sc, y_tr)
print(f"\nRidgeCV:")
print(f" Optimal α : {ridge_cv.alpha_:.5f}")
print(f" Test R² : {ridge_cv.score(X_te_sc, y_te):.4f}")

# ── LassoCV: computes entire regularization path efficiently ──────────────────
lasso_cv = LassoCV(
n_alphas=200,
cv=5,
max_iter=20_000,
random_state=42
)
lasso_cv.fit(X_tr_sc, y_tr)
print(f"\nLassoCV:")
print(f" Optimal α : {lasso_cv.alpha_:.5f}")
print(f" Test R² : {lasso_cv.score(X_te_sc, y_te):.4f}")
print(f" Features : {np.sum(np.abs(lasso_cv.coef_) > 1e-8)}/200")

# ── ElasticNetCV: searches over both α and l1_ratio jointly ──────────────────
enet_cv = ElasticNetCV(
l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 1.0],
n_alphas=100,
cv=5,
max_iter=20_000,
random_state=42,
n_jobs=-1
)
enet_cv.fit(X_tr_sc, y_tr)
print(f"\nElasticNetCV:")
print(f" Optimal α : {enet_cv.alpha_:.5f}")
print(f" Optimal l1 : {enet_cv.l1_ratio_:.2f}")
print(f" Test R² : {enet_cv.score(X_te_sc, y_te):.4f}")
print(f" Features : {np.sum(np.abs(enet_cv.coef_) > 1e-8)}/200")

# ── Visualize CV error curve ──────────────────────────────────────────────────
plt.figure(figsize=(9, 5))
mean_mse = lasso_cv.mse_path_.mean(axis=1)
std_mse = lasso_cv.mse_path_.std(axis=1) / np.sqrt(5)

plt.semilogx(lasso_cv.alphas_, mean_mse, "b-", linewidth=2, label="Mean CV MSE")
plt.fill_between(lasso_cv.alphas_,
mean_mse - std_mse,
mean_mse + std_mse,
alpha=0.2, color="blue", label="±1 SE")
plt.axvline(lasso_cv.alpha_, color="crimson", linestyle="--", linewidth=2,
label=f"Optimal α={lasso_cv.alpha_:.4f}")
plt.xlabel("Regularization α (log scale)")
plt.ylabel("Cross-Validated MSE")
plt.title("LassoCV Error Curve\n"
"Minimum identifies the best bias-variance operating point")
plt.legend()
plt.grid(True, alpha=0.25)
plt.tight_layout()
plt.savefig("lasso_cv_curve.png", dpi=150, bbox_inches="tight")
plt.show()

Bias-Variance Tradeoff Visualization

# ── How bias and variance change as α increases ────────────────────────────────
from sklearn.model_selection import cross_validate

alphas_bv = np.logspace(-4, 5, 60)
train_errs, val_errs = [], []

for alpha in alphas_bv:
cv_res = cross_validate(
Ridge(alpha=alpha), X_tr_sc, y_tr,
cv=5, scoring="neg_mean_squared_error",
return_train_score=True
)
train_errs.append(-cv_res["train_score"].mean())
val_errs.append(-cv_res["test_score"].mean())

opt_idx = int(np.argmin(val_errs))
opt_alpha = alphas_bv[opt_idx]

plt.figure(figsize=(10, 5))
plt.semilogx(alphas_bv, train_errs, "b-", linewidth=2,
label="Train MSE (proxy for bias²)")
plt.semilogx(alphas_bv, val_errs, "r-", linewidth=2,
label="Val MSE = bias² + variance")
plt.axvline(opt_alpha, color="green", linestyle="--", linewidth=2,
label=f"Optimal α={opt_alpha:.2f}")

plt.annotate("High variance\n(underfitting: λ too small)",
xy=(alphas_bv[3], val_errs[3]),
fontsize=9, ha="left", color="darkred")
plt.annotate("High bias\n(underfitting: λ too large)",
xy=(alphas_bv[-5], val_errs[-5]),
fontsize=9, ha="right", color="darkblue")

plt.xlabel("Regularization α (log scale)")
plt.ylabel("MSE")
plt.title("Bias-Variance Tradeoff Under Ridge Regularization\n"
"Optimal α balances the two sources of error")
plt.legend()
plt.grid(True, alpha=0.25)
plt.tight_layout()
plt.savefig("bias_variance_ridge.png", dpi=150, bbox_inches="tight")
plt.show()

Regularization in Deep Learning

Understanding L1/L2 regularization in linear models pays dividends in deep learning.

Weight decay = L2 regularization: in deep learning frameworks, the optimizer's weight_decay parameter adds λw22\lambda\|\mathbf{w}\|_2^2 to the loss - this is identical to Ridge regularization. The gradient update becomes:

wt+1=wtη(L+2λwt)=(12ηλ)wtηL\mathbf{w}_{t+1} = \mathbf{w}_t - \eta(\nabla\mathcal{L} + 2\lambda\mathbf{w}_t) = (1 - 2\eta\lambda)\mathbf{w}_t - \eta\nabla\mathcal{L}

The term (12ηλ)(1 - 2\eta\lambda) multiplies weights by a factor slightly less than 1 each step - hence the name "weight decay."

L2 vs Dropout: both reduce overfitting by different mechanisms. L2 penalizes large weights, encouraging the model to use all features slightly rather than a few features heavily. Dropout randomly zeroes activations during training, forcing the network to learn redundant representations. For neural networks, dropout is generally preferred because the model is usually not capacity-limited in the same way as linear models - adding noise to activations is more effective than penalizing weights directly.

import torch
import torch.nn as nn

# Weight decay = L2 regularization in PyTorch
model = nn.Linear(100, 1)
optimizer = torch.optim.Adam(model.parameters(),
lr=1e-3,
weight_decay=1e-4) # equivalent to λ = 1e-4 L2

# Equivalent manual implementation (verifying the connection)
def manual_l2_loss(model, inputs, targets, lam=1e-4):
outputs = model(inputs)
mse = nn.functional.mse_loss(outputs, targets)
# L2 penalty on all weight parameters (not biases)
l2_penalty = sum(p.pow(2).sum() for name, p in model.named_parameters()
if "weight" in name)
return mse + lam * l2_penalty

print("Weight decay (1e-4) is equivalent to L2 penalty λ=1e-4")
print("PyTorch AdamW decouples weight decay from the adaptive learning rate")
print("This is important: Adam + weight_decay ≠ Adam + L2 penalty due to")
print("the interaction between the gradient scaling and the decay term.")

When to Use Each Regularizer

SituationRecommended
All features likely relevant, multicollinearity presentRidge
Feature selection needed, dnd \gg n, sparse true modelLasso
Correlated features that should be selected togetherElasticNet
Genomics, text, finance with correlated feature groupsElasticNet
Need closed-form analytical solutionRidge only
Always: lambda selectionRidgeCV / LassoCV / ElasticNetCV

Common Mistakes

:::danger Not standardizing features before regularization

# WRONG - raw features on different scales make regularization unfair
lasso = Lasso(alpha=0.1).fit(X_raw, y)

# CORRECT - standardize first so every feature is penalized equally
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_sc = sc.fit_transform(X_train)
lasso = Lasso(alpha=0.1).fit(X_sc, y)

The L1 and L2 penalties measure the weight magnitude in the units of w\mathbf{w}. A feature with values in [0,1000][0, 1000] naturally has smaller weights than a feature in [0,1][0, 1], so it gets penalized less - even though it may be no more informative. Standardizing features before applying any regularization is mandatory. Apply the same StandardScaler.transform() (never fit_transform) to the test set. :::

:::danger Choosing lambda by visual inspection of training loss The regularization strength λ\lambda controls bias vs variance. Training loss always decreases as λ\lambda decreases - so you cannot use training loss to choose λ\lambda. Always use the CV variants: LassoCV, RidgeCV, ElasticNetCV. They compute the validation error efficiently over the entire regularization path. :::

:::warning Interpreting Lasso zeros as "irrelevant features" LASSO sets feature jj to zero when xjr/nλ/2|{\mathbf{x}_j^\top\mathbf{r}}|/n \leq \lambda/2 - when its partial correlation with the residual falls below the threshold. With correlated features, LASSO may zero out a feature not because it is uninformative but because another correlated feature has already captured its information. The zeroed feature may have near-identical predictive power to the selected one. Use ElasticNet when correlated features should be selected together, or verify zeroed features with univariate tests. :::

:::warning Ridge does not perform feature selection Ridge shrinks all coefficients toward zero but keeps all features active. For a model with 22,000 genes, Ridge returns 22,000 non-zero coefficients. If your goal is identifying which genes are predictive (a very common goal in genomics), Ridge is wrong. Use Lasso or ElasticNet. :::

YouTube Resources

ResourceChannelWhy Watch
Regularization - Clearly ExplainedStatQuest with Josh StarmerBest visual walkthrough of Ridge, Lasso, and the bias-variance tradeoff - start here
Ridge vs Lasso RegressionStatQuest with Josh StarmerDetailed comparison with the geometric sparsity argument explained visually
MIT 6.S897: LASSO and Sparse RecoveryMIT OpenCourseWareRigorous treatment of LASSO, soft thresholding, and LARS - for deeper theory
Bayesian Interpretation of RegularizationProbabilistic ML (Tübingen)Gaussian prior → Ridge; Laplace prior → LASSO; complete Bayesian derivation
ElasticNet and Correlated FeaturesStatQuestWhen to use ElasticNet vs LASSO - the grouping effect explained clearly

Interview Q&A

Q1: Why does L1 regularization produce sparse models but L2 does not?

There are two equivalent explanations - geometric and algebraic.

Geometrically: the regularized problem is equivalent to minimizing the loss subject to a constraint on the norm wpt\|\mathbf{w}\|_p \leq t. The MSE loss has ellipsoidal contours centered at the OLS solution. The constrained solution is where the smallest ellipse first touches the constraint set. The L1 constraint set (cross-polytope/diamond in 2D) has corners exactly on the coordinate axes. Loss ellipses almost always first touch at a corner where one or more wj=0w_j = 0 exactly. The L2 constraint set (sphere/circle) has no corners - the touching point is generically not on an axis. Ridge shrinks all coefficients but never zeros them.

Algebraically: the L1 penalty has a subdifferential at wj=0w_j = 0 of [λ,λ][-\lambda, \lambda]. The KKT optimality condition requires that the data gradient L/wjλ/2|\partial\mathcal{L}/\partial w_j| \leq \lambda/2 for wj=0w_j = 0 to be optimal. This is achievable for many features. The L2 gradient is 2λwj2\lambda w_j - it is zero only when wj=0w_j = 0 itself, but the gradient from the data pushes wjw_j away from zero no matter how small - so only the limit λ\lambda \to \infty gives exact zeros.

Q2: Derive the Ridge regression closed-form solution and explain how it fixes multicollinearity.

Setting the gradient of Jridge=1nyXw2+λw2\mathcal{J}_\text{ridge} = \frac{1}{n}\|\mathbf{y}-\mathbf{X}\mathbf{w}\|^2 + \lambda\|\mathbf{w}\|^2 to zero: 2nX(Xwy)+2λw=0\frac{2}{n}\mathbf{X}^\top(\mathbf{X}\mathbf{w}-\mathbf{y}) + 2\lambda\mathbf{w} = \mathbf{0}. Rearranging: (XX+nλI)w=Xy(\mathbf{X}^\top\mathbf{X} + n\lambda\mathbf{I})\mathbf{w} = \mathbf{X}^\top\mathbf{y}, giving w^ridge=(XX+nλI)1Xy\hat{\mathbf{w}}_\text{ridge} = (\mathbf{X}^\top\mathbf{X} + n\lambda\mathbf{I})^{-1}\mathbf{X}^\top\mathbf{y}.

Multicollinearity causes near-zero eigenvalues of XX\mathbf{X}^\top\mathbf{X}. If the smallest eigenvalue is \mu_\min \approx 0, inverting XX\mathbf{X}^\top\mathbf{X} amplifies noise by 1/\mu_\min \to \infty - extremely unstable. Adding nλIn\lambda\mathbf{I} shifts all eigenvalues up by nλn\lambda, making the smallest eigenvalue at least nλ>0n\lambda > 0. The condition number decreases from \mu_\max/\mu_\min to (\mu_\max + n\lambda)/(n\lambda). Ridge literally adds numerical stability to an ill-conditioned system.

Q3: What is the Bayesian interpretation of L1 and L2 regularization?

Both are MAP (Maximum A Posteriori) estimation: w^MAP=argmaxwlogP(yX,w)+logP(w)\hat{\mathbf{w}}_\text{MAP} = \arg\max_\mathbf{w} \log P(\mathbf{y}|\mathbf{X},\mathbf{w}) + \log P(\mathbf{w}).

L2 (Ridge): Gaussian prior P(wj)=N(0,τ2)P(w_j) = \mathcal{N}(0, \tau^2). Log-prior: 12τ2w2+const-\frac{1}{2\tau^2}\|\mathbf{w}\|^2 + \text{const}. This adds λw2-\lambda\|\mathbf{w}\|^2 to the log-likelihood with λ=σ2/τ2\lambda = \sigma^2/\tau^2 - the Ridge objective. A Gaussian prior has a smooth, rounded peak at zero - coefficients are pulled toward zero but never exactly zeroed.

L1 (LASSO): Laplace prior P(wj)=12bexp(wj/b)P(w_j) = \frac{1}{2b}\exp(-|w_j|/b). Log-prior: 1bw1+const-\frac{1}{b}\|\mathbf{w}\|_1 + \text{const}. This adds λw1-\lambda\|\mathbf{w}\|_1 to the log-likelihood - the LASSO objective. The Laplace distribution has a sharp, pointed peak at zero (non-differentiable, like the L1 norm itself) - coefficients are strongly pulled to exactly zero unless the data evidence is strong.

Q4: Explain the coordinate descent update for LASSO. Why does it produce exact zeros algebraically?

For coordinate jj with all others fixed, the partial residual is ri(j)=yikjwkxikr_i^{(j)} = y_i - \sum_{k\neq j} w_k x_{ik}. The objective reduces to J(wj)=1ni(ri(j)wjxij)2+λwj\mathcal{J}(w_j) = \frac{1}{n}\sum_i(r_i^{(j)} - w_j x_{ij})^2 + \lambda|w_j|.

With standardized features (1nixij2=1\frac{1}{n}\sum_i x_{ij}^2 = 1), the unconstrained minimum is ρj=1nxjr(j)\rho_j = \frac{1}{n}\mathbf{x}_j^\top\mathbf{r}^{(j)}. Adding the L1 penalty at zero creates a non-differentiable kink. The subdifferential optimality condition gives: w^j=sign(ρj)max(ρjλ/2,0)\hat{w}_j = \text{sign}(\rho_j)\max(|\rho_j| - \lambda/2, 0).

Exact zeros arise because: when ρjλ/2|\rho_j| \leq \lambda/2, the subdifferential at wj=0w_j = 0 contains the data gradient 2ρj-2\rho_j. So wj=0w_j = 0 satisfies the optimality condition - zero is a valid stationary point. With L2, the gradient at wj=0w_j = 0 from the penalty is 2λ0=02\lambda \cdot 0 = 0 and the data gradient 2ρj0-2\rho_j \neq 0 immediately pulls the coefficient away from zero.

Q5: How do you choose between Ridge, Lasso, and ElasticNet in practice?

Decision framework: (1) Do you need an interpretable sparse model or automatic feature selection? If yes, Lasso or ElasticNet; if no (all features expected to contribute), Ridge. (2) Are features in correlated groups (pathway genes, topic-related TF-IDF terms, financial factor exposures)? If yes, ElasticNet - Lasso picks one arbitrarily and its selection is unstable across subsamples. (3) Is d>nd > n? Use Lasso or ElasticNet - Ridge still works but does not reduce model complexity. (4) Do you need a closed-form solution or analytical tractability? Ridge only. (5) Always use the CV variants - LassoCV, RidgeCV, ElasticNetCV - to select λ\lambda. Never set it manually.

Q6: How does regularization interact with the bias-variance tradeoff, and when does increasing regularization actually improve test performance?

The expected test error decomposes as Bias2+Variance+σ2\text{Bias}^2 + \text{Variance} + \sigma^2. Regularization increases bias (constraints prevent the model from fitting the true function perfectly) and decreases variance (the model is less sensitive to the specific training set). The question is which effect dominates.

When dnd \approx n or d>nd > n: OLS variance is enormous - the model fits noise aggressively. Even a large increase in bias (e.g., heavily shrinking toward zero) can result in a net reduction of test error because the variance reduction dominates. In the 22,000-gene example with 800 samples, OLS has variance proportional to d/n=27.5×d/n = 27.5 \times the signal - almost all of the model's variation is noise. Strong regularization (large λ\lambda) dramatically reduces variance at the cost of modest bias, and total error plummets.

When ndn \gg d: OLS already has low variance. Regularization mainly introduces bias without meaningfully reducing variance. Here, only very small λ\lambda (strong regularization only if specifically needed) improves test performance. The optimal λ\lambda found by CV will be very small in this case - consistent with the mathematical analysis.

Stability Selection and Lasso as a Feature Screening Tool

A key limitation of LASSO for feature selection is its instability: the selected support can change substantially between slightly different training sets. Stability Selection (Meinshausen and Bühlmann, 2010) addresses this by combining LASSO with bootstrap resampling.

The algorithm: run LASSO on many (e.g., 100) bootstrap subsamples of the training data, at each of many regularization values. Record how often each feature is selected (its selection probability). Features with high selection probability across bootstrap runs are stably important; features selected erratically are likely false positives.

import numpy as np
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression

np.random.seed(42)
X, y, true_coef = make_regression(
n_samples=300, n_features=100, n_informative=15,
noise=15.0, coef=True, random_state=42
)
true_support = set(np.where(np.abs(true_coef) > 1e-6)[0])

sc = StandardScaler()
X_sc = sc.fit_transform(X)

# ── Stability Selection ────────────────────────────────────────────────────────
def stability_selection(X, y, alpha=0.1, n_bootstrap=100, subsample_frac=0.5):
"""
Estimate selection probability for each feature under LASSO.
Features with high selection probability are stably informative.
"""
n, d = X.shape
n_sub = int(n * subsample_frac)
selection_counts = np.zeros(d)

for seed in range(n_bootstrap):
rng = np.random.default_rng(seed)
idx = rng.choice(n, size=n_sub, replace=False)
X_sub = X[idx]
y_sub = y[idx]

lasso = Lasso(alpha=alpha, max_iter=5000)
lasso.fit(X_sub, y_sub)
selection_counts += (np.abs(lasso.coef_) > 1e-8).astype(float)

return selection_counts / n_bootstrap # selection probabilities

sel_probs = stability_selection(X_sc, y, alpha=0.1, n_bootstrap=200)

# Features with selection probability > 0.6 are stably selected
threshold = 0.6
stable_features = set(np.where(sel_probs > threshold)[0])
precision = len(true_support & stable_features) / max(len(stable_features), 1)
recall = len(true_support & stable_features) / max(len(true_support), 1)

print("Stability Selection Results:")
print(f" Selected {len(stable_features)} features (threshold={threshold})")
print(f" Precision: {precision:.3f}")
print(f" Recall: {recall:.3f}")
print(f" (Single-run LASSO precision/recall typically worse due to instability)")

# Show selection probabilities for top features
top_features = np.argsort(sel_probs)[::-1][:20]
print("\nTop 20 features by selection probability:")
for idx in top_features:
is_true = "TRUE" if idx in true_support else " "
bar = "#" * int(sel_probs[idx] * 30)
print(f" Feature {idx:3d} [{is_true}] {sel_probs[idx]:.2f} {bar}")

Regularization Paths: The LARS-LASSO Connection

The Least Angle Regression (LARS) algorithm (Efron et al., 2004) computes the entire LASSO regularization path in O(d3+nd2)O(d^3 + nd^2) time - the same cost as a single OLS fit, regardless of the path length.

The LARS algorithm builds the path step by step:

  1. Start with all coefficients at zero (λ=\lambda = \infty).
  2. Identify the feature most correlated with the current residual and begin moving its coefficient in the direction of the correlation.
  3. Continue until another feature becomes equally correlated with the residual.
  4. At that point, add the new feature and move all currently active features jointly.
  5. Repeat until all features are active or λ=0\lambda = 0.

The LASSO path is recovered from LARS by adding a modification: when a coefficient crosses zero during the joint movement, set it to zero and remove it from the active set. This "drop step" is what makes the LARS path equal to the LASSO path.

from sklearn.linear_model import lars_path, LassoLars

# Compute the LASSO path via LARS (exact, efficient)
alphas_lars, _, coefs_lars = lars_path(
X_sc, y, method="lasso", verbose=0
)

print(f"\nLARS-LASSO path:")
print(f" Number of path steps: {len(alphas_lars)}")
print(f" Alpha range: [{alphas_lars.min():.4f}, {alphas_lars.max():.4f}]")
print(f" Path is piecewise linear - exact regularization path in O(d³) time")

# Verify: LassoLars at a specific alpha matches sklearn's Lasso
alpha_target = 0.1
lasso_lars = LassoLars(alpha=alpha_target).fit(X_sc, y)
lasso_std = Lasso(alpha=alpha_target, max_iter=10_000).fit(X_sc, y)

max_diff = np.max(np.abs(lasso_lars.coef_ - lasso_std.coef_))
print(f"\nLassoLars vs Lasso at α={alpha_target}:")
print(f" Max coefficient difference: {max_diff:.2e} (should be ~0)")

Practical Regularization Checklist

When you encounter a new regression problem, follow this decision checklist:

def regularization_checklist(n_samples, n_features, has_correlated_groups,
need_feature_selection, need_closed_form):
"""
Practical guide to regularization choice.
Returns a recommendation with reasoning.
"""
print("Regularization Decision Checklist")
print("=" * 50)
print(f" n (samples) : {n_samples}")
print(f" d (features) : {n_features}")
print(f" d/n ratio : {n_features/n_samples:.2f}")
print()

if n_features / n_samples < 0.1:
print(" d/n < 0.1: OLS may work fine.")
print(" → Try OLS first; add Ridge if you see overfitting.")
elif n_features / n_samples < 1.0:
print(" 0.1 ≤ d/n < 1: Regularization strongly recommended.")
else:
print(" d/n ≥ 1: OLS has no unique solution. Regularization required.")

print()
if need_closed_form:
print(" Need closed form → Ridge only option.")
elif need_feature_selection and has_correlated_groups:
print(" Feature selection + correlated groups → ElasticNet (l1_ratio=0.5–0.9)")
elif need_feature_selection:
print(" Feature selection, no correlated groups → Lasso")
elif has_correlated_groups:
print(" Correlated groups, no selection needed → Ridge or ElasticNet (low l1)")
else:
print(" No special constraints → Ridge (default), try Lasso if sparsity useful")

print()
print(" Lambda selection: ALWAYS use CV variants (RidgeCV/LassoCV/ElasticNetCV)")
print(" Feature scaling: ALWAYS standardize features before regularization")


# Example: genomics problem
regularization_checklist(
n_samples=800, n_features=22_000,
has_correlated_groups=True, # genes in same pathway are correlated
need_feature_selection=True, # want to identify which genes matter
need_closed_form=False
)

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Regularization Path demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.