Norms and Distance Metrics - The Geometry of Regularization
Reading time: ~22 minutes | Level: Mathematical Foundations → ML Engineering
L1 regularization makes models sparse. L2 regularization makes models small. These are not arbitrary choices - they come directly from the geometry of different norms. The shape of the constraint region determines which weights go to zero.
When a regularized model trains, it is solving a constrained optimization problem: minimize the loss subject to the constraint that the weights live inside a norm ball. The L1 ball is a diamond. The L2 ball is a sphere. The corners of the diamond lie exactly on the coordinate axes. The loss surface, descending from above, is almost certain to touch a corner first - and a corner means some weights are exactly zero. That is why Lasso produces sparse models and Ridge does not.
This lesson gives you the full geometry.
What You Will Learn
- The three axioms that define a norm (and why they matter)
- L1 norm: the diamond, sparsity induction, why corners hit axes
- L2 norm: the sphere, smoothness, why solutions are small but non-zero
- L∞ norm: max absolute value, minimax problems
- Frobenius norm for matrices - equivalent to L2 on the vectorized matrix
- Nuclear norm - the convex relaxation of matrix rank
- Distance metrics derived from norms: Euclidean, Manhattan, Chebyshev, Mahalanobis
- NumPy:
np.linalg.normwith everyordvalue - ML: Lasso, Ridge, ElasticNet; cosine similarity vs Euclidean in embedding search
Prerequisites
Part 1 - What a Norm Is: Three Axioms
A norm is a function that assigns a non-negative length to a vector. Not every function qualifies - a norm must satisfy three axioms:
Axiom 1: Positive definiteness
‖v‖ ≥ 0 for all v
‖v‖ = 0 if and only if v = 0
Zero length only for the zero vector. No non-zero vector can have zero norm.
Axiom 2: Absolute homogeneity (scaling)
‖αv‖ = |α| ‖v‖ for any scalar α
Scaling a vector by α scales its length by |α|. Doubling a vector doubles its length.
Axiom 3: Triangle inequality
‖u + v‖ ≤ ‖u‖ + ‖v‖
The length of a sum cannot exceed the sum of lengths. This is the mathematical statement that "the straight-line path is never longer than the detour."
Why axioms matter for ML
The triangle inequality is what makes a norm usable as a distance metric. It guarantees that if you move from point A to point B and then from B to C, the total distance is at least as long as going directly from A to C - a property any reasonable "distance" must have.
When you regularize, you constrain weights to live inside a norm ball - the set of all vectors with norm ≤ r. The shape of that ball determines everything about which solutions are selected.
The Lp family
The most important norms in ML are the Lp norms, defined for vectors v ∈ ℝⁿ:
‖v‖_p = (|v₁|^p + |v₂|^p + ... + |vₙ|^p)^(1/p)
Special cases:
- p = 1: L1 norm (Manhattan distance)
- p = 2: L2 norm (Euclidean distance)
- p → ∞: L∞ norm (Chebyshev / max norm)
import numpy as np
v = np.array([3.0, -4.0, 2.0])
# All Lp norms via np.linalg.norm
l1 = np.linalg.norm(v, ord=1) # |3| + |-4| + |2| = 9.0
l2 = np.linalg.norm(v, ord=2) # sqrt(9+16+4) = sqrt(29) ≈ 5.385
linf = np.linalg.norm(v, ord=np.inf) # max(3, 4, 2) = 4.0
print(f"L1 ‖v‖₁ = {l1}") # 9.0
print(f"L2 ‖v‖₂ = {l2:.4f}") # 5.3852
print(f"L∞ ‖v‖∞ = {linf}") # 4.0
# Manual verification
l1_manual = np.sum(np.abs(v))
l2_manual = np.sqrt(np.sum(v**2))
linf_manual = np.max(np.abs(v))
assert np.isclose(l1, l1_manual)
assert np.isclose(l2, l2_manual)
assert np.isclose(linf, linf_manual)
# General p norm - as p grows, Lp converges to L∞
for p in [1, 1.5, 2, 3, 5, 10, 100]:
lp = np.linalg.norm(v, ord=p)
print(f"L{p:<4} ‖v‖_{p} = {lp:.4f}")
# At p=100 this will be very close to 4.0 (the L∞ value)
Part 2 - The L1 Norm: Geometry, Sparsity, and Why Lasso Works
Definition
‖v‖₁ = |v₁| + |v₂| + ... + |vₙ|
In 2D with v = (x, y): ‖v‖₁ = |x| + |y|.
The L1 ball is a diamond
The set {v : ‖v‖₁ ≤ 1} in 2D is:
|x| + |y| ≤ 1
This is a square rotated 45 degrees - a diamond shape. The four corners of the diamond sit exactly at (1,0), (-1,0), (0,1), (0,-1) - on the coordinate axes.
Why L1 induces sparsity: the geometric argument
Lasso regression solves:
minimize ‖Xw - y‖₂² subject to ‖w‖₁ ≤ r
This is equivalent to finding where the elliptical loss surface first touches the L1 ball.
The elliptical loss contours descend from the unconstrained optimum. Almost always, the first point of contact with the L1 diamond is a corner - and a corner has a zero component. The more dimensions there are, the more corners there are, and the more likely the contact point has multiple zero components.
For a d-dimensional L1 ball:
- Number of corners: 2d (one per axis, positive and negative)
- At a corner: d-1 components are zero
- The loss contour almost surely hits a corner rather than a face or interior point
This is not a coincidence - it is a consequence of the L1 ball having non-differentiable points at each corner. The subdifferential of the L1 norm at a corner spans a range of gradient directions, capturing any generic loss gradient that points toward that corner.
Soft-thresholding: the L1 proximal operator
The proximal operator of the L1 norm is soft-thresholding. It is the analytic solution to each coordinate step in Lasso coordinate descent:
S(z, t) = sign(z) * max(|z| - t, 0)
Values with |z| ≤ t are exactly zeroed. This is the mechanism that produces sparsity.
import numpy as np
def lasso_coordinate_descent(
X: np.ndarray,
y: np.ndarray,
alpha: float = 1.0,
n_iter: int = 1000,
tol: float = 1e-6
) -> np.ndarray:
"""
Lasso regression via coordinate descent.
Lasso objective: (1/2n) * ‖Xw - y‖₂² + alpha * ‖w‖₁
For each weight w_j, the 1D Lasso problem has a closed-form solution
via soft-thresholding - the proximal operator of the L1 norm.
"""
n, d = X.shape
w = np.zeros(d)
for iteration in range(n_iter):
w_old = w.copy()
for j in range(d):
# Partial residual: residual excluding contribution of w_j
r_j = y - X @ w + X[:, j] * w[j]
# OLS update for w_j ignoring L1 penalty
rho_j = X[:, j] @ r_j / n
# Soft-thresholding: shrink rho_j toward zero by alpha
# Values in [-alpha, alpha] are zeroed exactly
z_j = np.dot(X[:, j], X[:, j]) / n
w[j] = np.sign(rho_j) * max(abs(rho_j) - alpha, 0.0) / z_j
if np.max(np.abs(w - w_old)) < tol:
print(f"Converged at iteration {iteration + 1}")
break
return w
# Generate sparse ground truth and verify recovery
np.random.seed(42)
n, d = 100, 20
X = np.random.randn(n, d)
w_true = np.zeros(d)
w_true[[0, 5, 12]] = [3.0, -2.0, 1.5] # only 3 non-zero weights
y = X @ w_true + 0.1 * np.random.randn(n)
w_lasso = lasso_coordinate_descent(X, y, alpha=0.1)
n_nonzero = np.sum(np.abs(w_lasso) > 1e-6)
print(f"Non-zero weights: {n_nonzero} (true sparse: 3)")
print(f"Recovered indices: {np.where(np.abs(w_lasso) > 1e-6)[0]}")
print(f"True indices: [0, 5, 12]")
Part 3 - The L2 Norm: Geometry, Smoothness, and Why Ridge Works
Definition
‖v‖₂ = sqrt(v₁² + v₂² + ... + vₙ²)
The L2 norm is the standard Euclidean length - the straight-line distance from the origin to the point.
The L2 ball is a sphere
The set {v : ‖v‖₂ ≤ 1} is a perfect circle in 2D, a sphere in 3D, and a hypersphere in higher dimensions. It has no corners - it is perfectly smooth everywhere.
Why L2 does not induce sparsity
Ridge regression solves:
minimize ‖Xw - y‖₂² subject to ‖w‖₂ ≤ r
The loss contour (an ellipse) descends toward the circular L2 ball. Because the ball is smooth, the contact point is almost never at a coordinate axis - it is somewhere on the curved surface where all coordinates are generically non-zero.
The L2 penalty does not force coefficients to zero. Instead, it shrinks all coefficients toward zero proportionally. If you double the regularization strength, all weights get cut approximately in half - none become exactly zero.
Closed-form Ridge solution
For Ridge regression, the penalty gives a clean closed form:
w_ridge = (XᵀX + αI)⁻¹ Xᵀy
The αI term shifts all eigenvalues of XᵀX up by α, ensuring the matrix is invertible even when XᵀX is singular (collinear features).
import numpy as np
def ridge_regression(
X: np.ndarray,
y: np.ndarray,
alpha: float = 1.0
) -> np.ndarray:
"""
Ridge regression: closed-form solution.
Objective: ‖Xw - y‖₂² + alpha * ‖w‖₂²
Solution: w = (XᵀX + αI)⁻¹ Xᵀy
Ridge shrinks all weights proportionally - no exact zeros.
The α*I ensures invertibility even with collinear features.
"""
n, d = X.shape
I = np.eye(d)
w = np.linalg.solve(X.T @ X + alpha * I, X.T @ y)
return w
np.random.seed(42)
n, d = 100, 10
X = np.random.randn(n, d)
w_true = np.zeros(d)
w_true[[0, 3]] = [2.0, -1.5]
y = X @ w_true + 0.1 * np.random.randn(n)
alpha = 0.5
w_ridge = ridge_regression(X, y, alpha=alpha)
print("Ridge weights (all small, none exactly zero):")
for i, w in enumerate(w_ridge):
print(f" w[{i}] = {w:.4f}")
print(f"\nRidge zero weights (exact): {np.sum(np.abs(w_ridge) < 1e-10)}") # 0
print(f"Ridge min |w|: {np.min(np.abs(w_ridge)):.6f}") # small but > 0
# Ridge eigenvalue shift: how α stabilizes ill-conditioned matrices
XtX = X.T @ X
eigs_before = np.linalg.eigvalsh(XtX)
eigs_after = np.linalg.eigvalsh(XtX + alpha * np.eye(d))
print(f"\nSmallest eigenvalue before Ridge: {eigs_before.min():.4f}")
print(f"Smallest eigenvalue after Ridge: {eigs_after.min():.4f}")
print(f"Condition number before: {eigs_before.max()/eigs_before.min():.1f}")
print(f"Condition number after: {eigs_after.max()/eigs_after.min():.1f}")
L2 vs L1: comparison table
| Property | L1 (Lasso) | L2 (Ridge) |
|---|---|---|
| Geometry | Diamond - corners on axes | Sphere - smooth everywhere |
| Solution structure | Sparse (corners hit axes) | Dense (smooth edge) |
| Shrinkage mechanism | Soft-threshold to exact zero | Proportional shrinkage |
| Correlated features | Picks one, zeros others | Averages them out |
| Solution uniqueness | Not always unique | Always unique |
| Closed form | No (requires iteration) | Yes |
| Best for | Feature selection | Stable regression with all features |
Part 4 - The L∞ Norm: Maximum Absolute Value
Definition
‖v‖∞ = max(|v₁|, |v₂|, ..., |vₙ|)
The L∞ norm is the maximum absolute component - the largest coordinate in absolute value.
The L∞ ball is a hypercube
The set {v : ‖v‖∞ ≤ 1} is the set where every coordinate satisfies -1 ≤ vᵢ ≤ 1 - an axis-aligned hypercube (a square in 2D).
When L∞ arises in ML
The L∞ norm appears in minimax problems - where you want to minimize the worst-case error:
- Adversarial robustness: L∞ attacks constrain the maximum pixel perturbation -
‖δ‖∞ ≤ εmeans no individual pixel changes by more than ε. FGSM and PGD attacks operate in this constraint set. - Chebyshev approximation: finding a polynomial that minimizes the maximum error on an interval
- Robust regression: minimizing the maximum residual rather than the sum of squared residuals
import numpy as np
v = np.array([1.2, -3.5, 0.8, 2.1])
linf = np.linalg.norm(v, ord=np.inf) # 3.5 = max absolute value
print(f"L∞ ‖v‖∞ = {linf}")
# Adversarial attack: L∞ ball projection
def project_linf_ball(delta: np.ndarray, epsilon: float) -> np.ndarray:
"""
Project perturbation delta onto the L∞ ball of radius epsilon.
This is the PGD projection step - clip each coordinate independently.
"""
return np.clip(delta, -epsilon, epsilon)
def fgsm_perturbation(gradient: np.ndarray, epsilon: float) -> np.ndarray:
"""
Fast Gradient Sign Method: L∞ adversarial perturbation.
Every pixel moves by exactly epsilon in the sign direction of the gradient.
‖delta‖∞ = epsilon - this saturates the L∞ constraint.
"""
return epsilon * np.sign(gradient)
# Example on a 4-pixel toy image
gradient = np.array([0.3, -0.7, 0.1, -0.5])
epsilon = 0.1
delta = fgsm_perturbation(gradient, epsilon)
print(f"FGSM perturbation: {delta}")
print(f"L∞ of perturbation: {np.linalg.norm(delta, ord=np.inf)}") # exactly 0.1
# PGD step: gradient step followed by L∞ projection
def pgd_step(
x: np.ndarray,
gradient: np.ndarray,
step_size: float,
epsilon: float
) -> np.ndarray:
"""One step of Projected Gradient Descent under L∞ constraint."""
x_adv = x + step_size * np.sign(gradient) # gradient step
return project_linf_ball(x_adv, epsilon) # project back to L∞ ball
Part 5 - The Frobenius Norm: L2 for Matrices
Definition
For a matrix A ∈ ℝ^(m×n):
‖A‖_F = sqrt(Σᵢ Σⱼ Aᵢⱼ²)
The Frobenius norm is the square root of the sum of all squared entries. This is exactly the L2 norm applied to the vectorized matrix - if you flatten A into a vector of length m*n, its L2 norm equals ‖A‖_F.
Connection to singular values
The Frobenius norm has a clean expression in terms of singular values:
‖A‖_F = sqrt(σ₁² + σ₂² + ... + σᵣ²) = ‖σ‖₂
where σᵢ are the singular values of A. This connects the Frobenius norm to SVD and makes it the natural matrix analog of the vector L2 norm.
Where Frobenius appears in ML
- Weight regularization: adding
α * ‖W‖_F²to a neural network loss penalizes large weights - this is L2 weight decay in matrix form - Low-rank approximation error: the Eckart-Young theorem says the best rank-k approximation in Frobenius norm is the truncated SVD
- Matrix factorization: collaborative filtering minimizes ‖R - UV^T‖_F² over user matrix U and item matrix V
import numpy as np
A = np.array([[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0]])
# Frobenius norm
frob = np.linalg.norm(A, 'fro')
print(f"Frobenius ‖A‖_F = {frob:.4f}") # sqrt(1+4+9+16+25+36) = sqrt(91)
# Verify: equals L2 of flattened vector
l2_flat = np.linalg.norm(A.flatten(), ord=2)
print(f"L2 of flatten = {l2_flat:.4f}")
assert np.isclose(frob, l2_flat)
# Verify: equals sqrt of sum of squared singular values
U, s, Vt = np.linalg.svd(A)
frob_from_svd = np.sqrt(np.sum(s**2))
print(f"From SVD: {frob_from_svd:.4f}")
assert np.isclose(frob, frob_from_svd)
# Low-rank approximation error (Eckart-Young theorem)
def low_rank_approx_error(A: np.ndarray, k: int) -> float:
"""
Frobenius error of best rank-k approximation.
Error = sqrt(sum of squared singular values beyond rank k).
This is the minimum achievable error - the truncated SVD achieves it.
"""
U, s, Vt = np.linalg.svd(A, full_matrices=False)
return float(np.sqrt(np.sum(s[k:]**2)))
A_test = np.random.randn(8, 6)
print("\nFrobenius approximation error by rank:")
for k in [1, 2, 3, 5]:
err = low_rank_approx_error(A_test, k)
print(f" rank-{k}: error = {err:.4f}")
# Matrix norms comparison
A_sq = np.random.randn(4, 4)
print("\nAll matrix norms of a 4x4 random matrix:")
print(f" Frobenius: {np.linalg.norm(A_sq, 'fro'):.4f}")
print(f" Nuclear: {np.linalg.norm(A_sq, 'nuc'):.4f}")
print(f" Spectral: {np.linalg.norm(A_sq, 2):.4f}") # = largest singular value
print(f" Row-max: {np.linalg.norm(A_sq, np.inf):.4f}")
print(f" Col-max: {np.linalg.norm(A_sq, 1):.4f}")
Part 6 - The Nuclear Norm: Convex Relaxation of Rank
The rank minimization problem
Many ML tasks want a low-rank matrix: recommender systems, matrix completion, robust PCA. The ideal objective is:
minimize rank(A) subject to constraints
But minimizing rank directly is NP-hard - rank is a combinatorial quantity (count of non-zero singular values). We need a convex relaxation.
The nuclear norm is the tightest convex relaxation
The nuclear norm (also called the trace norm or Schatten 1-norm) is:
‖A‖_* = σ₁ + σ₂ + ... + σᵣ (sum of all singular values)
This is to singular values what L1 is to vector entries. Just as L1 is the tightest convex relaxation of the L0 norm (count of non-zeros), the nuclear norm is the tightest convex relaxation of rank (count of non-zero singular values).
:::tip When to use the nuclear norm Use nuclear norm regularization when you need low-rank matrix completion - for example, filling in missing ratings in a recommender system:
minimize ‖A‖_* subject to Aᵢⱼ = Mᵢⱼ for observed (i,j)
This produces the lowest-rank matrix consistent with the observations. The proximal operator of the nuclear norm is singular value soft-thresholding - apply the L1 soft-threshold to each singular value. This is the Singular Value Thresholding (SVT) algorithm. In PyTorch, torch.linalg.matrix_norm(A, ord='nuc') computes this norm.
:::
import numpy as np
A = np.array([[1.0, 2.0, 0.0],
[0.0, 3.0, 1.0],
[2.0, 0.0, 4.0]])
U, s, Vt = np.linalg.svd(A)
nuclear_manual = np.sum(s)
nuclear_numpy = np.linalg.norm(A, 'nuc')
print(f"Nuclear norm (manual): {nuclear_manual:.4f}")
print(f"Nuclear norm (numpy): {nuclear_numpy:.4f}")
assert np.isclose(nuclear_manual, nuclear_numpy)
# Singular value soft-thresholding - proximal operator of nuclear norm
def nuclear_norm_prox(A: np.ndarray, tau: float) -> np.ndarray:
"""
Proximal operator of the nuclear norm at step size tau.
prox_{tau * ‖·‖_*}(A) = U * diag(max(σ - tau, 0)) * Vᵀ
This is the matrix analog of scalar soft-thresholding for L1.
Small singular values are zeroed, reducing the effective rank.
Used in the SVT algorithm for matrix completion.
"""
U, s, Vt = np.linalg.svd(A, full_matrices=False)
s_thresh = np.maximum(s - tau, 0.0)
return U @ np.diag(s_thresh) @ Vt
A_test = np.random.randn(5, 4)
A_thresh = nuclear_norm_prox(A_test, tau=1.0)
s_before = np.linalg.svd(A_test, compute_uv=False)
s_after = np.linalg.svd(A_thresh, compute_uv=False)
print(f"\nSingular values before: {s_before.round(3)}")
print(f"Singular values after: {s_after.round(3)}")
print(f"Rank before: {np.linalg.matrix_rank(A_test)}")
print(f"Rank after: {np.linalg.matrix_rank(A_thresh)}")
# Small singular values become zero → lower rank
# Nuclear norm vs rank: the relaxation relationship
print("\nNuclear norm and rank for varying low-rank matrices:")
for k in [1, 2, 3]:
# Build a rank-k matrix
U_r = np.random.randn(5, k)
V_r = np.random.randn(4, k)
M = U_r @ V_r.T
nuc = np.linalg.norm(M, 'nuc')
rk = np.linalg.matrix_rank(M)
print(f" Rank {rk}: nuclear norm = {nuc:.4f}")
Part 7 - Distance Metrics from Norms
A distance metric d(u, v) derived from a norm satisfies:
d(u, v) = ‖u - v‖
Every norm defines a distance metric. Different choices of norm give different notions of "distance."
Euclidean distance (from L2)
d(u, v) = ‖u - v‖₂ = sqrt(Σᵢ (uᵢ - vᵢ)²)
Standard straight-line distance. Sensitive to scale - features with large variance dominate.
Manhattan distance (from L1)
d(u, v) = ‖u - v‖₁ = Σᵢ |uᵢ - vᵢ|
Distance traveled along a grid (like city blocks). Less sensitive to outliers than L2 because it does not square the differences.
Chebyshev distance (from L∞)
d(u, v) = ‖u - v‖∞ = max_i |uᵢ - vᵢ|
The maximum coordinate difference. Used in chess (king moves), minimax games, and adversarial robustness settings.
Mahalanobis distance: the norm that accounts for correlation
The Mahalanobis distance is derived from a covariance-weighted norm:
d_M(u, v) = sqrt((u - v)ᵀ Σ⁻¹ (u - v))
where Σ is the covariance matrix of the data. Mahalanobis distance:
- Normalizes each feature by its standard deviation
- Decorrelates correlated features
- Is invariant to linear transformations of the data
- Reduces to Euclidean distance when Σ = I
import numpy as np
def mahalanobis_distance(
u: np.ndarray,
v: np.ndarray,
cov: np.ndarray
) -> float:
"""
Mahalanobis distance: accounts for feature scale and correlation.
d_M(u, v) = sqrt((u - v)ᵀ Σ⁻¹ (u - v))
Use this when:
- Features have different scales (instead of normalizing manually)
- Features are correlated (instead of first running PCA)
- Detecting outliers (data far from cluster center)
"""
diff = u - v
cov_inv = np.linalg.inv(cov)
return float(np.sqrt(diff.T @ cov_inv @ diff))
# Example: correlated features
np.random.seed(42)
n = 200
# x2 ≈ 2*x1 (strongly correlated)
x1 = np.random.randn(n)
x2 = 2.0 * x1 + 0.1 * np.random.randn(n)
X = np.column_stack([x1, x2])
cov = np.cov(X.T)
u = np.array([2.0, 4.0]) # on the correlation line (x2 = 2*x1)
v = np.array([0.0, 3.0]) # off the correlation line
d_euclidean = np.linalg.norm(u - v)
d_mahalanobis = mahalanobis_distance(u, v, cov)
print(f"Euclidean distance: {d_euclidean:.4f}")
print(f"Mahalanobis distance: {d_mahalanobis:.4f}")
# Mahalanobis accounts for correlation: v is farther off the data manifold
# Anomaly detection with Mahalanobis
def mahalanobis_scores(X: np.ndarray) -> np.ndarray:
"""Compute Mahalanobis distance of each point from the data centroid."""
mu = X.mean(axis=0)
cov = np.cov(X.T)
cov_inv = np.linalg.inv(cov)
diffs = X - mu
# Vectorized: (diffs @ cov_inv) * diffs, summed over features
scores = np.sqrt(np.sum((diffs @ cov_inv) * diffs, axis=1))
return scores
scores = mahalanobis_scores(X)
# Points with high Mahalanobis distance are anomalies
threshold = np.percentile(scores, 95)
anomalies = np.where(scores > threshold)[0]
print(f"\nTop 5% anomaly threshold: {threshold:.2f}")
print(f"Detected {len(anomalies)} anomalies")
Distance metric comparison
| Metric | Base norm | Formula | Scale-sensitive | When to use |
|---|---|---|---|---|
| Euclidean | L2 | sqrt(Σ(uᵢ-vᵢ)²) | Yes | Default starting point |
| Manhattan | L1 | Σ|uᵢ-vᵢ| | Yes | Robust to outliers |
| Chebyshev | L∞ | max|uᵢ-vᵢ| | Yes | Minimax, adversarial |
| Mahalanobis | Σ⁻¹-weighted | sqrt(dᵀΣ⁻¹d) | No | Correlated features |
| Cosine | - | 1 - cos(θ) | No | Embeddings, text |
Part 8 - NumPy: np.linalg.norm Complete Reference
import numpy as np
# ── Vector norms ──────────────────────────────────────────────────────────
v = np.array([3.0, -4.0, 2.0, -1.0])
print("=== Vector norms ===")
print(f"L0 (non-zeros, not a true norm): {np.sum(v != 0)}")
print(f"L1: {np.linalg.norm(v, ord=1)}") # sum of |vᵢ| = 10.0
print(f"L2: {np.linalg.norm(v, ord=2):.4f}") # sqrt(30) ≈ 5.477 (default)
print(f"L2 (default, no ord): {np.linalg.norm(v):.4f}")
print(f"L∞: {np.linalg.norm(v, ord=np.inf)}") # max|vᵢ| = 4.0
print(f"L-∞: {np.linalg.norm(v, ord=-np.inf)}") # min|vᵢ| = 1.0
# ── Matrix norms ──────────────────────────────────────────────────────────
A = np.array([[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0],
[7.0, 8.0, 9.0]])
print("\n=== Matrix norms ===")
print(f"Frobenius: {np.linalg.norm(A, 'fro'):.4f}") # sqrt of sum sq entries
print(f"Nuclear: {np.linalg.norm(A, 'nuc'):.4f}") # sum of singular values
print(f"Spectral (ord=2): {np.linalg.norm(A, 2):.4f}") # largest singular value
print(f"Max row sum (inf): {np.linalg.norm(A, np.inf):.4f}") # max(row sums)
print(f"Max col sum (1): {np.linalg.norm(A, 1):.4f}") # max(col sums)
# ── IMPORTANT: ord=inf means different things for vectors vs matrices ─────
v2 = np.array([3.0, 4.0])
M = v2.reshape(1, -1) # (1, 2) matrix - same numbers, different shape
print(f"\nVector L∞ (max |vᵢ|): {np.linalg.norm(v2, ord=np.inf)}") # 4.0
print(f"Matrix L∞ (max row sum): {np.linalg.norm(M, ord=np.inf)}") # 7.0 !!!
# Different! The matrix L∞ norm is max(sum of |row entries|), not max entry.
# ── Axis parameter: per-row or per-column norms ──────────────────────────
B = np.array([[3.0, 4.0],
[1.0, 0.0],
[2.0, 2.0]])
row_norms = np.linalg.norm(B, axis=1) # L2 norm of each row
col_norms = np.linalg.norm(B, ord=1, axis=0) # L1 norm of each column
print(f"\nRow-wise L2 norms: {row_norms}") # [5.0, 1.0, 2.828]
print(f"Col-wise L1 norms: {col_norms}") # [6.0, 6.0]
# Normalize rows to unit length (required before dot-product similarity)
B_unit = B / np.linalg.norm(B, axis=1, keepdims=True)
print(f"Row-normalized norms: {np.linalg.norm(B_unit, axis=1).round(4)}") # all 1.0
# ── Batch embeddings: norm over last axis ────────────────────────────────
embeddings = np.random.randn(32, 512) # 32 embeddings of dimension 512
norms = np.linalg.norm(embeddings, axis=1, keepdims=True) # (32, 1)
normalized = embeddings / norms # (32, 512)
print(f"\nEmbedding norms (after): {np.linalg.norm(normalized, axis=1).round(4)[:5]}")
# All should be 1.0
Part 9 - ML Connections: Regularization, Embeddings, and Search
ElasticNet: combining L1 and L2
ElasticNet blends Lasso and Ridge, getting sparsity from L1 and stability from L2:
objective = (1/2n)‖Xw - y‖₂² + alpha * l1_ratio * ‖w‖₁ + (alpha/2) * (1-l1_ratio) * ‖w‖₂²
This is useful when:
- There are many correlated features (Lasso picks one arbitrarily; ElasticNet spreads weight)
- You want sparsity but Lasso alone is unstable
import numpy as np
def elasticnet_coordinate_descent(
X: np.ndarray,
y: np.ndarray,
alpha: float = 1.0,
l1_ratio: float = 0.5,
n_iter: int = 1000
) -> np.ndarray:
"""
ElasticNet via coordinate descent.
l1_ratio = 1.0 → pure Lasso (sparse, unstable with correlations)
l1_ratio = 0.0 → pure Ridge (dense, stable)
l1_ratio in (0, 1) → ElasticNet blend
"""
n, d = X.shape
w = np.zeros(d)
alpha1 = alpha * l1_ratio # L1 strength
alpha2 = alpha * (1 - l1_ratio) # L2 strength
for _ in range(n_iter):
for j in range(d):
r_j = y - X @ w + X[:, j] * w[j]
rho_j = X[:, j] @ r_j / n
# L2 adds alpha2 to the denominator - shrinks but does not zero
z_j = X[:, j] @ X[:, j] / n + alpha2
# L1 soft-thresholds - creates sparsity
w[j] = np.sign(rho_j) * max(abs(rho_j) - alpha1, 0.0) / z_j
return w
:::danger Cosine similarity vs Euclidean distance for embeddings - choose carefully These two metrics answer fundamentally different questions:
Cosine similarity measures the angle between vectors - direction only, ignoring magnitude:
cos(u, v) = (u · v) / (‖u‖₂ ‖v‖₂)
Use when: magnitude carries no semantic meaning - word embeddings, sentence embeddings, TF-IDF vectors. The same document expressed in two differently-scaled embeddings should be identical.
Euclidean distance measures geometric distance - direction AND magnitude:
d(u, v) = ‖u - v‖₂
Use when: magnitude carries information, or when using embeddings from a model trained with contrastive loss where magnitude was calibrated.
The critical mistake: applying Euclidean distance to un-normalized embeddings from a model trained with cosine similarity loss. A vector of magnitude 10 and one of magnitude 1 pointing in the same direction will have large Euclidean distance but zero cosine distance. If your model was trained with cosine loss (sentence-transformers, DPR, CLIP), normalize embeddings to unit length first - then Euclidean distance and cosine similarity give identical rankings. :::
Cosine similarity in embedding search
import numpy as np
def cosine_similarity_matrix(A: np.ndarray, B: np.ndarray) -> np.ndarray:
"""
Cosine similarities between all rows of A and all rows of B.
A: (n, d) queries, B: (m, d) documents.
Returns (n, m) similarity matrix.
"""
A_norm = A / np.linalg.norm(A, axis=1, keepdims=True)
B_norm = B / np.linalg.norm(B, axis=1, keepdims=True)
return A_norm @ B_norm.T # dot product of unit vectors = cosine similarity
# Demonstrate that cosine and Euclidean give different rankings on raw embeddings
np.random.seed(42)
query = np.array([1.0, 0.5, 0.0])
docs = np.array([
[2.0, 1.0, 0.0], # same direction, 2x magnitude
[0.5, 0.25, 0.0], # same direction, 0.5x magnitude
[0.0, 1.0, 0.0], # different direction
[10.0, 5.0, 0.0], # same direction, 10x magnitude
])
# Cosine similarity (direction only)
q_unit = query / np.linalg.norm(query)
d_units = docs / np.linalg.norm(docs, axis=1, keepdims=True)
cos_sims = d_units @ q_unit
# Euclidean distance (direction + magnitude)
eucl_dists = np.linalg.norm(docs - query, axis=1)
print("Cosine ranking (docs 0,1,3 all direction-identical):")
for rank, idx in enumerate(np.argsort(cos_sims)[::-1]):
print(f" rank {rank+1}: doc[{idx}] cos_sim={cos_sims[idx]:.4f}")
print("\nEuclidean ranking (magnitude differences dominate):")
for rank, idx in enumerate(np.argsort(eucl_dists)):
print(f" rank {rank+1}: doc[{idx}] dist={eucl_dists[idx]:.4f}")
Regularization path: alpha vs sparsity
import numpy as np
np.random.seed(42)
n, d = 80, 10
X = np.random.randn(n, d)
w_true = np.zeros(d)
w_true[[0, 2, 7]] = [3.0, -1.5, 2.0]
y = X @ w_true + 0.2 * np.random.randn(n)
alphas = [0.005, 0.01, 0.05, 0.1, 0.3, 0.8, 1.5]
print(f"{'Alpha':>8} | {'Non-zeros':>10} | {'‖w‖₁':>8}")
print("-" * 32)
for alpha in alphas:
w = np.zeros(d)
for _ in range(500):
for j in range(d):
r_j = y - X @ w + X[:, j] * w[j]
rho_j = X[:, j] @ r_j / n
z_j = X[:, j] @ X[:, j] / n
w[j] = np.sign(rho_j) * max(abs(rho_j) - alpha, 0.0) / z_j
n_nz = np.sum(np.abs(w) > 1e-6)
l1 = np.sum(np.abs(w))
print(f"{alpha:>8.3f} | {n_nz:>10} | {l1:>8.4f}")
# As alpha increases: fewer non-zeros → sparser model
Interview Questions
Q1: Explain geometrically why L1 regularization induces sparsity but L2 does not.
Regularized regression is a constrained optimization problem: find the point on the norm ball that minimizes the loss function.
L1 ball (diamond shape): In 2D, the L1 ball is a rotated square with corners at (±r, 0) and (0, ±r) - exactly on the coordinate axes. The loss function has smooth, elliptical contours. When the ellipse descends toward the diamond, it almost always makes first contact at a corner. A corner is a point where one coordinate is zero. In d dimensions, corners become even more likely contact points because there are 2d corners and the loss surface is generic.
Why corners? The L1 norm is not differentiable at the coordinate axes - its subdifferential spans a range of subgradients at each corner. This means that for any loss gradient pointing toward the corner, the corner is still optimal. Mathematically, the optimality condition 0 ∈ ∂f(w*) + λ ∂‖w*‖₁ is satisfied at a corner for all loss gradients in a finite angular range.
L2 ball (sphere shape): The L2 ball is smooth everywhere - no corners. The elliptical loss contour makes tangential contact with the sphere at a single point determined by the gradient direction. That point is generically on the interior of a face with all coordinates non-zero. L2 shrinks all weights proportionally but cannot force any weight to exactly zero.
Summary: Sparsity requires corners on coordinate axes. L1 has them; L2 does not. This is a geometric property of the constraint region, not a numerical artifact.
Q2: What is the Frobenius norm and when does it appear in ML?
The Frobenius norm of a matrix A ∈ ℝ^(m×n) is the square root of the sum of all squared entries:
‖A‖_F = sqrt(Σᵢ Σⱼ Aᵢⱼ²)
Three equivalent characterizations:
- Square root of sum of squared entries
- L2 norm of the vectorized (flattened) matrix
- L2 norm of the singular values: ‖A‖_F = sqrt(σ₁² + σ₂² + ... + σᵣ²)
Where it appears in ML:
-
Weight decay in neural networks:
L_total = L_task + λ Σ_layers ‖W_l‖_F²- this is the standard L2 regularization. In PyTorch,weight_decayin optimizers implements this. -
Eckart-Young theorem: The best rank-k approximation of A in the Frobenius norm is the truncated SVD. The approximation error is
‖A - A_k‖_F = sqrt(σ_\{k+1\}² + ... + σ_r²). -
Collaborative filtering: Matrix factorization for recommender systems minimizes
‖R - UV^T‖_F²plus regularization terms on U and V. -
Gradient norms: monitoring
‖∇L‖_Fduring training, gradient clipping by Frobenius norm.
Use np.linalg.norm(A, 'fro') in NumPy and torch.linalg.matrix_norm(A, 'fro') in PyTorch.
Q3: What is the nuclear norm and why is it the convex relaxation of rank?
The nuclear norm of A is the sum of all singular values:
‖A‖_* = σ₁ + σ₂ + ... + σᵣ
Why it relaxes rank: The rank of A counts the number of non-zero singular values - a discrete, combinatorial quantity. Minimizing rank is NP-hard (equivalent to finding the sparsest representation). The nuclear norm replaces "count non-zero σᵢ" with "sum all σᵢ" - a continuous, convex function.
This is the exact same relaxation that L1 performs on vectors:
- Vectors: L0 (count non-zeros) → relaxed to L1 (sum of |vᵢ|)
- Matrices: rank (count non-zero σᵢ) → relaxed to nuclear (sum of σᵢ)
Formally, the nuclear norm is the convex envelope of the rank function on the set {A : ‖A‖₂ ≤ 1} (proven by Fazel 2002 and Recht et al. 2010).
Proximal operator: Singular value soft-thresholding - apply L1 soft-threshold to each singular value:
prox_{τ ‖·‖_*}(A) = U * diag(max(σ - τ, 0)) * Vᵀ
This is the SVT (Singular Value Thresholding) algorithm, the core subroutine for nuclear norm minimization problems like matrix completion and robust PCA.
Q4: When should you use cosine similarity vs Euclidean distance for embeddings?
The fundamental question: does magnitude carry semantic meaning in your embedding space?
Use cosine similarity when:
- Magnitude is irrelevant (word2vec, GloVe, TF-IDF)
- The model was explicitly trained with cosine similarity loss (most sentence-transformers, DPR, CLIP)
- You want direction-only comparison (the angle between representations)
Use Euclidean distance when:
- Both direction and magnitude carry information
- Working with normalized embeddings (cosine ≡ Euclidean on the unit sphere)
- The model's training procedure calibrated embedding magnitudes
The unit-sphere equivalence: For unit-normalized embeddings (‖u‖₂ = ‖v‖₂ = 1):
‖u - v‖₂² = 2 - 2·(u·v) = 2·(1 - cos(u,v))
So on the unit sphere, Euclidean distance and cosine similarity give identical rankings. This means: normalize embeddings to unit length, then use dot product (which equals cosine for unit vectors). This lets you use FAISS or any L2-based approximate nearest neighbor library and get cosine-equivalent rankings.
Practical rule: For sentence-transformers, CLIP, or DPR - always normalize to unit length first. For raw last-layer hidden states from a model not trained with cosine loss - use cosine similarity explicitly, since magnitudes may not be meaningful.
Practice Challenges
Level 1: Predict
Challenge: Without computing, predict which norm gives the largest value for v = [1, 1, 1, ..., 1] ∈ ℝ¹⁰⁰ (all ones):
- L1 norm
- L2 norm
- L∞ norm
Answer
For v = [1, 1, ..., 1] ∈ ℝ¹⁰⁰:
- L1 = |1| * 100 = 100
- L2 = sqrt(100) = 10
- L∞ = max(1,...,1) = 1
L1 is largest, L∞ is smallest. The ordering L∞ ≤ L2 ≤ L1 always holds for unit vectors (general fact: Lp decreases as p increases).
More precisely, for a d-dimensional vector with all entries equal to c:
‖v‖₁ = d·c, ‖v‖₂ = sqrt(d)·c, ‖v‖∞ = c
This illustrates why L1 regularization is "stricter" - it penalizes many small weights more heavily than L2 does, which is another way to understand why L1 prefers sparse solutions.
Level 2: Debug
Challenge: The following function is supposed to compute cosine similarity but returns incorrect results for some inputs. Find all bugs:
def cosine_sim(u, v):
return u @ v / np.linalg.norm(u) * np.linalg.norm(v)
Answer
Bug: Operator precedence. Python evaluates left to right: (u @ v / norm(u)) * norm(v) instead of u @ v / (norm(u) * norm(v)).
# Wrong: operator precedence multiplies by norm(v) instead of dividing
def cosine_sim_wrong(u, v):
return u @ v / np.linalg.norm(u) * np.linalg.norm(v)
# Fix 1: explicit parentheses
def cosine_sim_fixed_v1(u, v):
return u @ v / (np.linalg.norm(u) * np.linalg.norm(v))
# Fix 2: normalize first - cleaner and avoids precision issues
def cosine_sim_fixed_v2(u, v):
u_hat = u / np.linalg.norm(u)
v_hat = v / np.linalg.norm(v)
return float(u_hat @ v_hat)
# Verify
u = np.array([3.0, 4.0]) # norm = 5
v = np.array([1.0, 0.0]) # norm = 1
# Correct: (3*1 + 4*0) / (5 * 1) = 3/5 = 0.6
print(f"Wrong: {cosine_sim_wrong(u, v):.4f}") # 3.0 (incorrect!)
print(f"Fixed v1: {cosine_sim_fixed_v1(u, v):.4f}") # 0.6 (correct)
print(f"Fixed v2: {cosine_sim_fixed_v2(u, v):.4f}") # 0.6 (correct)
# Second bug: no guard for zero-norm vectors
zero = np.zeros(2)
# cosine_sim_fixed_v1(u, zero) # ZeroDivisionError!
def cosine_sim_safe(u, v, eps=1e-8):
denom = np.linalg.norm(u) * np.linalg.norm(v)
if denom < eps:
return 0.0
return float(u @ v / denom)
Level 3: Design
Challenge: Implement a matrix completion algorithm using the nuclear norm proximal operator (SVT) that fills in missing entries of a rating matrix.
Answer
import numpy as np
def svt_matrix_completion(
M_observed: np.ndarray,
mask: np.ndarray,
tau: float = 1.0,
step: float = 1.9,
n_iter: int = 200,
tol: float = 1e-4
) -> np.ndarray:
"""
Singular Value Thresholding for matrix completion.
Solves: minimize ‖X‖_* subject to Xᵢⱼ = Mᵢⱼ for observed (i,j)
Algorithm:
1. Start with X = 0
2. Repeat:
a. Gradient step: Y = X + step * mask * (M_observed - X)
b. Nuclear norm prox: X = SVT(Y, tau)
3. Until convergence
"""
X = np.zeros_like(M_observed)
def soft_threshold_svd(A: np.ndarray, threshold: float) -> np.ndarray:
U, s, Vt = np.linalg.svd(A, full_matrices=False)
s_thresh = np.maximum(s - threshold, 0.0)
return U @ np.diag(s_thresh) @ Vt
for iteration in range(n_iter):
X_old = X.copy()
# Gradient step: push toward observed values on the mask
residual = mask * (M_observed - X)
Y = X + step * residual
# Nuclear norm proximal step: soft-threshold singular values
X = soft_threshold_svd(Y, tau)
# Check convergence
change = np.linalg.norm(X - X_old, 'fro') / (np.linalg.norm(X_old, 'fro') + 1e-10)
if change < tol:
print(f"Converged at iteration {iteration + 1}")
break
return X
# Test: recover a low-rank matrix with 50% missing entries
np.random.seed(42)
m, n, r = 20, 15, 3
# True low-rank matrix
U_true = np.random.randn(m, r)
V_true = np.random.randn(n, r)
M_true = U_true @ V_true.T # rank-3 matrix
# Random observation mask (50% observed)
mask = (np.random.rand(m, n) > 0.5).astype(float)
M_obs = mask * M_true # zero out unobserved entries
# Run SVT
M_recovered = svt_matrix_completion(M_obs, mask, tau=0.5, step=1.5)
# Evaluate recovery on unobserved entries
unobserved_mask = 1 - mask
rmse = np.sqrt(np.mean((unobserved_mask * (M_recovered - M_true))**2 /
(unobserved_mask.sum() + 1e-10) * m * n))
print(f"RMSE on unobserved entries: {rmse:.4f}")
print(f"Recovered rank: {np.linalg.matrix_rank(M_recovered, tol=1e-2)}")
print(f"True rank: {r}")
Quick Reference Cheatsheet
| Norm | Formula | Constraint shape | NumPy | ML use |
|---|---|---|---|---|
| L1 | Σ|vᵢ| | Diamond (corners on axes) | norm(v, 1) | Lasso, feature selection |
| L2 | sqrt(Σvᵢ²) | Sphere (smooth) | norm(v) | Ridge, Euclidean distance |
| L∞ | max|vᵢ| | Hypercube | norm(v, inf) | Adversarial ε-ball |
| Frobenius | sqrt(ΣΣAᵢⱼ²) | - | norm(A, 'fro') | Weight decay |
| Nuclear | Σσᵢ | - | norm(A, 'nuc') | Low-rank completion |
| Spectral | σ_max | - | norm(A, 2) | Lipschitz bounds |
Key Takeaways
- A norm is defined by three axioms: positive definiteness, homogeneity, and the triangle inequality - every norm is a valid distance metric
- The L1 ball is a diamond with corners on coordinate axes - the loss surface hits corners first, creating exact zeros (sparsity); L1 = Lasso = feature selection
- The L2 ball is a smooth sphere - the loss surface touches a generic boundary point, shrinking all weights proportionally but zeroing none; L2 = Ridge = stability
- The Frobenius norm is the L2 norm of the vectorized matrix, equal to the L2 norm of its singular values - use
np.linalg.norm(A, 'fro')for weight decay - The nuclear norm is the sum of singular values - the convex relaxation of rank - use it when you need low-rank matrix completion
- For embedding search: cosine similarity measures angle (normalize first), Euclidean measures angle + magnitude - normalize embeddings to unit length to make them equivalent
Next: Tensors for Deep Learning →
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Norms Explorer demo on the EngineersOfAI Playground - no code required.
:::
