Norms and Distance Metrics - The Geometry of Regularization

Reading time: ~22 minutes | Level: Mathematical Foundations → ML Engineering

L1 regularization makes models sparse. L2 regularization makes models small. These are not arbitrary choices - they come directly from the geometry of different norms. The shape of the constraint region determines which weights go to zero.

When a regularized model trains, it is solving a constrained optimization problem: minimize the loss subject to the constraint that the weights live inside a norm ball. The L1 ball is a diamond. The L2 ball is a sphere. The corners of the diamond lie exactly on the coordinate axes. The loss surface, descending from above, is almost certain to touch a corner first - and a corner means some weights are exactly zero. That is why Lasso produces sparse models and Ridge does not.

This lesson gives you the full geometry.

What You Will Learn

The three axioms that define a norm (and why they matter)
L1 norm: the diamond, sparsity induction, why corners hit axes
L2 norm: the sphere, smoothness, why solutions are small but non-zero
L∞ norm: max absolute value, minimax problems
Frobenius norm for matrices - equivalent to L2 on the vectorized matrix
Nuclear norm - the convex relaxation of matrix rank
Distance metrics derived from norms: Euclidean, Manhattan, Chebyshev, Mahalanobis
NumPy: np.linalg.norm with every ord value
ML: Lasso, Ridge, ElasticNet; cosine similarity vs Euclidean in embedding search

Prerequisites

Part 1 - What a Norm Is: Three Axioms

A norm is a function that assigns a non-negative length to a vector. Not every function qualifies - a norm must satisfy three axioms:

Axiom 1: Positive definiteness

‖v‖ ≥ 0  for all v
‖v‖ = 0  if and only if  v = 0

Zero length only for the zero vector. No non-zero vector can have zero norm.

Axiom 2: Absolute homogeneity (scaling)

‖αv‖ = |α| ‖v‖  for any scalar α

Scaling a vector by α scales its length by |α|. Doubling a vector doubles its length.

Axiom 3: Triangle inequality

‖u + v‖ ≤ ‖u‖ + ‖v‖

The length of a sum cannot exceed the sum of lengths. This is the mathematical statement that "the straight-line path is never longer than the detour."

Why axioms matter for ML

The triangle inequality is what makes a norm usable as a distance metric. It guarantees that if you move from point A to point B and then from B to C, the total distance is at least as long as going directly from A to C - a property any reasonable "distance" must have.

When you regularize, you constrain weights to live inside a norm ball - the set of all vectors with norm ≤ r. The shape of that ball determines everything about which solutions are selected.

The Lp family

The most important norms in ML are the Lp norms, defined for vectors v ∈ ℝⁿ:

‖v‖_p = (|v₁|^p + |v₂|^p + ... + |vₙ|^p)^(1/p)

Special cases:

p = 1: L1 norm (Manhattan distance)
p = 2: L2 norm (Euclidean distance)
p → ∞: L∞ norm (Chebyshev / max norm)

import numpy as np

v = np.array([3.0, -4.0, 2.0])

# All Lp norms via np.linalg.norm
l1   = np.linalg.norm(v, ord=1)          # |3| + |-4| + |2| = 9.0
l2   = np.linalg.norm(v, ord=2)          # sqrt(9+16+4) = sqrt(29) ≈ 5.385
linf = np.linalg.norm(v, ord=np.inf)     # max(3, 4, 2) = 4.0

print(f"L1   ‖v‖₁ = {l1}")       # 9.0
print(f"L2   ‖v‖₂ = {l2:.4f}")   # 5.3852
print(f"L∞   ‖v‖∞ = {linf}")     # 4.0

# Manual verification
l1_manual   = np.sum(np.abs(v))
l2_manual   = np.sqrt(np.sum(v**2))
linf_manual = np.max(np.abs(v))

assert np.isclose(l1, l1_manual)
assert np.isclose(l2, l2_manual)
assert np.isclose(linf, linf_manual)

# General p norm - as p grows, Lp converges to L∞
for p in [1, 1.5, 2, 3, 5, 10, 100]:
    lp = np.linalg.norm(v, ord=p)
    print(f"L{p:<4} ‖v‖_{p} = {lp:.4f}")
    # At p=100 this will be very close to 4.0 (the L∞ value)

Part 2 - The L1 Norm: Geometry, Sparsity, and Why Lasso Works

Definition

‖v‖₁ = |v₁| + |v₂| + ... + |vₙ|

In 2D with v = (x, y): ‖v‖₁ = |x| + |y|.

The L1 ball is a diamond

The set {v : ‖v‖₁ ≤ 1} in 2D is:

|x| + |y| ≤ 1

This is a square rotated 45 degrees - a diamond shape. The four corners of the diamond sit exactly at (1,0), (-1,0), (0,1), (0,-1) - on the coordinate axes.

Why L1 induces sparsity: the geometric argument

Lasso regression solves:

minimize  ‖Xw - y‖₂²   subject to  ‖w‖₁ ≤ r

This is equivalent to finding where the elliptical loss surface first touches the L1 ball.

The elliptical loss contours descend from the unconstrained optimum. Almost always, the first point of contact with the L1 diamond is a corner - and a corner has a zero component. The more dimensions there are, the more corners there are, and the more likely the contact point has multiple zero components.

For a d-dimensional L1 ball:

Number of corners: 2d (one per axis, positive and negative)
At a corner: d-1 components are zero
The loss contour almost surely hits a corner rather than a face or interior point

This is not a coincidence - it is a consequence of the L1 ball having non-differentiable points at each corner. The subdifferential of the L1 norm at a corner spans a range of gradient directions, capturing any generic loss gradient that points toward that corner.

Soft-thresholding: the L1 proximal operator

The proximal operator of the L1 norm is soft-thresholding. It is the analytic solution to each coordinate step in Lasso coordinate descent:

S(z, t) = sign(z) * max(|z| - t, 0)

Values with |z| ≤ t are exactly zeroed. This is the mechanism that produces sparsity.

import numpy as np

def lasso_coordinate_descent(
    X: np.ndarray,
    y: np.ndarray,
    alpha: float = 1.0,
    n_iter: int = 1000,
    tol: float = 1e-6
) -> np.ndarray:
    """
    Lasso regression via coordinate descent.

    Lasso objective: (1/2n) * ‖Xw - y‖₂² + alpha * ‖w‖₁

    For each weight w_j, the 1D Lasso problem has a closed-form solution
    via soft-thresholding - the proximal operator of the L1 norm.
    """
    n, d = X.shape
    w = np.zeros(d)

    for iteration in range(n_iter):
        w_old = w.copy()

        for j in range(d):
            # Partial residual: residual excluding contribution of w_j
            r_j = y - X @ w + X[:, j] * w[j]

            # OLS update for w_j ignoring L1 penalty
            rho_j = X[:, j] @ r_j / n

            # Soft-thresholding: shrink rho_j toward zero by alpha
            # Values in [-alpha, alpha] are zeroed exactly
            z_j = np.dot(X[:, j], X[:, j]) / n
            w[j] = np.sign(rho_j) * max(abs(rho_j) - alpha, 0.0) / z_j

        if np.max(np.abs(w - w_old)) < tol:
            print(f"Converged at iteration {iteration + 1}")
            break

    return w

# Generate sparse ground truth and verify recovery
np.random.seed(42)
n, d = 100, 20
X = np.random.randn(n, d)
w_true = np.zeros(d)
w_true[[0, 5, 12]] = [3.0, -2.0, 1.5]  # only 3 non-zero weights
y = X @ w_true + 0.1 * np.random.randn(n)

w_lasso = lasso_coordinate_descent(X, y, alpha=0.1)

n_nonzero = np.sum(np.abs(w_lasso) > 1e-6)
print(f"Non-zero weights: {n_nonzero} (true sparse: 3)")
print(f"Recovered indices: {np.where(np.abs(w_lasso) > 1e-6)[0]}")
print(f"True indices:      [0, 5, 12]")

Part 3 - The L2 Norm: Geometry, Smoothness, and Why Ridge Works

Definition

‖v‖₂ = sqrt(v₁² + v₂² + ... + vₙ²)

The L2 norm is the standard Euclidean length - the straight-line distance from the origin to the point.

The L2 ball is a sphere

The set {v : ‖v‖₂ ≤ 1} is a perfect circle in 2D, a sphere in 3D, and a hypersphere in higher dimensions. It has no corners - it is perfectly smooth everywhere.

Why L2 does not induce sparsity

Ridge regression solves:

minimize  ‖Xw - y‖₂²   subject to  ‖w‖₂ ≤ r

The loss contour (an ellipse) descends toward the circular L2 ball. Because the ball is smooth, the contact point is almost never at a coordinate axis - it is somewhere on the curved surface where all coordinates are generically non-zero.

The L2 penalty does not force coefficients to zero. Instead, it shrinks all coefficients toward zero proportionally. If you double the regularization strength, all weights get cut approximately in half - none become exactly zero.

Closed-form Ridge solution

For Ridge regression, the penalty gives a clean closed form:

w_ridge = (XᵀX + αI)⁻¹ Xᵀy

The αI term shifts all eigenvalues of XᵀX up by α, ensuring the matrix is invertible even when XᵀX is singular (collinear features).

import numpy as np

def ridge_regression(
    X: np.ndarray,
    y: np.ndarray,
    alpha: float = 1.0
) -> np.ndarray:
    """
    Ridge regression: closed-form solution.

    Objective: ‖Xw - y‖₂² + alpha * ‖w‖₂²
    Solution:  w = (XᵀX + αI)⁻¹ Xᵀy

    Ridge shrinks all weights proportionally - no exact zeros.
    The α*I ensures invertibility even with collinear features.
    """
    n, d = X.shape
    I = np.eye(d)
    w = np.linalg.solve(X.T @ X + alpha * I, X.T @ y)
    return w

np.random.seed(42)
n, d = 100, 10
X = np.random.randn(n, d)
w_true = np.zeros(d)
w_true[[0, 3]] = [2.0, -1.5]
y = X @ w_true + 0.1 * np.random.randn(n)

alpha = 0.5
w_ridge = ridge_regression(X, y, alpha=alpha)

print("Ridge weights (all small, none exactly zero):")
for i, w in enumerate(w_ridge):
    print(f"  w[{i}] = {w:.4f}")

print(f"\nRidge zero weights (exact): {np.sum(np.abs(w_ridge) < 1e-10)}")  # 0
print(f"Ridge min |w|: {np.min(np.abs(w_ridge)):.6f}")  # small but > 0

# Ridge eigenvalue shift: how α stabilizes ill-conditioned matrices
XtX = X.T @ X
eigs_before = np.linalg.eigvalsh(XtX)
eigs_after  = np.linalg.eigvalsh(XtX + alpha * np.eye(d))
print(f"\nSmallest eigenvalue before Ridge: {eigs_before.min():.4f}")
print(f"Smallest eigenvalue after  Ridge: {eigs_after.min():.4f}")
print(f"Condition number before: {eigs_before.max()/eigs_before.min():.1f}")
print(f"Condition number after:  {eigs_after.max()/eigs_after.min():.1f}")

L2 vs L1: comparison table

Property	L1 (Lasso)	L2 (Ridge)
Geometry	Diamond - corners on axes	Sphere - smooth everywhere
Solution structure	Sparse (corners hit axes)	Dense (smooth edge)
Shrinkage mechanism	Soft-threshold to exact zero	Proportional shrinkage
Correlated features	Picks one, zeros others	Averages them out
Solution uniqueness	Not always unique	Always unique
Closed form	No (requires iteration)	Yes
Best for	Feature selection	Stable regression with all features

Part 4 - The L∞ Norm: Maximum Absolute Value

Definition

‖v‖∞ = max(|v₁|, |v₂|, ..., |vₙ|)

The L∞ norm is the maximum absolute component - the largest coordinate in absolute value.

The L∞ ball is a hypercube

The set {v : ‖v‖∞ ≤ 1} is the set where every coordinate satisfies -1 ≤ vᵢ ≤ 1 - an axis-aligned hypercube (a square in 2D).

When L∞ arises in ML

The L∞ norm appears in minimax problems - where you want to minimize the worst-case error:

Adversarial robustness: L∞ attacks constrain the maximum pixel perturbation - ‖δ‖∞ ≤ ε means no individual pixel changes by more than ε. FGSM and PGD attacks operate in this constraint set.
Chebyshev approximation: finding a polynomial that minimizes the maximum error on an interval
Robust regression: minimizing the maximum residual rather than the sum of squared residuals

import numpy as np

v = np.array([1.2, -3.5, 0.8, 2.1])

linf = np.linalg.norm(v, ord=np.inf)   # 3.5 = max absolute value
print(f"L∞ ‖v‖∞ = {linf}")

# Adversarial attack: L∞ ball projection
def project_linf_ball(delta: np.ndarray, epsilon: float) -> np.ndarray:
    """
    Project perturbation delta onto the L∞ ball of radius epsilon.
    This is the PGD projection step - clip each coordinate independently.
    """
    return np.clip(delta, -epsilon, epsilon)

def fgsm_perturbation(gradient: np.ndarray, epsilon: float) -> np.ndarray:
    """
    Fast Gradient Sign Method: L∞ adversarial perturbation.
    Every pixel moves by exactly epsilon in the sign direction of the gradient.
    ‖delta‖∞ = epsilon - this saturates the L∞ constraint.
    """
    return epsilon * np.sign(gradient)

# Example on a 4-pixel toy image
gradient = np.array([0.3, -0.7, 0.1, -0.5])
epsilon = 0.1
delta = fgsm_perturbation(gradient, epsilon)
print(f"FGSM perturbation: {delta}")
print(f"L∞ of perturbation: {np.linalg.norm(delta, ord=np.inf)}")  # exactly 0.1

# PGD step: gradient step followed by L∞ projection
def pgd_step(
    x: np.ndarray,
    gradient: np.ndarray,
    step_size: float,
    epsilon: float
) -> np.ndarray:
    """One step of Projected Gradient Descent under L∞ constraint."""
    x_adv = x + step_size * np.sign(gradient)  # gradient step
    return project_linf_ball(x_adv, epsilon)    # project back to L∞ ball

Part 5 - The Frobenius Norm: L2 for Matrices

Definition

For a matrix A ∈ ℝ^(m×n):

‖A‖_F = sqrt(Σᵢ Σⱼ Aᵢⱼ²)

The Frobenius norm is the square root of the sum of all squared entries. This is exactly the L2 norm applied to the vectorized matrix - if you flatten A into a vector of length m*n, its L2 norm equals ‖A‖_F.

Connection to singular values

The Frobenius norm has a clean expression in terms of singular values:

‖A‖_F = sqrt(σ₁² + σ₂² + ... + σᵣ²) = ‖σ‖₂

where σᵢ are the singular values of A. This connects the Frobenius norm to SVD and makes it the natural matrix analog of the vector L2 norm.

Where Frobenius appears in ML

Weight regularization: adding α * ‖W‖_F² to a neural network loss penalizes large weights - this is L2 weight decay in matrix form
Low-rank approximation error: the Eckart-Young theorem says the best rank-k approximation in Frobenius norm is the truncated SVD
Matrix factorization: collaborative filtering minimizes ‖R - UV^T‖_F² over user matrix U and item matrix V

import numpy as np

A = np.array([[1.0, 2.0, 3.0],
              [4.0, 5.0, 6.0]])

# Frobenius norm
frob = np.linalg.norm(A, 'fro')
print(f"Frobenius ‖A‖_F = {frob:.4f}")  # sqrt(1+4+9+16+25+36) = sqrt(91)

# Verify: equals L2 of flattened vector
l2_flat = np.linalg.norm(A.flatten(), ord=2)
print(f"L2 of flatten   = {l2_flat:.4f}")
assert np.isclose(frob, l2_flat)

# Verify: equals sqrt of sum of squared singular values
U, s, Vt = np.linalg.svd(A)
frob_from_svd = np.sqrt(np.sum(s**2))
print(f"From SVD:        {frob_from_svd:.4f}")
assert np.isclose(frob, frob_from_svd)

# Low-rank approximation error (Eckart-Young theorem)
def low_rank_approx_error(A: np.ndarray, k: int) -> float:
    """
    Frobenius error of best rank-k approximation.
    Error = sqrt(sum of squared singular values beyond rank k).
    This is the minimum achievable error - the truncated SVD achieves it.
    """
    U, s, Vt = np.linalg.svd(A, full_matrices=False)
    return float(np.sqrt(np.sum(s[k:]**2)))

A_test = np.random.randn(8, 6)
print("\nFrobenius approximation error by rank:")
for k in [1, 2, 3, 5]:
    err = low_rank_approx_error(A_test, k)
    print(f"  rank-{k}: error = {err:.4f}")

# Matrix norms comparison
A_sq = np.random.randn(4, 4)
print("\nAll matrix norms of a 4x4 random matrix:")
print(f"  Frobenius:  {np.linalg.norm(A_sq, 'fro'):.4f}")
print(f"  Nuclear:    {np.linalg.norm(A_sq, 'nuc'):.4f}")
print(f"  Spectral:   {np.linalg.norm(A_sq, 2):.4f}")     # = largest singular value
print(f"  Row-max:    {np.linalg.norm(A_sq, np.inf):.4f}")
print(f"  Col-max:    {np.linalg.norm(A_sq, 1):.4f}")

Part 6 - The Nuclear Norm: Convex Relaxation of Rank

The rank minimization problem

Many ML tasks want a low-rank matrix: recommender systems, matrix completion, robust PCA. The ideal objective is:

minimize  rank(A)   subject to  constraints

But minimizing rank directly is NP-hard - rank is a combinatorial quantity (count of non-zero singular values). We need a convex relaxation.

The nuclear norm is the tightest convex relaxation

The nuclear norm (also called the trace norm or Schatten 1-norm) is:

‖A‖_* = σ₁ + σ₂ + ... + σᵣ  (sum of all singular values)

This is to singular values what L1 is to vector entries. Just as L1 is the tightest convex relaxation of the L0 norm (count of non-zeros), the nuclear norm is the tightest convex relaxation of rank (count of non-zero singular values).

:::tip When to use the nuclear norm Use nuclear norm regularization when you need low-rank matrix completion - for example, filling in missing ratings in a recommender system:

minimize  ‖A‖_*   subject to  Aᵢⱼ = Mᵢⱼ  for observed (i,j)

This produces the lowest-rank matrix consistent with the observations. The proximal operator of the nuclear norm is singular value soft-thresholding - apply the L1 soft-threshold to each singular value. This is the Singular Value Thresholding (SVT) algorithm. In PyTorch, torch.linalg.matrix_norm(A, ord='nuc') computes this norm. :::

import numpy as np

A = np.array([[1.0, 2.0, 0.0],
              [0.0, 3.0, 1.0],
              [2.0, 0.0, 4.0]])

U, s, Vt = np.linalg.svd(A)
nuclear_manual = np.sum(s)
nuclear_numpy  = np.linalg.norm(A, 'nuc')
print(f"Nuclear norm (manual): {nuclear_manual:.4f}")
print(f"Nuclear norm (numpy):  {nuclear_numpy:.4f}")
assert np.isclose(nuclear_manual, nuclear_numpy)

# Singular value soft-thresholding - proximal operator of nuclear norm
def nuclear_norm_prox(A: np.ndarray, tau: float) -> np.ndarray:
    """
    Proximal operator of the nuclear norm at step size tau.

    prox_{tau * ‖·‖_*}(A) = U * diag(max(σ - tau, 0)) * Vᵀ

    This is the matrix analog of scalar soft-thresholding for L1.
    Small singular values are zeroed, reducing the effective rank.
    Used in the SVT algorithm for matrix completion.
    """
    U, s, Vt = np.linalg.svd(A, full_matrices=False)
    s_thresh = np.maximum(s - tau, 0.0)
    return U @ np.diag(s_thresh) @ Vt

A_test = np.random.randn(5, 4)
A_thresh = nuclear_norm_prox(A_test, tau=1.0)

s_before = np.linalg.svd(A_test, compute_uv=False)
s_after  = np.linalg.svd(A_thresh, compute_uv=False)
print(f"\nSingular values before: {s_before.round(3)}")
print(f"Singular values after:  {s_after.round(3)}")
print(f"Rank before: {np.linalg.matrix_rank(A_test)}")
print(f"Rank after:  {np.linalg.matrix_rank(A_thresh)}")
# Small singular values become zero → lower rank

# Nuclear norm vs rank: the relaxation relationship
print("\nNuclear norm and rank for varying low-rank matrices:")
for k in [1, 2, 3]:
    # Build a rank-k matrix
    U_r = np.random.randn(5, k)
    V_r = np.random.randn(4, k)
    M = U_r @ V_r.T
    nuc = np.linalg.norm(M, 'nuc')
    rk = np.linalg.matrix_rank(M)
    print(f"  Rank {rk}: nuclear norm = {nuc:.4f}")

Part 7 - Distance Metrics from Norms

A distance metric d(u, v) derived from a norm satisfies:

d(u, v) = ‖u - v‖

Every norm defines a distance metric. Different choices of norm give different notions of "distance."

Euclidean distance (from L2)

d(u, v) = ‖u - v‖₂ = sqrt(Σᵢ (uᵢ - vᵢ)²)

Standard straight-line distance. Sensitive to scale - features with large variance dominate.

Manhattan distance (from L1)

d(u, v) = ‖u - v‖₁ = Σᵢ |uᵢ - vᵢ|

Distance traveled along a grid (like city blocks). Less sensitive to outliers than L2 because it does not square the differences.

Chebyshev distance (from L∞)

d(u, v) = ‖u - v‖∞ = max_i |uᵢ - vᵢ|

The maximum coordinate difference. Used in chess (king moves), minimax games, and adversarial robustness settings.

Mahalanobis distance: the norm that accounts for correlation

The Mahalanobis distance is derived from a covariance-weighted norm:

d_M(u, v) = sqrt((u - v)ᵀ Σ⁻¹ (u - v))

where Σ is the covariance matrix of the data. Mahalanobis distance:

Normalizes each feature by its standard deviation
Decorrelates correlated features
Is invariant to linear transformations of the data
Reduces to Euclidean distance when Σ = I

import numpy as np

def mahalanobis_distance(
    u: np.ndarray,
    v: np.ndarray,
    cov: np.ndarray
) -> float:
    """
    Mahalanobis distance: accounts for feature scale and correlation.

    d_M(u, v) = sqrt((u - v)ᵀ Σ⁻¹ (u - v))

    Use this when:
    - Features have different scales (instead of normalizing manually)
    - Features are correlated (instead of first running PCA)
    - Detecting outliers (data far from cluster center)
    """
    diff = u - v
    cov_inv = np.linalg.inv(cov)
    return float(np.sqrt(diff.T @ cov_inv @ diff))

# Example: correlated features
np.random.seed(42)
n = 200
# x2 ≈ 2*x1 (strongly correlated)
x1 = np.random.randn(n)
x2 = 2.0 * x1 + 0.1 * np.random.randn(n)
X = np.column_stack([x1, x2])
cov = np.cov(X.T)

u = np.array([2.0, 4.0])   # on the correlation line (x2 = 2*x1)
v = np.array([0.0, 3.0])   # off the correlation line

d_euclidean   = np.linalg.norm(u - v)
d_mahalanobis = mahalanobis_distance(u, v, cov)

print(f"Euclidean distance:    {d_euclidean:.4f}")
print(f"Mahalanobis distance:  {d_mahalanobis:.4f}")
# Mahalanobis accounts for correlation: v is farther off the data manifold

# Anomaly detection with Mahalanobis
def mahalanobis_scores(X: np.ndarray) -> np.ndarray:
    """Compute Mahalanobis distance of each point from the data centroid."""
    mu = X.mean(axis=0)
    cov = np.cov(X.T)
    cov_inv = np.linalg.inv(cov)
    diffs = X - mu
    # Vectorized: (diffs @ cov_inv) * diffs, summed over features
    scores = np.sqrt(np.sum((diffs @ cov_inv) * diffs, axis=1))
    return scores

scores = mahalanobis_scores(X)
# Points with high Mahalanobis distance are anomalies
threshold = np.percentile(scores, 95)
anomalies = np.where(scores > threshold)[0]
print(f"\nTop 5% anomaly threshold: {threshold:.2f}")
print(f"Detected {len(anomalies)} anomalies")

Distance metric comparison

Metric	Base norm	Formula	Scale-sensitive	When to use
Euclidean	L2	sqrt(Σ(uᵢ-vᵢ)²)	Yes	Default starting point
Manhattan	L1	Σ\|uᵢ-vᵢ\|	Yes	Robust to outliers
Chebyshev	L∞	max\|uᵢ-vᵢ\|	Yes	Minimax, adversarial
Mahalanobis	Σ⁻¹-weighted	sqrt(dᵀΣ⁻¹d)	No	Correlated features
Cosine	-	1 - cos(θ)	No	Embeddings, text

Part 8 - NumPy: np.linalg.norm Complete Reference

import numpy as np

# ── Vector norms ──────────────────────────────────────────────────────────
v = np.array([3.0, -4.0, 2.0, -1.0])

print("=== Vector norms ===")
print(f"L0 (non-zeros, not a true norm): {np.sum(v != 0)}")
print(f"L1:  {np.linalg.norm(v, ord=1)}")           # sum of |vᵢ| = 10.0
print(f"L2:  {np.linalg.norm(v, ord=2):.4f}")       # sqrt(30) ≈ 5.477 (default)
print(f"L2 (default, no ord):  {np.linalg.norm(v):.4f}")
print(f"L∞:  {np.linalg.norm(v, ord=np.inf)}")      # max|vᵢ| = 4.0
print(f"L-∞: {np.linalg.norm(v, ord=-np.inf)}")     # min|vᵢ| = 1.0

# ── Matrix norms ──────────────────────────────────────────────────────────
A = np.array([[1.0, 2.0, 3.0],
              [4.0, 5.0, 6.0],
              [7.0, 8.0, 9.0]])

print("\n=== Matrix norms ===")
print(f"Frobenius:          {np.linalg.norm(A, 'fro'):.4f}")   # sqrt of sum sq entries
print(f"Nuclear:            {np.linalg.norm(A, 'nuc'):.4f}")   # sum of singular values
print(f"Spectral (ord=2):   {np.linalg.norm(A, 2):.4f}")       # largest singular value
print(f"Max row sum (inf):  {np.linalg.norm(A, np.inf):.4f}")  # max(row sums)
print(f"Max col sum (1):    {np.linalg.norm(A, 1):.4f}")       # max(col sums)

# ── IMPORTANT: ord=inf means different things for vectors vs matrices ─────
v2 = np.array([3.0, 4.0])
M  = v2.reshape(1, -1)   # (1, 2) matrix - same numbers, different shape

print(f"\nVector L∞ (max |vᵢ|):        {np.linalg.norm(v2, ord=np.inf)}")  # 4.0
print(f"Matrix L∞ (max row sum):      {np.linalg.norm(M, ord=np.inf)}")    # 7.0 !!!
# Different! The matrix L∞ norm is max(sum of |row entries|), not max entry.

# ── Axis parameter: per-row or per-column norms ──────────────────────────
B = np.array([[3.0, 4.0],
              [1.0, 0.0],
              [2.0, 2.0]])

row_norms = np.linalg.norm(B, axis=1)          # L2 norm of each row
col_norms = np.linalg.norm(B, ord=1, axis=0)   # L1 norm of each column

print(f"\nRow-wise L2 norms: {row_norms}")   # [5.0, 1.0, 2.828]
print(f"Col-wise L1 norms: {col_norms}")    # [6.0, 6.0]

# Normalize rows to unit length (required before dot-product similarity)
B_unit = B / np.linalg.norm(B, axis=1, keepdims=True)
print(f"Row-normalized norms: {np.linalg.norm(B_unit, axis=1).round(4)}")  # all 1.0

# ── Batch embeddings: norm over last axis ────────────────────────────────
embeddings = np.random.randn(32, 512)   # 32 embeddings of dimension 512
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)  # (32, 1)
normalized = embeddings / norms                              # (32, 512)
print(f"\nEmbedding norms (after): {np.linalg.norm(normalized, axis=1).round(4)[:5]}")
# All should be 1.0

Part 9 - ML Connections: Regularization, Embeddings, and Search

ElasticNet: combining L1 and L2

ElasticNet blends Lasso and Ridge, getting sparsity from L1 and stability from L2:

objective = (1/2n)‖Xw - y‖₂² + alpha * l1_ratio * ‖w‖₁ + (alpha/2) * (1-l1_ratio) * ‖w‖₂²

This is useful when:

There are many correlated features (Lasso picks one arbitrarily; ElasticNet spreads weight)
You want sparsity but Lasso alone is unstable

import numpy as np

def elasticnet_coordinate_descent(
    X: np.ndarray,
    y: np.ndarray,
    alpha: float = 1.0,
    l1_ratio: float = 0.5,
    n_iter: int = 1000
) -> np.ndarray:
    """
    ElasticNet via coordinate descent.

    l1_ratio = 1.0 → pure Lasso (sparse, unstable with correlations)
    l1_ratio = 0.0 → pure Ridge (dense, stable)
    l1_ratio in (0, 1) → ElasticNet blend
    """
    n, d = X.shape
    w = np.zeros(d)
    alpha1 = alpha * l1_ratio         # L1 strength
    alpha2 = alpha * (1 - l1_ratio)   # L2 strength

    for _ in range(n_iter):
        for j in range(d):
            r_j = y - X @ w + X[:, j] * w[j]
            rho_j = X[:, j] @ r_j / n
            # L2 adds alpha2 to the denominator - shrinks but does not zero
            z_j = X[:, j] @ X[:, j] / n + alpha2
            # L1 soft-thresholds - creates sparsity
            w[j] = np.sign(rho_j) * max(abs(rho_j) - alpha1, 0.0) / z_j
    return w

:::danger Cosine similarity vs Euclidean distance for embeddings - choose carefully These two metrics answer fundamentally different questions:

Cosine similarity measures the angle between vectors - direction only, ignoring magnitude:

cos(u, v) = (u · v) / (‖u‖₂ ‖v‖₂)

Use when: magnitude carries no semantic meaning - word embeddings, sentence embeddings, TF-IDF vectors. The same document expressed in two differently-scaled embeddings should be identical.

Euclidean distance measures geometric distance - direction AND magnitude:

d(u, v) = ‖u - v‖₂

Use when: magnitude carries information, or when using embeddings from a model trained with contrastive loss where magnitude was calibrated.

The critical mistake: applying Euclidean distance to un-normalized embeddings from a model trained with cosine similarity loss. A vector of magnitude 10 and one of magnitude 1 pointing in the same direction will have large Euclidean distance but zero cosine distance. If your model was trained with cosine loss (sentence-transformers, DPR, CLIP), normalize embeddings to unit length first - then Euclidean distance and cosine similarity give identical rankings. :::

Cosine similarity in embedding search

import numpy as np

def cosine_similarity_matrix(A: np.ndarray, B: np.ndarray) -> np.ndarray:
    """
    Cosine similarities between all rows of A and all rows of B.
    A: (n, d) queries, B: (m, d) documents.
    Returns (n, m) similarity matrix.
    """
    A_norm = A / np.linalg.norm(A, axis=1, keepdims=True)
    B_norm = B / np.linalg.norm(B, axis=1, keepdims=True)
    return A_norm @ B_norm.T   # dot product of unit vectors = cosine similarity

# Demonstrate that cosine and Euclidean give different rankings on raw embeddings
np.random.seed(42)
query = np.array([1.0, 0.5, 0.0])

docs = np.array([
    [2.0, 1.0, 0.0],    # same direction, 2x magnitude
    [0.5, 0.25, 0.0],   # same direction, 0.5x magnitude
    [0.0, 1.0, 0.0],    # different direction
    [10.0, 5.0, 0.0],   # same direction, 10x magnitude
])

# Cosine similarity (direction only)
q_unit = query / np.linalg.norm(query)
d_units = docs / np.linalg.norm(docs, axis=1, keepdims=True)
cos_sims = d_units @ q_unit

# Euclidean distance (direction + magnitude)
eucl_dists = np.linalg.norm(docs - query, axis=1)

print("Cosine ranking (docs 0,1,3 all direction-identical):")
for rank, idx in enumerate(np.argsort(cos_sims)[::-1]):
    print(f"  rank {rank+1}: doc[{idx}] cos_sim={cos_sims[idx]:.4f}")

print("\nEuclidean ranking (magnitude differences dominate):")
for rank, idx in enumerate(np.argsort(eucl_dists)):
    print(f"  rank {rank+1}: doc[{idx}] dist={eucl_dists[idx]:.4f}")

Regularization path: alpha vs sparsity

import numpy as np

np.random.seed(42)
n, d = 80, 10
X = np.random.randn(n, d)
w_true = np.zeros(d)
w_true[[0, 2, 7]] = [3.0, -1.5, 2.0]
y = X @ w_true + 0.2 * np.random.randn(n)

alphas = [0.005, 0.01, 0.05, 0.1, 0.3, 0.8, 1.5]
print(f"{'Alpha':>8} | {'Non-zeros':>10} | {'‖w‖₁':>8}")
print("-" * 32)
for alpha in alphas:
    w = np.zeros(d)
    for _ in range(500):
        for j in range(d):
            r_j = y - X @ w + X[:, j] * w[j]
            rho_j = X[:, j] @ r_j / n
            z_j = X[:, j] @ X[:, j] / n
            w[j] = np.sign(rho_j) * max(abs(rho_j) - alpha, 0.0) / z_j
    n_nz = np.sum(np.abs(w) > 1e-6)
    l1 = np.sum(np.abs(w))
    print(f"{alpha:>8.3f} | {n_nz:>10} | {l1:>8.4f}")
# As alpha increases: fewer non-zeros → sparser model

Interview Questions

Q1: Explain geometrically why L1 regularization induces sparsity but L2 does not.

Regularized regression is a constrained optimization problem: find the point on the norm ball that minimizes the loss function.

L1 ball (diamond shape): In 2D, the L1 ball is a rotated square with corners at (±r, 0) and (0, ±r) - exactly on the coordinate axes. The loss function has smooth, elliptical contours. When the ellipse descends toward the diamond, it almost always makes first contact at a corner. A corner is a point where one coordinate is zero. In d dimensions, corners become even more likely contact points because there are 2d corners and the loss surface is generic.

Why corners? The L1 norm is not differentiable at the coordinate axes - its subdifferential spans a range of subgradients at each corner. This means that for any loss gradient pointing toward the corner, the corner is still optimal. Mathematically, the optimality condition 0 ∈ ∂f(w*) + λ ∂‖w*‖₁ is satisfied at a corner for all loss gradients in a finite angular range.

L2 ball (sphere shape): The L2 ball is smooth everywhere - no corners. The elliptical loss contour makes tangential contact with the sphere at a single point determined by the gradient direction. That point is generically on the interior of a face with all coordinates non-zero. L2 shrinks all weights proportionally but cannot force any weight to exactly zero.

Summary: Sparsity requires corners on coordinate axes. L1 has them; L2 does not. This is a geometric property of the constraint region, not a numerical artifact.

Q2: What is the Frobenius norm and when does it appear in ML?

The Frobenius norm of a matrix A ∈ ℝ^(m×n) is the square root of the sum of all squared entries:

‖A‖_F = sqrt(Σᵢ Σⱼ Aᵢⱼ²)

Three equivalent characterizations:

Square root of sum of squared entries
L2 norm of the vectorized (flattened) matrix
L2 norm of the singular values: ‖A‖_F = sqrt(σ₁² + σ₂² + ... + σᵣ²)

Where it appears in ML:

Weight decay in neural networks: L_total = L_task + λ Σ_layers ‖W_l‖_F² - this is the standard L2 regularization. In PyTorch, weight_decay in optimizers implements this.
Eckart-Young theorem: The best rank-k approximation of A in the Frobenius norm is the truncated SVD. The approximation error is ‖A - A_k‖_F = sqrt(σ_\{k+1\}² + ... + σ_r²).
Collaborative filtering: Matrix factorization for recommender systems minimizes ‖R - UV^T‖_F² plus regularization terms on U and V.
Gradient norms: monitoring ‖∇L‖_F during training, gradient clipping by Frobenius norm.

Use np.linalg.norm(A, 'fro') in NumPy and torch.linalg.matrix_norm(A, 'fro') in PyTorch.

Q3: What is the nuclear norm and why is it the convex relaxation of rank?

The nuclear norm of A is the sum of all singular values:

‖A‖_* = σ₁ + σ₂ + ... + σᵣ

Why it relaxes rank: The rank of A counts the number of non-zero singular values - a discrete, combinatorial quantity. Minimizing rank is NP-hard (equivalent to finding the sparsest representation). The nuclear norm replaces "count non-zero σᵢ" with "sum all σᵢ" - a continuous, convex function.

This is the exact same relaxation that L1 performs on vectors:

Vectors: L0 (count non-zeros) → relaxed to L1 (sum of |vᵢ|)
Matrices: rank (count non-zero σᵢ) → relaxed to nuclear (sum of σᵢ)

Formally, the nuclear norm is the convex envelope of the rank function on the set {A : ‖A‖₂ ≤ 1} (proven by Fazel 2002 and Recht et al. 2010).

Proximal operator: Singular value soft-thresholding - apply L1 soft-threshold to each singular value:

prox_{τ ‖·‖_*}(A) = U * diag(max(σ - τ, 0)) * Vᵀ

This is the SVT (Singular Value Thresholding) algorithm, the core subroutine for nuclear norm minimization problems like matrix completion and robust PCA.

Q4: When should you use cosine similarity vs Euclidean distance for embeddings?

The fundamental question: does magnitude carry semantic meaning in your embedding space?

Use cosine similarity when:

Magnitude is irrelevant (word2vec, GloVe, TF-IDF)
The model was explicitly trained with cosine similarity loss (most sentence-transformers, DPR, CLIP)
You want direction-only comparison (the angle between representations)

Use Euclidean distance when:

Both direction and magnitude carry information
Working with normalized embeddings (cosine ≡ Euclidean on the unit sphere)
The model's training procedure calibrated embedding magnitudes

The unit-sphere equivalence: For unit-normalized embeddings (‖u‖₂ = ‖v‖₂ = 1):

‖u - v‖₂² = 2 - 2·(u·v) = 2·(1 - cos(u,v))

So on the unit sphere, Euclidean distance and cosine similarity give identical rankings. This means: normalize embeddings to unit length, then use dot product (which equals cosine for unit vectors). This lets you use FAISS or any L2-based approximate nearest neighbor library and get cosine-equivalent rankings.

Practical rule: For sentence-transformers, CLIP, or DPR - always normalize to unit length first. For raw last-layer hidden states from a model not trained with cosine loss - use cosine similarity explicitly, since magnitudes may not be meaningful.

Practice Challenges

Level 1: Predict

Challenge: Without computing, predict which norm gives the largest value for v = [1, 1, 1, ..., 1] ∈ ℝ¹⁰⁰ (all ones):

L1 norm
L2 norm
L∞ norm

Answer

For v = [1, 1, ..., 1] ∈ ℝ¹⁰⁰:

L1 = |1| * 100 = 100
L2 = sqrt(100) = 10
L∞ = max(1,...,1) = 1

L1 is largest, L∞ is smallest. The ordering L∞ ≤ L2 ≤ L1 always holds for unit vectors (general fact: Lp decreases as p increases).

More precisely, for a d-dimensional vector with all entries equal to c:

‖v‖₁ = d·c,  ‖v‖₂ = sqrt(d)·c,  ‖v‖∞ = c

This illustrates why L1 regularization is "stricter" - it penalizes many small weights more heavily than L2 does, which is another way to understand why L1 prefers sparse solutions.

Level 2: Debug

Challenge: The following function is supposed to compute cosine similarity but returns incorrect results for some inputs. Find all bugs:

def cosine_sim(u, v):
    return u @ v / np.linalg.norm(u) * np.linalg.norm(v)

Answer

Bug: Operator precedence. Python evaluates left to right: (u @ v / norm(u)) * norm(v) instead of u @ v / (norm(u) * norm(v)).

# Wrong: operator precedence multiplies by norm(v) instead of dividing
def cosine_sim_wrong(u, v):
    return u @ v / np.linalg.norm(u) * np.linalg.norm(v)

# Fix 1: explicit parentheses
def cosine_sim_fixed_v1(u, v):
    return u @ v / (np.linalg.norm(u) * np.linalg.norm(v))

# Fix 2: normalize first - cleaner and avoids precision issues
def cosine_sim_fixed_v2(u, v):
    u_hat = u / np.linalg.norm(u)
    v_hat = v / np.linalg.norm(v)
    return float(u_hat @ v_hat)

# Verify
u = np.array([3.0, 4.0])   # norm = 5
v = np.array([1.0, 0.0])   # norm = 1
# Correct: (3*1 + 4*0) / (5 * 1) = 3/5 = 0.6

print(f"Wrong:   {cosine_sim_wrong(u, v):.4f}")         # 3.0 (incorrect!)
print(f"Fixed v1: {cosine_sim_fixed_v1(u, v):.4f}")     # 0.6 (correct)
print(f"Fixed v2: {cosine_sim_fixed_v2(u, v):.4f}")     # 0.6 (correct)

# Second bug: no guard for zero-norm vectors
zero = np.zeros(2)
# cosine_sim_fixed_v1(u, zero)  # ZeroDivisionError!

def cosine_sim_safe(u, v, eps=1e-8):
    denom = np.linalg.norm(u) * np.linalg.norm(v)
    if denom < eps:
        return 0.0
    return float(u @ v / denom)

Level 3: Design

Challenge: Implement a matrix completion algorithm using the nuclear norm proximal operator (SVT) that fills in missing entries of a rating matrix.

Answer

import numpy as np

def svt_matrix_completion(
    M_observed: np.ndarray,
    mask: np.ndarray,
    tau: float = 1.0,
    step: float = 1.9,
    n_iter: int = 200,
    tol: float = 1e-4
) -> np.ndarray:
    """
    Singular Value Thresholding for matrix completion.

    Solves: minimize ‖X‖_*  subject to  Xᵢⱼ = Mᵢⱼ  for observed (i,j)

    Algorithm:
    1. Start with X = 0
    2. Repeat:
       a. Gradient step: Y = X + step * mask * (M_observed - X)
       b. Nuclear norm prox: X = SVT(Y, tau)
    3. Until convergence
    """
    X = np.zeros_like(M_observed)

    def soft_threshold_svd(A: np.ndarray, threshold: float) -> np.ndarray:
        U, s, Vt = np.linalg.svd(A, full_matrices=False)
        s_thresh = np.maximum(s - threshold, 0.0)
        return U @ np.diag(s_thresh) @ Vt

    for iteration in range(n_iter):
        X_old = X.copy()

        # Gradient step: push toward observed values on the mask
        residual = mask * (M_observed - X)
        Y = X + step * residual

        # Nuclear norm proximal step: soft-threshold singular values
        X = soft_threshold_svd(Y, tau)

        # Check convergence
        change = np.linalg.norm(X - X_old, 'fro') / (np.linalg.norm(X_old, 'fro') + 1e-10)
        if change < tol:
            print(f"Converged at iteration {iteration + 1}")
            break

    return X

# Test: recover a low-rank matrix with 50% missing entries
np.random.seed(42)
m, n, r = 20, 15, 3

# True low-rank matrix
U_true = np.random.randn(m, r)
V_true = np.random.randn(n, r)
M_true = U_true @ V_true.T   # rank-3 matrix

# Random observation mask (50% observed)
mask = (np.random.rand(m, n) > 0.5).astype(float)
M_obs = mask * M_true  # zero out unobserved entries

# Run SVT
M_recovered = svt_matrix_completion(M_obs, mask, tau=0.5, step=1.5)

# Evaluate recovery on unobserved entries
unobserved_mask = 1 - mask
rmse = np.sqrt(np.mean((unobserved_mask * (M_recovered - M_true))**2 /
               (unobserved_mask.sum() + 1e-10) * m * n))
print(f"RMSE on unobserved entries: {rmse:.4f}")
print(f"Recovered rank: {np.linalg.matrix_rank(M_recovered, tol=1e-2)}")
print(f"True rank: {r}")

Quick Reference Cheatsheet

Norm	Formula	Constraint shape	NumPy	ML use
L1	Σ\|vᵢ\|	Diamond (corners on axes)	`norm(v, 1)`	Lasso, feature selection
L2	sqrt(Σvᵢ²)	Sphere (smooth)	`norm(v)`	Ridge, Euclidean distance
L∞	max\|vᵢ\|	Hypercube	`norm(v, inf)`	Adversarial ε-ball
Frobenius	sqrt(ΣΣAᵢⱼ²)	-	`norm(A, 'fro')`	Weight decay
Nuclear	Σσᵢ	-	`norm(A, 'nuc')`	Low-rank completion
Spectral	σ_max	-	`norm(A, 2)`	Lipschitz bounds

Key Takeaways

A norm is defined by three axioms: positive definiteness, homogeneity, and the triangle inequality - every norm is a valid distance metric
The L1 ball is a diamond with corners on coordinate axes - the loss surface hits corners first, creating exact zeros (sparsity); L1 = Lasso = feature selection
The L2 ball is a smooth sphere - the loss surface touches a generic boundary point, shrinking all weights proportionally but zeroing none; L2 = Ridge = stability
The Frobenius norm is the L2 norm of the vectorized matrix, equal to the L2 norm of its singular values - use np.linalg.norm(A, 'fro') for weight decay
The nuclear norm is the sum of singular values - the convex relaxation of rank - use it when you need low-rank matrix completion
For embedding search: cosine similarity measures angle (normalize first), Euclidean measures angle + magnitude - normalize embeddings to unit length to make them equivalent

Next: Tensors for Deep Learning →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Norms Explorer demo on the EngineersOfAI Playground - no code required.

:::

What You Will Learn​

Prerequisites​

Part 1 - What a Norm Is: Three Axioms​

Why axioms matter for ML​

The Lp family​

Part 2 - The L1 Norm: Geometry, Sparsity, and Why Lasso Works​

Definition​

The L1 ball is a diamond​

Why L1 induces sparsity: the geometric argument​

Soft-thresholding: the L1 proximal operator​

Part 3 - The L2 Norm: Geometry, Smoothness, and Why Ridge Works​

Definition​

The L2 ball is a sphere​

Why L2 does not induce sparsity​

Closed-form Ridge solution​

L2 vs L1: comparison table​

Part 4 - The L∞ Norm: Maximum Absolute Value​

Definition​

The L∞ ball is a hypercube​

When L∞ arises in ML​

Part 5 - The Frobenius Norm: L2 for Matrices​

Definition​

Connection to singular values​

Where Frobenius appears in ML​

Part 6 - The Nuclear Norm: Convex Relaxation of Rank​

The rank minimization problem​

The nuclear norm is the tightest convex relaxation​

Part 7 - Distance Metrics from Norms​

Euclidean distance (from L2)​

Manhattan distance (from L1)​

Chebyshev distance (from L∞)​

Mahalanobis distance: the norm that accounts for correlation​

Distance metric comparison​

Part 8 - NumPy: np.linalg.norm Complete Reference​

Part 9 - ML Connections: Regularization, Embeddings, and Search​

ElasticNet: combining L1 and L2​

Cosine similarity in embedding search​

Regularization path: alpha vs sparsity​

Interview Questions​

Practice Challenges​

Level 1: Predict​

Level 2: Debug​

Level 3: Design​

Quick Reference Cheatsheet​

Key Takeaways​

What You Will Learn

Prerequisites

Part 1 - What a Norm Is: Three Axioms

Why axioms matter for ML

The Lp family

Part 2 - The L1 Norm: Geometry, Sparsity, and Why Lasso Works

Definition

The L1 ball is a diamond

Why L1 induces sparsity: the geometric argument

Soft-thresholding: the L1 proximal operator

Part 3 - The L2 Norm: Geometry, Smoothness, and Why Ridge Works

Definition

The L2 ball is a sphere

Why L2 does not induce sparsity

Closed-form Ridge solution

L2 vs L1: comparison table

Part 4 - The L∞ Norm: Maximum Absolute Value

Definition

The L∞ ball is a hypercube

When L∞ arises in ML

Part 5 - The Frobenius Norm: L2 for Matrices

Definition

Connection to singular values

Where Frobenius appears in ML

Part 6 - The Nuclear Norm: Convex Relaxation of Rank

The rank minimization problem

The nuclear norm is the tightest convex relaxation

Part 7 - Distance Metrics from Norms

Euclidean distance (from L2)

Manhattan distance (from L1)

Chebyshev distance (from L∞)

Mahalanobis distance: the norm that accounts for correlation

Distance metric comparison

Part 8 - NumPy: np.linalg.norm Complete Reference

Part 9 - ML Connections: Regularization, Embeddings, and Search

ElasticNet: combining L1 and L2

Cosine similarity in embedding search

Regularization path: alpha vs sparsity

Interview Questions

Practice Challenges

Level 1: Predict

Level 2: Debug

Level 3: Design

Quick Reference Cheatsheet

Key Takeaways