SVD and Matrix Decompositions - The Netflix Algorithm and JPEG Compression
Reading time: ~25 minutes | Level: Mathematical Foundations → ML Engineering
Netflix's original recommendation algorithm was matrix factorization via SVD. Every JPEG image you have ever seen uses a compression scheme mathematically related to SVD. Latent Semantic Analysis (LSA), one of the first word embedding techniques, was computed with SVD. Google's early search engine used SVD for dimensionality reduction.
The Singular Value Decomposition is the most powerful matrix decomposition in applied mathematics - and unlike eigendecomposition, it works on any matrix, not just square ones.
This lesson teaches you SVD and the other decompositions that computational ML depends on: LU for solving linear systems, QR for orthogonalization and least squares, Cholesky for positive definite matrices.
What You Will Learn
- SVD: what it is, what it means geometrically, why it works on any matrix
- Truncated SVD: low-rank approximation for dimensionality reduction and compression
- LU decomposition: how linear systems are actually solved numerically
- QR decomposition: orthogonalization, Gram-Schmidt, and least squares
- Cholesky decomposition: why positive definite matrices have their own decomposition
- NumPy:
np.linalg.svd,np.linalg.qr,np.linalg.cholesky,scipy.linalg.lu - ML connections: collaborative filtering, image compression, LSA, PCA via SVD
Prerequisites
Part 1 - SVD: The Fundamental Theorem of Linear Algebra
The Singular Value Decomposition states: any matrix A (m×n) can be written as:
A = U Σ Vᵀ
Where:
- U is an (m×m) orthogonal matrix: the "left singular vectors"
- Σ is an (m×n) diagonal matrix with non-negative entries: the "singular values"
- V is an (n×n) orthogonal matrix: the "right singular vectors"
The diagonal entries of Σ are called singular values σ₁ ≥ σ₂ ≥ ... ≥ σₘᵢₙ ≥ 0 (sorted descending), where min = min(m,n).
Geometric interpretation: rotate → scale → rotate
The SVD decomposes any linear transformation into three simple steps:
- Vᵀ: rotate the input space to align with the "natural input directions"
- Σ: scale each dimension by the corresponding singular value σᵢ
- U: rotate the output space to the "natural output directions"
This is remarkable: any linear transformation - no matter how complex - is just two rotations and one scaling. The singular values σᵢ tell you how much each direction is stretched (or shrunk).
Relationship to eigendecomposition
SVD and eigendecomposition are deeply connected:
AᵀA = (UΣVᵀ)ᵀ (UΣVᵀ) = VΣᵀUᵀUΣVᵀ = V(ΣᵀΣ)Vᵀ
AAᵀ = (UΣVᵀ)(UΣVᵀ)ᵀ = UΣVᵀVΣᵀUᵀ = U(ΣΣᵀ)Uᵀ
So:
- Columns of V are eigenvectors of AᵀA (right singular vectors)
- Columns of U are eigenvectors of AAᵀ (left singular vectors)
- σᵢ² are eigenvalues of both AᵀA and AAᵀ (singular values = square roots of eigenvalues)
This is why sklearn.decomposition.PCA uses SVD internally: computing SVD of the centered data matrix X is equivalent to eigendecomposing the covariance matrix XᵀX - but SVD is more numerically stable.
Part 2 - Truncated SVD: Low-Rank Approximation
The full SVD stores all m×m, m×n, and n×n matrices. For an m×n matrix, this is expensive. More importantly, the best rank-k approximation to any matrix A (in the Frobenius norm) is given by keeping only the top k singular values:
Aₖ = U[:, :k] Σ[:k, :k] Vᵀ[:k, :]
= σ₁ u₁ v₁ᵀ + σ₂ u₂ v₂ᵀ + ... + σₖ uₖ vₖᵀ
This is the Eckart-Young theorem: Aₖ is the best rank-k approximation to A.
Image compression demo
import numpy as np
import matplotlib.pyplot as plt
def compress_image_with_svd(image: np.ndarray, k: int) -> np.ndarray:
"""
Compress a grayscale image by keeping only the top-k singular values.
The SVD of an m×n image has min(m,n) singular values.
Keeping k << min(m,n) achieves compression at cost of quality.
Storage comparison:
- Original: m × n values
- Compressed: k × (m + 1 + n) values = k(m+n+1)
- Compression ratio: m*n / (k*(m+n+1))
"""
# Full SVD of the image matrix
U, sigma, Vt = np.linalg.svd(image, full_matrices=False)
# U: (m, min), sigma: (min,), Vt: (min, n) - "economy" SVD
# Keep only top-k singular values
U_k = U[:, :k] # (m, k)
sigma_k = sigma[:k] # (k,)
Vt_k = Vt[:k, :] # (k, n)
# Reconstruct
reconstructed = U_k @ np.diag(sigma_k) @ Vt_k
# Clip to valid pixel range [0, 255]
return np.clip(reconstructed, 0, 255)
# Simulate a grayscale image (256×256)
np.random.seed(42)
m, n = 256, 256
# Create a structured image (not pure noise)
x = np.linspace(0, 4*np.pi, n)
y = np.linspace(0, 4*np.pi, m)
X, Y = np.meshgrid(x, y)
image = (np.sin(X) * np.cos(Y) * 127 + 128).astype(float)
# Compute SVD
U, sigma, Vt = np.linalg.svd(image, full_matrices=False)
# Show singular value decay
print("Top-20 singular values:")
for i, s in enumerate(sigma[:20]):
bar = "█" * int(s / sigma[0] * 30)
print(f" σ_{i+1:2d} = {s:8.2f} {bar}")
# Compression quality vs k
for k in [1, 5, 10, 20, 50]:
compressed = compress_image_with_svd(image, k)
# Relative approximation error
error = np.linalg.norm(image - compressed, 'fro') / np.linalg.norm(image, 'fro')
# Storage ratio (compressed / original)
storage_ratio = k * (m + 1 + n) / (m * n)
print(f"k={k:3d}: error={error:.4f}, "
f"storage={storage_ratio:.3f}x original, "
f"compression={1/storage_ratio:.1f}:1")
Collaborative filtering: the Netflix algorithm
import numpy as np
def collaborative_filtering_svd(
ratings: np.ndarray, # (n_users, n_items) - NaN for unrated
k: int = 20 # number of latent factors
) -> np.ndarray:
"""
Matrix factorization for recommendation via truncated SVD.
The idea: users×items rating matrix has low-rank structure.
A few "latent factors" (genres, moods, etc.) explain most ratings.
SVD finds these latent factors automatically.
Real Netflix Prize winner used more sophisticated algorithms,
but the core idea is this matrix factorization.
"""
# Fill missing ratings with row mean (simple imputation)
row_means = np.nanmean(ratings, axis=1, keepdims=True)
ratings_filled = np.where(np.isnan(ratings), row_means, ratings)
# Center: subtract per-user mean (remove user bias)
ratings_centered = ratings_filled - row_means
# Truncated SVD: keep top-k latent factors
U, sigma, Vt = np.linalg.svd(ratings_centered, full_matrices=False)
U_k = U[:, :k] # (n_users, k) - user embeddings
sigma_k = sigma[:k] # (k,) - importance of each factor
Vt_k = Vt[:k, :] # (k, n_items) - item embeddings
# Reconstruct full ratings matrix
reconstructed = U_k @ np.diag(sigma_k) @ Vt_k + row_means
return reconstructed, U_k, Vt_k.T # also return embeddings
# Example: 50 users, 100 movies
np.random.seed(42)
n_users, n_items = 50, 100
# True rating matrix has rank-3 structure (3 movie genres)
true_user_prefs = np.random.randn(n_users, 3) # users prefer genres differently
true_item_genre = np.random.randn(3, n_items) # items belong to genres
true_ratings = 3.5 + true_user_prefs @ true_item_genre
# Observe only 20% of ratings
mask = np.random.rand(n_users, n_items) < 0.20
observed_ratings = np.where(mask, true_ratings, np.nan)
# Predict missing ratings
predicted, user_embeddings, item_embeddings = collaborative_filtering_svd(
observed_ratings, k=3
)
# Evaluate on unobserved entries
unobserved_mask = ~mask
rmse = np.sqrt(np.mean((predicted[unobserved_mask] - true_ratings[unobserved_mask])**2))
print(f"RMSE on unobserved ratings: {rmse:.4f}")
print(f"User embedding shape: {user_embeddings.shape}") # (50, 3)
print(f"Item embedding shape: {item_embeddings.shape}") # (100, 3)
Part 3 - LU Decomposition: How Linear Systems Are Actually Solved
The LU decomposition factors a square matrix A into:
A = LU (or PA = LU with a permutation matrix P for stability)
Where:
- L is lower triangular (zeros above the diagonal)
- U is upper triangular (zeros below the diagonal)
- P is a permutation matrix (reorders rows for numerical stability)
Why LU matters for solving Ax = b
Solving Ax = b directly requires O(n³) work. But once A = LU, solving:
Ly = b(forward substitution) - O(n²)Ux = y(back substitution) - O(n²)
More importantly: if you need to solve Ax = b for multiple right-hand sides b₁, b₂, ..., you only compute LU once (O(n³)) and then solve each system in O(n²).
This is exactly what np.linalg.solve does internally - it calls LAPACK's LU factorization.
import numpy as np
from scipy import linalg
# LU decomposition (scipy has better LU interface than numpy)
A = np.array([[2.0, 1.0, -1.0],
[-3.0, -1.0, 2.0],
[-2.0, 1.0, 2.0]])
# scipy.linalg.lu returns P, L, U such that A = P @ L @ U
P, L, U = linalg.lu(A)
print("P (permutation):\n", P)
print("L (lower triangular):\n", L.round(4))
print("U (upper triangular):\n", U.round(4))
print(f"Reconstruction error: {np.linalg.norm(A - P @ L @ U):.2e}")
# Solving multiple systems with the same A
# Real use case: solving A @ X = B where B has multiple columns
B = np.random.randn(3, 5) # 5 different right-hand sides
# Method 1: solve each separately (inefficient)
X1 = np.column_stack([np.linalg.solve(A, B[:, i]) for i in range(5)])
# Method 2: solve all at once (numpy handles this)
X2 = np.linalg.solve(A, B) # B is a matrix - solves all columns at once
print(f"Solutions match: {np.allclose(X1, X2)}")
# Method 3: factorize once, solve multiple times (scipy)
lu_factor = linalg.lu_factor(A) # returns LU factorization
X3 = linalg.lu_solve(lu_factor, B) # O(n²) per solve
print(f"lu_solve matches: {np.allclose(X1, X3)}")
Part 4 - QR Decomposition: Orthogonalization and Least Squares
The QR decomposition factors a matrix A (m×n, m ≥ n) into:
A = QR
Where:
- Q is (m×m) orthogonal: Qᵀ = Q⁻¹, columns are orthonormal
- R is (m×n) upper triangular
Gram-Schmidt process: building orthonormal bases
QR decomposition is the numerical implementation of the Gram-Schmidt orthogonalization process, which takes a set of linearly independent vectors and produces orthonormal vectors spanning the same space.
The Gram-Schmidt algorithm:
Given vectors a₁, a₂, ..., aₙ:
u₁ = a₁
e₁ = u₁ / ‖u₁‖
u₂ = a₂ - (a₂·e₁)e₁ ← remove component along e₁
e₂ = u₂ / ‖u₂‖
u₃ = a₃ - (a₃·e₁)e₁ - (a₃·e₂)e₂ ← remove components along e₁, e₂
e₃ = u₃ / ‖u₃‖
Each new vector is made orthogonal to all previous ones.
Least squares via QR
The least squares problem min ‖Ax - b‖₂ can be solved via QR:
A = QR
Ax = b → QRx = b → Rx = Qᵀb
Since R is upper triangular: solve by back-substitution
This is numerically more stable than the normal equations (AᵀA)x = Aᵀb because we never form the potentially ill-conditioned AᵀA.
import numpy as np
# QR decomposition
A = np.random.randn(5, 3) # overdetermined system: 5 equations, 3 unknowns
# np.linalg.qr returns "economy" QR by default (mode='reduced')
Q, R = np.linalg.qr(A)
print(f"Q shape: {Q.shape}") # (5, 3) - economy QR
print(f"R shape: {R.shape}") # (3, 3)
# Verify orthogonality: Qᵀ Q = I
print(f"Qᵀ Q = I: {np.allclose(Q.T @ Q, np.eye(3))}") # True
# Verify reconstruction: A = QR
print(f"A = QR: {np.allclose(A, Q @ R)}") # True
# Gram-Schmidt from scratch
def gram_schmidt(A: np.ndarray) -> tuple:
"""
QR decomposition via classical Gram-Schmidt orthogonalization.
(NumPy uses modified Gram-Schmidt or Householder - more stable)
"""
m, n = A.shape
Q = np.zeros((m, n))
R = np.zeros((n, n))
for j in range(n):
# Start with the j-th column of A
v = A[:, j].copy()
# Remove projections onto all previous Q columns
for i in range(j):
R[i, j] = Q[:, i] @ A[:, j] # R[i,j] = qᵢᵀaⱼ
v -= R[i, j] * Q[:, i]
# Normalize
R[j, j] = np.linalg.norm(v)
Q[:, j] = v / R[j, j]
return Q, R
Q_gs, R_gs = gram_schmidt(A)
print(f"Gram-Schmidt Q orthogonal: {np.allclose(Q_gs.T @ Q_gs, np.eye(3))}") # True
# Least squares via QR (numerically preferred over normal equations)
def least_squares_qr(A: np.ndarray, b: np.ndarray) -> np.ndarray:
"""
Solve min ‖Ax - b‖₂ using QR decomposition.
More stable than (AᵀA)⁻¹Aᵀb for ill-conditioned A.
"""
Q, R = np.linalg.qr(A) # A = QR
Qtb = Q.T @ b # Qᵀb - projection onto column space of A
x = np.linalg.solve(R, Qtb) # Back-substitution: Rx = Qᵀb
return x
b = np.random.randn(5)
x_qr = least_squares_qr(A, b)
x_lstsq = np.linalg.lstsq(A, b, rcond=None)[0] # NumPy's built-in
print(f"QR and lstsq match: {np.allclose(x_qr, x_lstsq)}") # True
Part 5 - Cholesky Decomposition: The Symmetric PD Special Case
The Cholesky decomposition applies only to symmetric positive definite (SPD) matrices. It factors A into:
A = LLᵀ
Where L is a lower triangular matrix with positive diagonal entries.
Why SPD matrices get their own decomposition
- Twice as fast as LU decomposition (exploit symmetry)
- More stable: no pivoting needed for SPD matrices
- Test for positive definiteness: Cholesky fails if A is not SPD - this is a reliable test
SPD matrices appear constantly in ML:
- Covariance matrices (Gaussian distributions)
- Gram matrices XᵀX (when X has full column rank)
- Kernel matrices K(xᵢ, xⱼ) (positive definite kernels by definition)
- Regularized Hessians H + εI (used in Newton's method)
import numpy as np
# Cholesky decomposition
n = 5
A = np.random.randn(n, n)
S = A.T @ A + np.eye(n) # symmetric positive definite (SPD) - add I ensures PD
L = np.linalg.cholesky(S)
print(f"L shape: {L.shape}") # (n, n)
print(f"L is lower triangular: {np.allclose(L, np.tril(L))}") # True
print(f"LLᵀ = S: {np.allclose(L @ L.T, S)}") # True
# Test for positive definiteness via Cholesky
def is_positive_definite(A: np.ndarray) -> bool:
"""Test if A is positive definite using Cholesky."""
try:
np.linalg.cholesky(A)
return True
except np.linalg.LinAlgError:
return False
pd_matrix = np.eye(3) + 0.1 * np.random.randn(3, 3)
pd_matrix = pd_matrix.T @ pd_matrix # definitely PD
not_pd = np.array([[1.0, 2.0],
[2.0, 1.0]]) # eigenvalues: 3, -1 → not PD
print(f"PD matrix is PD: {is_positive_definite(pd_matrix)}") # True
print(f"not_pd is PD: {is_positive_definite(not_pd)}") # False
# Solving SPD linear systems: use Cholesky for 2× speedup
from scipy import linalg
# For Gaussian distributions: need to solve Σx = b
# Cholesky is the preferred method
n_dim = 50
cov = np.random.randn(n_dim, n_dim)
cov = cov.T @ cov + 0.1 * np.eye(n_dim) # SPD covariance matrix
b = np.random.randn(n_dim)
# Method 1: LU via solve (general)
x_lu = np.linalg.solve(cov, b)
# Method 2: Cholesky (2× faster for large SPD matrices)
L = np.linalg.cholesky(cov)
# Solve LLᵀx = b:
# Step 1: Ly = b (forward substitution)
y = linalg.solve_triangular(L, b, lower=True)
# Step 2: Lᵀx = y (back substitution)
x_chol = linalg.solve_triangular(L.T, y, lower=False)
print(f"Cholesky and LU solutions match: {np.allclose(x_lu, x_chol)}")
# Sampling from multivariate Gaussian N(μ, Σ)
def sample_multivariate_normal(
mean: np.ndarray,
cov: np.ndarray,
n_samples: int
) -> np.ndarray:
"""
Sample from N(mean, cov) using Cholesky decomposition.
x = mean + L @ z, where z ~ N(0, I) and L is Cholesky factor of cov.
"""
d = len(mean)
L = np.linalg.cholesky(cov)
z = np.random.randn(n_samples, d)
return mean + z @ L.T # (n_samples, d)
samples = sample_multivariate_normal(
mean=np.zeros(5),
cov=cov[:5, :5],
n_samples=1000
)
print(f"Sample shape: {samples.shape}") # (1000, 5)
print(f"Sample mean ≈ 0: {np.allclose(samples.mean(axis=0), 0, atol=0.1)}")
Part 6 - NumPy: Complete Decomposition Reference
import numpy as np
from scipy import linalg
m, n = 6, 4
A = np.random.randn(m, n)
S = A.T @ A + np.eye(n) # SPD matrix (n×n)
# ── SVD ────────────────────────────────────────────────────────────────────
# Full SVD
U_full, sigma_full, Vt_full = np.linalg.svd(A, full_matrices=True)
print(f"Full SVD: U={U_full.shape}, σ={sigma_full.shape}, Vᵀ={Vt_full.shape}")
# U: (m, m), sigma: (min(m,n),), Vt: (n, n)
# Economy (thin) SVD - usually what you want
U, sigma, Vt = np.linalg.svd(A, full_matrices=False)
print(f"Economy SVD: U={U.shape}, σ={sigma.shape}, Vᵀ={Vt.shape}")
# U: (m, k), sigma: (k,), Vt: (k, n) where k = min(m,n)
# Reconstruct A
A_reconstructed = U @ np.diag(sigma) @ Vt
print(f"SVD reconstruction error: {np.linalg.norm(A - A_reconstructed):.2e}")
# Rank via SVD (most numerically stable method)
rank = np.sum(sigma > sigma[0] * max(m, n) * np.finfo(float).eps)
print(f"Matrix rank: {rank}")
# Condition number = σmax / σmin
cond = sigma[0] / sigma[-1]
print(f"Condition number: {cond:.4f}")
# ── Truncated SVD (scikit-learn style) ────────────────────────────────────
def truncated_svd(A: np.ndarray, k: int) -> tuple:
"""Low-rank approximation: keep top-k singular values."""
U, sigma, Vt = np.linalg.svd(A, full_matrices=False)
return U[:, :k], sigma[:k], Vt[:k, :]
U_k, sigma_k, Vt_k = truncated_svd(A, k=2)
A_approx = U_k @ np.diag(sigma_k) @ Vt_k
# Frobenius norm of approximation error
error = np.linalg.norm(A - A_approx, 'fro')
# Theory: ‖A - Aₖ‖_F = √(σₖ₊₁² + ... + σₘᵢₙ²)
theoretical_error = np.sqrt(np.sum(sigma[2:]**2))
print(f"Approx error: {error:.4f}, theoretical: {theoretical_error:.4f}")
# ── LU decomposition ───────────────────────────────────────────────────────
A_sq = np.random.randn(4, 4) # square matrix
P, L, U = linalg.lu(A_sq) # P @ A = L @ U
print(f"\nLU: P={P.shape}, L={L.shape}, U={U.shape}")
print(f"Reconstruction: {np.allclose(P @ A_sq, L @ U)}")
# ── QR decomposition ───────────────────────────────────────────────────────
Q, R = np.linalg.qr(A) # economy QR
Q_full, R_full = np.linalg.qr(A, mode='complete') # full QR
print(f"\nQR economy: Q={Q.shape}, R={R.shape}")
print(f"Q orthonormal: {np.allclose(Q.T @ Q, np.eye(n))}")
print(f"A = QR: {np.allclose(A, Q @ R)}")
# ── Cholesky decomposition ─────────────────────────────────────────────────
L_chol = np.linalg.cholesky(S) # L such that S = L @ L.T
print(f"\nCholesky: L={L_chol.shape}")
print(f"S = LLᵀ: {np.allclose(S, L_chol @ L_chol.T)}")
print(f"L lower triangular: {np.allclose(L_chol, np.tril(L_chol))}")
Part 7 - ML Connections: Decompositions in Production
PCA via SVD (what sklearn actually does)
import numpy as np
def pca_via_svd(X: np.ndarray, n_components: int) -> tuple:
"""
PCA using SVD - the numerically stable approach.
Why SVD instead of eigendecomposing the covariance matrix?
1. SVD operates on X directly (n_samples × n_features)
2. Avoids forming XᵀX (which squares the condition number)
3. More stable for thin matrices (n >> d)
This is what sklearn.decomposition.PCA does internally.
"""
n, d = X.shape
# Center the data
mean = X.mean(axis=0)
X_centered = X - mean
# SVD of centered data matrix
# U: (n, k), sigma: (k,), Vt: (k, d)
U, sigma, Vt = np.linalg.svd(X_centered, full_matrices=False)
# Principal components = rows of Vt (right singular vectors)
components = Vt[:n_components, :] # (n_components, d)
# Explained variance = (σᵢ / √(n-1))² = σᵢ² / (n-1)
explained_variance = sigma[:n_components]**2 / (n - 1)
explained_variance_ratio = explained_variance / (sigma**2 / (n - 1)).sum()
# Projected data
X_projected = X_centered @ components.T # (n, n_components)
return X_projected, components, explained_variance_ratio
# Verify matches sklearn
from sklearn.decomposition import PCA
np.random.seed(42)
X = np.random.randn(100, 20)
k = 5
X_svd, comps_svd, var_svd = pca_via_svd(X, k)
sk_pca = PCA(n_components=k).fit(X)
X_sklearn = sk_pca.transform(X)
print(f"Explained variance ratio (SVD): {var_svd.round(4)}")
print(f"Explained variance ratio (sklearn): {sk_pca.explained_variance_ratio_.round(4)}")
# Exact match (up to sign of components)
LSA: Latent Semantic Analysis
import numpy as np
def latent_semantic_analysis(
term_doc_matrix: np.ndarray,
k: int = 100
) -> tuple:
"""
Latent Semantic Analysis: find k latent topics in a document corpus.
term_doc_matrix: (n_terms, n_documents) - typically TF-IDF values
Returns: term_embeddings (n_terms, k), doc_embeddings (n_docs, k)
The k left singular vectors = directions in term space (topics)
The k right singular vectors = directions in document space
"""
U, sigma, Vt = np.linalg.svd(term_doc_matrix, full_matrices=False)
# Term embeddings in latent space
term_embeddings = U[:, :k] @ np.diag(sigma[:k]) # (n_terms, k)
# Doc embeddings in latent space
doc_embeddings = Vt[:k, :].T # (n_docs, k)
# Similarity between documents: dot product in latent space
doc_similarity = doc_embeddings @ doc_embeddings.T
return term_embeddings, doc_embeddings, doc_similarity
# Conceptual example
np.random.seed(42)
n_terms, n_docs = 500, 200
tfidf_matrix = np.random.exponential(0.5, (n_terms, n_docs))
term_emb, doc_emb, doc_sim = latent_semantic_analysis(tfidf_matrix, k=50)
print(f"Term embeddings: {term_emb.shape}") # (500, 50)
print(f"Document embeddings: {doc_emb.shape}") # (200, 50)
print(f"Document similarity: {doc_sim.shape}") # (200, 200)
Part 8 - Common Failure Modes
:::danger SVD can be slow on large matrices - use truncated SVD Full SVD on an (m×n) matrix costs O(min(m,n) × m × n). For large NLP or vision matrices, this is prohibitive.
# SLOW: full SVD on large matrix
U, sigma, Vt = np.linalg.svd(large_matrix) # compute all singular values
# FAST: truncated SVD (only top-k singular values)
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=50) # only k=50 components
X_reduced = svd.fit_transform(large_matrix)
# FAST (scipy): randomized SVD for very large matrices
from sklearn.utils.extmath import randomized_svd
U, sigma, Vt = randomized_svd(large_matrix, n_components=50, random_state=42)
For matrices > 10,000 × 10,000, use randomized SVD (sklearn's randomized_svd), which computes an approximate truncated SVD in O(k × m × n) instead of O(min(m,n) × m × n).
:::
:::danger Cholesky fails on non-positive-definite matrices - add regularization
# This will fail if S has any eigenvalue ≤ 0
try:
L = np.linalg.cholesky(S)
except np.linalg.LinAlgError:
print("Matrix is not positive definite!")
# Fix: add small diagonal regularization
eps = 1e-6
L = np.linalg.cholesky(S + eps * np.eye(S.shape[0]))
Covariance matrices estimated from data can be near-singular (fewer samples than features). Always add a small regularization εI before Cholesky decomposition in production code.
:::
:::tip Use the appropriate decomposition for your matrix structure
- General square:
np.linalg.solve(LU internally) - Rectangular / overdetermined:
np.linalg.lstsq(QR internally) - Symmetric positive definite:
np.linalg.cholesky(2× faster than LU) - Any shape, low-rank approximation:
np.linalg.svdwith truncation - PCA on data matrix:
np.linalg.svdon centered data (more stable than covariance eigendecomp) :::
Interview Questions
Q1: What is the SVD and why is it more general than eigendecomposition?
SVD: Any m×n matrix A = U Σ Vᵀ where U, V are orthogonal and Σ is diagonal.
More general than eigendecomposition because:
- Eigendecomposition requires a square matrix AND n linearly independent eigenvectors
- SVD works on ANY matrix (rectangular, singular, non-diagonalizable)
- When A is symmetric positive definite: SVD = eigendecomposition (U = V, σᵢ = λᵢ)
Why this matters for ML:
- Data matrices X are almost never square (n_samples ≠ n_features)
- Even square matrices might not be diagonalizable (non-symmetric weight matrices)
- SVD always exists and always gives the best low-rank approximation (Eckart-Young theorem)
This is why sklearn.PCA uses SVD on the data matrix rather than eigendecomposing the covariance matrix - SVD handles the rectangular data matrix directly and avoids squaring the condition number.
Q2: Explain the Eckart-Young theorem and its ML implications.
Eckart-Young theorem: The best rank-k approximation to matrix A (in both Frobenius and spectral norm) is the truncated SVD:
Aₖ = σ₁u₁v₁ᵀ + σ₂u₂v₂ᵀ + ... + σₖuₖvₖᵀ
Proof intuition: Each rank-1 term σᵢuᵢvᵢᵀ captures the "principal direction" of variation in A. By taking the top-k, we capture the most variation with the fewest terms.
ML implications:
- PCA is optimal compression: The first k principal components give the best k-dimensional representation of the data (minimum reconstruction error)
- Collaborative filtering: The top-k latent factors capture the dominant rating patterns (genres, styles)
- Image compression: Keeping k singular values gives the best rank-k image approximation for any k
- Low-rank approximation for large models: LoRA uses the assumption that fine-tuning updates have low intrinsic rank (SVD-based justification)
Q3: When should you use QR decomposition instead of the normal equations?
For solving least squares min ‖Ax - b‖₂:
Normal equations approach: form AᵀA and solve (AᵀA)x = Aᵀb
- Problem:
AᵀAsquares the condition number: κ(AᵀA) = κ(A)² - If κ(A) = 1000, κ(AᵀA) = 10⁶ - 6 digits of accuracy lost in float64
QR approach: factor A = QR, solve Rx = Qᵀb
- Advantage: works with κ(A) directly - no squaring
- Preferred whenever A is near-rank-deficient or poorly conditioned
Rule of thumb:
- Well-conditioned A (κ < 1000): normal equations are fine, slightly faster
- Near-rank-deficient A (κ > 10⁶): always use QR or SVD
- In general ML code: use
np.linalg.lstsq(uses QR internally) and avoid normal equations
Q4: What is the Cholesky decomposition and when is it preferred over LU?
Cholesky: for symmetric positive definite (SPD) matrix A, factor as A = LLᵀ where L is lower triangular.
Preferred over LU when:
- A is symmetric positive definite: exploit structure for 2× speedup
- No pivoting needed: Cholesky is unconditionally stable for SPD matrices (unlike LU which requires pivoting for stability)
- Sampling from Gaussians:
x = mean + L @ zwhere z ~ N(0,I) generates x ~ N(mean, A) - Testing for PD: Cholesky throws LinAlgError if A is not PD - reliable detection
Where SPD matrices appear:
- Covariance matrices Σ (Gaussian distributions)
- Gram matrices XᵀX (kernel methods, ridge regression)
- Regularized Hessians H + εI (Newton's method)
- Kernel matrices K(xᵢ, xⱼ) (SVMs, Gaussian processes)
Practice Challenges
Level 1: Predict
Challenge: A data matrix X has shape (200, 50) and rank 10. What are the dimensions of U, Σ, V in the economy SVD? How many non-zero singular values are there?
Answer
Economy SVD of X (200×50):
- U: (200, 50) - m × min(m,n)
- Σ (as vector): (50,) - min(m,n) singular values
- Vᵀ: (50, 50) - min(m,n) × n
But since rank = 10, only the first 10 singular values are non-zero (σ₁ ≥ σ₂ ≥ ... ≥ σ₁₀ > 0 = σ₁₁ = ... = σ₅₀).
The approximation error of a rank-k approximation for k ≥ 10 is exactly zero (since A is already rank 10):
‖A - Aₖ‖_F = 0 for k ≥ 10
Level 2: Debug
Challenge: The following code for PCA reconstruction is supposed to return the original data (zero error), but the reconstruction error is large. Find the bug:
X = np.random.randn(100, 20)
X_centered = X - X.mean(axis=0)
U, sigma, Vt = np.linalg.svd(X_centered, full_matrices=False)
k = 5
X_projected = X_centered @ Vt[:k, :].T
# Bug: try to reconstruct
X_reconstructed = X_projected @ Vt[:k, :]
print(np.linalg.norm(X_centered - X_reconstructed)) # large error
Answer
Bug: Projection onto k=5 components destroys information from the remaining 15 components. The reconstruction from 5 components back to 20 dimensions can only recover the rank-5 approximation, not the original data.
There is no bug in the code per se - this is expected behavior. A rank-5 projection will not reconstruct a full-rank matrix.
If the goal is zero error: use all components (k = min(m,n)):
U, sigma, Vt = np.linalg.svd(X_centered, full_matrices=False)
# Full reconstruction (no compression)
X_reconstructed = U @ np.diag(sigma) @ Vt
print(np.linalg.norm(X_centered - X_reconstructed)) # ≈ 0 (machine epsilon)
If the goal is lossy compression: the error is expected and equals √(σ_{k+1}² + ... + σₘᵢₙ²).
Level 3: Design
Challenge: Implement a low-rank matrix completion algorithm for collaborative filtering. The algorithm should: (1) fill missing entries with user means, (2) apply truncated SVD, (3) iterate until convergence.
Answer
import numpy as np
def svd_matrix_completion(
R: np.ndarray, # (n_users, n_items) with NaN for missing
k: int = 10, # rank of approximation
n_iter: int = 20 # number of iterations
) -> np.ndarray:
"""
SVD-based matrix completion (collaborative filtering).
Algorithm:
1. Initialize missing entries (user means)
2. Compute rank-k SVD approximation
3. Replace observed entries with original values
4. Repeat until convergence
This is the "hard impute with SVD" algorithm.
"""
observed = ~np.isnan(R)
# Initialize: fill with user row means
row_means = np.nanmean(R, axis=1, keepdims=True)
M = np.where(observed, R, row_means)
prev_error = np.inf
for iteration in range(n_iter):
# Rank-k SVD approximation
U, sigma, Vt = np.linalg.svd(M, full_matrices=False)
M_approx = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :]
# Update: keep observed entries, fill unobserved with SVD predictions
M_new = np.where(observed, R, M_approx)
# Convergence check
error = np.linalg.norm(M_new - M, 'fro')
if error < 1e-6:
print(f"Converged at iteration {iteration + 1}")
break
M = M_new
return M_approx
# Test
np.random.seed(42)
n_users, n_items = 50, 100
# True low-rank ratings
true_U = np.random.randn(n_users, 3)
true_V = np.random.randn(n_items, 3)
true_R = 3.5 + true_U @ true_V.T + 0.1 * np.random.randn(n_users, n_items)
# Observe 20% of ratings
mask = np.random.rand(n_users, n_items) < 0.20
observed_R = np.where(mask, true_R, np.nan)
# Predict
predicted_R = svd_matrix_completion(observed_R, k=3, n_iter=50)
# Evaluate on unobserved
unobserved = ~mask
rmse = np.sqrt(np.mean((predicted_R[unobserved] - true_R[unobserved])**2))
print(f"RMSE on held-out ratings: {rmse:.4f}")
Quick Reference Cheatsheet
| Decomposition | Formula | NumPy | Best for |
|---|---|---|---|
| SVD (full) | A = UΣVᵀ | np.linalg.svd(A) | Any matrix analysis |
| SVD (economy) | A ≈ UₖΣₖVₖᵀ | np.linalg.svd(A, full_matrices=False) | Low-rank approx |
| SVD (truncated) | Aₖ (best rank-k) | sklearn.utils.extmath.randomized_svd | Large matrices |
| LU | PA = LU | scipy.linalg.lu(A) | Square systems |
| LU factor+solve | Ax = b | scipy.linalg.lu_factor + lu_solve | Multiple RHS |
| QR | A = QR | np.linalg.qr(A) | Least squares, orthogonalization |
| Cholesky | A = LLᵀ | np.linalg.cholesky(A) | SPD matrices, 2× faster |
| Condition number | σmax/σmin | sigma[0]/sigma[-1] after SVD | Numerical stability |
| Rank (via SVD) | #nonzero σᵢ | np.linalg.matrix_rank(A) | Intrinsic dimensionality |
Key Takeaways
- SVD works on any matrix (A = UΣVᵀ) and gives the best low-rank approximation - it is more general than eigendecomposition
- Singular values measure the "importance" of each direction: large σᵢ = major direction, small σᵢ = minor direction
- The Eckart-Young theorem guarantees the truncated SVD is optimal for compression and dimensionality reduction
- LU decomposition is what
np.linalg.solveuses internally - understand it to know whysolvebeatsinv - QR decomposition orthogonalizes a set of vectors and is more numerically stable than normal equations for least squares
- Cholesky decomposition is twice as fast as LU for symmetric positive definite matrices and appears everywhere covariance matrices are inverted
Next: Linear Transformations - The Geometry of Neural Network Layers →
:::tip 🎮 Interactive Playground
Visualize this concept: Try the SVD Explorer demo on the EngineersOfAI Playground - no code required.
:::
