SVD and Matrix Decompositions - The Netflix Algorithm and JPEG Compression

Reading time: ~25 minutes | Level: Mathematical Foundations → ML Engineering

Netflix's original recommendation algorithm was matrix factorization via SVD. Every JPEG image you have ever seen uses a compression scheme mathematically related to SVD. Latent Semantic Analysis (LSA), one of the first word embedding techniques, was computed with SVD. Google's early search engine used SVD for dimensionality reduction.

The Singular Value Decomposition is the most powerful matrix decomposition in applied mathematics - and unlike eigendecomposition, it works on any matrix, not just square ones.

This lesson teaches you SVD and the other decompositions that computational ML depends on: LU for solving linear systems, QR for orthogonalization and least squares, Cholesky for positive definite matrices.

What You Will Learn

SVD: what it is, what it means geometrically, why it works on any matrix
Truncated SVD: low-rank approximation for dimensionality reduction and compression
LU decomposition: how linear systems are actually solved numerically
QR decomposition: orthogonalization, Gram-Schmidt, and least squares
Cholesky decomposition: why positive definite matrices have their own decomposition
NumPy: np.linalg.svd, np.linalg.qr, np.linalg.cholesky, scipy.linalg.lu
ML connections: collaborative filtering, image compression, LSA, PCA via SVD

Prerequisites

Part 1 - SVD: The Fundamental Theorem of Linear Algebra

The Singular Value Decomposition states: any matrix A (m×n) can be written as:

A = U Σ Vᵀ

Where:

U is an (m×m) orthogonal matrix: the "left singular vectors"
Σ is an (m×n) diagonal matrix with non-negative entries: the "singular values"
V is an (n×n) orthogonal matrix: the "right singular vectors"

The diagonal entries of Σ are called singular values σ₁ ≥ σ₂ ≥ ... ≥ σₘᵢₙ ≥ 0 (sorted descending), where min = min(m,n).

Geometric interpretation: rotate → scale → rotate

The SVD decomposes any linear transformation into three simple steps:

Vᵀ: rotate the input space to align with the "natural input directions"
Σ: scale each dimension by the corresponding singular value σᵢ
U: rotate the output space to the "natural output directions"

This is remarkable: any linear transformation - no matter how complex - is just two rotations and one scaling. The singular values σᵢ tell you how much each direction is stretched (or shrunk).

Relationship to eigendecomposition

SVD and eigendecomposition are deeply connected:

AᵀA = (UΣVᵀ)ᵀ (UΣVᵀ) = VΣᵀUᵀUΣVᵀ = V(ΣᵀΣ)Vᵀ
AAᵀ = (UΣVᵀ)(UΣVᵀ)ᵀ = UΣVᵀVΣᵀUᵀ = U(ΣΣᵀ)Uᵀ

So:

Columns of V are eigenvectors of AᵀA (right singular vectors)
Columns of U are eigenvectors of AAᵀ (left singular vectors)
σᵢ² are eigenvalues of both AᵀA and AAᵀ (singular values = square roots of eigenvalues)

This is why sklearn.decomposition.PCA uses SVD internally: computing SVD of the centered data matrix X is equivalent to eigendecomposing the covariance matrix XᵀX - but SVD is more numerically stable.

Part 2 - Truncated SVD: Low-Rank Approximation

The full SVD stores all m×m, m×n, and n×n matrices. For an m×n matrix, this is expensive. More importantly, the best rank-k approximation to any matrix A (in the Frobenius norm) is given by keeping only the top k singular values:

Aₖ = U[:, :k] Σ[:k, :k] Vᵀ[:k, :]
   = σ₁ u₁ v₁ᵀ + σ₂ u₂ v₂ᵀ + ... + σₖ uₖ vₖᵀ

This is the Eckart-Young theorem: Aₖ is the best rank-k approximation to A.

Image compression demo

import numpy as np
import matplotlib.pyplot as plt

def compress_image_with_svd(image: np.ndarray, k: int) -> np.ndarray:
    """
    Compress a grayscale image by keeping only the top-k singular values.

    The SVD of an m×n image has min(m,n) singular values.
    Keeping k << min(m,n) achieves compression at cost of quality.

    Storage comparison:
    - Original: m × n values
    - Compressed: k × (m + 1 + n) values = k(m+n+1)
    - Compression ratio: m*n / (k*(m+n+1))
    """
    # Full SVD of the image matrix
    U, sigma, Vt = np.linalg.svd(image, full_matrices=False)
    # U: (m, min), sigma: (min,), Vt: (min, n) - "economy" SVD

    # Keep only top-k singular values
    U_k = U[:, :k]           # (m, k)
    sigma_k = sigma[:k]      # (k,)
    Vt_k = Vt[:k, :]         # (k, n)

    # Reconstruct
    reconstructed = U_k @ np.diag(sigma_k) @ Vt_k

    # Clip to valid pixel range [0, 255]
    return np.clip(reconstructed, 0, 255)

# Simulate a grayscale image (256×256)
np.random.seed(42)
m, n = 256, 256

# Create a structured image (not pure noise)
x = np.linspace(0, 4*np.pi, n)
y = np.linspace(0, 4*np.pi, m)
X, Y = np.meshgrid(x, y)
image = (np.sin(X) * np.cos(Y) * 127 + 128).astype(float)

# Compute SVD
U, sigma, Vt = np.linalg.svd(image, full_matrices=False)

# Show singular value decay
print("Top-20 singular values:")
for i, s in enumerate(sigma[:20]):
    bar = "█" * int(s / sigma[0] * 30)
    print(f"  σ_{i+1:2d} = {s:8.2f}  {bar}")

# Compression quality vs k
for k in [1, 5, 10, 20, 50]:
    compressed = compress_image_with_svd(image, k)
    # Relative approximation error
    error = np.linalg.norm(image - compressed, 'fro') / np.linalg.norm(image, 'fro')

    # Storage ratio (compressed / original)
    storage_ratio = k * (m + 1 + n) / (m * n)

    print(f"k={k:3d}: error={error:.4f}, "
          f"storage={storage_ratio:.3f}x original, "
          f"compression={1/storage_ratio:.1f}:1")

Collaborative filtering: the Netflix algorithm

import numpy as np

def collaborative_filtering_svd(
    ratings: np.ndarray,  # (n_users, n_items) - NaN for unrated
    k: int = 20           # number of latent factors
) -> np.ndarray:
    """
    Matrix factorization for recommendation via truncated SVD.

    The idea: users×items rating matrix has low-rank structure.
    A few "latent factors" (genres, moods, etc.) explain most ratings.
    SVD finds these latent factors automatically.

    Real Netflix Prize winner used more sophisticated algorithms,
    but the core idea is this matrix factorization.
    """
    # Fill missing ratings with row mean (simple imputation)
    row_means = np.nanmean(ratings, axis=1, keepdims=True)
    ratings_filled = np.where(np.isnan(ratings), row_means, ratings)

    # Center: subtract per-user mean (remove user bias)
    ratings_centered = ratings_filled - row_means

    # Truncated SVD: keep top-k latent factors
    U, sigma, Vt = np.linalg.svd(ratings_centered, full_matrices=False)
    U_k = U[:, :k]        # (n_users, k) - user embeddings
    sigma_k = sigma[:k]   # (k,) - importance of each factor
    Vt_k = Vt[:k, :]      # (k, n_items) - item embeddings

    # Reconstruct full ratings matrix
    reconstructed = U_k @ np.diag(sigma_k) @ Vt_k + row_means

    return reconstructed, U_k, Vt_k.T  # also return embeddings

# Example: 50 users, 100 movies
np.random.seed(42)
n_users, n_items = 50, 100

# True rating matrix has rank-3 structure (3 movie genres)
true_user_prefs = np.random.randn(n_users, 3)   # users prefer genres differently
true_item_genre = np.random.randn(3, n_items)    # items belong to genres
true_ratings = 3.5 + true_user_prefs @ true_item_genre

# Observe only 20% of ratings
mask = np.random.rand(n_users, n_items) < 0.20
observed_ratings = np.where(mask, true_ratings, np.nan)

# Predict missing ratings
predicted, user_embeddings, item_embeddings = collaborative_filtering_svd(
    observed_ratings, k=3
)

# Evaluate on unobserved entries
unobserved_mask = ~mask
rmse = np.sqrt(np.mean((predicted[unobserved_mask] - true_ratings[unobserved_mask])**2))
print(f"RMSE on unobserved ratings: {rmse:.4f}")
print(f"User embedding shape: {user_embeddings.shape}")   # (50, 3)
print(f"Item embedding shape: {item_embeddings.shape}")   # (100, 3)

Part 3 - LU Decomposition: How Linear Systems Are Actually Solved

The LU decomposition factors a square matrix A into:

A = LU    (or PA = LU with a permutation matrix P for stability)

Where:

L is lower triangular (zeros above the diagonal)
U is upper triangular (zeros below the diagonal)
P is a permutation matrix (reorders rows for numerical stability)

Why LU matters for solving Ax = b

Solving Ax = b directly requires O(n³) work. But once A = LU, solving:

Ly = b (forward substitution) - O(n²)
Ux = y (back substitution) - O(n²)

More importantly: if you need to solve Ax = b for multiple right-hand sides b₁, b₂, ..., you only compute LU once (O(n³)) and then solve each system in O(n²).

This is exactly what np.linalg.solve does internally - it calls LAPACK's LU factorization.

import numpy as np
from scipy import linalg

# LU decomposition (scipy has better LU interface than numpy)
A = np.array([[2.0, 1.0, -1.0],
              [-3.0, -1.0, 2.0],
              [-2.0, 1.0, 2.0]])

# scipy.linalg.lu returns P, L, U such that A = P @ L @ U
P, L, U = linalg.lu(A)

print("P (permutation):\n", P)
print("L (lower triangular):\n", L.round(4))
print("U (upper triangular):\n", U.round(4))
print(f"Reconstruction error: {np.linalg.norm(A - P @ L @ U):.2e}")

# Solving multiple systems with the same A
# Real use case: solving A @ X = B where B has multiple columns
B = np.random.randn(3, 5)  # 5 different right-hand sides

# Method 1: solve each separately (inefficient)
X1 = np.column_stack([np.linalg.solve(A, B[:, i]) for i in range(5)])

# Method 2: solve all at once (numpy handles this)
X2 = np.linalg.solve(A, B)  # B is a matrix - solves all columns at once
print(f"Solutions match: {np.allclose(X1, X2)}")

# Method 3: factorize once, solve multiple times (scipy)
lu_factor = linalg.lu_factor(A)  # returns LU factorization
X3 = linalg.lu_solve(lu_factor, B)  # O(n²) per solve
print(f"lu_solve matches: {np.allclose(X1, X3)}")

Part 4 - QR Decomposition: Orthogonalization and Least Squares

The QR decomposition factors a matrix A (m×n, m ≥ n) into:

A = QR

Where:

Q is (m×m) orthogonal: Qᵀ = Q⁻¹, columns are orthonormal
R is (m×n) upper triangular

Gram-Schmidt process: building orthonormal bases

QR decomposition is the numerical implementation of the Gram-Schmidt orthogonalization process, which takes a set of linearly independent vectors and produces orthonormal vectors spanning the same space.

The Gram-Schmidt algorithm:

Given vectors a₁, a₂, ..., aₙ:

u₁ = a₁
e₁ = u₁ / ‖u₁‖

u₂ = a₂ - (a₂·e₁)e₁          ← remove component along e₁
e₂ = u₂ / ‖u₂‖

u₃ = a₃ - (a₃·e₁)e₁ - (a₃·e₂)e₂    ← remove components along e₁, e₂
e₃ = u₃ / ‖u₃‖

Each new vector is made orthogonal to all previous ones.

Least squares via QR

The least squares problem min ‖Ax - b‖₂ can be solved via QR:

A = QR
Ax = b → QRx = b → Rx = Qᵀb

Since R is upper triangular: solve by back-substitution

This is numerically more stable than the normal equations (AᵀA)x = Aᵀb because we never form the potentially ill-conditioned AᵀA.

import numpy as np

# QR decomposition
A = np.random.randn(5, 3)  # overdetermined system: 5 equations, 3 unknowns

# np.linalg.qr returns "economy" QR by default (mode='reduced')
Q, R = np.linalg.qr(A)
print(f"Q shape: {Q.shape}")  # (5, 3) - economy QR
print(f"R shape: {R.shape}")  # (3, 3)

# Verify orthogonality: Qᵀ Q = I
print(f"Qᵀ Q = I: {np.allclose(Q.T @ Q, np.eye(3))}")  # True

# Verify reconstruction: A = QR
print(f"A = QR: {np.allclose(A, Q @ R)}")  # True

# Gram-Schmidt from scratch
def gram_schmidt(A: np.ndarray) -> tuple:
    """
    QR decomposition via classical Gram-Schmidt orthogonalization.
    (NumPy uses modified Gram-Schmidt or Householder - more stable)
    """
    m, n = A.shape
    Q = np.zeros((m, n))
    R = np.zeros((n, n))

    for j in range(n):
        # Start with the j-th column of A
        v = A[:, j].copy()

        # Remove projections onto all previous Q columns
        for i in range(j):
            R[i, j] = Q[:, i] @ A[:, j]  # R[i,j] = qᵢᵀaⱼ
            v -= R[i, j] * Q[:, i]

        # Normalize
        R[j, j] = np.linalg.norm(v)
        Q[:, j] = v / R[j, j]

    return Q, R

Q_gs, R_gs = gram_schmidt(A)
print(f"Gram-Schmidt Q orthogonal: {np.allclose(Q_gs.T @ Q_gs, np.eye(3))}")  # True

# Least squares via QR (numerically preferred over normal equations)
def least_squares_qr(A: np.ndarray, b: np.ndarray) -> np.ndarray:
    """
    Solve min ‖Ax - b‖₂ using QR decomposition.
    More stable than (AᵀA)⁻¹Aᵀb for ill-conditioned A.
    """
    Q, R = np.linalg.qr(A)        # A = QR
    Qtb = Q.T @ b                  # Qᵀb - projection onto column space of A
    x = np.linalg.solve(R, Qtb)   # Back-substitution: Rx = Qᵀb
    return x

b = np.random.randn(5)
x_qr = least_squares_qr(A, b)
x_lstsq = np.linalg.lstsq(A, b, rcond=None)[0]  # NumPy's built-in
print(f"QR and lstsq match: {np.allclose(x_qr, x_lstsq)}")  # True

Part 5 - Cholesky Decomposition: The Symmetric PD Special Case

The Cholesky decomposition applies only to symmetric positive definite (SPD) matrices. It factors A into:

A = LLᵀ

Where L is a lower triangular matrix with positive diagonal entries.

Why SPD matrices get their own decomposition

Twice as fast as LU decomposition (exploit symmetry)
More stable: no pivoting needed for SPD matrices
Test for positive definiteness: Cholesky fails if A is not SPD - this is a reliable test

SPD matrices appear constantly in ML:

Covariance matrices (Gaussian distributions)
Gram matrices XᵀX (when X has full column rank)
Kernel matrices K(xᵢ, xⱼ) (positive definite kernels by definition)
Regularized Hessians H + εI (used in Newton's method)

import numpy as np

# Cholesky decomposition
n = 5
A = np.random.randn(n, n)
S = A.T @ A + np.eye(n)  # symmetric positive definite (SPD) - add I ensures PD

L = np.linalg.cholesky(S)
print(f"L shape: {L.shape}")  # (n, n)
print(f"L is lower triangular: {np.allclose(L, np.tril(L))}")  # True
print(f"LLᵀ = S: {np.allclose(L @ L.T, S)}")  # True

# Test for positive definiteness via Cholesky
def is_positive_definite(A: np.ndarray) -> bool:
    """Test if A is positive definite using Cholesky."""
    try:
        np.linalg.cholesky(A)
        return True
    except np.linalg.LinAlgError:
        return False

pd_matrix = np.eye(3) + 0.1 * np.random.randn(3, 3)
pd_matrix = pd_matrix.T @ pd_matrix  # definitely PD
not_pd = np.array([[1.0, 2.0],
                   [2.0, 1.0]])  # eigenvalues: 3, -1 → not PD

print(f"PD matrix is PD: {is_positive_definite(pd_matrix)}")  # True
print(f"not_pd is PD: {is_positive_definite(not_pd)}")        # False

# Solving SPD linear systems: use Cholesky for 2× speedup
from scipy import linalg

# For Gaussian distributions: need to solve Σx = b
# Cholesky is the preferred method
n_dim = 50
cov = np.random.randn(n_dim, n_dim)
cov = cov.T @ cov + 0.1 * np.eye(n_dim)  # SPD covariance matrix

b = np.random.randn(n_dim)

# Method 1: LU via solve (general)
x_lu = np.linalg.solve(cov, b)

# Method 2: Cholesky (2× faster for large SPD matrices)
L = np.linalg.cholesky(cov)
# Solve LLᵀx = b:
# Step 1: Ly = b (forward substitution)
y = linalg.solve_triangular(L, b, lower=True)
# Step 2: Lᵀx = y (back substitution)
x_chol = linalg.solve_triangular(L.T, y, lower=False)

print(f"Cholesky and LU solutions match: {np.allclose(x_lu, x_chol)}")

# Sampling from multivariate Gaussian N(μ, Σ)
def sample_multivariate_normal(
    mean: np.ndarray,
    cov: np.ndarray,
    n_samples: int
) -> np.ndarray:
    """
    Sample from N(mean, cov) using Cholesky decomposition.
    x = mean + L @ z, where z ~ N(0, I) and L is Cholesky factor of cov.
    """
    d = len(mean)
    L = np.linalg.cholesky(cov)
    z = np.random.randn(n_samples, d)
    return mean + z @ L.T  # (n_samples, d)

samples = sample_multivariate_normal(
    mean=np.zeros(5),
    cov=cov[:5, :5],
    n_samples=1000
)
print(f"Sample shape: {samples.shape}")  # (1000, 5)
print(f"Sample mean ≈ 0: {np.allclose(samples.mean(axis=0), 0, atol=0.1)}")

Part 6 - NumPy: Complete Decomposition Reference

import numpy as np
from scipy import linalg

m, n = 6, 4
A = np.random.randn(m, n)
S = A.T @ A + np.eye(n)  # SPD matrix (n×n)

# ── SVD ────────────────────────────────────────────────────────────────────
# Full SVD
U_full, sigma_full, Vt_full = np.linalg.svd(A, full_matrices=True)
print(f"Full SVD: U={U_full.shape}, σ={sigma_full.shape}, Vᵀ={Vt_full.shape}")
# U: (m, m), sigma: (min(m,n),), Vt: (n, n)

# Economy (thin) SVD - usually what you want
U, sigma, Vt = np.linalg.svd(A, full_matrices=False)
print(f"Economy SVD: U={U.shape}, σ={sigma.shape}, Vᵀ={Vt.shape}")
# U: (m, k), sigma: (k,), Vt: (k, n) where k = min(m,n)

# Reconstruct A
A_reconstructed = U @ np.diag(sigma) @ Vt
print(f"SVD reconstruction error: {np.linalg.norm(A - A_reconstructed):.2e}")

# Rank via SVD (most numerically stable method)
rank = np.sum(sigma > sigma[0] * max(m, n) * np.finfo(float).eps)
print(f"Matrix rank: {rank}")

# Condition number = σmax / σmin
cond = sigma[0] / sigma[-1]
print(f"Condition number: {cond:.4f}")

# ── Truncated SVD (scikit-learn style) ────────────────────────────────────
def truncated_svd(A: np.ndarray, k: int) -> tuple:
    """Low-rank approximation: keep top-k singular values."""
    U, sigma, Vt = np.linalg.svd(A, full_matrices=False)
    return U[:, :k], sigma[:k], Vt[:k, :]

U_k, sigma_k, Vt_k = truncated_svd(A, k=2)
A_approx = U_k @ np.diag(sigma_k) @ Vt_k

# Frobenius norm of approximation error
error = np.linalg.norm(A - A_approx, 'fro')
# Theory: ‖A - Aₖ‖_F = √(σₖ₊₁² + ... + σₘᵢₙ²)
theoretical_error = np.sqrt(np.sum(sigma[2:]**2))
print(f"Approx error: {error:.4f}, theoretical: {theoretical_error:.4f}")

# ── LU decomposition ───────────────────────────────────────────────────────
A_sq = np.random.randn(4, 4)  # square matrix
P, L, U = linalg.lu(A_sq)    # P @ A = L @ U
print(f"\nLU: P={P.shape}, L={L.shape}, U={U.shape}")
print(f"Reconstruction: {np.allclose(P @ A_sq, L @ U)}")

# ── QR decomposition ───────────────────────────────────────────────────────
Q, R = np.linalg.qr(A)             # economy QR
Q_full, R_full = np.linalg.qr(A, mode='complete')  # full QR
print(f"\nQR economy: Q={Q.shape}, R={R.shape}")
print(f"Q orthonormal: {np.allclose(Q.T @ Q, np.eye(n))}")
print(f"A = QR: {np.allclose(A, Q @ R)}")

# ── Cholesky decomposition ─────────────────────────────────────────────────
L_chol = np.linalg.cholesky(S)     # L such that S = L @ L.T
print(f"\nCholesky: L={L_chol.shape}")
print(f"S = LLᵀ: {np.allclose(S, L_chol @ L_chol.T)}")
print(f"L lower triangular: {np.allclose(L_chol, np.tril(L_chol))}")

Part 7 - ML Connections: Decompositions in Production

PCA via SVD (what sklearn actually does)

import numpy as np

def pca_via_svd(X: np.ndarray, n_components: int) -> tuple:
    """
    PCA using SVD - the numerically stable approach.

    Why SVD instead of eigendecomposing the covariance matrix?
    1. SVD operates on X directly (n_samples × n_features)
    2. Avoids forming XᵀX (which squares the condition number)
    3. More stable for thin matrices (n >> d)

    This is what sklearn.decomposition.PCA does internally.
    """
    n, d = X.shape

    # Center the data
    mean = X.mean(axis=0)
    X_centered = X - mean

    # SVD of centered data matrix
    # U: (n, k), sigma: (k,), Vt: (k, d)
    U, sigma, Vt = np.linalg.svd(X_centered, full_matrices=False)

    # Principal components = rows of Vt (right singular vectors)
    components = Vt[:n_components, :]  # (n_components, d)

    # Explained variance = (σᵢ / √(n-1))² = σᵢ² / (n-1)
    explained_variance = sigma[:n_components]**2 / (n - 1)
    explained_variance_ratio = explained_variance / (sigma**2 / (n - 1)).sum()

    # Projected data
    X_projected = X_centered @ components.T  # (n, n_components)

    return X_projected, components, explained_variance_ratio

# Verify matches sklearn
from sklearn.decomposition import PCA

np.random.seed(42)
X = np.random.randn(100, 20)
k = 5

X_svd, comps_svd, var_svd = pca_via_svd(X, k)
sk_pca = PCA(n_components=k).fit(X)
X_sklearn = sk_pca.transform(X)

print(f"Explained variance ratio (SVD):    {var_svd.round(4)}")
print(f"Explained variance ratio (sklearn): {sk_pca.explained_variance_ratio_.round(4)}")
# Exact match (up to sign of components)

LSA: Latent Semantic Analysis

import numpy as np

def latent_semantic_analysis(
    term_doc_matrix: np.ndarray,
    k: int = 100
) -> tuple:
    """
    Latent Semantic Analysis: find k latent topics in a document corpus.

    term_doc_matrix: (n_terms, n_documents) - typically TF-IDF values
    Returns: term_embeddings (n_terms, k), doc_embeddings (n_docs, k)

    The k left singular vectors = directions in term space (topics)
    The k right singular vectors = directions in document space
    """
    U, sigma, Vt = np.linalg.svd(term_doc_matrix, full_matrices=False)

    # Term embeddings in latent space
    term_embeddings = U[:, :k] @ np.diag(sigma[:k])  # (n_terms, k)
    # Doc embeddings in latent space
    doc_embeddings = Vt[:k, :].T  # (n_docs, k)

    # Similarity between documents: dot product in latent space
    doc_similarity = doc_embeddings @ doc_embeddings.T

    return term_embeddings, doc_embeddings, doc_similarity

# Conceptual example
np.random.seed(42)
n_terms, n_docs = 500, 200
tfidf_matrix = np.random.exponential(0.5, (n_terms, n_docs))

term_emb, doc_emb, doc_sim = latent_semantic_analysis(tfidf_matrix, k=50)
print(f"Term embeddings: {term_emb.shape}")   # (500, 50)
print(f"Document embeddings: {doc_emb.shape}") # (200, 50)
print(f"Document similarity: {doc_sim.shape}") # (200, 200)

Part 8 - Common Failure Modes

:::danger SVD can be slow on large matrices - use truncated SVD Full SVD on an (m×n) matrix costs O(min(m,n) × m × n). For large NLP or vision matrices, this is prohibitive.

# SLOW: full SVD on large matrix
U, sigma, Vt = np.linalg.svd(large_matrix)  # compute all singular values

# FAST: truncated SVD (only top-k singular values)
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=50)  # only k=50 components
X_reduced = svd.fit_transform(large_matrix)

# FAST (scipy): randomized SVD for very large matrices
from sklearn.utils.extmath import randomized_svd
U, sigma, Vt = randomized_svd(large_matrix, n_components=50, random_state=42)

For matrices > 10,000 × 10,000, use randomized SVD (sklearn's randomized_svd), which computes an approximate truncated SVD in O(k × m × n) instead of O(min(m,n) × m × n). :::

:::danger Cholesky fails on non-positive-definite matrices - add regularization

# This will fail if S has any eigenvalue ≤ 0
try:
    L = np.linalg.cholesky(S)
except np.linalg.LinAlgError:
    print("Matrix is not positive definite!")
    # Fix: add small diagonal regularization
    eps = 1e-6
    L = np.linalg.cholesky(S + eps * np.eye(S.shape[0]))

Covariance matrices estimated from data can be near-singular (fewer samples than features). Always add a small regularization εI before Cholesky decomposition in production code. :::

:::tip Use the appropriate decomposition for your matrix structure

General square: np.linalg.solve (LU internally)
Rectangular / overdetermined: np.linalg.lstsq (QR internally)
Symmetric positive definite: np.linalg.cholesky (2× faster than LU)
Any shape, low-rank approximation: np.linalg.svd with truncation
PCA on data matrix: np.linalg.svd on centered data (more stable than covariance eigendecomp) :::

Interview Questions

Q1: What is the SVD and why is it more general than eigendecomposition?

SVD: Any m×n matrix A = U Σ Vᵀ where U, V are orthogonal and Σ is diagonal.

More general than eigendecomposition because:

Eigendecomposition requires a square matrix AND n linearly independent eigenvectors
SVD works on ANY matrix (rectangular, singular, non-diagonalizable)
When A is symmetric positive definite: SVD = eigendecomposition (U = V, σᵢ = λᵢ)

Why this matters for ML:

Data matrices X are almost never square (n_samples ≠ n_features)
Even square matrices might not be diagonalizable (non-symmetric weight matrices)
SVD always exists and always gives the best low-rank approximation (Eckart-Young theorem)

This is why sklearn.PCA uses SVD on the data matrix rather than eigendecomposing the covariance matrix - SVD handles the rectangular data matrix directly and avoids squaring the condition number.

Q2: Explain the Eckart-Young theorem and its ML implications.

Eckart-Young theorem: The best rank-k approximation to matrix A (in both Frobenius and spectral norm) is the truncated SVD:

Aₖ = σ₁u₁v₁ᵀ + σ₂u₂v₂ᵀ + ... + σₖuₖvₖᵀ

Proof intuition: Each rank-1 term σᵢuᵢvᵢᵀ captures the "principal direction" of variation in A. By taking the top-k, we capture the most variation with the fewest terms.

ML implications:

PCA is optimal compression: The first k principal components give the best k-dimensional representation of the data (minimum reconstruction error)
Collaborative filtering: The top-k latent factors capture the dominant rating patterns (genres, styles)
Image compression: Keeping k singular values gives the best rank-k image approximation for any k
Low-rank approximation for large models: LoRA uses the assumption that fine-tuning updates have low intrinsic rank (SVD-based justification)

Q3: When should you use QR decomposition instead of the normal equations?

For solving least squares min ‖Ax - b‖₂:

Normal equations approach: form AᵀA and solve (AᵀA)x = Aᵀb

Problem: AᵀA squares the condition number: κ(AᵀA) = κ(A)²
If κ(A) = 1000, κ(AᵀA) = 10⁶ - 6 digits of accuracy lost in float64

QR approach: factor A = QR, solve Rx = Qᵀb

Advantage: works with κ(A) directly - no squaring
Preferred whenever A is near-rank-deficient or poorly conditioned

Rule of thumb:

Well-conditioned A (κ < 1000): normal equations are fine, slightly faster
Near-rank-deficient A (κ > 10⁶): always use QR or SVD
In general ML code: use np.linalg.lstsq (uses QR internally) and avoid normal equations

Q4: What is the Cholesky decomposition and when is it preferred over LU?

Cholesky: for symmetric positive definite (SPD) matrix A, factor as A = LLᵀ where L is lower triangular.

Preferred over LU when:

A is symmetric positive definite: exploit structure for 2× speedup
No pivoting needed: Cholesky is unconditionally stable for SPD matrices (unlike LU which requires pivoting for stability)
Sampling from Gaussians: x = mean + L @ z where z ~ N(0,I) generates x ~ N(mean, A)
Testing for PD: Cholesky throws LinAlgError if A is not PD - reliable detection

Where SPD matrices appear:

Covariance matrices Σ (Gaussian distributions)
Gram matrices XᵀX (kernel methods, ridge regression)
Regularized Hessians H + εI (Newton's method)
Kernel matrices K(xᵢ, xⱼ) (SVMs, Gaussian processes)

Practice Challenges

Level 1: Predict

Challenge: A data matrix X has shape (200, 50) and rank 10. What are the dimensions of U, Σ, V in the economy SVD? How many non-zero singular values are there?

Answer

Economy SVD of X (200×50):

U: (200, 50) - m × min(m,n)
Σ (as vector): (50,) - min(m,n) singular values
Vᵀ: (50, 50) - min(m,n) × n

But since rank = 10, only the first 10 singular values are non-zero (σ₁ ≥ σ₂ ≥ ... ≥ σ₁₀ > 0 = σ₁₁ = ... = σ₅₀).

The approximation error of a rank-k approximation for k ≥ 10 is exactly zero (since A is already rank 10):

‖A - Aₖ‖_F = 0 for k ≥ 10

Level 2: Debug

Challenge: The following code for PCA reconstruction is supposed to return the original data (zero error), but the reconstruction error is large. Find the bug:

X = np.random.randn(100, 20)
X_centered = X - X.mean(axis=0)

U, sigma, Vt = np.linalg.svd(X_centered, full_matrices=False)
k = 5
X_projected = X_centered @ Vt[:k, :].T

# Bug: try to reconstruct
X_reconstructed = X_projected @ Vt[:k, :]
print(np.linalg.norm(X_centered - X_reconstructed))  # large error

Answer

Bug: Projection onto k=5 components destroys information from the remaining 15 components. The reconstruction from 5 components back to 20 dimensions can only recover the rank-5 approximation, not the original data.

There is no bug in the code per se - this is expected behavior. A rank-5 projection will not reconstruct a full-rank matrix.

If the goal is zero error: use all components (k = min(m,n)):

U, sigma, Vt = np.linalg.svd(X_centered, full_matrices=False)
# Full reconstruction (no compression)
X_reconstructed = U @ np.diag(sigma) @ Vt
print(np.linalg.norm(X_centered - X_reconstructed))  # ≈ 0 (machine epsilon)

If the goal is lossy compression: the error is expected and equals √(σ_{k+1}² + ... + σₘᵢₙ²).

Level 3: Design

Challenge: Implement a low-rank matrix completion algorithm for collaborative filtering. The algorithm should: (1) fill missing entries with user means, (2) apply truncated SVD, (3) iterate until convergence.

Answer

import numpy as np

def svd_matrix_completion(
    R: np.ndarray,    # (n_users, n_items) with NaN for missing
    k: int = 10,      # rank of approximation
    n_iter: int = 20  # number of iterations
) -> np.ndarray:
    """
    SVD-based matrix completion (collaborative filtering).

    Algorithm:
    1. Initialize missing entries (user means)
    2. Compute rank-k SVD approximation
    3. Replace observed entries with original values
    4. Repeat until convergence

    This is the "hard impute with SVD" algorithm.
    """
    observed = ~np.isnan(R)

    # Initialize: fill with user row means
    row_means = np.nanmean(R, axis=1, keepdims=True)
    M = np.where(observed, R, row_means)

    prev_error = np.inf

    for iteration in range(n_iter):
        # Rank-k SVD approximation
        U, sigma, Vt = np.linalg.svd(M, full_matrices=False)
        M_approx = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :]

        # Update: keep observed entries, fill unobserved with SVD predictions
        M_new = np.where(observed, R, M_approx)

        # Convergence check
        error = np.linalg.norm(M_new - M, 'fro')
        if error < 1e-6:
            print(f"Converged at iteration {iteration + 1}")
            break

        M = M_new

    return M_approx

# Test
np.random.seed(42)
n_users, n_items = 50, 100

# True low-rank ratings
true_U = np.random.randn(n_users, 3)
true_V = np.random.randn(n_items, 3)
true_R = 3.5 + true_U @ true_V.T + 0.1 * np.random.randn(n_users, n_items)

# Observe 20% of ratings
mask = np.random.rand(n_users, n_items) < 0.20
observed_R = np.where(mask, true_R, np.nan)

# Predict
predicted_R = svd_matrix_completion(observed_R, k=3, n_iter=50)

# Evaluate on unobserved
unobserved = ~mask
rmse = np.sqrt(np.mean((predicted_R[unobserved] - true_R[unobserved])**2))
print(f"RMSE on held-out ratings: {rmse:.4f}")

Quick Reference Cheatsheet

Decomposition	Formula	NumPy	Best for
SVD (full)	A = UΣVᵀ	`np.linalg.svd(A)`	Any matrix analysis
SVD (economy)	A ≈ UₖΣₖVₖᵀ	`np.linalg.svd(A, full_matrices=False)`	Low-rank approx
SVD (truncated)	Aₖ (best rank-k)	`sklearn.utils.extmath.randomized_svd`	Large matrices
LU	PA = LU	`scipy.linalg.lu(A)`	Square systems
LU factor+solve	Ax = b	`scipy.linalg.lu_factor` + `lu_solve`	Multiple RHS
QR	A = QR	`np.linalg.qr(A)`	Least squares, orthogonalization
Cholesky	A = LLᵀ	`np.linalg.cholesky(A)`	SPD matrices, 2× faster
Condition number	σmax/σmin	`sigma[0]/sigma[-1]` after SVD	Numerical stability
Rank (via SVD)	#nonzero σᵢ	`np.linalg.matrix_rank(A)`	Intrinsic dimensionality

Key Takeaways

SVD works on any matrix (A = UΣVᵀ) and gives the best low-rank approximation - it is more general than eigendecomposition
Singular values measure the "importance" of each direction: large σᵢ = major direction, small σᵢ = minor direction
The Eckart-Young theorem guarantees the truncated SVD is optimal for compression and dimensionality reduction
LU decomposition is what np.linalg.solve uses internally - understand it to know why solve beats inv
QR decomposition orthogonalizes a set of vectors and is more numerically stable than normal equations for least squares
Cholesky decomposition is twice as fast as LU for symmetric positive definite matrices and appears everywhere covariance matrices are inverted

Next: Linear Transformations - The Geometry of Neural Network Layers →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the SVD Explorer demo on the EngineersOfAI Playground - no code required.

:::

What You Will Learn​

Prerequisites​

Part 1 - SVD: The Fundamental Theorem of Linear Algebra​

Geometric interpretation: rotate → scale → rotate​

Relationship to eigendecomposition​

Part 2 - Truncated SVD: Low-Rank Approximation​

Image compression demo​

Collaborative filtering: the Netflix algorithm​

Part 3 - LU Decomposition: How Linear Systems Are Actually Solved​

Why LU matters for solving Ax = b​

Part 4 - QR Decomposition: Orthogonalization and Least Squares​

Gram-Schmidt process: building orthonormal bases​

Least squares via QR​

Part 5 - Cholesky Decomposition: The Symmetric PD Special Case​

Why SPD matrices get their own decomposition​

Part 6 - NumPy: Complete Decomposition Reference​

Part 7 - ML Connections: Decompositions in Production​

PCA via SVD (what sklearn actually does)​

LSA: Latent Semantic Analysis​

Part 8 - Common Failure Modes​

Interview Questions​

Practice Challenges​

Level 1: Predict​

Level 2: Debug​

Level 3: Design​

Quick Reference Cheatsheet​

Key Takeaways​

What You Will Learn

Prerequisites

Part 1 - SVD: The Fundamental Theorem of Linear Algebra

Geometric interpretation: rotate → scale → rotate

Relationship to eigendecomposition

Part 2 - Truncated SVD: Low-Rank Approximation

Image compression demo

Collaborative filtering: the Netflix algorithm

Part 3 - LU Decomposition: How Linear Systems Are Actually Solved

Why LU matters for solving Ax = b

Part 4 - QR Decomposition: Orthogonalization and Least Squares

Gram-Schmidt process: building orthonormal bases

Least squares via QR

Part 5 - Cholesky Decomposition: The Symmetric PD Special Case

Why SPD matrices get their own decomposition

Part 6 - NumPy: Complete Decomposition Reference

Part 7 - ML Connections: Decompositions in Production

PCA via SVD (what sklearn actually does)

LSA: Latent Semantic Analysis

Part 8 - Common Failure Modes

Interview Questions

Practice Challenges

Level 1: Predict

Level 2: Debug

Level 3: Design

Quick Reference Cheatsheet

Key Takeaways