Linear Algebra in NumPy - A Complete Engineering Reference
Reading time: ~25 minutes | Level: Mathematical Foundations → ML Engineering
NumPy is the linear algebra engine underneath sklearn, PyTorch (CPU ops), TensorFlow, and JAX. Knowing NumPy linear algebra means you can implement - and debug - any ML algorithm from scratch.
Every major ML framework dispatches to BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra PACKage) under the hood. NumPy exposes this through its np.linalg module. SciPy extends it with sparse matrices and advanced factorizations. PyTorch mirrors it in torch.linalg for GPU execution and gradient support.
This lesson is a complete engineering reference: every function, every performance gotcha, every numerical stability trap, and the ML patterns you will reach for repeatedly.
What You Will Learn
- NumPy array internals vs Python lists - what makes NumPy fast
- The complete
np.linalgmodule: every function with ML context - Solving linear systems correctly:
np.linalg.solvenotnp.linalg.inv - Computing decompositions: SVD, eigenvalue, QR, Cholesky - when to use each
- SciPy
linalg: sparse matrices and advanced factorizations - Performance: memory layout, vectorization, avoiding loops, timing
- Numerical stability: condition number, catastrophic cancellation, float32 vs float64
- Common ML patterns from scratch: Gram matrix, covariance, whitening, rotations
- PyTorch
torch.linalg: the GPU-accelerated equivalent
Prerequisites
- Lesson 01: Vectors and Vector Spaces
- Lesson 03: Eigenvalues and Eigenvectors
- Lesson 04: SVD and Matrix Decompositions
Part 1 - NumPy Array vs Python List: Why Performance Differs
Memory layout is the fundamental difference
A Python list stores pointers to Python objects scattered across heap memory:
list: [ptr→obj₁, ptr→obj₂, ptr→obj₃, ...]
Each ptr is 8 bytes. Each obj is ≥28 bytes (Python float). Scattered in memory.
A NumPy array stores raw values in a single contiguous block of memory:
array: [val₁, val₂, val₃, ...]
Each val is 4 bytes (float32) or 8 bytes (float64). Sequential in memory.
This difference has profound performance consequences:
- Cache efficiency: sequential memory access is cache-friendly - the CPU prefetches upcoming values automatically. Pointer-following (Python lists) causes cache misses.
- SIMD vectorization: AVX-512 can process 16 float32s in one instruction, but only on contiguous memory.
- BLAS dispatch:
A @ Bin NumPy calls BLASdgemm- a decades-optimized routine. There is no Python loop anywhere in the call. - No GIL release needed: NumPy releases the Python GIL during heavy computation, enabling multi-threaded BLAS.
import numpy as np
import time
n = 1_000_000
# Memory footprint comparison
py_list = list(range(n))
np_array = np.arange(n, dtype=np.float64)
import sys
py_size = sys.getsizeof(py_list) + sum(sys.getsizeof(x) for x in py_list[:100]) * 10000
np_size = np_array.nbytes
print(f"Python list (approx): {py_size / 1e6:.1f} MB")
print(f"NumPy float64: {np_size / 1e6:.1f} MB")
print(f"NumPy float32: {np_size / 2 / 1e6:.1f} MB")
# NumPy is typically 5-10x more memory-efficient
# Speed comparison: sum of 1M elements
a_list = list(range(n))
a_array = np.arange(n, dtype=np.float64)
t0 = time.perf_counter()
for _ in range(100):
s = sum(a_list)
t_python = (time.perf_counter() - t0) / 100
t0 = time.perf_counter()
for _ in range(100):
s = np.sum(a_array)
t_numpy = (time.perf_counter() - t0) / 100
print(f"\nPython sum: {t_python*1000:.2f} ms")
print(f"NumPy sum: {t_numpy*1000:.3f} ms")
print(f"Speedup: {t_python/t_numpy:.0f}x")
# Data types: float32 vs float64
a32 = np.random.randn(1000, 1000).astype(np.float32)
a64 = np.random.randn(1000, 1000).astype(np.float64)
t0 = time.perf_counter()
for _ in range(10):
_ = a32 @ a32
t_f32 = (time.perf_counter() - t0) / 10
t0 = time.perf_counter()
for _ in range(10):
_ = a64 @ a64
t_f64 = (time.perf_counter() - t0) / 10
print(f"\nfloat32 matmul: {t_f32*1000:.2f} ms")
print(f"float64 matmul: {t_f64*1000:.2f} ms")
print(f"float32 speedup: ~{t_f64/t_f32:.1f}x (wider SIMD, less memory bandwidth)")
Part 2 - The np.linalg Module: Complete Reference
import numpy as np
# ── Matrix properties ───────────────────────────────────────────────────
A = np.array([[4.0, 2.0, 1.0],
[2.0, 5.0, 3.0],
[1.0, 3.0, 6.0]])
print("=== Matrix properties ===")
print(f"det(A) = {np.linalg.det(A):.4f}")
print(f"trace(A) = {np.trace(A):.4f}") # sum of eigenvalues
print(f"rank(A) = {np.linalg.matrix_rank(A)}")
print(f"cond(A) = {np.linalg.cond(A):.4f}") # condition number
print(f"norm(A, 'fro') = {np.linalg.norm(A, 'fro'):.4f}")
print(f"norm(A, 2) = {np.linalg.norm(A, 2):.4f}") # spectral norm
# ── System solving ──────────────────────────────────────────────────────
b = np.array([1.0, 2.0, 3.0])
# ALWAYS use solve, not inv
x = np.linalg.solve(A, b) # solution to Ax = b
print(f"\n=== Linear system: Ax = b ===")
print(f"x = {x}")
print(f"Residual ‖Ax - b‖: {np.linalg.norm(A @ x - b):.2e}")
# Multiple right-hand sides: A X = B (B has multiple columns)
B = np.random.randn(3, 5)
X = np.linalg.solve(A, B) # X shape: (3, 5) - solves 5 systems at once
print(f"Multi-RHS solve: X shape = {X.shape}")
# Least squares: overdetermined system Ax ≈ b (m > n)
A_rect = np.random.randn(10, 3) # 10 equations, 3 unknowns
b_rect = np.random.randn(10)
x_ls, residuals, rank, sv = np.linalg.lstsq(A_rect, b_rect, rcond=None)
print(f"\n=== Least squares ===")
print(f"x_ls shape: {x_ls.shape}, rank: {rank}")
# ── Decompositions ──────────────────────────────────────────────────────
print("\n=== Decompositions ===")
# SVD
U, s, Vt = np.linalg.svd(A) # full SVD
U2, s2, Vt2 = np.linalg.svd(A, full_matrices=False) # economy SVD
print(f"SVD: U={U.shape}, s={s.shape}, Vt={Vt.shape}")
print(f"Economy SVD: U={U2.shape}, s={s2.shape}, Vt={Vt2.shape}")
print(f"Singular values: {s2.round(4)}")
# Eigendecomposition - general matrix
eigenvals, eigenvecs = np.linalg.eig(A)
print(f"\nEig (general): λ = {eigenvals.round(4)}")
# Eigendecomposition - symmetric matrix (always prefer eigh for symmetric)
eigenvals_s, eigenvecs_s = np.linalg.eigh(A) # A is symmetric → use eigh
print(f"Eigh (symmetric): λ = {eigenvals_s.round(4)} (ascending)")
# QR decomposition
Q, R = np.linalg.qr(A)
print(f"\nQR: Q={Q.shape} (orthonormal), R={R.shape} (upper triangular)")
print(f"Q orthonormal: {np.allclose(Q.T @ Q, np.eye(3))}")
print(f"QR reconstruction: {np.allclose(Q @ R, A)}")
# Cholesky: for symmetric positive definite matrices
L = np.linalg.cholesky(A) # A must be symmetric positive definite
print(f"\nCholesky: L={L.shape} (lower triangular)")
print(f"LLᵀ = A: {np.allclose(L @ L.T, A)}")
# ── Inverses and pseudoinverses ─────────────────────────────────────────
print("\n=== Inverses ===")
A_inv = np.linalg.inv(A)
print(f"inv(A): shape={A_inv.shape}")
print(f"A @ inv(A) ≈ I: {np.allclose(A @ A_inv, np.eye(3))}")
# Pseudoinverse (Moore-Penrose): works for rectangular/singular matrices
A_rect = np.random.randn(5, 3) # tall matrix, rank 3
A_pinv = np.linalg.pinv(A_rect)
print(f"\npinv: A={A_rect.shape} → pinv(A)={A_pinv.shape}")
print(f"A_pinv @ A ≈ I₃: {np.allclose(A_pinv @ A_rect, np.eye(3), atol=1e-10)}")
# Log determinant for numerical stability
# Use slogdet instead of log(det) to avoid overflow/underflow
sign, logdet = np.linalg.slogdet(A)
print(f"\nlog|det(A)| = {logdet:.4f}")
print(f"det(A) = {np.exp(logdet) * sign:.4f}")
# For large matrices, det() overflows/underflows; slogdet stays stable
Part 3 - Solving Linear Systems: solve Not inv
Why np.linalg.solve is almost always better than np.linalg.inv
The common mistake: x = np.linalg.inv(A) @ b
The correct approach: x = np.linalg.solve(A, b)
Three reasons:
-
Speed:
solveuses LU factorization - O(n³/3).invalso computes LU but then multiplies by the identity for each column - doing roughly 3x more work. -
Numerical accuracy:
solvenever explicitly forms the inverse. The inverse amplifies numerical errors;solveminimizes them by directly finding x. -
Explicit inverse is rarely needed: in ML, you almost never need A⁻¹ itself - you need A⁻¹b for some b. Use
solve.
import numpy as np
import time
np.random.seed(42)
n = 500
# Well-conditioned system
A = np.random.randn(n, n)
A = A.T @ A + n * np.eye(n) # symmetric positive definite, well-conditioned
b = np.random.randn(n)
# Method 1: inv then multiply (wrong approach)
t0 = time.perf_counter()
for _ in range(20):
x_inv = np.linalg.inv(A) @ b
t_inv = (time.perf_counter() - t0) / 20
# Method 2: solve (correct approach)
t0 = time.perf_counter()
for _ in range(20):
x_solve = np.linalg.solve(A, b)
t_solve = (time.perf_counter() - t0) / 20
print(f"inv() + @: {t_inv*1000:.2f} ms")
print(f"solve(): {t_solve*1000:.2f} ms")
print(f"solve speedup: {t_inv/t_solve:.1f}x")
print(f"Results match: {np.allclose(x_inv, x_solve)}")
# Numerical accuracy: ill-conditioned system
# High condition number → inv amplifies errors more than solve
A_ill = np.vander(np.linspace(0.1, 1, 10), 10) # Vandermonde matrix (very ill-conditioned)
b_ill = np.random.randn(10)
true_x = np.linalg.lstsq(A_ill, b_ill, rcond=None)[0] # most stable reference
try:
x_inv_ill = np.linalg.inv(A_ill) @ b_ill
err_inv = np.linalg.norm(A_ill @ x_inv_ill - b_ill)
print(f"\nIll-conditioned residual (inv): {err_inv:.2e}")
except np.linalg.LinAlgError:
print("inv failed on ill-conditioned matrix")
x_solve_ill = np.linalg.solve(A_ill, b_ill)
err_solve = np.linalg.norm(A_ill @ x_solve_ill - b_ill)
print(f"Ill-conditioned residual (solve): {err_solve:.2e}")
cond = np.linalg.cond(A_ill)
print(f"Condition number: {cond:.2e}") # should be very large
# Solving multiple right-hand sides efficiently
# Factorize once, solve multiple times - use scipy for this
from scipy import linalg as sla
lu, piv = sla.lu_factor(A) # factorize once: O(n³)
n_rhs = 50
residuals = []
for _ in range(n_rhs):
b_i = np.random.randn(n)
x_i = sla.lu_solve((lu, piv), b_i) # each solve: O(n²)
residuals.append(np.linalg.norm(A @ x_i - b_i))
print(f"\nLU pre-factorization residuals: max = {max(residuals):.2e}")
print("(factorize once, solve {n_rhs} times at O(n²) each)")
Part 4 - Decompositions: When to Use Each
Decision guide
import numpy as np
# ── SVD: the most general decomposition ────────────────────────────────
def svd_examples():
np.random.seed(42)
A = np.random.randn(6, 4)
# Full vs economy SVD
U_full, s, Vt_full = np.linalg.svd(A, full_matrices=True) # U: (6,6), Vt: (4,4)
U_econ, s, Vt_econ = np.linalg.svd(A, full_matrices=False) # U: (6,4), Vt: (4,4)
# Reconstruction
Sigma = np.diag(s)
A_recon = U_econ @ Sigma @ Vt_econ
print(f"SVD reconstruction error: {np.linalg.norm(A - A_recon):.2e}")
# Singular values only (faster)
s_only = np.linalg.svd(A, compute_uv=False)
print(f"Singular values: {s_only.round(4)}")
# Rank from singular values
tol = 1e-10
rank = np.sum(s_only > tol)
print(f"Effective rank: {rank}")
svd_examples()
# ── Cholesky: 2x faster than LU for symmetric positive definite ────────
def cholesky_examples():
np.random.seed(42)
n = 200
A_pd = np.random.randn(n, n)
A_pd = A_pd.T @ A_pd + n * np.eye(n) # symmetric positive definite
import time
# Compare solve approaches for symmetric PD matrix
b = np.random.randn(n)
# LU-based solve (general)
t0 = time.perf_counter()
for _ in range(50):
x_lu = np.linalg.solve(A_pd, b)
t_lu = (time.perf_counter() - t0) / 50
# Cholesky-based solve (specialized, ~2x faster)
from scipy.linalg import cho_factor, cho_solve
t0 = time.perf_counter()
for _ in range(50):
c, low = cho_factor(A_pd)
x_cho = cho_solve((c, low), b)
t_cho = (time.perf_counter() - t0) / 50
print(f"\nLU solve: {t_lu*1000:.2f} ms")
print(f"Cho solve: {t_cho*1000:.2f} ms")
print(f"Results match: {np.allclose(x_lu, x_cho)}")
# Cholesky also used for sampling from multivariate Gaussian
# X = mu + L @ z, where L = cholesky(Sigma) and z ~ N(0, I)
mu = np.zeros(n)
Sigma = A_pd / n # scale to reasonable covariance
L = np.linalg.cholesky(Sigma)
z = np.random.randn(n, 1000) # 1000 standard normal samples
samples = mu[:, np.newaxis] + L @ z # (n, 1000) samples from N(mu, Sigma)
print(f"\nMultivariate Gaussian samples: {samples.shape}")
cholesky_examples()
# ── QR decomposition: orthonormal bases and linear regression ──────────
def qr_examples():
np.random.seed(42)
A = np.random.randn(8, 4) # tall matrix (overdetermined)
Q, R = np.linalg.qr(A)
print(f"\nQR: Q={Q.shape}, R={R.shape}")
print(f"Q orthonormal: {np.allclose(Q.T @ Q, np.eye(4))}")
# QR for least squares: Ax ≈ b → QR decomp → x = R⁻¹ Qᵀ b
b = np.random.randn(8)
x_qr = np.linalg.solve(R, Q.T @ b) # triangular solve
x_lstsq = np.linalg.lstsq(A, b, rcond=None)[0]
print(f"QR vs lstsq match: {np.allclose(x_qr, x_lstsq)}")
# Gram-Schmidt via QR: orthonormal basis for column space of A
# Q columns are the orthonormal basis
print(f"Column space basis (Q): {Q.shape}") # (8, 4) - 4 orthonormal basis vectors
qr_examples()
Part 5 - SciPy linalg: Sparse Matrices and Advanced Factorizations
When NumPy is not enough:
scipy.linalg: more factorizations (Schur, Hessenberg, interpolative), Cholesky solve, LU with pre-factorizationscipy.sparse: sparse matrix formats (CSR, CSC, COO), sparse linear solversscipy.sparse.linalg: iterative solvers (conjugate gradient, GMRES, LOBPCG for eigenvalues)
import numpy as np
from scipy import linalg as sla
import scipy.sparse as sp
import scipy.sparse.linalg as spla
# ── Pre-factorize for multiple solves ─────────────────────────────────
np.random.seed(42)
n = 100
A = np.random.randn(n, n)
A = A.T @ A + 0.1 * np.eye(n) # symmetric positive definite
# LU factorization once, then solve many times
lu, piv = sla.lu_factor(A) # O(n³) - do once
b_vectors = [np.random.randn(n) for _ in range(20)]
solutions = [sla.lu_solve((lu, piv), b) for b in b_vectors] # O(n²) each
print(f"Pre-factored LU: solved {len(solutions)} systems")
# Cholesky factor once, solve many times (symmetric PD)
c, low = sla.cho_factor(A) # O(n³) once
solutions_cho = [sla.cho_solve((c, low), b) for b in b_vectors]
print(f"Cholesky solutions match LU: {np.allclose(solutions, solutions_cho)}")
# ── Sparse matrices ───────────────────────────────────────────────────
# Real-world matrices (graph Laplacians, FEM, NLP co-occurrence) are sparse
# Storing them dense would be wasteful - use sparse formats
n_sparse = 1000
density = 0.01 # 1% of entries are non-zero
# COO format: coordinate format (row, col, data)
rows = np.random.randint(0, n_sparse, size=int(n_sparse**2 * density))
cols = np.random.randint(0, n_sparse, size=int(n_sparse**2 * density))
data = np.random.randn(len(rows))
A_sparse = sp.coo_matrix((data, (rows, cols)), shape=(n_sparse, n_sparse))
# Convert to CSR (Compressed Sparse Row) for efficient matrix operations
A_csr = A_sparse.tocsr()
print(f"\nSparse matrix: {A_csr.shape}, nnz={A_csr.nnz}, density={A_csr.nnz/n_sparse**2:.4f}")
print(f"Dense size: {n_sparse**2 * 8 / 1e6:.1f} MB")
print(f"Sparse size: {(A_csr.data.nbytes + A_csr.indices.nbytes + A_csr.indptr.nbytes) / 1e6:.3f} MB")
# Sparse matrix-vector multiply (much faster than dense for low density)
b_sparse = np.random.randn(n_sparse)
x_sparse = A_csr @ b_sparse # uses sparse multiply
print(f"Sparse matvec: shape {x_sparse.shape}")
# Sparse symmetric eigenvalue problem (e.g., graph Laplacian)
# Build a symmetric sparse matrix for eigensolver
A_sym = A_sparse + A_sparse.T # make symmetric
A_sym_csr = A_sym.tocsr()
# LOBPCG or ARPACK for large sparse eigenproblems
# Get top-3 eigenvalues (in magnitude) without forming full matrix
k = 3
eigenvals_sparse, eigenvecs_sparse = spla.eigsh(
A_sym_csr,
k=k,
which='LM' # Largest Magnitude
)
print(f"\nSparse eigenvalues (top {k}): {eigenvals_sparse.round(4)}")
# ── Advanced SciPy factorizations ─────────────────────────────────────
A_sq = np.random.randn(5, 5)
# Schur decomposition: A = Q T Qᵀ where T is quasi-upper-triangular
T_schur, Z_schur = sla.schur(A_sq)
print(f"\nSchur: T quasi-triangular: T[1,0]={T_schur[1,0]:.6f}") # should be ~0
# Matrix functions via scipy
from scipy.linalg import expm, logm, sqrtm
# Matrix exponential - used in ODE solvers, graph neural networks
A_sym = A_sq + A_sq.T # symmetrize
eA = expm(A_sym)
print(f"Matrix exponential: shape {eA.shape}")
# Matrix square root - used in whitening transforms
A_pd = A_sym + 5 * np.eye(5) # ensure positive definite
sqrtA = sqrtm(A_pd)
print(f"sqrt(A) @ sqrt(A) ≈ A: {np.allclose(sqrtA @ sqrtA, A_pd, atol=1e-8)}")
Part 6 - Performance: Memory Layout, Vectorization, Timing
C-order vs Fortran-order: which is faster for your access pattern
import numpy as np
import time
n = 2000
n_trials = 5
# C-order: row-major (default in NumPy)
# Row elements are contiguous → row operations are cache-friendly
A_C = np.random.randn(n, n) # C-order by default
A_F = np.asfortranarray(A_C) # Fortran-order (column-major)
print("Memory layout performance:")
# Row sum: C-order is faster (rows are contiguous)
t0 = time.perf_counter()
for _ in range(n_trials):
r = A_C.sum(axis=1)
t_C = (time.perf_counter() - t0) / n_trials
t0 = time.perf_counter()
for _ in range(n_trials):
r = A_F.sum(axis=1)
t_F = (time.perf_counter() - t0) / n_trials
print(f" Row sum, C-order: {t_C*1000:.2f} ms")
print(f" Row sum, F-order: {t_F*1000:.2f} ms")
# Column sum: Fortran-order is faster (columns are contiguous)
t0 = time.perf_counter()
for _ in range(n_trials):
c = A_C.sum(axis=0)
t_C2 = (time.perf_counter() - t0) / n_trials
t0 = time.perf_counter()
for _ in range(n_trials):
c = A_F.sum(axis=0)
t_F2 = (time.perf_counter() - t0) / n_trials
print(f" Col sum, C-order: {t_C2*1000:.2f} ms")
print(f" Col sum, F-order: {t_F2*1000:.2f} ms")
# Matrix multiply: BLAS handles both orders efficiently
t0 = time.perf_counter()
for _ in range(n_trials):
_ = A_C @ A_C
t_mm = (time.perf_counter() - t0) / n_trials
print(f" MatMul (C-order): {t_mm*1000:.2f} ms")
# Contiguous check before heavy ops
def ensure_contiguous(A: np.ndarray) -> np.ndarray:
"""Make array C-contiguous if it is not already."""
if not A.flags['C_CONTIGUOUS']:
return np.ascontiguousarray(A)
return A
# Transpose creates a non-contiguous view - may slow subsequent ops
A_T = A_C.T # non-contiguous!
print(f"\nA.T is C-contiguous: {A_T.flags['C_CONTIGUOUS']}")
print(f"A.T is F-contiguous: {A_T.flags['F_CONTIGUOUS']}")
# Force contiguous copy when needed
A_T_contig = np.ascontiguousarray(A_T)
print(f"After ascontiguousarray: {A_T_contig.flags['C_CONTIGUOUS']}")
Vectorizing ML patterns
import numpy as np
import time
np.random.seed(42)
# Pattern 1: Pairwise distances - the common vectorization trick
n_queries, n_docs, d = 100, 500, 256
Q = np.random.randn(n_queries, d)
D = np.random.randn(n_docs, d)
# SLOW: nested Python loop
def pairwise_dist_loop(Q, D):
n, m = len(Q), len(D)
dists = np.zeros((n, m))
for i in range(n):
for j in range(m):
dists[i, j] = np.sqrt(np.sum((Q[i] - D[j])**2))
return dists
# FAST: vectorized broadcasting
def pairwise_dist_vectorized(Q, D):
# ‖q - d‖² = ‖q‖² + ‖d‖² - 2 q·d
Q_sq = (Q**2).sum(axis=1, keepdims=True) # (n_q, 1)
D_sq = (D**2).sum(axis=1, keepdims=True) # (n_d, 1)
cross = Q @ D.T # (n_q, n_d)
return np.sqrt(np.maximum(Q_sq + D_sq.T - 2 * cross, 0.0))
t0 = time.perf_counter()
dists_loop = pairwise_dist_loop(Q[:20], D[:50]) # smaller for timing
t_loop = time.perf_counter() - t0
t0 = time.perf_counter()
dists_vec = pairwise_dist_vectorized(Q, D) # full size
t_vec = time.perf_counter() - t0
print(f"Loop (20x50): {t_loop*1000:.1f} ms")
print(f"Vectorized (100x500): {t_vec*1000:.1f} ms (full problem!)")
print(f"Results match: {np.allclose(dists_loop, dists_vec[:20, :50])}")
# Pattern 2: Batch covariance computation
def batch_covariance(X: np.ndarray) -> np.ndarray:
"""Compute covariance matrix of X: (n, d) → (d, d)."""
X_c = X - X.mean(axis=0)
return X_c.T @ X_c / (len(X) - 1)
# Pattern 3: Gram matrix (inner products of all pairs)
def gram_matrix(X: np.ndarray) -> np.ndarray:
"""Compute Gram (kernel) matrix: K_ij = x_i · x_j."""
return X @ X.T # (n, n)
# Pattern 4: Normalizing rows to unit length (for cosine similarity)
def normalize_rows(X: np.ndarray) -> np.ndarray:
norms = np.linalg.norm(X, axis=1, keepdims=True)
return X / np.maximum(norms, 1e-10) # avoid division by zero
print("\nML pattern shapes:")
X = np.random.randn(200, 50)
print(f"Covariance: {batch_covariance(X).shape}") # (50, 50)
print(f"Gram: {gram_matrix(X).shape}") # (200, 200)
print(f"Normalized: {normalize_rows(X).shape}") # (200, 50)
print(f"Normalized norms: {np.linalg.norm(normalize_rows(X), axis=1)[:3].round(4)}") # all 1.0
Part 7 - Numerical Stability: Condition Number, Cancellation, Precision
Condition number
The condition number κ(A) measures how much the solution to Ax = b can change when b has small perturbations:
κ(A) = ‖A‖ · ‖A⁻¹‖ = σ_max / σ_min (for L2 norm)
A rule of thumb: if κ(A) ≈ 10^k, you may lose up to k digits of accuracy in the solution. For float64 (≈15 decimal digits), a matrix with κ ≈ 10^12 gives roughly 3 meaningful digits in the solution.
import numpy as np
# Condition number examples
def condition_analysis(A: np.ndarray, b: np.ndarray, name: str) -> None:
"""Analyze numerical stability of solving Ax = b."""
cond = np.linalg.cond(A)
x = np.linalg.solve(A, b)
residual = np.linalg.norm(A @ x - b)
# Perturb b slightly - how much does solution change?
delta_b = 1e-6 * np.random.randn(len(b))
x_perturbed = np.linalg.solve(A, b + delta_b)
solution_change = np.linalg.norm(x_perturbed - x) / np.linalg.norm(x)
rhs_change = np.linalg.norm(delta_b) / np.linalg.norm(b)
print(f"\n{name}:")
print(f" Condition number: {cond:.2e}")
print(f" Residual: {residual:.2e}")
print(f" RHS perturbation: {rhs_change:.2e}")
print(f" Solution change: {solution_change:.2e}")
print(f" Sensitivity ratio: {solution_change/rhs_change:.1f} (≤ κ = {cond:.2e})")
np.random.seed(42)
n = 8
b = np.random.randn(n)
# Well-conditioned
A_good = np.eye(n) + 0.1 * np.random.randn(n, n)
A_good = A_good.T @ A_good + n * np.eye(n)
condition_analysis(A_good, b, "Well-conditioned (κ ≈ 10)")
# Ill-conditioned: Hilbert matrix
def hilbert(n: int) -> np.ndarray:
"""Hilbert matrix: H_ij = 1/(i+j+1). Notorious for ill-conditioning."""
i, j = np.meshgrid(np.arange(n), np.arange(n), indexing='ij')
return 1.0 / (i + j + 1)
H = hilbert(n)
condition_analysis(H, b, "Hilbert matrix (ill-conditioned)")
# Detecting ill-conditioning before solving
def safe_solve(A: np.ndarray, b: np.ndarray, max_cond: float = 1e10) -> np.ndarray:
"""Solve Ax = b with condition number check."""
cond = np.linalg.cond(A)
if cond > max_cond:
import warnings
warnings.warn(
f"Condition number {cond:.2e} > {max_cond:.2e}. "
f"Solution may be inaccurate. Consider regularization.",
RuntimeWarning
)
return np.linalg.solve(A, b)
Catastrophic cancellation and float32 vs float64
import numpy as np
# Catastrophic cancellation: subtracting nearly equal numbers
x = np.float64(1e15)
y = np.float64(1e15 + 1)
diff = y - x
print(f"1e15 + 1 - 1e15 = {diff}") # should be 1.0, might be 0.0 or wrong
# Numerically stable operations: use library functions that avoid cancellation
# Example: log(1 + x) for small x
x_small = 1e-10
log_unstable = np.log(1 + x_small) # may suffer cancellation
log_stable = np.log1p(x_small) # specifically designed for small x
print(f"\nlog(1+1e-10):")
print(f" np.log(1+x): {log_unstable:.16f}")
print(f" np.log1p(x): {log_stable:.16f}")
# Stable softmax (subtract max before exp)
def softmax_unstable(x: np.ndarray) -> np.ndarray:
"""Will overflow for large x values."""
return np.exp(x) / np.exp(x).sum()
def softmax_stable(x: np.ndarray) -> np.ndarray:
"""Numerically stable: subtract max first."""
x_shifted = x - x.max() # max becomes 0 → max exp = 1 → no overflow
exp_x = np.exp(x_shifted)
return exp_x / exp_x.sum()
x_large = np.array([1000.0, 1001.0, 999.0])
print(f"\nSoftmax on large values:")
try:
result_unstable = softmax_unstable(x_large)
print(f" Unstable: {result_unstable}")
except (RuntimeWarning, FloatingPointError):
print(" Unstable: overflow!")
result_stable = softmax_stable(x_large)
print(f" Stable: {result_stable}")
# float32 vs float64 precision
x32 = np.float32(1.0) / np.float32(3.0)
x64 = np.float64(1.0) / np.float64(3.0)
print(f"\n1/3 as float32: {x32:.15f}") # ~7 significant digits
print(f"1/3 as float64: {x64:.15f}") # ~15 significant digits
# When float32 causes problems: ill-conditioned matrix inversion
A = np.random.randn(50, 50)
A = A.T @ A
A32 = A.astype(np.float32)
A64 = A.astype(np.float64)
cond = np.linalg.cond(A64)
b = np.random.randn(50)
x32 = np.linalg.solve(A32, b.astype(np.float32))
x64 = np.linalg.solve(A64, b)
err32 = np.linalg.norm(A64 @ x32.astype(np.float64) - b)
err64 = np.linalg.norm(A64 @ x64 - b)
print(f"\ncond(A) = {cond:.2e}")
print(f"float32 residual: {err32:.2e}")
print(f"float64 residual: {err64:.2e}")
# For well-conditioned matrices, float32 is fine; ill-conditioned → use float64
# Rule: use float64 for linear system solving and decompositions
# use float32 for neural network forward/backward (GPU speed advantage)
Part 8 - Common ML Patterns from Scratch
import numpy as np
# ── Gram matrix (kernel matrix) ─────────────────────────────────────────
def gram_matrix(X: np.ndarray) -> np.ndarray:
"""
K_ij = xᵢ · xⱼ (linear kernel)
For RBF kernel: K_ij = exp(-‖xᵢ - xⱼ‖² / 2σ²)
"""
return X @ X.T # (n, n)
def rbf_kernel(X: np.ndarray, gamma: float = 1.0) -> np.ndarray:
"""Radial Basis Function (Gaussian) kernel matrix."""
# ‖xᵢ - xⱼ‖² = ‖xᵢ‖² + ‖xⱼ‖² - 2 xᵢ·xⱼ
sq_norms = (X**2).sum(axis=1)
K = sq_norms[:, np.newaxis] + sq_norms[np.newaxis, :] - 2 * (X @ X.T)
return np.exp(-gamma * np.maximum(K, 0.0))
# ── Covariance matrix ───────────────────────────────────────────────────
def covariance_matrix(X: np.ndarray, bias: bool = False) -> np.ndarray:
"""
Σ = Xc^T Xc / (n - 1) (unbiased, bias=False)
Σ = Xc^T Xc / n (biased/MLE, bias=True)
"""
X_c = X - X.mean(axis=0)
n = X.shape[0]
divisor = n if bias else n - 1
return X_c.T @ X_c / divisor
# ── Whitening transform ─────────────────────────────────────────────────
def zca_whiten(X: np.ndarray, eps: float = 1e-5) -> np.ndarray:
"""
ZCA whitening: transform data to have identity covariance.
Steps:
1. Compute covariance: Σ
2. Eigendecompose: Σ = Q Λ Qᵀ
3. Whitening matrix: W = Q Λ^(-1/2) Qᵀ
4. Return X_w = X_c @ W
ZCA whitening preserves the original orientation (unlike PCA whitening).
Used in image preprocessing, ICA, some normalization schemes.
"""
X_c = X - X.mean(axis=0)
cov = X_c.T @ X_c / (len(X) - 1)
eigenvals, Q = np.linalg.eigh(cov)
eigenvals = np.maximum(eigenvals, 0.0) # guard against numerical negatives
# Whitening matrix: Σ^(-1/2) = Q Λ^(-1/2) Qᵀ
D_inv_sqrt = np.diag(1.0 / np.sqrt(eigenvals + eps))
W = Q @ D_inv_sqrt @ Q.T
return X_c @ W
def pca_whiten(X: np.ndarray, n_components: int = None, eps: float = 1e-5) -> np.ndarray:
"""
PCA whitening: decorrelate features AND reduce to n_components dimensions.
Used in preprocessing for ICA, some generative models.
"""
X_c = X - X.mean(axis=0)
cov = X_c.T @ X_c / (len(X) - 1)
eigenvals, Q = np.linalg.eigh(cov)
# Sort descending
idx = np.argsort(eigenvals)[::-1]
eigenvals = eigenvals[idx]
Q = Q[:, idx]
if n_components is not None:
Q = Q[:, :n_components]
eigenvals = eigenvals[:n_components]
# Project and scale by 1/sqrt(eigenvalue)
X_pca = X_c @ Q # project
X_white = X_pca / np.sqrt(eigenvals + eps) # scale to unit variance
return X_white
# ── Rotation matrices ───────────────────────────────────────────────────
def rotation_matrix_2d(theta: float) -> np.ndarray:
"""2D rotation matrix by angle theta (radians)."""
c, s = np.cos(theta), np.sin(theta)
return np.array([[c, -s],
[s, c]])
def random_rotation_matrix(n: int) -> np.ndarray:
"""
Uniformly random rotation matrix in n dimensions.
Uses QR decomposition of a random Gaussian matrix.
"""
A = np.random.randn(n, n)
Q, R = np.linalg.qr(A)
# Ensure proper rotation (det = +1, not -1)
Q = Q * np.sign(np.diag(R))
return Q
# ── Projection matrices ─────────────────────────────────────────────────
def projection_onto_subspace(A: np.ndarray) -> np.ndarray:
"""
Orthogonal projection matrix onto column space of A.
P = A (AᵀA)⁻¹ Aᵀ (if columns of A are linearly independent)
P = A Aᵀ (if columns of A are orthonormal)
"""
# Use QR for numerical stability (instead of forming (AᵀA)⁻¹)
Q, R = np.linalg.qr(A)
return Q @ Q.T # projection matrix
# ── Tests ───────────────────────────────────────────────────────────────
np.random.seed(42)
n, d = 100, 10
X = np.random.randn(n, d)
# Test whitening: covariance of whitened data ≈ Identity
X_white = zca_whiten(X)
cov_white = covariance_matrix(X_white)
print("ZCA whitening:")
print(f" Max off-diagonal: {np.max(np.abs(cov_white - np.eye(d))):.4f} (should be ~0)")
# Test rotation: orthogonal
R = random_rotation_matrix(5)
print(f"\nRandom rotation matrix:")
print(f" Orthogonal (RᵀR = I): {np.allclose(R.T @ R, np.eye(5))}")
print(f" det = +1: {np.isclose(np.linalg.det(R), 1.0)}")
# Test projection: idempotent (P² = P)
P = projection_onto_subspace(X[:, :3]) # project onto first 3 feature directions
print(f"\nProjection matrix:")
print(f" P² = P (idempotent): {np.allclose(P @ P, P)}")
print(f" Pᵀ = P (symmetric): {np.allclose(P.T, P)}")
Part 9 - PyTorch torch.linalg: GPU-Accelerated Equivalent
# torch.linalg mirrors np.linalg almost exactly - same function names,
# GPU acceleration, and gradient support for automatic differentiation.
try:
import torch
import numpy as np
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Device: {device}")
A = torch.randn(4, 4, device=device)
A = A.T @ A + torch.eye(4, device=device) # symmetric positive definite
# ── torch.linalg reference ──────────────────────────────────────────
# Determinant
det = torch.linalg.det(A)
slogdet_sign, slogdet_val = torch.linalg.slogdet(A) # numerically stable
print(f"\ndet(A) = {det.item():.4f}")
print(f"slogdet: sign={slogdet_sign.item():.0f}, log|det|={slogdet_val.item():.4f}")
# Matrix norm
frob = torch.linalg.matrix_norm(A, ord='fro')
nuc = torch.linalg.matrix_norm(A, ord='nuc')
spec = torch.linalg.matrix_norm(A, ord=2)
print(f"Norms: Frobenius={frob.item():.4f}, Nuclear={nuc.item():.4f}, Spectral={spec.item():.4f}")
# Solve
b = torch.randn(4, device=device)
x = torch.linalg.solve(A, b)
print(f"Solve residual: {torch.linalg.norm(A @ x - b).item():.2e}")
# SVD
U, S, Vh = torch.linalg.svd(A, full_matrices=False)
print(f"SVD: U={U.shape}, S={S.shape}, Vh={Vh.shape}")
# Eigenvalues (symmetric)
eigenvals = torch.linalg.eigvalsh(A) # symmetric → eigh equivalent
print(f"Eigenvalues (symmetric): {eigenvals.cpu().numpy().round(4)}")
# QR
Q, R = torch.linalg.qr(A)
print(f"QR: Q={Q.shape}, R={R.shape}")
print(f"Q orthonormal: {torch.allclose(Q.T @ Q, torch.eye(4, device=device), atol=1e-5)}")
# Cholesky
L = torch.linalg.cholesky(A)
print(f"Cholesky: L={L.shape}")
# ── Gradient support through linalg ops ────────────────────────────
A_grad = torch.randn(3, 3, requires_grad=True)
A_pd = A_grad.T @ A_grad + torch.eye(3)
# Gradient flows through SVD
U2, S2, Vh2 = torch.linalg.svd(A_pd, full_matrices=False)
loss = S2.sum() # some loss involving singular values
loss.backward()
print(f"\nGradient through SVD: A_grad.grad shape = {A_grad.grad.shape}")
# ── When to use torch vs numpy ──────────────────────────────────────
print("\ntorch.linalg vs np.linalg:")
print(" Use torch when: on GPU, need gradients, inside a PyTorch model")
print(" Use numpy when: on CPU only, no gradient needed, scipy interop")
print(" Conversion: np_arr = t.detach().cpu().numpy()")
print(" t = torch.from_numpy(np_arr).to(device)")
# Performance: GPU vs CPU for large matrix
n_large = 1000
A_large_np = np.random.randn(n_large, n_large)
A_large_np = A_large_np.T @ A_large_np + n_large * np.eye(n_large)
import time
t0 = time.perf_counter()
U_np, s_np, Vt_np = np.linalg.svd(A_large_np)
t_np = time.perf_counter() - t0
print(f"\nSVD {n_large}x{n_large}: NumPy = {t_np*1000:.1f} ms")
A_large_t = torch.from_numpy(A_large_np).float()
t0 = time.perf_counter()
U_t, S_t, Vh_t = torch.linalg.svd(A_large_t, full_matrices=False)
t_torch_cpu = time.perf_counter() - t0
print(f"SVD {n_large}x{n_large}: PyTorch CPU = {t_torch_cpu*1000:.1f} ms")
if torch.cuda.is_available():
A_gpu = A_large_t.cuda()
torch.cuda.synchronize()
t0 = time.perf_counter()
U_g, S_g, Vh_g = torch.linalg.svd(A_gpu, full_matrices=False)
torch.cuda.synchronize()
t_gpu = time.perf_counter() - t0
print(f"SVD {n_large}x{n_large}: PyTorch GPU = {t_gpu*1000:.1f} ms")
print(f"GPU speedup: {t_np/t_gpu:.1f}x over NumPy")
except ImportError:
print("PyTorch not installed. Install with: pip install torch")
print("\ntorch.linalg function equivalents to np.linalg:")
torch_numpy_equiv = [
("np.linalg.solve(A, b)", "torch.linalg.solve(A, b)"),
("np.linalg.svd(A)", "torch.linalg.svd(A)"),
("np.linalg.eig(A)", "torch.linalg.eig(A)"),
("np.linalg.eigh(A)", "torch.linalg.eigh(A)"),
("np.linalg.eigvalsh(A)", "torch.linalg.eigvalsh(A)"),
("np.linalg.qr(A)", "torch.linalg.qr(A)"),
("np.linalg.cholesky(A)", "torch.linalg.cholesky(A)"),
("np.linalg.norm(A, 'fro')", "torch.linalg.matrix_norm(A, 'fro')"),
("np.linalg.norm(v)", "torch.linalg.vector_norm(v)"),
("np.linalg.det(A)", "torch.linalg.det(A)"),
("np.linalg.inv(A)", "torch.linalg.inv(A)"),
("np.linalg.pinv(A)", "torch.linalg.pinv(A)"),
("np.linalg.cond(A)", "torch.linalg.cond(A)"),
("np.linalg.matrix_rank(A)", "torch.linalg.matrix_rank(A)"),
]
for np_fn, torch_fn in torch_numpy_equiv:
print(f" {np_fn:<35} → {torch_fn}")
Interview Questions
Q1: What is the condition number and why does it matter for numerical computation?
The condition number of a matrix A (with respect to a given norm) is:
κ(A) = ‖A‖ · ‖A⁻¹‖
For the L2 norm and a non-singular matrix:
κ₂(A) = σ_max / σ_min
where σ_max and σ_min are the largest and smallest singular values.
What it measures: The condition number is the worst-case factor by which relative errors in the input (b) are amplified in the output (x) when solving Ax = b:
‖Δx‖/‖x‖ ≤ κ(A) · ‖Δb‖/‖b‖
A rule of thumb: if κ(A) ≈ 10^k, solving Ax = b in float64 (≈15 digits of precision) gives approximately 15 - k correct significant digits.
ML implications:
- Ridge regression: the αI term in (XᵀX + αI) lowers the condition number from σ_max(XᵀX)/σ_min(XᵀX) to (σ_max + α)/(σ_min + α), improving numerical stability
- Optimization: the condition number of the Hessian determines convergence speed of gradient descent - high condition number (elongated loss landscape) leads to slow convergence; Adam adapts per-parameter learning rates to effectively precondition the problem
- Feature scaling: StandardScaler reduces the condition number of XᵀX by making features comparable in scale, improving numerical stability and optimization speed
Check condition number before solving: cond = np.linalg.cond(A). If it exceeds 10^12 for float64 (or 10^6 for float32), consider regularization or a different numerical approach.
Q2: Explain the difference between C-order and Fortran-order memory layout in NumPy.
C-order (row-major): Elements of each row are stored contiguously in memory.
A = [[1, 2, 3], Memory: [1, 2, 3, 4, 5, 6]
[4, 5, 6]] (row 0, then row 1)
Fortran-order (column-major): Elements of each column are stored contiguously.
A = [[1, 2, 3], Memory: [1, 4, 2, 5, 3, 6]
[4, 5, 6]] (col 0, then col 1, then col 2)
Performance implications:
NumPy creates C-order arrays by default. Row operations (summing along axis=1, iterating over rows) are cache-friendly for C-order because each row's elements are adjacent in memory - the CPU fetches them in cache lines efficiently. Column operations on C-order arrays cause strided access and cache misses.
BLAS (called by np.dot, @) handles both orders internally and passes layout information to the LAPACK routine, so A @ B is efficient regardless of the layout.
When it matters in practice:
- After
A.T, the result is F-order (or viewed as F-order without copying). Callingreshapeon it may trigger an unexpected copy. np.ascontiguousarray(A)forces a copy to C-order if needed.- PyTorch defaults to C-order (contiguous in PyTorch terminology). After
.permute()or.transpose(), call.contiguous()before.view().
For ML: the main practical consequence is that after transposing or permuting tensor axes, you may need to explicitly call .contiguous() (PyTorch) or np.ascontiguousarray() (NumPy) before reshaping, to avoid RuntimeErrors or unexpected copies.
Q3: When should you use scipy.linalg instead of numpy.linalg?
Use scipy.linalg instead of (or in addition to) numpy.linalg in four situations:
-
Pre-factorization for multiple solves:
scipy.linalg.lu_factor+lu_solvelets you factorize once (O(n³)) and solve multiple systems cheaply (O(n²) each). NumPy has no equivalent -np.linalg.solvere-factorizes each call. -
Cholesky solve:
scipy.linalg.cho_factor+cho_solvefor symmetric positive definite matrices. Cholesky is roughly 2x faster than LU and guaranteed to be positive - NumPy lacks this pre-factorization workflow. -
Sparse matrices:
scipy.sparseprovides sparse matrix formats (CSR, CSC, COO) andscipy.sparse.linalgprovides iterative solvers (GMRES, CG, MINRES) and sparse eigensolvers (LOBPCG, ARPACK viaeigsh). For matrices with millions of rows but only a few non-zeros per row (graph Laplacians, FEM stiffness matrices), these are essential. -
Additional matrix functions:
scipy.linalg.expm(matrix exponential),logm,sqrtm,schur, and the Hessenberg decomposition are not innumpy.linalg.
For most day-to-day single-solve operations, numpy.linalg is sufficient. For production ML code with repeated linear system solving (Kalman filters, Gaussian processes, kernel methods), scipy.linalg pre-factorization can give 10-50x speedups.
Q4: What is the relationship between np.linalg and torch.linalg?
torch.linalg (introduced in PyTorch 1.9) was explicitly designed to mirror numpy.linalg in API, with two key additions:
API equivalence: Almost every np.linalg function has a direct torch.linalg counterpart with the same name and argument order: solve, svd, eig, eigh, qr, cholesky, inv, pinv, det, norm, matrix_rank, cond.
Key differences:
-
Gradient support:
torch.linalgoperations participate in autograd. Gradients flow through SVD, eigendecomposition, Cholesky, and solve - enabling gradient-based optimization over quantities that depend on these decompositions (e.g., nuclear norm minimization, optimizing over rotation matrices). -
Device-agnostic: tensors can live on CPU or GPU. The same code path works on both - dispatch to cuBLAS/cuSOLVER on GPU, BLAS/LAPACK on CPU.
-
Norm API difference:
np.linalg.norm(v)handles both vectors and matrices with theordparameter. PyTorch separates this intotorch.linalg.vector_norm(v)andtorch.linalg.matrix_norm(A).
When to use which:
- Inside a PyTorch model or loss function: always use
torch.linalg(gradients, GPU) - Preprocessing, offline analysis, sklearn integration: use
numpy.linalg - Conversion:
t.detach().cpu().numpy()for torch → numpy;torch.from_numpy(arr)for numpy → torch (shares memory on CPU)
Practice Challenges
Level 1: Predict
Challenge: Predict the output of each call:
A = np.array([[1.0, 2.0],
[3.0, 4.0]])
print(np.linalg.norm(A, 'fro'))
print(np.linalg.norm(A.flatten()))
print(np.trace(A))
print(np.linalg.det(A))
Answer
Frobenius norm: sqrt(1² + 2² + 3² + 4²) = sqrt(1+4+9+16) = sqrt(30) ≈ 5.4772
L2 of flatten: sqrt(1+4+9+16) = sqrt(30) ≈ 5.4772 (identical to Frobenius)
trace(A): 1 + 4 = 5 (sum of diagonal)
det(A): 1*4 - 2*3 = 4 - 6 = -2
The Frobenius norm equals the L2 norm of the vectorized matrix - this is confirmed by the first two lines giving identical results.
The negative determinant means A has one negative eigenvalue (and one positive). Sum of eigenvalues = trace = 5, product = det = -2. Eigenvalues: solve λ² - 5λ - 2 = 0 → λ = (5 ± sqrt(33)) / 2 ≈ {5.37, -0.37}.
Level 2: Debug
Challenge: Find the bug in this Ridge regression implementation:
def ridge_predict(X_train, y_train, X_test, alpha=1.0):
n, d = X_train.shape
w = np.linalg.inv(X_train.T @ X_train + alpha * np.eye(n)) @ X_train.T @ y_train
return X_test @ w
Answer
Bug: The identity matrix has size n (number of samples) instead of d (number of features).
Ridge regression adds αI to XᵀX, which is a d×d matrix. The code uses np.eye(n) (n×n) - this causes a shape mismatch when n ≠ d.
# Wrong: np.eye(n) - size is wrong
w = np.linalg.inv(X_train.T @ X_train + alpha * np.eye(n)) @ ...
# Fix 1: use np.eye(d)
def ridge_predict_fixed_v1(X_train, y_train, X_test, alpha=1.0):
n, d = X_train.shape
w = np.linalg.inv(X_train.T @ X_train + alpha * np.eye(d)) @ X_train.T @ y_train
return X_test @ w
# Fix 2: use solve instead of inv (better numerics)
def ridge_predict_fixed_v2(X_train, y_train, X_test, alpha=1.0):
n, d = X_train.shape
A = X_train.T @ X_train + alpha * np.eye(d) # (d, d)
b = X_train.T @ y_train # (d,)
w = np.linalg.solve(A, b) # never use inv!
return X_test @ w
# Verify
np.random.seed(42)
X_tr = np.random.randn(20, 5)
y_tr = np.random.randn(20)
X_te = np.random.randn(10, 5)
preds = ridge_predict_fixed_v2(X_tr, y_tr, X_te, alpha=0.1)
print(f"Predictions shape: {preds.shape}") # (10,) - correct
Level 3: Design
Challenge: Implement a numerically stable Gaussian Process regression from scratch using Cholesky decomposition and log marginal likelihood.
Answer
import numpy as np
from scipy.linalg import cho_factor, cho_solve
class GaussianProcessRegressor:
"""
Gaussian Process Regression with RBF kernel.
Uses Cholesky decomposition for numerical stability and efficiency.
"""
def __init__(self, length_scale: float = 1.0, noise_var: float = 1e-4):
self.length_scale = length_scale
self.noise_var = noise_var
self._chol_factor = None
self.X_train = None
self.y_train = None
def _rbf_kernel(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
"""K(x,x') = exp(-‖x - x'‖² / (2 * l²))"""
sq_norms = (
(X1**2).sum(axis=1, keepdims=True) # (n1, 1)
+ (X2**2).sum(axis=1, keepdims=True).T # (1, n2)
- 2 * X1 @ X2.T # (n1, n2)
)
return np.exp(-np.maximum(sq_norms, 0.0) / (2 * self.length_scale**2))
def fit(self, X: np.ndarray, y: np.ndarray) -> 'GaussianProcessRegressor':
"""
Fit GP: compute and factorize K + noise_var * I using Cholesky.
O(n³) for factorization, O(n²) for each subsequent predict call.
"""
self.X_train = X.copy()
self.y_train = y.copy()
n = len(X)
# Kernel matrix + noise diagonal
K = self._rbf_kernel(X, X)
K_noisy = K + self.noise_var * np.eye(n)
# Cholesky factorization: K_noisy = L Lᵀ
# More stable than LU for positive definite matrices
self._chol_factor = cho_factor(K_noisy)
# Pre-compute K⁻¹ y = L⁻ᵀ L⁻¹ y for prediction
self._alpha = cho_solve(self._chol_factor, y) # (n,)
return self
def predict(
self, X_test: np.ndarray, return_std: bool = False
) -> tuple:
"""
Predict mean and optional standard deviation.
O(n²) per call after O(n³) fit.
"""
# Cross-covariance: K_* = K(X_test, X_train)
K_star = self._rbf_kernel(X_test, self.X_train) # (n_test, n_train)
# Posterior mean: μ* = K_* K⁻¹ y = K_* alpha
mean = K_star @ self._alpha
if not return_std:
return mean
# Posterior variance: Var* = K(X_*, X_*) - K_* K⁻¹ K_*ᵀ
K_star_star = self._rbf_kernel(X_test, X_test) # (n_test, n_test)
# K⁻¹ K_*ᵀ via Cholesky solve - avoids forming explicit inverse
V = cho_solve(self._chol_factor, K_star.T) # (n_train, n_test)
var = np.diag(K_star_star) - np.einsum('ij,ji->i', K_star, V)
var = np.maximum(var, 0.0) # numerical safety: clamp negative variances
return mean, np.sqrt(var)
def log_marginal_likelihood(self) -> float:
"""
log p(y | X, θ) = -0.5 [yᵀ K⁻¹ y + log|K| + n log(2π)]
Using slogdet via Cholesky: log|K| = 2 * Σᵢ log(Lᵢᵢ)
"""
n = len(self.y_train)
L = self._chol_factor[0] # lower triangular Cholesky factor
# log|K| = 2 * sum of log diagonals of L
log_det_K = 2.0 * np.sum(np.log(np.diag(np.abs(L))))
data_fit = 0.5 * (self.y_train @ self._alpha)
complexity = 0.5 * log_det_K
constant = 0.5 * n * np.log(2 * np.pi)
return -(data_fit + complexity + constant)
# Test
np.random.seed(42)
n_train, n_test = 30, 100
X_train = np.linspace(0, 5, n_train).reshape(-1, 1)
y_train = np.sin(X_train.ravel()) + 0.1 * np.random.randn(n_train)
X_test = np.linspace(-0.5, 5.5, n_test).reshape(-1, 1)
gp = GaussianProcessRegressor(length_scale=1.0, noise_var=0.01)
gp.fit(X_train, y_train)
mean, std = gp.predict(X_test, return_std=True)
lml = gp.log_marginal_likelihood()
print(f"GP predictions shape: {mean.shape}") # (100,)
print(f"GP stdev shape: {std.shape}") # (100,)
print(f"Log marginal likelihood: {lml:.4f}")
print(f"Max prediction error: {np.max(np.abs(mean - np.sin(X_test.ravel()))):.4f}")
Complete np.linalg and torch.linalg Cheatsheet
| Operation | np.linalg | torch.linalg | Notes |
|---|---|---|---|
| Solve Ax=b | solve(A, b) | solve(A, b) | Prefer over inv |
| Least squares | lstsq(A, b) | lstsq(A, b) | Overdetermined systems |
| Inverse | inv(A) | inv(A) | Avoid - use solve |
| Pseudoinverse | pinv(A) | pinv(A) | SVD-based |
| Determinant | det(A) | det(A) | Use slogdet for stability |
| Log-det | slogdet(A) | slogdet(A) | Avoids overflow |
| Condition # | cond(A) | cond(A) | σ_max/σ_min |
| Rank | matrix_rank(A) | matrix_rank(A) | Uses SVD |
| SVD | svd(A) | svd(A) | full_matrices kwarg |
| Eig (general) | eig(A) | eig(A) | May be complex |
| Eig (symmetric) | eigh(A) | eigh(A) | Real, ascending |
| QR | qr(A) | qr(A) | Orthonormal basis |
| Cholesky | cholesky(A) | cholesky(A) | Symmetric PD only |
| Vector norm | norm(v) | vector_norm(v) | Default L2 |
| Matrix norm | norm(A, 'fro') | matrix_norm(A, 'fro') | fro, nuc, 2 |
Key Takeaways
- NumPy arrays use contiguous memory with BLAS dispatch - they are 100-1000x faster than Python lists for numerical computation
- Always use
np.linalg.solve(A, b)instead ofnp.linalg.inv(A) @ b- it is faster and more numerically stable - Use
np.linalg.eighfor symmetric matrices - it returns real eigenvalues in ascending order, is faster, and more stable thaneig - The condition number κ(A) = σ_max/σ_min measures numerical sensitivity - large condition number means small perturbations in b cause large changes in the solution
- Use
scipy.linalg.lu_factorandcho_factorwhen solving multiple systems with the same matrix - factorize O(n³) once, solve O(n²) many times - For float32 neural network weights, prefer float64 for any linear algebra operations (decompositions, covariance, kernel matrices) to avoid catastrophic cancellation
torch.linalgmirrorsnp.linalgexactly but adds GPU acceleration and gradient support - switch to it inside any PyTorch model
Back to: Module Overview →
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Matrix Transformations in 3D demo on the EngineersOfAI Playground - no code required.
:::
