Linear Algebra in NumPy - A Complete Engineering Reference

Reading time: ~25 minutes | Level: Mathematical Foundations → ML Engineering

NumPy is the linear algebra engine underneath sklearn, PyTorch (CPU ops), TensorFlow, and JAX. Knowing NumPy linear algebra means you can implement - and debug - any ML algorithm from scratch.

Every major ML framework dispatches to BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra PACKage) under the hood. NumPy exposes this through its np.linalg module. SciPy extends it with sparse matrices and advanced factorizations. PyTorch mirrors it in torch.linalg for GPU execution and gradient support.

This lesson is a complete engineering reference: every function, every performance gotcha, every numerical stability trap, and the ML patterns you will reach for repeatedly.

What You Will Learn

NumPy array internals vs Python lists - what makes NumPy fast
The complete np.linalg module: every function with ML context
Solving linear systems correctly: np.linalg.solve not np.linalg.inv
Computing decompositions: SVD, eigenvalue, QR, Cholesky - when to use each
SciPy linalg: sparse matrices and advanced factorizations
Performance: memory layout, vectorization, avoiding loops, timing
Numerical stability: condition number, catastrophic cancellation, float32 vs float64
Common ML patterns from scratch: Gram matrix, covariance, whitening, rotations
PyTorch torch.linalg: the GPU-accelerated equivalent

Prerequisites

Part 1 - NumPy Array vs Python List: Why Performance Differs

Memory layout is the fundamental difference

A Python list stores pointers to Python objects scattered across heap memory:

list: [ptr→obj₁, ptr→obj₂, ptr→obj₃, ...]
      Each ptr is 8 bytes. Each obj is ≥28 bytes (Python float). Scattered in memory.

A NumPy array stores raw values in a single contiguous block of memory:

array: [val₁, val₂, val₃, ...]
       Each val is 4 bytes (float32) or 8 bytes (float64). Sequential in memory.

This difference has profound performance consequences:

Cache efficiency: sequential memory access is cache-friendly - the CPU prefetches upcoming values automatically. Pointer-following (Python lists) causes cache misses.
SIMD vectorization: AVX-512 can process 16 float32s in one instruction, but only on contiguous memory.
BLAS dispatch: A @ B in NumPy calls BLAS dgemm - a decades-optimized routine. There is no Python loop anywhere in the call.
No GIL release needed: NumPy releases the Python GIL during heavy computation, enabling multi-threaded BLAS.

import numpy as np
import time

n = 1_000_000

# Memory footprint comparison
py_list = list(range(n))
np_array = np.arange(n, dtype=np.float64)

import sys
py_size = sys.getsizeof(py_list) + sum(sys.getsizeof(x) for x in py_list[:100]) * 10000
np_size = np_array.nbytes
print(f"Python list (approx): {py_size / 1e6:.1f} MB")
print(f"NumPy float64:        {np_size / 1e6:.1f} MB")
print(f"NumPy float32:        {np_size / 2 / 1e6:.1f} MB")
# NumPy is typically 5-10x more memory-efficient

# Speed comparison: sum of 1M elements
a_list = list(range(n))
a_array = np.arange(n, dtype=np.float64)

t0 = time.perf_counter()
for _ in range(100):
    s = sum(a_list)
t_python = (time.perf_counter() - t0) / 100

t0 = time.perf_counter()
for _ in range(100):
    s = np.sum(a_array)
t_numpy = (time.perf_counter() - t0) / 100

print(f"\nPython sum: {t_python*1000:.2f} ms")
print(f"NumPy sum:  {t_numpy*1000:.3f} ms")
print(f"Speedup:    {t_python/t_numpy:.0f}x")

# Data types: float32 vs float64
a32 = np.random.randn(1000, 1000).astype(np.float32)
a64 = np.random.randn(1000, 1000).astype(np.float64)

t0 = time.perf_counter()
for _ in range(10):
    _ = a32 @ a32
t_f32 = (time.perf_counter() - t0) / 10

t0 = time.perf_counter()
for _ in range(10):
    _ = a64 @ a64
t_f64 = (time.perf_counter() - t0) / 10

print(f"\nfloat32 matmul: {t_f32*1000:.2f} ms")
print(f"float64 matmul: {t_f64*1000:.2f} ms")
print(f"float32 speedup: ~{t_f64/t_f32:.1f}x (wider SIMD, less memory bandwidth)")

Part 2 - The np.linalg Module: Complete Reference

import numpy as np

# ── Matrix properties ───────────────────────────────────────────────────
A = np.array([[4.0, 2.0, 1.0],
              [2.0, 5.0, 3.0],
              [1.0, 3.0, 6.0]])

print("=== Matrix properties ===")
print(f"det(A)          = {np.linalg.det(A):.4f}")
print(f"trace(A)        = {np.trace(A):.4f}")         # sum of eigenvalues
print(f"rank(A)         = {np.linalg.matrix_rank(A)}")
print(f"cond(A)         = {np.linalg.cond(A):.4f}")  # condition number
print(f"norm(A, 'fro')  = {np.linalg.norm(A, 'fro'):.4f}")
print(f"norm(A, 2)      = {np.linalg.norm(A, 2):.4f}")  # spectral norm

# ── System solving ──────────────────────────────────────────────────────
b = np.array([1.0, 2.0, 3.0])

# ALWAYS use solve, not inv
x = np.linalg.solve(A, b)        # solution to Ax = b
print(f"\n=== Linear system: Ax = b ===")
print(f"x = {x}")
print(f"Residual ‖Ax - b‖: {np.linalg.norm(A @ x - b):.2e}")

# Multiple right-hand sides: A X = B (B has multiple columns)
B = np.random.randn(3, 5)
X = np.linalg.solve(A, B)         # X shape: (3, 5) - solves 5 systems at once
print(f"Multi-RHS solve: X shape = {X.shape}")

# Least squares: overdetermined system Ax ≈ b (m > n)
A_rect = np.random.randn(10, 3)   # 10 equations, 3 unknowns
b_rect  = np.random.randn(10)
x_ls, residuals, rank, sv = np.linalg.lstsq(A_rect, b_rect, rcond=None)
print(f"\n=== Least squares ===")
print(f"x_ls shape: {x_ls.shape}, rank: {rank}")

# ── Decompositions ──────────────────────────────────────────────────────
print("\n=== Decompositions ===")

# SVD
U, s, Vt = np.linalg.svd(A)                          # full SVD
U2, s2, Vt2 = np.linalg.svd(A, full_matrices=False)  # economy SVD
print(f"SVD: U={U.shape}, s={s.shape}, Vt={Vt.shape}")
print(f"Economy SVD: U={U2.shape}, s={s2.shape}, Vt={Vt2.shape}")
print(f"Singular values: {s2.round(4)}")

# Eigendecomposition - general matrix
eigenvals, eigenvecs = np.linalg.eig(A)
print(f"\nEig (general): λ = {eigenvals.round(4)}")

# Eigendecomposition - symmetric matrix (always prefer eigh for symmetric)
eigenvals_s, eigenvecs_s = np.linalg.eigh(A)  # A is symmetric → use eigh
print(f"Eigh (symmetric): λ = {eigenvals_s.round(4)} (ascending)")

# QR decomposition
Q, R = np.linalg.qr(A)
print(f"\nQR: Q={Q.shape} (orthonormal), R={R.shape} (upper triangular)")
print(f"Q orthonormal: {np.allclose(Q.T @ Q, np.eye(3))}")
print(f"QR reconstruction: {np.allclose(Q @ R, A)}")

# Cholesky: for symmetric positive definite matrices
L = np.linalg.cholesky(A)   # A must be symmetric positive definite
print(f"\nCholesky: L={L.shape} (lower triangular)")
print(f"LLᵀ = A: {np.allclose(L @ L.T, A)}")

# ── Inverses and pseudoinverses ─────────────────────────────────────────
print("\n=== Inverses ===")
A_inv = np.linalg.inv(A)
print(f"inv(A): shape={A_inv.shape}")
print(f"A @ inv(A) ≈ I: {np.allclose(A @ A_inv, np.eye(3))}")

# Pseudoinverse (Moore-Penrose): works for rectangular/singular matrices
A_rect = np.random.randn(5, 3)  # tall matrix, rank 3
A_pinv = np.linalg.pinv(A_rect)
print(f"\npinv: A={A_rect.shape} → pinv(A)={A_pinv.shape}")
print(f"A_pinv @ A ≈ I₃: {np.allclose(A_pinv @ A_rect, np.eye(3), atol=1e-10)}")

# Log determinant for numerical stability
# Use slogdet instead of log(det) to avoid overflow/underflow
sign, logdet = np.linalg.slogdet(A)
print(f"\nlog|det(A)| = {logdet:.4f}")
print(f"det(A) = {np.exp(logdet) * sign:.4f}")
# For large matrices, det() overflows/underflows; slogdet stays stable

Part 3 - Solving Linear Systems: solve Not inv

Why `np.linalg.solve` is almost always better than `np.linalg.inv`

The common mistake: x = np.linalg.inv(A) @ b

The correct approach: x = np.linalg.solve(A, b)

Three reasons:

Speed: solve uses LU factorization - O(n³/3). inv also computes LU but then multiplies by the identity for each column - doing roughly 3x more work.
Numerical accuracy: solve never explicitly forms the inverse. The inverse amplifies numerical errors; solve minimizes them by directly finding x.
Explicit inverse is rarely needed: in ML, you almost never need A⁻¹ itself - you need A⁻¹b for some b. Use solve.

import numpy as np
import time

np.random.seed(42)
n = 500

# Well-conditioned system
A = np.random.randn(n, n)
A = A.T @ A + n * np.eye(n)  # symmetric positive definite, well-conditioned
b = np.random.randn(n)

# Method 1: inv then multiply (wrong approach)
t0 = time.perf_counter()
for _ in range(20):
    x_inv = np.linalg.inv(A) @ b
t_inv = (time.perf_counter() - t0) / 20

# Method 2: solve (correct approach)
t0 = time.perf_counter()
for _ in range(20):
    x_solve = np.linalg.solve(A, b)
t_solve = (time.perf_counter() - t0) / 20

print(f"inv() + @:      {t_inv*1000:.2f} ms")
print(f"solve():        {t_solve*1000:.2f} ms")
print(f"solve speedup:  {t_inv/t_solve:.1f}x")
print(f"Results match:  {np.allclose(x_inv, x_solve)}")

# Numerical accuracy: ill-conditioned system
# High condition number → inv amplifies errors more than solve
A_ill = np.vander(np.linspace(0.1, 1, 10), 10)  # Vandermonde matrix (very ill-conditioned)
b_ill = np.random.randn(10)
true_x = np.linalg.lstsq(A_ill, b_ill, rcond=None)[0]  # most stable reference

try:
    x_inv_ill = np.linalg.inv(A_ill) @ b_ill
    err_inv = np.linalg.norm(A_ill @ x_inv_ill - b_ill)
    print(f"\nIll-conditioned residual (inv):   {err_inv:.2e}")
except np.linalg.LinAlgError:
    print("inv failed on ill-conditioned matrix")

x_solve_ill = np.linalg.solve(A_ill, b_ill)
err_solve = np.linalg.norm(A_ill @ x_solve_ill - b_ill)
print(f"Ill-conditioned residual (solve): {err_solve:.2e}")

cond = np.linalg.cond(A_ill)
print(f"Condition number: {cond:.2e}")  # should be very large

# Solving multiple right-hand sides efficiently
# Factorize once, solve multiple times - use scipy for this
from scipy import linalg as sla
lu, piv = sla.lu_factor(A)  # factorize once: O(n³)

n_rhs = 50
residuals = []
for _ in range(n_rhs):
    b_i = np.random.randn(n)
    x_i = sla.lu_solve((lu, piv), b_i)  # each solve: O(n²)
    residuals.append(np.linalg.norm(A @ x_i - b_i))

print(f"\nLU pre-factorization residuals: max = {max(residuals):.2e}")
print("(factorize once, solve {n_rhs} times at O(n²) each)")

Part 4 - Decompositions: When to Use Each

Decision guide

import numpy as np

# ── SVD: the most general decomposition ────────────────────────────────
def svd_examples():
    np.random.seed(42)
    A = np.random.randn(6, 4)

    # Full vs economy SVD
    U_full, s, Vt_full = np.linalg.svd(A, full_matrices=True)   # U: (6,6), Vt: (4,4)
    U_econ, s, Vt_econ = np.linalg.svd(A, full_matrices=False)  # U: (6,4), Vt: (4,4)

    # Reconstruction
    Sigma = np.diag(s)
    A_recon = U_econ @ Sigma @ Vt_econ
    print(f"SVD reconstruction error: {np.linalg.norm(A - A_recon):.2e}")

    # Singular values only (faster)
    s_only = np.linalg.svd(A, compute_uv=False)
    print(f"Singular values: {s_only.round(4)}")

    # Rank from singular values
    tol = 1e-10
    rank = np.sum(s_only > tol)
    print(f"Effective rank: {rank}")

svd_examples()

# ── Cholesky: 2x faster than LU for symmetric positive definite ────────
def cholesky_examples():
    np.random.seed(42)
    n = 200
    A_pd = np.random.randn(n, n)
    A_pd = A_pd.T @ A_pd + n * np.eye(n)  # symmetric positive definite

    import time

    # Compare solve approaches for symmetric PD matrix
    b = np.random.randn(n)

    # LU-based solve (general)
    t0 = time.perf_counter()
    for _ in range(50):
        x_lu = np.linalg.solve(A_pd, b)
    t_lu = (time.perf_counter() - t0) / 50

    # Cholesky-based solve (specialized, ~2x faster)
    from scipy.linalg import cho_factor, cho_solve
    t0 = time.perf_counter()
    for _ in range(50):
        c, low = cho_factor(A_pd)
        x_cho = cho_solve((c, low), b)
    t_cho = (time.perf_counter() - t0) / 50

    print(f"\nLU solve:   {t_lu*1000:.2f} ms")
    print(f"Cho solve:  {t_cho*1000:.2f} ms")
    print(f"Results match: {np.allclose(x_lu, x_cho)}")

    # Cholesky also used for sampling from multivariate Gaussian
    # X = mu + L @ z, where L = cholesky(Sigma) and z ~ N(0, I)
    mu = np.zeros(n)
    Sigma = A_pd / n   # scale to reasonable covariance
    L = np.linalg.cholesky(Sigma)
    z = np.random.randn(n, 1000)   # 1000 standard normal samples
    samples = mu[:, np.newaxis] + L @ z  # (n, 1000) samples from N(mu, Sigma)
    print(f"\nMultivariate Gaussian samples: {samples.shape}")

cholesky_examples()

# ── QR decomposition: orthonormal bases and linear regression ──────────
def qr_examples():
    np.random.seed(42)
    A = np.random.randn(8, 4)   # tall matrix (overdetermined)

    Q, R = np.linalg.qr(A)
    print(f"\nQR: Q={Q.shape}, R={R.shape}")
    print(f"Q orthonormal: {np.allclose(Q.T @ Q, np.eye(4))}")

    # QR for least squares: Ax ≈ b → QR decomp → x = R⁻¹ Qᵀ b
    b = np.random.randn(8)
    x_qr = np.linalg.solve(R, Q.T @ b)    # triangular solve
    x_lstsq = np.linalg.lstsq(A, b, rcond=None)[0]
    print(f"QR vs lstsq match: {np.allclose(x_qr, x_lstsq)}")

    # Gram-Schmidt via QR: orthonormal basis for column space of A
    # Q columns are the orthonormal basis
    print(f"Column space basis (Q): {Q.shape}")  # (8, 4) - 4 orthonormal basis vectors

qr_examples()

Part 5 - SciPy linalg: Sparse Matrices and Advanced Factorizations

When NumPy is not enough:

scipy.linalg: more factorizations (Schur, Hessenberg, interpolative), Cholesky solve, LU with pre-factorization
scipy.sparse: sparse matrix formats (CSR, CSC, COO), sparse linear solvers
scipy.sparse.linalg: iterative solvers (conjugate gradient, GMRES, LOBPCG for eigenvalues)

import numpy as np
from scipy import linalg as sla
import scipy.sparse as sp
import scipy.sparse.linalg as spla

# ── Pre-factorize for multiple solves ─────────────────────────────────
np.random.seed(42)
n = 100
A = np.random.randn(n, n)
A = A.T @ A + 0.1 * np.eye(n)  # symmetric positive definite

# LU factorization once, then solve many times
lu, piv = sla.lu_factor(A)        # O(n³) - do once

b_vectors = [np.random.randn(n) for _ in range(20)]
solutions = [sla.lu_solve((lu, piv), b) for b in b_vectors]  # O(n²) each
print(f"Pre-factored LU: solved {len(solutions)} systems")

# Cholesky factor once, solve many times (symmetric PD)
c, low = sla.cho_factor(A)        # O(n³) once
solutions_cho = [sla.cho_solve((c, low), b) for b in b_vectors]
print(f"Cholesky solutions match LU: {np.allclose(solutions, solutions_cho)}")

# ── Sparse matrices ───────────────────────────────────────────────────
# Real-world matrices (graph Laplacians, FEM, NLP co-occurrence) are sparse
# Storing them dense would be wasteful - use sparse formats

n_sparse = 1000
density = 0.01   # 1% of entries are non-zero
# COO format: coordinate format (row, col, data)
rows = np.random.randint(0, n_sparse, size=int(n_sparse**2 * density))
cols = np.random.randint(0, n_sparse, size=int(n_sparse**2 * density))
data = np.random.randn(len(rows))
A_sparse = sp.coo_matrix((data, (rows, cols)), shape=(n_sparse, n_sparse))

# Convert to CSR (Compressed Sparse Row) for efficient matrix operations
A_csr = A_sparse.tocsr()
print(f"\nSparse matrix: {A_csr.shape}, nnz={A_csr.nnz}, density={A_csr.nnz/n_sparse**2:.4f}")
print(f"Dense size:  {n_sparse**2 * 8 / 1e6:.1f} MB")
print(f"Sparse size: {(A_csr.data.nbytes + A_csr.indices.nbytes + A_csr.indptr.nbytes) / 1e6:.3f} MB")

# Sparse matrix-vector multiply (much faster than dense for low density)
b_sparse = np.random.randn(n_sparse)
x_sparse = A_csr @ b_sparse   # uses sparse multiply
print(f"Sparse matvec: shape {x_sparse.shape}")

# Sparse symmetric eigenvalue problem (e.g., graph Laplacian)
# Build a symmetric sparse matrix for eigensolver
A_sym = A_sparse + A_sparse.T   # make symmetric
A_sym_csr = A_sym.tocsr()

# LOBPCG or ARPACK for large sparse eigenproblems
# Get top-3 eigenvalues (in magnitude) without forming full matrix
k = 3
eigenvals_sparse, eigenvecs_sparse = spla.eigsh(
    A_sym_csr,
    k=k,
    which='LM'   # Largest Magnitude
)
print(f"\nSparse eigenvalues (top {k}): {eigenvals_sparse.round(4)}")

# ── Advanced SciPy factorizations ─────────────────────────────────────
A_sq = np.random.randn(5, 5)

# Schur decomposition: A = Q T Qᵀ where T is quasi-upper-triangular
T_schur, Z_schur = sla.schur(A_sq)
print(f"\nSchur: T quasi-triangular: T[1,0]={T_schur[1,0]:.6f}")  # should be ~0

# Matrix functions via scipy
from scipy.linalg import expm, logm, sqrtm

# Matrix exponential - used in ODE solvers, graph neural networks
A_sym = A_sq + A_sq.T   # symmetrize
eA = expm(A_sym)
print(f"Matrix exponential: shape {eA.shape}")

# Matrix square root - used in whitening transforms
A_pd = A_sym + 5 * np.eye(5)   # ensure positive definite
sqrtA = sqrtm(A_pd)
print(f"sqrt(A) @ sqrt(A) ≈ A: {np.allclose(sqrtA @ sqrtA, A_pd, atol=1e-8)}")

Part 6 - Performance: Memory Layout, Vectorization, Timing

C-order vs Fortran-order: which is faster for your access pattern

import numpy as np
import time

n = 2000
n_trials = 5

# C-order: row-major (default in NumPy)
# Row elements are contiguous → row operations are cache-friendly
A_C = np.random.randn(n, n)             # C-order by default
A_F = np.asfortranarray(A_C)            # Fortran-order (column-major)

print("Memory layout performance:")

# Row sum: C-order is faster (rows are contiguous)
t0 = time.perf_counter()
for _ in range(n_trials):
    r = A_C.sum(axis=1)
t_C = (time.perf_counter() - t0) / n_trials

t0 = time.perf_counter()
for _ in range(n_trials):
    r = A_F.sum(axis=1)
t_F = (time.perf_counter() - t0) / n_trials

print(f"  Row sum, C-order: {t_C*1000:.2f} ms")
print(f"  Row sum, F-order: {t_F*1000:.2f} ms")

# Column sum: Fortran-order is faster (columns are contiguous)
t0 = time.perf_counter()
for _ in range(n_trials):
    c = A_C.sum(axis=0)
t_C2 = (time.perf_counter() - t0) / n_trials

t0 = time.perf_counter()
for _ in range(n_trials):
    c = A_F.sum(axis=0)
t_F2 = (time.perf_counter() - t0) / n_trials

print(f"  Col sum, C-order: {t_C2*1000:.2f} ms")
print(f"  Col sum, F-order: {t_F2*1000:.2f} ms")

# Matrix multiply: BLAS handles both orders efficiently
t0 = time.perf_counter()
for _ in range(n_trials):
    _ = A_C @ A_C
t_mm = (time.perf_counter() - t0) / n_trials
print(f"  MatMul (C-order): {t_mm*1000:.2f} ms")

# Contiguous check before heavy ops
def ensure_contiguous(A: np.ndarray) -> np.ndarray:
    """Make array C-contiguous if it is not already."""
    if not A.flags['C_CONTIGUOUS']:
        return np.ascontiguousarray(A)
    return A

# Transpose creates a non-contiguous view - may slow subsequent ops
A_T = A_C.T   # non-contiguous!
print(f"\nA.T is C-contiguous: {A_T.flags['C_CONTIGUOUS']}")
print(f"A.T is F-contiguous: {A_T.flags['F_CONTIGUOUS']}")

# Force contiguous copy when needed
A_T_contig = np.ascontiguousarray(A_T)
print(f"After ascontiguousarray: {A_T_contig.flags['C_CONTIGUOUS']}")

Vectorizing ML patterns

import numpy as np
import time

np.random.seed(42)

# Pattern 1: Pairwise distances - the common vectorization trick
n_queries, n_docs, d = 100, 500, 256
Q = np.random.randn(n_queries, d)
D = np.random.randn(n_docs, d)

# SLOW: nested Python loop
def pairwise_dist_loop(Q, D):
    n, m = len(Q), len(D)
    dists = np.zeros((n, m))
    for i in range(n):
        for j in range(m):
            dists[i, j] = np.sqrt(np.sum((Q[i] - D[j])**2))
    return dists

# FAST: vectorized broadcasting
def pairwise_dist_vectorized(Q, D):
    # ‖q - d‖² = ‖q‖² + ‖d‖² - 2 q·d
    Q_sq = (Q**2).sum(axis=1, keepdims=True)  # (n_q, 1)
    D_sq = (D**2).sum(axis=1, keepdims=True)  # (n_d, 1)
    cross = Q @ D.T                            # (n_q, n_d)
    return np.sqrt(np.maximum(Q_sq + D_sq.T - 2 * cross, 0.0))

t0 = time.perf_counter()
dists_loop = pairwise_dist_loop(Q[:20], D[:50])  # smaller for timing
t_loop = time.perf_counter() - t0

t0 = time.perf_counter()
dists_vec = pairwise_dist_vectorized(Q, D)  # full size
t_vec = time.perf_counter() - t0

print(f"Loop (20x50):       {t_loop*1000:.1f} ms")
print(f"Vectorized (100x500): {t_vec*1000:.1f} ms (full problem!)")
print(f"Results match: {np.allclose(dists_loop, dists_vec[:20, :50])}")

# Pattern 2: Batch covariance computation
def batch_covariance(X: np.ndarray) -> np.ndarray:
    """Compute covariance matrix of X: (n, d) → (d, d)."""
    X_c = X - X.mean(axis=0)
    return X_c.T @ X_c / (len(X) - 1)

# Pattern 3: Gram matrix (inner products of all pairs)
def gram_matrix(X: np.ndarray) -> np.ndarray:
    """Compute Gram (kernel) matrix: K_ij = x_i · x_j."""
    return X @ X.T   # (n, n)

# Pattern 4: Normalizing rows to unit length (for cosine similarity)
def normalize_rows(X: np.ndarray) -> np.ndarray:
    norms = np.linalg.norm(X, axis=1, keepdims=True)
    return X / np.maximum(norms, 1e-10)  # avoid division by zero

print("\nML pattern shapes:")
X = np.random.randn(200, 50)
print(f"Covariance: {batch_covariance(X).shape}")    # (50, 50)
print(f"Gram:       {gram_matrix(X).shape}")          # (200, 200)
print(f"Normalized: {normalize_rows(X).shape}")       # (200, 50)
print(f"Normalized norms: {np.linalg.norm(normalize_rows(X), axis=1)[:3].round(4)}")  # all 1.0

Part 7 - Numerical Stability: Condition Number, Cancellation, Precision

Condition number

The condition number κ(A) measures how much the solution to Ax = b can change when b has small perturbations:

κ(A) = ‖A‖ · ‖A⁻¹‖ = σ_max / σ_min   (for L2 norm)

A rule of thumb: if κ(A) ≈ 10^k, you may lose up to k digits of accuracy in the solution. For float64 (≈15 decimal digits), a matrix with κ ≈ 10^12 gives roughly 3 meaningful digits in the solution.

import numpy as np

# Condition number examples
def condition_analysis(A: np.ndarray, b: np.ndarray, name: str) -> None:
    """Analyze numerical stability of solving Ax = b."""
    cond = np.linalg.cond(A)
    x = np.linalg.solve(A, b)
    residual = np.linalg.norm(A @ x - b)

    # Perturb b slightly - how much does solution change?
    delta_b = 1e-6 * np.random.randn(len(b))
    x_perturbed = np.linalg.solve(A, b + delta_b)
    solution_change = np.linalg.norm(x_perturbed - x) / np.linalg.norm(x)
    rhs_change = np.linalg.norm(delta_b) / np.linalg.norm(b)

    print(f"\n{name}:")
    print(f"  Condition number:  {cond:.2e}")
    print(f"  Residual:          {residual:.2e}")
    print(f"  RHS perturbation:  {rhs_change:.2e}")
    print(f"  Solution change:   {solution_change:.2e}")
    print(f"  Sensitivity ratio: {solution_change/rhs_change:.1f} (≤ κ = {cond:.2e})")

np.random.seed(42)
n = 8
b = np.random.randn(n)

# Well-conditioned
A_good = np.eye(n) + 0.1 * np.random.randn(n, n)
A_good = A_good.T @ A_good + n * np.eye(n)
condition_analysis(A_good, b, "Well-conditioned (κ ≈ 10)")

# Ill-conditioned: Hilbert matrix
def hilbert(n: int) -> np.ndarray:
    """Hilbert matrix: H_ij = 1/(i+j+1). Notorious for ill-conditioning."""
    i, j = np.meshgrid(np.arange(n), np.arange(n), indexing='ij')
    return 1.0 / (i + j + 1)

H = hilbert(n)
condition_analysis(H, b, "Hilbert matrix (ill-conditioned)")

# Detecting ill-conditioning before solving
def safe_solve(A: np.ndarray, b: np.ndarray, max_cond: float = 1e10) -> np.ndarray:
    """Solve Ax = b with condition number check."""
    cond = np.linalg.cond(A)
    if cond > max_cond:
        import warnings
        warnings.warn(
            f"Condition number {cond:.2e} > {max_cond:.2e}. "
            f"Solution may be inaccurate. Consider regularization.",
            RuntimeWarning
        )
    return np.linalg.solve(A, b)

Catastrophic cancellation and float32 vs float64

import numpy as np

# Catastrophic cancellation: subtracting nearly equal numbers
x = np.float64(1e15)
y = np.float64(1e15 + 1)
diff = y - x
print(f"1e15 + 1 - 1e15 = {diff}")   # should be 1.0, might be 0.0 or wrong

# Numerically stable operations: use library functions that avoid cancellation
# Example: log(1 + x) for small x
x_small = 1e-10
log_unstable = np.log(1 + x_small)    # may suffer cancellation
log_stable   = np.log1p(x_small)      # specifically designed for small x
print(f"\nlog(1+1e-10):")
print(f"  np.log(1+x):  {log_unstable:.16f}")
print(f"  np.log1p(x):  {log_stable:.16f}")

# Stable softmax (subtract max before exp)
def softmax_unstable(x: np.ndarray) -> np.ndarray:
    """Will overflow for large x values."""
    return np.exp(x) / np.exp(x).sum()

def softmax_stable(x: np.ndarray) -> np.ndarray:
    """Numerically stable: subtract max first."""
    x_shifted = x - x.max()   # max becomes 0 → max exp = 1 → no overflow
    exp_x = np.exp(x_shifted)
    return exp_x / exp_x.sum()

x_large = np.array([1000.0, 1001.0, 999.0])
print(f"\nSoftmax on large values:")
try:
    result_unstable = softmax_unstable(x_large)
    print(f"  Unstable: {result_unstable}")
except (RuntimeWarning, FloatingPointError):
    print("  Unstable: overflow!")
result_stable = softmax_stable(x_large)
print(f"  Stable:   {result_stable}")

# float32 vs float64 precision
x32 = np.float32(1.0) / np.float32(3.0)
x64 = np.float64(1.0) / np.float64(3.0)
print(f"\n1/3 as float32: {x32:.15f}")    # ~7 significant digits
print(f"1/3 as float64: {x64:.15f}")    # ~15 significant digits

# When float32 causes problems: ill-conditioned matrix inversion
A = np.random.randn(50, 50)
A = A.T @ A

A32 = A.astype(np.float32)
A64 = A.astype(np.float64)

cond = np.linalg.cond(A64)
b = np.random.randn(50)

x32 = np.linalg.solve(A32, b.astype(np.float32))
x64 = np.linalg.solve(A64, b)

err32 = np.linalg.norm(A64 @ x32.astype(np.float64) - b)
err64 = np.linalg.norm(A64 @ x64 - b)
print(f"\ncond(A) = {cond:.2e}")
print(f"float32 residual: {err32:.2e}")
print(f"float64 residual: {err64:.2e}")
# For well-conditioned matrices, float32 is fine; ill-conditioned → use float64

# Rule: use float64 for linear system solving and decompositions
# use float32 for neural network forward/backward (GPU speed advantage)

Part 8 - Common ML Patterns from Scratch

import numpy as np

# ── Gram matrix (kernel matrix) ─────────────────────────────────────────
def gram_matrix(X: np.ndarray) -> np.ndarray:
    """
    K_ij = xᵢ · xⱼ (linear kernel)
    For RBF kernel: K_ij = exp(-‖xᵢ - xⱼ‖² / 2σ²)
    """
    return X @ X.T   # (n, n)

def rbf_kernel(X: np.ndarray, gamma: float = 1.0) -> np.ndarray:
    """Radial Basis Function (Gaussian) kernel matrix."""
    # ‖xᵢ - xⱼ‖² = ‖xᵢ‖² + ‖xⱼ‖² - 2 xᵢ·xⱼ
    sq_norms = (X**2).sum(axis=1)
    K = sq_norms[:, np.newaxis] + sq_norms[np.newaxis, :] - 2 * (X @ X.T)
    return np.exp(-gamma * np.maximum(K, 0.0))

# ── Covariance matrix ───────────────────────────────────────────────────
def covariance_matrix(X: np.ndarray, bias: bool = False) -> np.ndarray:
    """
    Σ = Xc^T Xc / (n - 1)  (unbiased, bias=False)
    Σ = Xc^T Xc / n         (biased/MLE, bias=True)
    """
    X_c = X - X.mean(axis=0)
    n = X.shape[0]
    divisor = n if bias else n - 1
    return X_c.T @ X_c / divisor

# ── Whitening transform ─────────────────────────────────────────────────
def zca_whiten(X: np.ndarray, eps: float = 1e-5) -> np.ndarray:
    """
    ZCA whitening: transform data to have identity covariance.

    Steps:
    1. Compute covariance: Σ
    2. Eigendecompose: Σ = Q Λ Qᵀ
    3. Whitening matrix: W = Q Λ^(-1/2) Qᵀ
    4. Return X_w = X_c @ W

    ZCA whitening preserves the original orientation (unlike PCA whitening).
    Used in image preprocessing, ICA, some normalization schemes.
    """
    X_c = X - X.mean(axis=0)
    cov = X_c.T @ X_c / (len(X) - 1)

    eigenvals, Q = np.linalg.eigh(cov)
    eigenvals = np.maximum(eigenvals, 0.0)   # guard against numerical negatives

    # Whitening matrix: Σ^(-1/2) = Q Λ^(-1/2) Qᵀ
    D_inv_sqrt = np.diag(1.0 / np.sqrt(eigenvals + eps))
    W = Q @ D_inv_sqrt @ Q.T

    return X_c @ W

def pca_whiten(X: np.ndarray, n_components: int = None, eps: float = 1e-5) -> np.ndarray:
    """
    PCA whitening: decorrelate features AND reduce to n_components dimensions.
    Used in preprocessing for ICA, some generative models.
    """
    X_c = X - X.mean(axis=0)
    cov = X_c.T @ X_c / (len(X) - 1)
    eigenvals, Q = np.linalg.eigh(cov)

    # Sort descending
    idx = np.argsort(eigenvals)[::-1]
    eigenvals = eigenvals[idx]
    Q = Q[:, idx]

    if n_components is not None:
        Q = Q[:, :n_components]
        eigenvals = eigenvals[:n_components]

    # Project and scale by 1/sqrt(eigenvalue)
    X_pca = X_c @ Q                              # project
    X_white = X_pca / np.sqrt(eigenvals + eps)  # scale to unit variance
    return X_white

# ── Rotation matrices ───────────────────────────────────────────────────
def rotation_matrix_2d(theta: float) -> np.ndarray:
    """2D rotation matrix by angle theta (radians)."""
    c, s = np.cos(theta), np.sin(theta)
    return np.array([[c, -s],
                     [s,  c]])

def random_rotation_matrix(n: int) -> np.ndarray:
    """
    Uniformly random rotation matrix in n dimensions.
    Uses QR decomposition of a random Gaussian matrix.
    """
    A = np.random.randn(n, n)
    Q, R = np.linalg.qr(A)
    # Ensure proper rotation (det = +1, not -1)
    Q = Q * np.sign(np.diag(R))
    return Q

# ── Projection matrices ─────────────────────────────────────────────────
def projection_onto_subspace(A: np.ndarray) -> np.ndarray:
    """
    Orthogonal projection matrix onto column space of A.
    P = A (AᵀA)⁻¹ Aᵀ  (if columns of A are linearly independent)
    P = A Aᵀ            (if columns of A are orthonormal)
    """
    # Use QR for numerical stability (instead of forming (AᵀA)⁻¹)
    Q, R = np.linalg.qr(A)
    return Q @ Q.T   # projection matrix

# ── Tests ───────────────────────────────────────────────────────────────
np.random.seed(42)
n, d = 100, 10
X = np.random.randn(n, d)

# Test whitening: covariance of whitened data ≈ Identity
X_white = zca_whiten(X)
cov_white = covariance_matrix(X_white)
print("ZCA whitening:")
print(f"  Max off-diagonal: {np.max(np.abs(cov_white - np.eye(d))):.4f} (should be ~0)")

# Test rotation: orthogonal
R = random_rotation_matrix(5)
print(f"\nRandom rotation matrix:")
print(f"  Orthogonal (RᵀR = I): {np.allclose(R.T @ R, np.eye(5))}")
print(f"  det = +1:             {np.isclose(np.linalg.det(R), 1.0)}")

# Test projection: idempotent (P² = P)
P = projection_onto_subspace(X[:, :3])   # project onto first 3 feature directions
print(f"\nProjection matrix:")
print(f"  P² = P (idempotent): {np.allclose(P @ P, P)}")
print(f"  Pᵀ = P (symmetric):  {np.allclose(P.T, P)}")

Part 9 - PyTorch torch.linalg: GPU-Accelerated Equivalent

# torch.linalg mirrors np.linalg almost exactly - same function names,
# GPU acceleration, and gradient support for automatic differentiation.

try:
    import torch
    import numpy as np

    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(f"Device: {device}")

    A = torch.randn(4, 4, device=device)
    A = A.T @ A + torch.eye(4, device=device)  # symmetric positive definite

    # ── torch.linalg reference ──────────────────────────────────────────

    # Determinant
    det = torch.linalg.det(A)
    slogdet_sign, slogdet_val = torch.linalg.slogdet(A)  # numerically stable
    print(f"\ndet(A) = {det.item():.4f}")
    print(f"slogdet: sign={slogdet_sign.item():.0f}, log|det|={slogdet_val.item():.4f}")

    # Matrix norm
    frob = torch.linalg.matrix_norm(A, ord='fro')
    nuc  = torch.linalg.matrix_norm(A, ord='nuc')
    spec = torch.linalg.matrix_norm(A, ord=2)
    print(f"Norms: Frobenius={frob.item():.4f}, Nuclear={nuc.item():.4f}, Spectral={spec.item():.4f}")

    # Solve
    b = torch.randn(4, device=device)
    x = torch.linalg.solve(A, b)
    print(f"Solve residual: {torch.linalg.norm(A @ x - b).item():.2e}")

    # SVD
    U, S, Vh = torch.linalg.svd(A, full_matrices=False)
    print(f"SVD: U={U.shape}, S={S.shape}, Vh={Vh.shape}")

    # Eigenvalues (symmetric)
    eigenvals = torch.linalg.eigvalsh(A)  # symmetric → eigh equivalent
    print(f"Eigenvalues (symmetric): {eigenvals.cpu().numpy().round(4)}")

    # QR
    Q, R = torch.linalg.qr(A)
    print(f"QR: Q={Q.shape}, R={R.shape}")
    print(f"Q orthonormal: {torch.allclose(Q.T @ Q, torch.eye(4, device=device), atol=1e-5)}")

    # Cholesky
    L = torch.linalg.cholesky(A)
    print(f"Cholesky: L={L.shape}")

    # ── Gradient support through linalg ops ────────────────────────────
    A_grad = torch.randn(3, 3, requires_grad=True)
    A_pd = A_grad.T @ A_grad + torch.eye(3)

    # Gradient flows through SVD
    U2, S2, Vh2 = torch.linalg.svd(A_pd, full_matrices=False)
    loss = S2.sum()   # some loss involving singular values
    loss.backward()
    print(f"\nGradient through SVD: A_grad.grad shape = {A_grad.grad.shape}")

    # ── When to use torch vs numpy ──────────────────────────────────────
    print("\ntorch.linalg vs np.linalg:")
    print("  Use torch when: on GPU, need gradients, inside a PyTorch model")
    print("  Use numpy when: on CPU only, no gradient needed, scipy interop")
    print("  Conversion: np_arr = t.detach().cpu().numpy()")
    print("              t = torch.from_numpy(np_arr).to(device)")

    # Performance: GPU vs CPU for large matrix
    n_large = 1000
    A_large_np  = np.random.randn(n_large, n_large)
    A_large_np  = A_large_np.T @ A_large_np + n_large * np.eye(n_large)

    import time
    t0 = time.perf_counter()
    U_np, s_np, Vt_np = np.linalg.svd(A_large_np)
    t_np = time.perf_counter() - t0
    print(f"\nSVD {n_large}x{n_large}: NumPy  = {t_np*1000:.1f} ms")

    A_large_t = torch.from_numpy(A_large_np).float()
    t0 = time.perf_counter()
    U_t, S_t, Vh_t = torch.linalg.svd(A_large_t, full_matrices=False)
    t_torch_cpu = time.perf_counter() - t0
    print(f"SVD {n_large}x{n_large}: PyTorch CPU = {t_torch_cpu*1000:.1f} ms")

    if torch.cuda.is_available():
        A_gpu = A_large_t.cuda()
        torch.cuda.synchronize()
        t0 = time.perf_counter()
        U_g, S_g, Vh_g = torch.linalg.svd(A_gpu, full_matrices=False)
        torch.cuda.synchronize()
        t_gpu = time.perf_counter() - t0
        print(f"SVD {n_large}x{n_large}: PyTorch GPU = {t_gpu*1000:.1f} ms")
        print(f"GPU speedup: {t_np/t_gpu:.1f}x over NumPy")

except ImportError:
    print("PyTorch not installed. Install with: pip install torch")
    print("\ntorch.linalg function equivalents to np.linalg:")
    torch_numpy_equiv = [
        ("np.linalg.solve(A, b)",          "torch.linalg.solve(A, b)"),
        ("np.linalg.svd(A)",               "torch.linalg.svd(A)"),
        ("np.linalg.eig(A)",               "torch.linalg.eig(A)"),
        ("np.linalg.eigh(A)",              "torch.linalg.eigh(A)"),
        ("np.linalg.eigvalsh(A)",          "torch.linalg.eigvalsh(A)"),
        ("np.linalg.qr(A)",                "torch.linalg.qr(A)"),
        ("np.linalg.cholesky(A)",          "torch.linalg.cholesky(A)"),
        ("np.linalg.norm(A, 'fro')",       "torch.linalg.matrix_norm(A, 'fro')"),
        ("np.linalg.norm(v)",              "torch.linalg.vector_norm(v)"),
        ("np.linalg.det(A)",               "torch.linalg.det(A)"),
        ("np.linalg.inv(A)",               "torch.linalg.inv(A)"),
        ("np.linalg.pinv(A)",              "torch.linalg.pinv(A)"),
        ("np.linalg.cond(A)",              "torch.linalg.cond(A)"),
        ("np.linalg.matrix_rank(A)",       "torch.linalg.matrix_rank(A)"),
    ]
    for np_fn, torch_fn in torch_numpy_equiv:
        print(f"  {np_fn:<35} → {torch_fn}")

Interview Questions

Q1: What is the condition number and why does it matter for numerical computation?

The condition number of a matrix A (with respect to a given norm) is:

κ(A) = ‖A‖ · ‖A⁻¹‖

For the L2 norm and a non-singular matrix:

κ₂(A) = σ_max / σ_min

where σ_max and σ_min are the largest and smallest singular values.

What it measures: The condition number is the worst-case factor by which relative errors in the input (b) are amplified in the output (x) when solving Ax = b:

‖Δx‖/‖x‖ ≤ κ(A) · ‖Δb‖/‖b‖

A rule of thumb: if κ(A) ≈ 10^k, solving Ax = b in float64 (≈15 digits of precision) gives approximately 15 - k correct significant digits.

ML implications:

Ridge regression: the αI term in (XᵀX + αI) lowers the condition number from σ_max(XᵀX)/σ_min(XᵀX) to (σ_max + α)/(σ_min + α), improving numerical stability
Optimization: the condition number of the Hessian determines convergence speed of gradient descent - high condition number (elongated loss landscape) leads to slow convergence; Adam adapts per-parameter learning rates to effectively precondition the problem
Feature scaling: StandardScaler reduces the condition number of XᵀX by making features comparable in scale, improving numerical stability and optimization speed

Check condition number before solving: cond = np.linalg.cond(A). If it exceeds 10^12 for float64 (or 10^6 for float32), consider regularization or a different numerical approach.

Q2: Explain the difference between C-order and Fortran-order memory layout in NumPy.

C-order (row-major): Elements of each row are stored contiguously in memory.

A = [[1, 2, 3],    Memory: [1, 2, 3, 4, 5, 6]
     [4, 5, 6]]             (row 0, then row 1)

Fortran-order (column-major): Elements of each column are stored contiguously.

A = [[1, 2, 3],    Memory: [1, 4, 2, 5, 3, 6]
     [4, 5, 6]]             (col 0, then col 1, then col 2)

Performance implications:

NumPy creates C-order arrays by default. Row operations (summing along axis=1, iterating over rows) are cache-friendly for C-order because each row's elements are adjacent in memory - the CPU fetches them in cache lines efficiently. Column operations on C-order arrays cause strided access and cache misses.

BLAS (called by np.dot, @) handles both orders internally and passes layout information to the LAPACK routine, so A @ B is efficient regardless of the layout.

When it matters in practice:

After A.T, the result is F-order (or viewed as F-order without copying). Calling reshape on it may trigger an unexpected copy.
np.ascontiguousarray(A) forces a copy to C-order if needed.
PyTorch defaults to C-order (contiguous in PyTorch terminology). After .permute() or .transpose(), call .contiguous() before .view().

For ML: the main practical consequence is that after transposing or permuting tensor axes, you may need to explicitly call .contiguous() (PyTorch) or np.ascontiguousarray() (NumPy) before reshaping, to avoid RuntimeErrors or unexpected copies.

Q3: When should you use scipy.linalg instead of numpy.linalg?

Use scipy.linalg instead of (or in addition to) numpy.linalg in four situations:

Pre-factorization for multiple solves: scipy.linalg.lu_factor + lu_solve lets you factorize once (O(n³)) and solve multiple systems cheaply (O(n²) each). NumPy has no equivalent - np.linalg.solve re-factorizes each call.
Cholesky solve: scipy.linalg.cho_factor + cho_solve for symmetric positive definite matrices. Cholesky is roughly 2x faster than LU and guaranteed to be positive - NumPy lacks this pre-factorization workflow.
Sparse matrices: scipy.sparse provides sparse matrix formats (CSR, CSC, COO) and scipy.sparse.linalg provides iterative solvers (GMRES, CG, MINRES) and sparse eigensolvers (LOBPCG, ARPACK via eigsh). For matrices with millions of rows but only a few non-zeros per row (graph Laplacians, FEM stiffness matrices), these are essential.
Additional matrix functions: scipy.linalg.expm (matrix exponential), logm, sqrtm, schur, and the Hessenberg decomposition are not in numpy.linalg.

For most day-to-day single-solve operations, numpy.linalg is sufficient. For production ML code with repeated linear system solving (Kalman filters, Gaussian processes, kernel methods), scipy.linalg pre-factorization can give 10-50x speedups.

Q4: What is the relationship between np.linalg and torch.linalg?

torch.linalg (introduced in PyTorch 1.9) was explicitly designed to mirror numpy.linalg in API, with two key additions:

API equivalence: Almost every np.linalg function has a direct torch.linalg counterpart with the same name and argument order: solve, svd, eig, eigh, qr, cholesky, inv, pinv, det, norm, matrix_rank, cond.

Key differences:

Gradient support: torch.linalg operations participate in autograd. Gradients flow through SVD, eigendecomposition, Cholesky, and solve - enabling gradient-based optimization over quantities that depend on these decompositions (e.g., nuclear norm minimization, optimizing over rotation matrices).
Device-agnostic: tensors can live on CPU or GPU. The same code path works on both - dispatch to cuBLAS/cuSOLVER on GPU, BLAS/LAPACK on CPU.
Norm API difference: np.linalg.norm(v) handles both vectors and matrices with the ord parameter. PyTorch separates this into torch.linalg.vector_norm(v) and torch.linalg.matrix_norm(A).

When to use which:

Inside a PyTorch model or loss function: always use torch.linalg (gradients, GPU)
Preprocessing, offline analysis, sklearn integration: use numpy.linalg
Conversion: t.detach().cpu().numpy() for torch → numpy; torch.from_numpy(arr) for numpy → torch (shares memory on CPU)

Practice Challenges

Level 1: Predict

Challenge: Predict the output of each call:

A = np.array([[1.0, 2.0],
              [3.0, 4.0]])
print(np.linalg.norm(A, 'fro'))
print(np.linalg.norm(A.flatten()))
print(np.trace(A))
print(np.linalg.det(A))

Answer

Frobenius norm: sqrt(1² + 2² + 3² + 4²) = sqrt(1+4+9+16) = sqrt(30) ≈ 5.4772
L2 of flatten:  sqrt(1+4+9+16) = sqrt(30) ≈ 5.4772  (identical to Frobenius)
trace(A):       1 + 4 = 5  (sum of diagonal)
det(A):         1*4 - 2*3 = 4 - 6 = -2

The Frobenius norm equals the L2 norm of the vectorized matrix - this is confirmed by the first two lines giving identical results.

The negative determinant means A has one negative eigenvalue (and one positive). Sum of eigenvalues = trace = 5, product = det = -2. Eigenvalues: solve λ² - 5λ - 2 = 0 → λ = (5 ± sqrt(33)) / 2 ≈ {5.37, -0.37}.

Level 2: Debug

Challenge: Find the bug in this Ridge regression implementation:

def ridge_predict(X_train, y_train, X_test, alpha=1.0):
    n, d = X_train.shape
    w = np.linalg.inv(X_train.T @ X_train + alpha * np.eye(n)) @ X_train.T @ y_train
    return X_test @ w

Answer

Bug: The identity matrix has size n (number of samples) instead of d (number of features).

Ridge regression adds αI to XᵀX, which is a d×d matrix. The code uses np.eye(n) (n×n) - this causes a shape mismatch when n ≠ d.

# Wrong: np.eye(n) - size is wrong
w = np.linalg.inv(X_train.T @ X_train + alpha * np.eye(n)) @ ...

# Fix 1: use np.eye(d)
def ridge_predict_fixed_v1(X_train, y_train, X_test, alpha=1.0):
    n, d = X_train.shape
    w = np.linalg.inv(X_train.T @ X_train + alpha * np.eye(d)) @ X_train.T @ y_train
    return X_test @ w

# Fix 2: use solve instead of inv (better numerics)
def ridge_predict_fixed_v2(X_train, y_train, X_test, alpha=1.0):
    n, d = X_train.shape
    A = X_train.T @ X_train + alpha * np.eye(d)  # (d, d)
    b = X_train.T @ y_train                        # (d,)
    w = np.linalg.solve(A, b)                      # never use inv!
    return X_test @ w

# Verify
np.random.seed(42)
X_tr = np.random.randn(20, 5)
y_tr = np.random.randn(20)
X_te = np.random.randn(10, 5)

preds = ridge_predict_fixed_v2(X_tr, y_tr, X_te, alpha=0.1)
print(f"Predictions shape: {preds.shape}")   # (10,) - correct

Level 3: Design

Challenge: Implement a numerically stable Gaussian Process regression from scratch using Cholesky decomposition and log marginal likelihood.

Answer

import numpy as np
from scipy.linalg import cho_factor, cho_solve

class GaussianProcessRegressor:
    """
    Gaussian Process Regression with RBF kernel.
    Uses Cholesky decomposition for numerical stability and efficiency.
    """

    def __init__(self, length_scale: float = 1.0, noise_var: float = 1e-4):
        self.length_scale = length_scale
        self.noise_var = noise_var
        self._chol_factor = None
        self.X_train = None
        self.y_train = None

    def _rbf_kernel(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        """K(x,x') = exp(-‖x - x'‖² / (2 * l²))"""
        sq_norms = (
            (X1**2).sum(axis=1, keepdims=True)   # (n1, 1)
            + (X2**2).sum(axis=1, keepdims=True).T   # (1, n2)
            - 2 * X1 @ X2.T                          # (n1, n2)
        )
        return np.exp(-np.maximum(sq_norms, 0.0) / (2 * self.length_scale**2))

    def fit(self, X: np.ndarray, y: np.ndarray) -> 'GaussianProcessRegressor':
        """
        Fit GP: compute and factorize K + noise_var * I using Cholesky.
        O(n³) for factorization, O(n²) for each subsequent predict call.
        """
        self.X_train = X.copy()
        self.y_train = y.copy()
        n = len(X)

        # Kernel matrix + noise diagonal
        K = self._rbf_kernel(X, X)
        K_noisy = K + self.noise_var * np.eye(n)

        # Cholesky factorization: K_noisy = L Lᵀ
        # More stable than LU for positive definite matrices
        self._chol_factor = cho_factor(K_noisy)

        # Pre-compute K⁻¹ y = L⁻ᵀ L⁻¹ y for prediction
        self._alpha = cho_solve(self._chol_factor, y)  # (n,)
        return self

    def predict(
        self, X_test: np.ndarray, return_std: bool = False
    ) -> tuple:
        """
        Predict mean and optional standard deviation.
        O(n²) per call after O(n³) fit.
        """
        # Cross-covariance: K_* = K(X_test, X_train)
        K_star = self._rbf_kernel(X_test, self.X_train)   # (n_test, n_train)

        # Posterior mean: μ* = K_* K⁻¹ y = K_* alpha
        mean = K_star @ self._alpha

        if not return_std:
            return mean

        # Posterior variance: Var* = K(X_*, X_*) - K_* K⁻¹ K_*ᵀ
        K_star_star = self._rbf_kernel(X_test, X_test)     # (n_test, n_test)

        # K⁻¹ K_*ᵀ via Cholesky solve - avoids forming explicit inverse
        V = cho_solve(self._chol_factor, K_star.T)          # (n_train, n_test)
        var = np.diag(K_star_star) - np.einsum('ij,ji->i', K_star, V)
        var = np.maximum(var, 0.0)   # numerical safety: clamp negative variances

        return mean, np.sqrt(var)

    def log_marginal_likelihood(self) -> float:
        """
        log p(y | X, θ) = -0.5 [yᵀ K⁻¹ y + log|K| + n log(2π)]

        Using slogdet via Cholesky: log|K| = 2 * Σᵢ log(Lᵢᵢ)
        """
        n = len(self.y_train)
        L = self._chol_factor[0]   # lower triangular Cholesky factor

        # log|K| = 2 * sum of log diagonals of L
        log_det_K = 2.0 * np.sum(np.log(np.diag(np.abs(L))))

        data_fit  = 0.5 * (self.y_train @ self._alpha)
        complexity = 0.5 * log_det_K
        constant   = 0.5 * n * np.log(2 * np.pi)

        return -(data_fit + complexity + constant)

# Test
np.random.seed(42)
n_train, n_test = 30, 100
X_train = np.linspace(0, 5, n_train).reshape(-1, 1)
y_train = np.sin(X_train.ravel()) + 0.1 * np.random.randn(n_train)
X_test = np.linspace(-0.5, 5.5, n_test).reshape(-1, 1)

gp = GaussianProcessRegressor(length_scale=1.0, noise_var=0.01)
gp.fit(X_train, y_train)

mean, std = gp.predict(X_test, return_std=True)
lml = gp.log_marginal_likelihood()

print(f"GP predictions shape: {mean.shape}")    # (100,)
print(f"GP stdev shape:       {std.shape}")     # (100,)
print(f"Log marginal likelihood: {lml:.4f}")
print(f"Max prediction error: {np.max(np.abs(mean - np.sin(X_test.ravel()))):.4f}")

Complete np.linalg and torch.linalg Cheatsheet

Operation	np.linalg	torch.linalg	Notes
Solve Ax=b	`solve(A, b)`	`solve(A, b)`	Prefer over inv
Least squares	`lstsq(A, b)`	`lstsq(A, b)`	Overdetermined systems
Inverse	`inv(A)`	`inv(A)`	Avoid - use solve
Pseudoinverse	`pinv(A)`	`pinv(A)`	SVD-based
Determinant	`det(A)`	`det(A)`	Use slogdet for stability
Log-det	`slogdet(A)`	`slogdet(A)`	Avoids overflow
Condition #	`cond(A)`	`cond(A)`	σ_max/σ_min
Rank	`matrix_rank(A)`	`matrix_rank(A)`	Uses SVD
SVD	`svd(A)`	`svd(A)`	full_matrices kwarg
Eig (general)	`eig(A)`	`eig(A)`	May be complex
Eig (symmetric)	`eigh(A)`	`eigh(A)`	Real, ascending
QR	`qr(A)`	`qr(A)`	Orthonormal basis
Cholesky	`cholesky(A)`	`cholesky(A)`	Symmetric PD only
Vector norm	`norm(v)`	`vector_norm(v)`	Default L2
Matrix norm	`norm(A, 'fro')`	`matrix_norm(A, 'fro')`	fro, nuc, 2

Key Takeaways

NumPy arrays use contiguous memory with BLAS dispatch - they are 100-1000x faster than Python lists for numerical computation
Always use np.linalg.solve(A, b) instead of np.linalg.inv(A) @ b - it is faster and more numerically stable
Use np.linalg.eigh for symmetric matrices - it returns real eigenvalues in ascending order, is faster, and more stable than eig
The condition number κ(A) = σ_max/σ_min measures numerical sensitivity - large condition number means small perturbations in b cause large changes in the solution
Use scipy.linalg.lu_factor and cho_factor when solving multiple systems with the same matrix - factorize O(n³) once, solve O(n²) many times
For float32 neural network weights, prefer float64 for any linear algebra operations (decompositions, covariance, kernel matrices) to avoid catastrophic cancellation
torch.linalg mirrors np.linalg exactly but adds GPU acceleration and gradient support - switch to it inside any PyTorch model

Back to: Module Overview →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Matrix Transformations in 3D demo on the EngineersOfAI Playground - no code required.

:::

What You Will Learn​

Prerequisites​

Part 1 - NumPy Array vs Python List: Why Performance Differs​

Memory layout is the fundamental difference​

Part 2 - The np.linalg Module: Complete Reference​

Part 3 - Solving Linear Systems: solve Not inv​

Why np.linalg.solve is almost always better than np.linalg.inv​

Part 4 - Decompositions: When to Use Each​

Decision guide​

Part 5 - SciPy linalg: Sparse Matrices and Advanced Factorizations​

Part 6 - Performance: Memory Layout, Vectorization, Timing​

C-order vs Fortran-order: which is faster for your access pattern​

Vectorizing ML patterns​

Part 7 - Numerical Stability: Condition Number, Cancellation, Precision​

Condition number​

Catastrophic cancellation and float32 vs float64​

Part 8 - Common ML Patterns from Scratch​

Part 9 - PyTorch torch.linalg: GPU-Accelerated Equivalent​

Interview Questions​

Practice Challenges​

Level 1: Predict​

Level 2: Debug​

Level 3: Design​

Complete np.linalg and torch.linalg Cheatsheet​

Key Takeaways​

What You Will Learn

Prerequisites

Part 1 - NumPy Array vs Python List: Why Performance Differs

Memory layout is the fundamental difference

Part 2 - The np.linalg Module: Complete Reference

Part 3 - Solving Linear Systems: solve Not inv

Why `np.linalg.solve` is almost always better than `np.linalg.inv`

Part 4 - Decompositions: When to Use Each

Decision guide

Part 5 - SciPy linalg: Sparse Matrices and Advanced Factorizations

Part 6 - Performance: Memory Layout, Vectorization, Timing

C-order vs Fortran-order: which is faster for your access pattern

Vectorizing ML patterns

Part 7 - Numerical Stability: Condition Number, Cancellation, Precision

Condition number

Catastrophic cancellation and float32 vs float64

Part 8 - Common ML Patterns from Scratch

Part 9 - PyTorch torch.linalg: GPU-Accelerated Equivalent

Interview Questions

Practice Challenges

Level 1: Predict

Level 2: Debug

Level 3: Design

Complete np.linalg and torch.linalg Cheatsheet

Key Takeaways