Numerical Linear Algebra - Condition Numbers, Solvers, and Backprop Stability
Reading time: ~28 minutes | Level: Numerical Foundations → Production ML
Your linear regression model produces wildly different weights every time you retrain on slightly different data. Your covariance matrix inversion fails with a cryptic LinAlgError: Singular matrix. Your neural network with very deep layers has gradients that explode on some architectures but not others - and you don't understand why.
These are numerical linear algebra problems. They stem from a single root cause: some matrices are sensitive to small perturbations, and finite-precision arithmetic introduces perturbations at every step. Understanding condition numbers and matrix decompositions is what separates engineers who debug these issues in an hour from those who spend weeks adding random seeds and hoping.
What You Will Learn
- The condition number: quantifying how "dangerous" a matrix is numerically
- Ill-conditioned matrices and what they mean for ML models
- LU, QR, and Cholesky factorizations: when to use each
- Why inverting a matrix is almost always the wrong approach
- Numerical stability of the backpropagation algorithm
- Practical SciPy / NumPy patterns for stable linear system solving
Part 1 - The Condition Number
Definition
The condition number of a matrix (with respect to the L2 norm) is:
where and are the largest and smallest singular values of .
The condition number measures how much relative error in the output results from a small relative error in the input:
where and is a small perturbation.
Interpreting the condition number
| Classification | Meaning | |
|---|---|---|
| Well-conditioned | Small input errors → small output errors | |
| Moderately ill-conditioned | Caution needed | |
| Severely ill-conditioned | Near-singular; solutions unreliable in float32 | |
| Numerically singular | Cannot solve reliably even in float64 |
import numpy as np
from numpy.linalg import cond, svd
# Well-conditioned matrix
A_good = np.array([[2.0, 1.0],
[1.0, 2.0]])
print(f"cond(A_good) = {cond(A_good):.2f}") # 3.0 - excellent
# Ill-conditioned matrix (nearly singular)
A_bad = np.array([[1.0, 1.0],
[1.0, 1.0001]]) # rows nearly identical
print(f"cond(A_bad) = {cond(A_bad):.2e}") # ~40000 - problematic
# Demonstrate the consequence
b = np.array([2.0, 2.0001])
x_exact = np.linalg.solve(A_bad, b)
# Tiny perturbation to b
delta_b = np.array([0.0001, 0.0])
b_perturbed = b + delta_b
x_perturbed = np.linalg.solve(A_bad, b_perturbed)
print(f"Relative change in b: {np.linalg.norm(delta_b)/np.linalg.norm(b):.6f}")
print(f"Relative change in x: {np.linalg.norm(x_perturbed - x_exact)/np.linalg.norm(x_exact):.4f}")
# A tiny change in b produces a huge change in x - ill-conditioning in action
Where ill-conditioned matrices appear in ML
1. Gram matrices with correlated features
The least squares problem involves the Gram matrix . When features are highly correlated, is nearly singular:
import numpy as np
# Two nearly identical features → near-singular Gram matrix
np.random.seed(42)
n = 100
x1 = np.random.randn(n)
x2 = x1 + 0.001 * np.random.randn(n) # x2 ≈ x1
X = np.column_stack([x1, x2])
G = X.T @ X
print(f"Condition number of Gram matrix: {np.linalg.cond(G):.2e}")
# ~4e6 - severely ill-conditioned!
# Adding L2 regularization (ridge) stabilizes the Gram matrix
lambda_reg = 0.01
G_regularized = X.T @ X + lambda_reg * np.eye(2)
print(f"Condition number after ridge: {np.linalg.cond(G_regularized):.2e}")
# Dramatically improved
2. Deep network Jacobians
In a deep network with layers, the gradient of the loss with respect to the first layer involves the product of Jacobian matrices:
If each has , the product's condition number can grow exponentially with depth - gradient explosion. If , it shrinks - gradient vanishing.
Batch normalization, residual connections, and careful initialization (Xavier, He) exist specifically to keep these condition numbers near 1.
Part 2 - LU Factorization: The Workhorse Linear Solver
What it is
LU factorization decomposes where:
- is a permutation matrix (row reordering for numerical stability)
- is lower triangular with ones on the diagonal
- is upper triangular
Solving becomes:
- Factor: - cost, done once
- Solve - forward substitution,
- Solve - back substitution,
When you need to solve the same system for many right-hand sides , factoring once and reusing is efficient.
import numpy as np
from scipy.linalg import lu, lu_factor, lu_solve
# Factorize once
A = np.array([[2.0, 1.0, -1.0],
[-3.0, -1.0, 2.0],
[-2.0, 1.0, 2.0]])
lu_piv = lu_factor(A) # Pre-compute LU factorization
# Solve for multiple right-hand sides efficiently
b1 = np.array([8.0, -11.0, -3.0])
b2 = np.array([1.0, 2.0, 3.0])
b3 = np.array([0.0, 1.0, 0.0])
x1 = lu_solve(lu_piv, b1) # O(n^2) each - much faster than O(n^3)
x2 = lu_solve(lu_piv, b2)
x3 = lu_solve(lu_piv, b3)
# Verify
print(f"Ax1 - b1: {np.linalg.norm(A @ x1 - b1):.2e}") # ~0 (machine precision)
When LU is appropriate:
- Square, dense, well-conditioned system
- Multiple right-hand sides with the same matrix
- General-purpose linear system solver
When LU is NOT appropriate:
- Matrix is symmetric positive definite → use Cholesky (twice as fast)
- System is overdetermined () → use QR or least squares
- Matrix is very large and sparse → use iterative methods (see Lesson 03)
- You need the solution to be numerically robust despite near-singularity → use SVD
Part 3 - QR Factorization: The Numerically Safe Solver
What it is
QR factorization decomposes where:
- is orthogonal () - orthogonal matrices have condition number = 1
- is upper triangular
For the least squares problem with overdetermined ():
The QR approach is numerically superior to the normal equations because:
- Forming squares the condition number:
- The QR solution avoids this squaring, using only
import numpy as np
from scipy.linalg import qr, solve_triangular
# Least squares via QR - numerically stable
A = np.random.randn(100, 10) # overdetermined system: 100 equations, 10 unknowns
b = np.random.randn(100)
# Method 1: Normal equations (NUMERICALLY INFERIOR - do not use)
# x_normal = np.linalg.inv(A.T @ A) @ A.T @ b
# Condition number is squared: bad for nearly correlated features
# Method 2: QR factorization (NUMERICALLY SUPERIOR)
Q, R = qr(A, mode='economic') # 'economic' QR: Q is (100, 10), R is (10, 10)
Qtb = Q.T @ b # Q^T b - cheap, Q is orthogonal
x_qr = solve_triangular(R, Qtb) # Solve Rx = Q^T b by back substitution
# Method 3: numpy's least squares (uses SVD internally - most stable)
x_lstsq, _, _, _ = np.linalg.lstsq(A, b, rcond=None)
# All three should give the same answer (QR and SVD do; normal equations may not)
print(f"QR vs lstsq difference: {np.linalg.norm(x_qr - x_lstsq):.2e}")
print(f"Residual (QR): {np.linalg.norm(A @ x_qr - b):.6f}")
print(f"Residual (lstsq): {np.linalg.norm(A @ x_lstsq - b):.6f}")
:::tip QR in ML pipelines
When fitting a linear model to features with unknown correlation structure, always solve via QR or np.linalg.lstsq rather than computing the normal equations manually. For Scikit-learn's LinearRegression, this is already done for you - but knowing why matters when you implement custom solvers.
:::
Gram-Schmidt and numerical orthogonalization
QR is also the algorithm behind Gram-Schmidt orthogonalization, used whenever you need an orthogonal basis:
def gram_schmidt_qr(A: np.ndarray) -> tuple:
"""
Classical Gram-Schmidt QR factorization.
Note: scipy.linalg.qr uses Householder reflections - more stable.
This is for illustration.
"""
m, n = A.shape
Q = np.zeros((m, n))
R = np.zeros((n, n))
for j in range(n):
v = A[:, j].copy()
for i in range(j):
R[i, j] = Q[:, i] @ A[:, j]
v -= R[i, j] * Q[:, i]
R[j, j] = np.linalg.norm(v)
if R[j, j] > 1e-10:
Q[:, j] = v / R[j, j]
return Q, R
Part 4 - Cholesky Factorization: For Positive Definite Matrices
When it applies
A matrix is symmetric positive definite (SPD) if:
- (symmetric)
- for all non-zero (positive definite)
Common SPD matrices in ML:
- Covariance matrices: (after centering)
- Gram matrices with regularization:
- Fisher information matrix
- Hessian at a local minimum
For SPD matrices, Cholesky factorization gives where is lower triangular with positive diagonal entries. This is twice as fast as LU and guaranteed stable for SPD matrices.
import numpy as np
from scipy.linalg import cholesky, cho_factor, cho_solve
# Generate a symmetric positive definite matrix
n = 50
A_raw = np.random.randn(n, n)
A = A_raw.T @ A_raw + n * np.eye(n) # Guaranteed SPD: X^T X + nI
# Cholesky factorization
L = cholesky(A, lower=True) # A = L @ L.T
print(f"Is lower triangular: {np.allclose(L, np.tril(L))}") # True
# Efficient solve using Cholesky
cho_fac = cho_factor(A, lower=True) # Pre-factorize
b = np.random.randn(n)
x = cho_solve(cho_fac, b)
# Verify: Ax = b
print(f"Residual: {np.linalg.norm(A @ x - b):.2e}") # ~machine precision
# Use case: Gaussian process covariance
def gp_log_likelihood(K: np.ndarray, y: np.ndarray) -> float:
"""
Compute log marginal likelihood for a Gaussian process.
Uses Cholesky for numerical stability and efficiency.
log p(y | K) = -1/2 y^T K^{-1} y - 1/2 log|K| - n/2 log(2π)
"""
n = len(y)
eps = 1e-6 # Jitter for numerical stability
K_stable = K + eps * np.eye(n)
# Cholesky: K = L L^T
L = cholesky(K_stable, lower=True)
# Solve K^{-1} y = (L L^T)^{-1} y = L^{-T} (L^{-1} y)
alpha = np.linalg.solve(L.T, np.linalg.solve(L, y))
# log|K| = 2 * sum(log(diag(L))) - from det of triangular matrix
log_det_K = 2.0 * np.sum(np.log(np.diag(L)))
return -0.5 * (y @ alpha + log_det_K + n * np.log(2 * np.pi))
:::note Cholesky failure = matrix is not SPD
If scipy.linalg.cholesky(A) throws a LinAlgError, the matrix is not positive definite. This often happens in practice when:
- You compute a covariance matrix on too few data points (rank-deficient)
- Accumulated floating-point errors make a theoretically SPD matrix numerically indefinite
- Features are exactly linearly dependent
Fix: Add a small jitter/diagonal: A + ε·I with to . This is standard practice in Gaussian processes and is the right thing to do.
:::
Part 5 - Why You Should Never Invert a Matrix
This is one of the most important practical rules in numerical linear algebra:
Never compute explicitly. Always solve the linear system directly.
Why this rule exists
import numpy as np
import time
n = 1000
# Ill-conditioned matrix (for demonstration)
A = np.random.randn(n, n)
A = A + A.T + n * np.eye(n) # Make it positive definite
b = np.random.randn(n)
# Method 1: WRONG - explicit inverse (slow + numerically worse)
start = time.time()
x_wrong = np.linalg.inv(A) @ b
time_wrong = time.time() - start
# Method 2: CORRECT - solve the linear system directly
start = time.time()
x_correct = np.linalg.solve(A, b)
time_correct = time.time() - start
# Both give the same answer, but:
print(f"Time (inv): {time_wrong:.4f}s")
print(f"Time (solve): {time_correct:.4f}s")
# solve() is ~2-3x faster - inv() computes n solutions to verify, solve() computes 1
# More importantly: accuracy on ill-conditioned systems
A_bad = np.array([[1.0, 1.0], [1.0, 1.0001]])
b_bad = np.array([2.0, 2.0001])
x_inv = np.linalg.inv(A_bad) @ b_bad
x_solve = np.linalg.solve(A_bad, b_bad)
# x_inv accumulates more rounding error - solve() uses LU with pivoting
print(f"Residual (inv): {np.linalg.norm(A_bad @ x_inv - b_bad):.2e}")
print(f"Residual (solve): {np.linalg.norm(A_bad @ x_solve - b_bad):.2e}")
The fundamental reason: Computing requires solving linear systems (one per column of ). But if you only need , that is just one linear system. Wasting extra solves introduces more rounding error.
The only legitimate use of explicit inverse
When you need the actual matrix (not just the solution to a particular system), such as:
- Computing for matrix-valued right-hand sides
- Analytical formulas that require the inverse
Even then, use np.linalg.solve(A, B) for multiple right-hand sides, not np.linalg.inv(A) @ B.
Part 6 - Numerical Stability of Backpropagation
The Jacobian product view
Backpropagation is the chain rule applied to a composition of functions:
where is the Jacobian of layer with respect to its inputs.
The numerical stability of the entire computation depends on the product of these Jacobians.
Eigenvalue analysis of gradient flow
Consider a deep network with linear layers and the same weight matrix at each layer. The gradient at layer involves . Let be eigenvalues of :
- If : gradients explode exponentially with depth
- If : gradients vanish exponentially with depth
- If : stable gradient flow
import numpy as np
def simulate_gradient_flow(n_layers: int, weight_scale: float) -> float:
"""
Simulate gradient magnitude through a deep linear network.
weight_scale > 1: gradient explosion
weight_scale < 1: gradient vanishing
weight_scale ~ 1: stable
"""
d = 32 # hidden dimension
# Initialize weights with specific scale
weights = [np.random.randn(d, d) * weight_scale / np.sqrt(d)
for _ in range(n_layers)]
# Forward pass: h_l = W_l h_{l-1}
h = np.random.randn(d)
h /= np.linalg.norm(h)
activations = [h]
for W in weights:
h = W @ h
activations.append(h)
# Backward pass: g_l = W_{l+1}^T g_{l+1}
g = np.random.randn(d)
g /= np.linalg.norm(g)
gradient_norms = [np.linalg.norm(g)]
for W in reversed(weights):
g = W.T @ g
gradient_norms.append(np.linalg.norm(g))
return gradient_norms[-1] # gradient at first layer
# Gradient explosion
exploded = simulate_gradient_flow(n_layers=20, weight_scale=1.5)
print(f"weight_scale=1.5, 20 layers: gradient norm = {exploded:.2e}") # huge
# Gradient vanishing
vanished = simulate_gradient_flow(n_layers=20, weight_scale=0.7)
print(f"weight_scale=0.7, 20 layers: gradient norm = {vanished:.2e}") # ~0
# Stable (Xavier initialization)
stable = simulate_gradient_flow(n_layers=20, weight_scale=1.0)
print(f"weight_scale=1.0, 20 layers: gradient norm = {stable:.2e}") # ~1
Xavier and He initialization: numerical stability through eigenvalue control
Xavier (Glorot) initialization targets by initializing weights as:
This keeps the variance of activations and gradients roughly constant across layers.
He initialization (for ReLU activations, which halve the variance):
import torch
import torch.nn as nn
# PyTorch's default initialization for Linear layers
# Uses Kaiming uniform (He initialization for ReLU)
layer = nn.Linear(512, 512)
print(f"Default weight std: {layer.weight.std():.4f}")
# ≈ 0.0442 = sqrt(2/512) - He initialization for ReLU
# Manual initialization
def stable_init(layer: nn.Linear, activation: str = 'relu') -> None:
"""Apply numerically stable initialization."""
if activation == 'relu':
nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')
elif activation == 'tanh':
nn.init.xavier_normal_(layer.weight, gain=nn.init.calculate_gain('tanh'))
elif activation == 'linear':
nn.init.xavier_normal_(layer.weight)
nn.init.zeros_(layer.bias)
Condition number monitoring in training
import torch
import numpy as np
def monitor_weight_conditioning(model: torch.nn.Module) -> dict:
"""
Compute condition numbers of weight matrices.
High condition numbers predict gradient instability.
"""
conditions = {}
for name, param in model.named_parameters():
if param.dim() == 2: # Only for weight matrices
W = param.detach().cpu().numpy()
try:
kappa = np.linalg.cond(W)
conditions[name] = kappa
if kappa > 1e6:
print(f"WARNING: {name} has condition number {kappa:.2e}")
except np.linalg.LinAlgError:
conditions[name] = float('inf')
return conditions
Part 7 - SVD: The Most Numerically Robust Decomposition
The Singular Value Decomposition (SVD) decomposes any matrix:
where are orthogonal and is diagonal with non-negative entries (singular values).
SVD is the most numerically robust decomposition. It is used when:
- The matrix may be rank-deficient
- You need the pseudoinverse
- You need to compute the condition number
import numpy as np
A = np.random.randn(50, 10) # Overdetermined
# Full SVD
U, s, Vt = np.linalg.svd(A, full_matrices=False)
print(f"U shape: {U.shape}") # (50, 10) - left singular vectors
print(f"s shape: {s.shape}") # (10,) - singular values
print(f"Vt shape: {Vt.shape}") # (10, 10) - right singular vectors (transposed)
# Condition number
kappa = s[0] / s[-1]
print(f"Condition number: {kappa:.2f}") # sigma_max / sigma_min
# Moore-Penrose pseudoinverse via SVD (more stable than inv for rank-deficient)
def stable_pseudoinverse(A: np.ndarray, rcond: float = 1e-10) -> np.ndarray:
"""
Compute pseudoinverse using SVD.
Equivalent to np.linalg.pinv() but with explicit threshold.
"""
U, s, Vt = np.linalg.svd(A, full_matrices=False)
# Zero out singular values below threshold (treat as numerically zero)
s_inv = np.where(s > rcond * s[0], 1.0 / s, 0.0)
return Vt.T @ np.diag(s_inv) @ U.T
A_pinv = stable_pseudoinverse(A)
b = np.random.randn(50)
x_pinv = A_pinv @ b
print(f"Residual: {np.linalg.norm(A @ x_pinv - b):.6f}") # minimum norm solution
Part 8 - Practical Solver Selection Guide
Given: solve Ax = b or related problem
┌─── Is A square? ──────────────────────────────────────────────────┐
│ │
│ YES → Is A symmetric positive definite? │
│ YES → Cholesky (fastest, most stable for SPD) │
│ NO → Is A well-conditioned and dense? │
│ YES → LU (general square solver) │
│ NO → SVD-based pseudoinverse │
│ │
│ NO → Overdetermined (m > n)? │
│ YES → Is precision critical? │
│ YES → SVD via lstsq (most stable) │
│ NO → QR (faster, nearly as stable) │
│ NO → Underdetermined (m < n)? │
│ → Minimum-norm via SVD pseudoinverse │
└───────────────────────────────────────────────────────────────────┘
import numpy as np
from scipy.linalg import solve, cho_solve, cho_factor
def smart_solve(A: np.ndarray, b: np.ndarray,
assume_spd: bool = False) -> np.ndarray:
"""
Solve Ax = b using the most numerically appropriate method.
"""
m, n = A.shape
if m == n:
if assume_spd:
# Cholesky: fastest for symmetric positive definite
try:
c = cho_factor(A, lower=True)
return cho_solve(c, b)
except np.linalg.LinAlgError:
print("Cholesky failed - matrix not SPD, falling back to LU")
return solve(A, b)
else:
# LU with partial pivoting (scipy's solve)
return solve(A, b)
elif m > n:
# Overdetermined: least squares via SVD (most stable)
x, _, _, _ = np.linalg.lstsq(A, b, rcond=None)
return x
else:
# Underdetermined: minimum-norm solution via SVD
x, _, _, _ = np.linalg.lstsq(A, b, rcond=None)
return x
Interview Questions
Q1: What is the condition number and why does it matter for machine learning?
The condition number (ratio of largest to smallest singular values) measures how sensitive the solution of is to perturbations in or .
Formal bound:
A relative perturbation of in can cause up to relative error in .
ML relevance:
-
Correlated features: When features are highly correlated, the Gram matrix has a very large condition number. The normal equations become unreliable - tiny data noise causes huge swings in . Ridge regression adds to the Gram matrix, reducing the condition number to .
-
Gradient explosion/vanishing: The condition number of the product of layer Jacobians in a deep network determines whether gradients explode (condition number >> 1) or vanish (condition number << 1).
-
Covariance estimation: Covariance matrices estimated from fewer samples than dimensions are rank-deficient (). Adding diagonal regularization (shrinkage) makes them invertible and well-conditioned.
Q2: Why should you never invert a matrix to solve a linear system?
Computing by:
- Explicitly computing (cost: + solve linear systems)
- Multiplying (cost: )
is slower and less accurate than directly solving .
Speed: np.linalg.solve(A, b) uses LU factorization - once, per RHS. np.linalg.inv(A) @ b computes all columns of , equivalent to solving linear systems. For a single , this is times more work.
Accuracy: Explicitly forming accumulates rounding errors across all column solves. The final matrix-vector product adds more rounding error. Direct solve (numpy.linalg.solve) introduces rounding error only once along the single solution path.
The rule: Always use np.linalg.solve(A, b) or scipy.linalg.solve(A, b). The only exception is when you genuinely need the inverse matrix itself (e.g., computing for matrix ), in which case use np.linalg.solve(A, B) with matrix right-hand side.
Q3: When should you use QR decomposition over LU for solving linear systems?
Use LU when: The system is square (), well-conditioned, and you are solving .
Use QR when:
-
Overdetermined system (, more equations than unknowns): QR solves the least squares problem stably, avoiding the condition-number squaring of the normal equations.
-
Numerical stability is critical: QR uses orthogonal transformations (which have condition number = 1), so it does not amplify existing ill-conditioning. The normal equations square the condition number of .
-
Gram-Schmidt orthogonalization: When you need an orthogonal basis for the column space of .
Quantitative comparison:
- If , then - loses 4 decimal digits of precision in float32 (which has ~7 digits). QR keeps precision at .
- For features with correlation coefficient 0.9999, can easily reach - QR is essential.
Q4: How does ridge regression stabilize ill-conditioned Gram matrices?
The OLS problem has solution . When is ill-conditioned (due to correlated features), is numerically unstable.
Ridge regression adds to the Gram matrix:
The eigenvalues of are (squared singular values of ). Adding shifts all eigenvalues to :
For a matrix where and :
- Without ridge:
- With : - much better
Trade-off: Larger → better conditioning but more bias (shrinks toward zero). Cross-validation selects the optimal .
Q5: Explain gradient vanishing and explosion from a numerical linear algebra perspective.
In a deep network with layers, backpropagation computes:
where is the Jacobian of layer and is the gradient from the loss.
Spectral analysis: If each Jacobian has singular values near constant :
- : grows exponentially → gradient explosion
- : shrinks exponentially → gradient vanishing
- : stable gradient magnitude throughout network
Solutions targeting this condition:
-
Careful initialization (Xavier/He): Sets initial by scaling weights by .
-
Batch/Layer normalization: Normalizes activations to have unit variance, which constrains the Jacobian's spectral radius to be near 1.
-
Residual connections (ResNets): means the Jacobian is . The identity term ensures at least some singular values are near 1, providing a "gradient highway" that bypasses the vanishing problem.
-
Gradient clipping: Clips to
max_norm- treats the symptom (explosion) but not the cause.
Quick Reference
| Matrix Type | Best Decomposition | When to Use |
|---|---|---|
| General square | LU (via scipy.linalg.solve) | Default for square systems |
| SPD (covariance, Gram+λI) | Cholesky | 2× faster, guaranteed stable |
| Overdetermined (m > n) | QR or SVD-lstsq | Least squares problems |
| Rank-deficient | SVD (via np.linalg.lstsq) | When near-singularity is expected |
| Any (maximum stability) | SVD | When accuracy is paramount |
| Condition Number | Interpretation | Action |
|---|---|---|
| 1–10 | Well-conditioned | No action needed |
| 10³–10⁶ | Moderately ill-conditioned | Monitor, consider regularization |
| > 10⁸ | Severely ill-conditioned | Add regularization λI |
| > 1/ε_machine | Numerically singular | SVD pseudoinverse required |
Next: Lesson 03: Iterative Solvers →
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Condition Number demo on the EngineersOfAI Playground - no code required.
:::
