Skip to main content

Numerical Linear Algebra - Condition Numbers, Solvers, and Backprop Stability

Reading time: ~28 minutes | Level: Numerical Foundations → Production ML

Your linear regression model produces wildly different weights every time you retrain on slightly different data. Your covariance matrix inversion fails with a cryptic LinAlgError: Singular matrix. Your neural network with very deep layers has gradients that explode on some architectures but not others - and you don't understand why.

These are numerical linear algebra problems. They stem from a single root cause: some matrices are sensitive to small perturbations, and finite-precision arithmetic introduces perturbations at every step. Understanding condition numbers and matrix decompositions is what separates engineers who debug these issues in an hour from those who spend weeks adding random seeds and hoping.

What You Will Learn

  • The condition number: quantifying how "dangerous" a matrix is numerically
  • Ill-conditioned matrices and what they mean for ML models
  • LU, QR, and Cholesky factorizations: when to use each
  • Why inverting a matrix is almost always the wrong approach
  • Numerical stability of the backpropagation algorithm
  • Practical SciPy / NumPy patterns for stable linear system solving

Part 1 - The Condition Number

Definition

The condition number of a matrix AA (with respect to the L2 norm) is:

κ(A)=AA1=σmaxσmin\kappa(A) = \|A\| \cdot \|A^{-1}\| = \frac{\sigma_{\max}}{\sigma_{\min}}

where σmax\sigma_{\max} and σmin\sigma_{\min} are the largest and smallest singular values of AA.

The condition number measures how much relative error in the output results from a small relative error in the input:

δxxκ(A)δbb\frac{\|\delta x\|}{\|x\|} \leq \kappa(A) \cdot \frac{\|\delta b\|}{\|b\|}

where Ax=bAx = b and δb\delta b is a small perturbation.

Interpreting the condition number

κ(A)\kappa(A)ClassificationMeaning
1\approx 1Well-conditionedSmall input errors → small output errors
10310610^3 - 10^6Moderately ill-conditionedCaution needed
>108> 10^8Severely ill-conditionedNear-singular; solutions unreliable in float32
>1015> 10^{15}Numerically singularCannot solve reliably even in float64
import numpy as np
from numpy.linalg import cond, svd

# Well-conditioned matrix
A_good = np.array([[2.0, 1.0],
[1.0, 2.0]])
print(f"cond(A_good) = {cond(A_good):.2f}") # 3.0 - excellent

# Ill-conditioned matrix (nearly singular)
A_bad = np.array([[1.0, 1.0],
[1.0, 1.0001]]) # rows nearly identical
print(f"cond(A_bad) = {cond(A_bad):.2e}") # ~40000 - problematic

# Demonstrate the consequence
b = np.array([2.0, 2.0001])
x_exact = np.linalg.solve(A_bad, b)

# Tiny perturbation to b
delta_b = np.array([0.0001, 0.0])
b_perturbed = b + delta_b
x_perturbed = np.linalg.solve(A_bad, b_perturbed)

print(f"Relative change in b: {np.linalg.norm(delta_b)/np.linalg.norm(b):.6f}")
print(f"Relative change in x: {np.linalg.norm(x_perturbed - x_exact)/np.linalg.norm(x_exact):.4f}")
# A tiny change in b produces a huge change in x - ill-conditioning in action

Where ill-conditioned matrices appear in ML

1. Gram matrices with correlated features

The least squares problem XTXβ=XTyX^T X \beta = X^T y involves the Gram matrix G=XTXG = X^T X. When features are highly correlated, GG is nearly singular:

import numpy as np

# Two nearly identical features → near-singular Gram matrix
np.random.seed(42)
n = 100
x1 = np.random.randn(n)
x2 = x1 + 0.001 * np.random.randn(n) # x2 ≈ x1

X = np.column_stack([x1, x2])
G = X.T @ X
print(f"Condition number of Gram matrix: {np.linalg.cond(G):.2e}")
# ~4e6 - severely ill-conditioned!

# Adding L2 regularization (ridge) stabilizes the Gram matrix
lambda_reg = 0.01
G_regularized = X.T @ X + lambda_reg * np.eye(2)
print(f"Condition number after ridge: {np.linalg.cond(G_regularized):.2e}")
# Dramatically improved

2. Deep network Jacobians

In a deep network with LL layers, the gradient of the loss with respect to the first layer involves the product of LL Jacobian matrices:

LW1=JLJL1J2J1Ly^\frac{\partial \mathcal{L}}{\partial W_1} = J_L \cdot J_{L-1} \cdots J_2 \cdot J_1 \cdot \frac{\partial \mathcal{L}}{\partial \hat{y}}

If each JlJ_l has κ(Jl)>1\kappa(J_l) > 1, the product's condition number can grow exponentially with depth - gradient explosion. If κ(Jl)<1\kappa(J_l) < 1, it shrinks - gradient vanishing.

Batch normalization, residual connections, and careful initialization (Xavier, He) exist specifically to keep these condition numbers near 1.

Part 2 - LU Factorization: The Workhorse Linear Solver

What it is

LU factorization decomposes A=PLUA = PLU where:

  • PP is a permutation matrix (row reordering for numerical stability)
  • LL is lower triangular with ones on the diagonal
  • UU is upper triangular

Solving Ax=bAx = b becomes:

  1. Factor: A=PLUA = PLU - O(n3)O(n^3) cost, done once
  2. Solve Lc=PbLc = Pb - forward substitution, O(n2)O(n^2)
  3. Solve Ux=cUx = c - back substitution, O(n2)O(n^2)

When you need to solve the same system for many right-hand sides b1,b2,b_1, b_2, \ldots, factoring once and reusing is efficient.

import numpy as np
from scipy.linalg import lu, lu_factor, lu_solve

# Factorize once
A = np.array([[2.0, 1.0, -1.0],
[-3.0, -1.0, 2.0],
[-2.0, 1.0, 2.0]])

lu_piv = lu_factor(A) # Pre-compute LU factorization

# Solve for multiple right-hand sides efficiently
b1 = np.array([8.0, -11.0, -3.0])
b2 = np.array([1.0, 2.0, 3.0])
b3 = np.array([0.0, 1.0, 0.0])

x1 = lu_solve(lu_piv, b1) # O(n^2) each - much faster than O(n^3)
x2 = lu_solve(lu_piv, b2)
x3 = lu_solve(lu_piv, b3)

# Verify
print(f"Ax1 - b1: {np.linalg.norm(A @ x1 - b1):.2e}") # ~0 (machine precision)

When LU is appropriate:

  • Square, dense, well-conditioned system
  • Multiple right-hand sides with the same matrix AA
  • General-purpose linear system solver

When LU is NOT appropriate:

  • Matrix is symmetric positive definite → use Cholesky (twice as fast)
  • System is overdetermined (m>nm > n) → use QR or least squares
  • Matrix is very large and sparse → use iterative methods (see Lesson 03)
  • You need the solution to be numerically robust despite near-singularity → use SVD

Part 3 - QR Factorization: The Numerically Safe Solver

What it is

QR factorization decomposes A=QRA = QR where:

  • QQ is orthogonal (QTQ=IQ^T Q = I) - orthogonal matrices have condition number = 1
  • RR is upper triangular

For the least squares problem minAxb2\min \|Ax - b\|^2 with overdetermined AA (m>nm > n):

ATAx=ATbRx=QTbA^T A x = A^T b \quad \Rightarrow \quad Rx = Q^T b

The QR approach is numerically superior to the normal equations ATAx=ATbA^T A x = A^T b because:

  • Forming ATAA^T A squares the condition number: κ(ATA)=κ(A)2\kappa(A^T A) = \kappa(A)^2
  • The QR solution avoids this squaring, using κ(R)=κ(A)\kappa(R) = \kappa(A) only
import numpy as np
from scipy.linalg import qr, solve_triangular

# Least squares via QR - numerically stable
A = np.random.randn(100, 10) # overdetermined system: 100 equations, 10 unknowns
b = np.random.randn(100)

# Method 1: Normal equations (NUMERICALLY INFERIOR - do not use)
# x_normal = np.linalg.inv(A.T @ A) @ A.T @ b
# Condition number is squared: bad for nearly correlated features

# Method 2: QR factorization (NUMERICALLY SUPERIOR)
Q, R = qr(A, mode='economic') # 'economic' QR: Q is (100, 10), R is (10, 10)
Qtb = Q.T @ b # Q^T b - cheap, Q is orthogonal
x_qr = solve_triangular(R, Qtb) # Solve Rx = Q^T b by back substitution

# Method 3: numpy's least squares (uses SVD internally - most stable)
x_lstsq, _, _, _ = np.linalg.lstsq(A, b, rcond=None)

# All three should give the same answer (QR and SVD do; normal equations may not)
print(f"QR vs lstsq difference: {np.linalg.norm(x_qr - x_lstsq):.2e}")
print(f"Residual (QR): {np.linalg.norm(A @ x_qr - b):.6f}")
print(f"Residual (lstsq): {np.linalg.norm(A @ x_lstsq - b):.6f}")

:::tip QR in ML pipelines When fitting a linear model to features with unknown correlation structure, always solve via QR or np.linalg.lstsq rather than computing the normal equations manually. For Scikit-learn's LinearRegression, this is already done for you - but knowing why matters when you implement custom solvers. :::

Gram-Schmidt and numerical orthogonalization

QR is also the algorithm behind Gram-Schmidt orthogonalization, used whenever you need an orthogonal basis:

def gram_schmidt_qr(A: np.ndarray) -> tuple:
"""
Classical Gram-Schmidt QR factorization.
Note: scipy.linalg.qr uses Householder reflections - more stable.
This is for illustration.
"""
m, n = A.shape
Q = np.zeros((m, n))
R = np.zeros((n, n))

for j in range(n):
v = A[:, j].copy()
for i in range(j):
R[i, j] = Q[:, i] @ A[:, j]
v -= R[i, j] * Q[:, i]
R[j, j] = np.linalg.norm(v)
if R[j, j] > 1e-10:
Q[:, j] = v / R[j, j]

return Q, R

Part 4 - Cholesky Factorization: For Positive Definite Matrices

When it applies

A matrix AA is symmetric positive definite (SPD) if:

  • A=ATA = A^T (symmetric)
  • xTAx>0x^T A x > 0 for all non-zero xx (positive definite)

Common SPD matrices in ML:

  • Covariance matrices: Σ=1nXTX\Sigma = \frac{1}{n} X^T X (after centering)
  • Gram matrices with regularization: XTX+λIX^T X + \lambda I
  • Fisher information matrix
  • Hessian at a local minimum

For SPD matrices, Cholesky factorization gives A=LLTA = LL^T where LL is lower triangular with positive diagonal entries. This is twice as fast as LU and guaranteed stable for SPD matrices.

import numpy as np
from scipy.linalg import cholesky, cho_factor, cho_solve

# Generate a symmetric positive definite matrix
n = 50
A_raw = np.random.randn(n, n)
A = A_raw.T @ A_raw + n * np.eye(n) # Guaranteed SPD: X^T X + nI

# Cholesky factorization
L = cholesky(A, lower=True) # A = L @ L.T
print(f"Is lower triangular: {np.allclose(L, np.tril(L))}") # True

# Efficient solve using Cholesky
cho_fac = cho_factor(A, lower=True) # Pre-factorize
b = np.random.randn(n)
x = cho_solve(cho_fac, b)

# Verify: Ax = b
print(f"Residual: {np.linalg.norm(A @ x - b):.2e}") # ~machine precision

# Use case: Gaussian process covariance
def gp_log_likelihood(K: np.ndarray, y: np.ndarray) -> float:
"""
Compute log marginal likelihood for a Gaussian process.
Uses Cholesky for numerical stability and efficiency.

log p(y | K) = -1/2 y^T K^{-1} y - 1/2 log|K| - n/2 log(2π)
"""
n = len(y)
eps = 1e-6 # Jitter for numerical stability
K_stable = K + eps * np.eye(n)

# Cholesky: K = L L^T
L = cholesky(K_stable, lower=True)

# Solve K^{-1} y = (L L^T)^{-1} y = L^{-T} (L^{-1} y)
alpha = np.linalg.solve(L.T, np.linalg.solve(L, y))

# log|K| = 2 * sum(log(diag(L))) - from det of triangular matrix
log_det_K = 2.0 * np.sum(np.log(np.diag(L)))

return -0.5 * (y @ alpha + log_det_K + n * np.log(2 * np.pi))

:::note Cholesky failure = matrix is not SPD If scipy.linalg.cholesky(A) throws a LinAlgError, the matrix AA is not positive definite. This often happens in practice when:

  • You compute a covariance matrix on too few data points (rank-deficient)
  • Accumulated floating-point errors make a theoretically SPD matrix numerically indefinite
  • Features are exactly linearly dependent

Fix: Add a small jitter/diagonal: A + ε·I with ε=106\varepsilon = 10^{-6} to 10410^{-4}. This is standard practice in Gaussian processes and is the right thing to do. :::

Part 5 - Why You Should Never Invert a Matrix

This is one of the most important practical rules in numerical linear algebra:

Never compute A1A^{-1} explicitly. Always solve the linear system Ax=bAx = b directly.

Why this rule exists

import numpy as np
import time

n = 1000

# Ill-conditioned matrix (for demonstration)
A = np.random.randn(n, n)
A = A + A.T + n * np.eye(n) # Make it positive definite
b = np.random.randn(n)

# Method 1: WRONG - explicit inverse (slow + numerically worse)
start = time.time()
x_wrong = np.linalg.inv(A) @ b
time_wrong = time.time() - start

# Method 2: CORRECT - solve the linear system directly
start = time.time()
x_correct = np.linalg.solve(A, b)
time_correct = time.time() - start

# Both give the same answer, but:
print(f"Time (inv): {time_wrong:.4f}s")
print(f"Time (solve): {time_correct:.4f}s")
# solve() is ~2-3x faster - inv() computes n solutions to verify, solve() computes 1

# More importantly: accuracy on ill-conditioned systems
A_bad = np.array([[1.0, 1.0], [1.0, 1.0001]])
b_bad = np.array([2.0, 2.0001])

x_inv = np.linalg.inv(A_bad) @ b_bad
x_solve = np.linalg.solve(A_bad, b_bad)

# x_inv accumulates more rounding error - solve() uses LU with pivoting
print(f"Residual (inv): {np.linalg.norm(A_bad @ x_inv - b_bad):.2e}")
print(f"Residual (solve): {np.linalg.norm(A_bad @ x_solve - b_bad):.2e}")

The fundamental reason: Computing A1A^{-1} requires solving nn linear systems (one per column of II). But if you only need A1bA^{-1}b, that is just one linear system. Wasting n1n-1 extra solves introduces more rounding error.

The only legitimate use of explicit inverse

When you need the actual matrix A1A^{-1} (not just the solution to a particular system), such as:

  • Computing A1BA^{-1} B for matrix-valued right-hand sides
  • Analytical formulas that require the inverse

Even then, use np.linalg.solve(A, B) for multiple right-hand sides, not np.linalg.inv(A) @ B.

Part 6 - Numerical Stability of Backpropagation

The Jacobian product view

Backpropagation is the chain rule applied to a composition of functions:

Lθl=LyLloss gradientJfLJfl+1upstream JacobiansJfl(θ)layer gradient\frac{\partial \mathcal{L}}{\partial \theta_l} = \underbrace{\frac{\partial \mathcal{L}}{\partial y_L}}_{\text{loss gradient}} \cdot \underbrace{J_{f_L} \cdots J_{f_{l+1}}}_{\text{upstream Jacobians}} \cdot \underbrace{J_{f_l}^{(\theta)}}_{\text{layer gradient}}

where JflJ_{f_l} is the Jacobian of layer ll with respect to its inputs.

The numerical stability of the entire computation depends on the product of these Jacobians.

Eigenvalue analysis of gradient flow

Consider a deep network with linear layers and the same weight matrix WW at each layer. The gradient at layer ll involves WLlW^{L-l}. Let λ1,,λn\lambda_1, \ldots, \lambda_n be eigenvalues of WW:

LhlLhLλmaxLl\left\|\frac{\partial \mathcal{L}}{\partial h_l}\right\| \approx \left\|\frac{\partial \mathcal{L}}{\partial h_L}\right\| \cdot |\lambda_{\max}|^{L-l}

  • If λmax>1|\lambda_{\max}| > 1: gradients explode exponentially with depth
  • If λmax<1|\lambda_{\max}| < 1: gradients vanish exponentially with depth
  • If λmax1|\lambda_{\max}| \approx 1: stable gradient flow
import numpy as np

def simulate_gradient_flow(n_layers: int, weight_scale: float) -> float:
"""
Simulate gradient magnitude through a deep linear network.

weight_scale > 1: gradient explosion
weight_scale < 1: gradient vanishing
weight_scale ~ 1: stable
"""
d = 32 # hidden dimension
# Initialize weights with specific scale
weights = [np.random.randn(d, d) * weight_scale / np.sqrt(d)
for _ in range(n_layers)]

# Forward pass: h_l = W_l h_{l-1}
h = np.random.randn(d)
h /= np.linalg.norm(h)

activations = [h]
for W in weights:
h = W @ h
activations.append(h)

# Backward pass: g_l = W_{l+1}^T g_{l+1}
g = np.random.randn(d)
g /= np.linalg.norm(g)

gradient_norms = [np.linalg.norm(g)]
for W in reversed(weights):
g = W.T @ g
gradient_norms.append(np.linalg.norm(g))

return gradient_norms[-1] # gradient at first layer

# Gradient explosion
exploded = simulate_gradient_flow(n_layers=20, weight_scale=1.5)
print(f"weight_scale=1.5, 20 layers: gradient norm = {exploded:.2e}") # huge

# Gradient vanishing
vanished = simulate_gradient_flow(n_layers=20, weight_scale=0.7)
print(f"weight_scale=0.7, 20 layers: gradient norm = {vanished:.2e}") # ~0

# Stable (Xavier initialization)
stable = simulate_gradient_flow(n_layers=20, weight_scale=1.0)
print(f"weight_scale=1.0, 20 layers: gradient norm = {stable:.2e}") # ~1

Xavier and He initialization: numerical stability through eigenvalue control

Xavier (Glorot) initialization targets λ1|\lambda| \approx 1 by initializing weights as:

WijU(6nin+nout,6nin+nout)W_{ij} \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}, \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right)

This keeps the variance of activations and gradients roughly constant across layers.

He initialization (for ReLU activations, which halve the variance):

WijN(0,2nin)W_{ij} \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)

import torch
import torch.nn as nn

# PyTorch's default initialization for Linear layers
# Uses Kaiming uniform (He initialization for ReLU)
layer = nn.Linear(512, 512)
print(f"Default weight std: {layer.weight.std():.4f}")
# ≈ 0.0442 = sqrt(2/512) - He initialization for ReLU

# Manual initialization
def stable_init(layer: nn.Linear, activation: str = 'relu') -> None:
"""Apply numerically stable initialization."""
if activation == 'relu':
nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')
elif activation == 'tanh':
nn.init.xavier_normal_(layer.weight, gain=nn.init.calculate_gain('tanh'))
elif activation == 'linear':
nn.init.xavier_normal_(layer.weight)
nn.init.zeros_(layer.bias)

Condition number monitoring in training

import torch
import numpy as np

def monitor_weight_conditioning(model: torch.nn.Module) -> dict:
"""
Compute condition numbers of weight matrices.
High condition numbers predict gradient instability.
"""
conditions = {}
for name, param in model.named_parameters():
if param.dim() == 2: # Only for weight matrices
W = param.detach().cpu().numpy()
try:
kappa = np.linalg.cond(W)
conditions[name] = kappa
if kappa > 1e6:
print(f"WARNING: {name} has condition number {kappa:.2e}")
except np.linalg.LinAlgError:
conditions[name] = float('inf')
return conditions

Part 7 - SVD: The Most Numerically Robust Decomposition

The Singular Value Decomposition (SVD) decomposes any matrix:

A=UΣVTA = U \Sigma V^T

where U,VU, V are orthogonal and Σ\Sigma is diagonal with non-negative entries (singular values).

SVD is the most numerically robust decomposition. It is used when:

  • The matrix may be rank-deficient
  • You need the pseudoinverse
  • You need to compute the condition number
import numpy as np

A = np.random.randn(50, 10) # Overdetermined

# Full SVD
U, s, Vt = np.linalg.svd(A, full_matrices=False)
print(f"U shape: {U.shape}") # (50, 10) - left singular vectors
print(f"s shape: {s.shape}") # (10,) - singular values
print(f"Vt shape: {Vt.shape}") # (10, 10) - right singular vectors (transposed)

# Condition number
kappa = s[0] / s[-1]
print(f"Condition number: {kappa:.2f}") # sigma_max / sigma_min

# Moore-Penrose pseudoinverse via SVD (more stable than inv for rank-deficient)
def stable_pseudoinverse(A: np.ndarray, rcond: float = 1e-10) -> np.ndarray:
"""
Compute pseudoinverse using SVD.
Equivalent to np.linalg.pinv() but with explicit threshold.
"""
U, s, Vt = np.linalg.svd(A, full_matrices=False)
# Zero out singular values below threshold (treat as numerically zero)
s_inv = np.where(s > rcond * s[0], 1.0 / s, 0.0)
return Vt.T @ np.diag(s_inv) @ U.T

A_pinv = stable_pseudoinverse(A)
b = np.random.randn(50)
x_pinv = A_pinv @ b
print(f"Residual: {np.linalg.norm(A @ x_pinv - b):.6f}") # minimum norm solution

Part 8 - Practical Solver Selection Guide

Given: solve Ax = b or related problem

┌─── Is A square? ──────────────────────────────────────────────────┐
│ │
│ YES → Is A symmetric positive definite? │
│ YES → Cholesky (fastest, most stable for SPD) │
│ NO → Is A well-conditioned and dense? │
│ YES → LU (general square solver) │
│ NO → SVD-based pseudoinverse │
│ │
│ NO → Overdetermined (m > n)? │
│ YES → Is precision critical? │
│ YES → SVD via lstsq (most stable) │
│ NO → QR (faster, nearly as stable) │
│ NO → Underdetermined (m < n)? │
│ → Minimum-norm via SVD pseudoinverse │
└───────────────────────────────────────────────────────────────────┘
import numpy as np
from scipy.linalg import solve, cho_solve, cho_factor

def smart_solve(A: np.ndarray, b: np.ndarray,
assume_spd: bool = False) -> np.ndarray:
"""
Solve Ax = b using the most numerically appropriate method.
"""
m, n = A.shape

if m == n:
if assume_spd:
# Cholesky: fastest for symmetric positive definite
try:
c = cho_factor(A, lower=True)
return cho_solve(c, b)
except np.linalg.LinAlgError:
print("Cholesky failed - matrix not SPD, falling back to LU")
return solve(A, b)
else:
# LU with partial pivoting (scipy's solve)
return solve(A, b)

elif m > n:
# Overdetermined: least squares via SVD (most stable)
x, _, _, _ = np.linalg.lstsq(A, b, rcond=None)
return x

else:
# Underdetermined: minimum-norm solution via SVD
x, _, _, _ = np.linalg.lstsq(A, b, rcond=None)
return x

Interview Questions

Q1: What is the condition number and why does it matter for machine learning?

The condition number κ(A)=σmax/σmin\kappa(A) = \sigma_{\max} / \sigma_{\min} (ratio of largest to smallest singular values) measures how sensitive the solution of Ax=bAx = b is to perturbations in bb or AA.

Formal bound: δxxκ(A)δbb\frac{\|\delta x\|}{\|x\|} \leq \kappa(A) \cdot \frac{\|\delta b\|}{\|b\|}

A relative perturbation of ϵ\epsilon in bb can cause up to κ(A)ϵ\kappa(A) \cdot \epsilon relative error in xx.

ML relevance:

  1. Correlated features: When features are highly correlated, the Gram matrix XTXX^T X has a very large condition number. The normal equations XTXβ=XTyX^T X \beta = X^T y become unreliable - tiny data noise causes huge swings in β\beta. Ridge regression adds λI\lambda I to the Gram matrix, reducing the condition number to (σmax2+λ)/(σmin2+λ)(\sigma_{\max}^2 + \lambda)/(\sigma_{\min}^2 + \lambda).

  2. Gradient explosion/vanishing: The condition number of the product of layer Jacobians in a deep network determines whether gradients explode (condition number >> 1) or vanish (condition number << 1).

  3. Covariance estimation: Covariance matrices estimated from fewer samples than dimensions are rank-deficient (κ=\kappa = \infty). Adding diagonal regularization (shrinkage) makes them invertible and well-conditioned.

Q2: Why should you never invert a matrix to solve a linear system?

Computing x=A1bx = A^{-1} b by:

  1. Explicitly computing A1A^{-1} (cost: O(n3)O(n^3) + solve nn linear systems)
  2. Multiplying A1bA^{-1} b (cost: O(n2)O(n^2))

is slower and less accurate than directly solving Ax=bAx = b.

Speed: np.linalg.solve(A, b) uses LU factorization - O(n3)O(n^3) once, O(n2)O(n^2) per RHS. np.linalg.inv(A) @ b computes all nn columns of A1A^{-1}, equivalent to solving nn linear systems. For a single bb, this is nn times more work.

Accuracy: Explicitly forming A1A^{-1} accumulates rounding errors across all nn column solves. The final matrix-vector product adds more rounding error. Direct solve (numpy.linalg.solve) introduces rounding error only once along the single solution path.

The rule: Always use np.linalg.solve(A, b) or scipy.linalg.solve(A, b). The only exception is when you genuinely need the inverse matrix itself (e.g., computing A1BA^{-1} B for matrix BB), in which case use np.linalg.solve(A, B) with matrix right-hand side.

Q3: When should you use QR decomposition over LU for solving linear systems?

Use LU when: The system is square (m=nm = n), well-conditioned, and you are solving Ax=bAx = b.

Use QR when:

  1. Overdetermined system (m>nm > n, more equations than unknowns): QR solves the least squares problem minAxb2\min \|Ax - b\|^2 stably, avoiding the condition-number squaring of the normal equations.

  2. Numerical stability is critical: QR uses orthogonal transformations (which have condition number = 1), so it does not amplify existing ill-conditioning. The normal equations ATAx=ATbA^T A x = A^T b square the condition number of AA.

  3. Gram-Schmidt orthogonalization: When you need an orthogonal basis for the column space of AA.

Quantitative comparison:

  • If κ(A)=104\kappa(A) = 10^4, then κ(ATA)=108\kappa(A^T A) = 10^8 - loses 4 decimal digits of precision in float32 (which has ~7 digits). QR keeps precision at κ(A)=104\kappa(A) = 10^4.
  • For features with correlation coefficient 0.9999, κ(XTX)\kappa(X^T X) can easily reach 10810^8 - QR is essential.
Q4: How does ridge regression stabilize ill-conditioned Gram matrices?

The OLS problem minXβy2\min \|X\beta - y\|^2 has solution β=(XTX)1XTy\beta = (X^T X)^{-1} X^T y. When XTXX^T X is ill-conditioned (due to correlated features), β\beta is numerically unstable.

Ridge regression adds λI\lambda I to the Gram matrix:

βridge=(XTX+λI)1XTy\beta_{\text{ridge}} = (X^T X + \lambda I)^{-1} X^T y

The eigenvalues of XTXX^T X are σi2\sigma_i^2 (squared singular values of XX). Adding λI\lambda I shifts all eigenvalues to σi2+λ\sigma_i^2 + \lambda:

κ(XTX+λI)=σmax2+λσmin2+λ\kappa(X^T X + \lambda I) = \frac{\sigma_{\max}^2 + \lambda}{\sigma_{\min}^2 + \lambda}

For a matrix where σmax2=100\sigma_{\max}^2 = 100 and σmin2=106\sigma_{\min}^2 = 10^{-6}:

  • Without ridge: κ=108\kappa = 10^8
  • With λ=0.01\lambda = 0.01: κ=(100+0.01)/(106+0.01)104\kappa = (100 + 0.01) / (10^{-6} + 0.01) \approx 10^4 - much better

Trade-off: Larger λ\lambda → better conditioning but more bias (shrinks β\beta toward zero). Cross-validation selects the optimal λ\lambda.

Q5: Explain gradient vanishing and explosion from a numerical linear algebra perspective.

In a deep network with LL layers, backpropagation computes:

gl=JLTJL1TJl+1TgLg_l = J_{L}^T J_{L-1}^T \cdots J_{l+1}^T \cdot g_L

where JlJ_l is the Jacobian of layer ll and gLg_L is the gradient from the loss.

Spectral analysis: If each Jacobian JlJ_l has singular values near constant σ\sigma:

g1gLσL1\|g_1\| \approx \|g_L\| \cdot \sigma^{L-1}

  • σ>1\sigma > 1: g1\|g_1\| grows exponentially → gradient explosion
  • σ<1\sigma < 1: g1\|g_1\| shrinks exponentially → gradient vanishing
  • σ=1\sigma = 1: stable gradient magnitude throughout network

Solutions targeting this condition:

  1. Careful initialization (Xavier/He): Sets initial σ1\sigma \approx 1 by scaling weights by 1/nin1/\sqrt{n_{\text{in}}}.

  2. Batch/Layer normalization: Normalizes activations to have unit variance, which constrains the Jacobian's spectral radius to be near 1.

  3. Residual connections (ResNets): hl+1=hl+F(hl)h_{l+1} = h_l + F(h_l) means the Jacobian is I+JFI + J_F. The identity term ensures at least some singular values are near 1, providing a "gradient highway" that bypasses the vanishing problem.

  4. Gradient clipping: Clips \|\nabla\| to max_norm - treats the symptom (explosion) but not the cause.

Quick Reference

Matrix TypeBest DecompositionWhen to Use
General squareLU (via scipy.linalg.solve)Default for square systems
SPD (covariance, Gram+λI)Cholesky2× faster, guaranteed stable
Overdetermined (m > n)QR or SVD-lstsqLeast squares problems
Rank-deficientSVD (via np.linalg.lstsq)When near-singularity is expected
Any (maximum stability)SVDWhen accuracy is paramount
Condition NumberInterpretationAction
1–10Well-conditionedNo action needed
10³–10⁶Moderately ill-conditionedMonitor, consider regularization
> 10⁸Severely ill-conditionedAdd regularization λI
> 1/ε_machineNumerically singularSVD pseudoinverse required

Next: Lesson 03: Iterative Solvers →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Condition Number demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.