Numerical Linear Algebra - Condition Numbers, Solvers, and Backprop Stability

Reading time: ~28 minutes | Level: Numerical Foundations → Production ML

Your linear regression model produces wildly different weights every time you retrain on slightly different data. Your covariance matrix inversion fails with a cryptic LinAlgError: Singular matrix. Your neural network with very deep layers has gradients that explode on some architectures but not others - and you don't understand why.

These are numerical linear algebra problems. They stem from a single root cause: some matrices are sensitive to small perturbations, and finite-precision arithmetic introduces perturbations at every step. Understanding condition numbers and matrix decompositions is what separates engineers who debug these issues in an hour from those who spend weeks adding random seeds and hoping.

What You Will Learn

The condition number: quantifying how "dangerous" a matrix is numerically
Ill-conditioned matrices and what they mean for ML models
LU, QR, and Cholesky factorizations: when to use each
Why inverting a matrix is almost always the wrong approach
Numerical stability of the backpropagation algorithm
Practical SciPy / NumPy patterns for stable linear system solving

Part 1 - The Condition Number

Definition

The condition number of a matrix $A$ (with respect to the L2 norm) is:

$\kappa(A) = \|A\| \cdot \|A^{-1}\| = \frac{\sigma_{\max}}{\sigma_{\min}}$

where $\sigma_{\max}$ and $\sigma_{\min}$ are the largest and smallest singular values of $A$ .

The condition number measures how much relative error in the output results from a small relative error in the input:

$\frac{\|\delta x\|}{\|x\|} \leq \kappa(A) \cdot \frac{\|\delta b\|}{\|b\|}$

where $Ax = b$ and $\delta b$ is a small perturbation.

Interpreting the condition number

$\kappa(A)$	Classification	Meaning
$\approx 1$	Well-conditioned	Small input errors → small output errors
$10^3 - 10^6$	Moderately ill-conditioned	Caution needed
$> 10^8$	Severely ill-conditioned	Near-singular; solutions unreliable in float32
$> 10^{15}$	Numerically singular	Cannot solve reliably even in float64

import numpy as np
from numpy.linalg import cond, svd

# Well-conditioned matrix
A_good = np.array([[2.0, 1.0],
                   [1.0, 2.0]])
print(f"cond(A_good) = {cond(A_good):.2f}")   # 3.0 - excellent

# Ill-conditioned matrix (nearly singular)
A_bad = np.array([[1.0, 1.0],
                  [1.0, 1.0001]])  # rows nearly identical
print(f"cond(A_bad) = {cond(A_bad):.2e}")     # ~40000 - problematic

# Demonstrate the consequence
b = np.array([2.0, 2.0001])
x_exact = np.linalg.solve(A_bad, b)

# Tiny perturbation to b
delta_b = np.array([0.0001, 0.0])
b_perturbed = b + delta_b
x_perturbed = np.linalg.solve(A_bad, b_perturbed)

print(f"Relative change in b:   {np.linalg.norm(delta_b)/np.linalg.norm(b):.6f}")
print(f"Relative change in x:   {np.linalg.norm(x_perturbed - x_exact)/np.linalg.norm(x_exact):.4f}")
# A tiny change in b produces a huge change in x - ill-conditioning in action

Where ill-conditioned matrices appear in ML

1. Gram matrices with correlated features

The least squares problem $X^T X \beta = X^T y$ involves the Gram matrix $G = X^T X$ . When features are highly correlated, $G$ is nearly singular:

import numpy as np

# Two nearly identical features → near-singular Gram matrix
np.random.seed(42)
n = 100
x1 = np.random.randn(n)
x2 = x1 + 0.001 * np.random.randn(n)  # x2 ≈ x1

X = np.column_stack([x1, x2])
G = X.T @ X
print(f"Condition number of Gram matrix: {np.linalg.cond(G):.2e}")
# ~4e6 - severely ill-conditioned!

# Adding L2 regularization (ridge) stabilizes the Gram matrix
lambda_reg = 0.01
G_regularized = X.T @ X + lambda_reg * np.eye(2)
print(f"Condition number after ridge: {np.linalg.cond(G_regularized):.2e}")
# Dramatically improved

2. Deep network Jacobians

In a deep network with $L$ layers, the gradient of the loss with respect to the first layer involves the product of $L$ Jacobian matrices:

$\frac{\partial \mathcal{L}}{\partial W_1} = J_L \cdot J_{L-1} \cdots J_2 \cdot J_1 \cdot \frac{\partial \mathcal{L}}{\partial \hat{y}}$

If each $J_l$ has $\kappa(J_l) > 1$ , the product's condition number can grow exponentially with depth - gradient explosion. If $\kappa(J_l) < 1$ , it shrinks - gradient vanishing.

Batch normalization, residual connections, and careful initialization (Xavier, He) exist specifically to keep these condition numbers near 1.

Part 2 - LU Factorization: The Workhorse Linear Solver

What it is

LU factorization decomposes $A = PLU$ where:

$P$ is a permutation matrix (row reordering for numerical stability)
$L$ is lower triangular with ones on the diagonal
$U$ is upper triangular

Solving $Ax = b$ becomes:

Factor: $A = PLU$ - $O(n^3)$ cost, done once
Solve $Lc = Pb$ - forward substitution, $O(n^2)$
Solve $Ux = c$ - back substitution, $O(n^2)$

When you need to solve the same system for many right-hand sides $b_1, b_2, \ldots$ , factoring once and reusing is efficient.

import numpy as np
from scipy.linalg import lu, lu_factor, lu_solve

# Factorize once
A = np.array([[2.0, 1.0, -1.0],
              [-3.0, -1.0, 2.0],
              [-2.0, 1.0, 2.0]])

lu_piv = lu_factor(A)  # Pre-compute LU factorization

# Solve for multiple right-hand sides efficiently
b1 = np.array([8.0, -11.0, -3.0])
b2 = np.array([1.0, 2.0, 3.0])
b3 = np.array([0.0, 1.0, 0.0])

x1 = lu_solve(lu_piv, b1)   # O(n^2) each - much faster than O(n^3)
x2 = lu_solve(lu_piv, b2)
x3 = lu_solve(lu_piv, b3)

# Verify
print(f"Ax1 - b1: {np.linalg.norm(A @ x1 - b1):.2e}")  # ~0 (machine precision)

When LU is appropriate:

Square, dense, well-conditioned system
Multiple right-hand sides with the same matrix $A$
General-purpose linear system solver

When LU is NOT appropriate:

Matrix is symmetric positive definite → use Cholesky (twice as fast)
System is overdetermined ( $m > n$ ) → use QR or least squares
Matrix is very large and sparse → use iterative methods (see Lesson 03)
You need the solution to be numerically robust despite near-singularity → use SVD

Part 3 - QR Factorization: The Numerically Safe Solver

What it is

QR factorization decomposes $A = QR$ where:

$Q$ is orthogonal ( $Q^T Q = I$ ) - orthogonal matrices have condition number = 1
$R$ is upper triangular

For the least squares problem $\min \|Ax - b\|^2$ with overdetermined $A$ ( $m > n$ ):

$A^T A x = A^T b \quad \Rightarrow \quad Rx = Q^T b$

The QR approach is numerically superior to the normal equations $A^T A x = A^T b$ because:

Forming $A^T A$ squares the condition number: $\kappa(A^T A) = \kappa(A)^2$
The QR solution avoids this squaring, using $\kappa(R) = \kappa(A)$ only

import numpy as np
from scipy.linalg import qr, solve_triangular

# Least squares via QR - numerically stable
A = np.random.randn(100, 10)   # overdetermined system: 100 equations, 10 unknowns
b = np.random.randn(100)

# Method 1: Normal equations (NUMERICALLY INFERIOR - do not use)
# x_normal = np.linalg.inv(A.T @ A) @ A.T @ b
# Condition number is squared: bad for nearly correlated features

# Method 2: QR factorization (NUMERICALLY SUPERIOR)
Q, R = qr(A, mode='economic')     # 'economic' QR: Q is (100, 10), R is (10, 10)
Qtb = Q.T @ b                      # Q^T b - cheap, Q is orthogonal
x_qr = solve_triangular(R, Qtb)    # Solve Rx = Q^T b by back substitution

# Method 3: numpy's least squares (uses SVD internally - most stable)
x_lstsq, _, _, _ = np.linalg.lstsq(A, b, rcond=None)

# All three should give the same answer (QR and SVD do; normal equations may not)
print(f"QR vs lstsq difference: {np.linalg.norm(x_qr - x_lstsq):.2e}")
print(f"Residual (QR):  {np.linalg.norm(A @ x_qr - b):.6f}")
print(f"Residual (lstsq): {np.linalg.norm(A @ x_lstsq - b):.6f}")

:::tip QR in ML pipelines When fitting a linear model to features with unknown correlation structure, always solve via QR or np.linalg.lstsq rather than computing the normal equations manually. For Scikit-learn's LinearRegression, this is already done for you - but knowing why matters when you implement custom solvers. :::

Gram-Schmidt and numerical orthogonalization

QR is also the algorithm behind Gram-Schmidt orthogonalization, used whenever you need an orthogonal basis:

def gram_schmidt_qr(A: np.ndarray) -> tuple:
    """
    Classical Gram-Schmidt QR factorization.
    Note: scipy.linalg.qr uses Householder reflections - more stable.
    This is for illustration.
    """
    m, n = A.shape
    Q = np.zeros((m, n))
    R = np.zeros((n, n))

    for j in range(n):
        v = A[:, j].copy()
        for i in range(j):
            R[i, j] = Q[:, i] @ A[:, j]
            v -= R[i, j] * Q[:, i]
        R[j, j] = np.linalg.norm(v)
        if R[j, j] > 1e-10:
            Q[:, j] = v / R[j, j]

    return Q, R

Part 4 - Cholesky Factorization: For Positive Definite Matrices

When it applies

A matrix $A$ is symmetric positive definite (SPD) if:

$A = A^T$ (symmetric)
$x^T A x > 0$ for all non-zero $x$ (positive definite)

Common SPD matrices in ML:

Covariance matrices: $\Sigma = \frac{1}{n} X^T X$ (after centering)
Gram matrices with regularization: $X^T X + \lambda I$
Fisher information matrix
Hessian at a local minimum

For SPD matrices, Cholesky factorization gives $A = LL^T$ where $L$ is lower triangular with positive diagonal entries. This is twice as fast as LU and guaranteed stable for SPD matrices.

import numpy as np
from scipy.linalg import cholesky, cho_factor, cho_solve

# Generate a symmetric positive definite matrix
n = 50
A_raw = np.random.randn(n, n)
A = A_raw.T @ A_raw + n * np.eye(n)   # Guaranteed SPD: X^T X + nI

# Cholesky factorization
L = cholesky(A, lower=True)           # A = L @ L.T
print(f"Is lower triangular: {np.allclose(L, np.tril(L))}")  # True

# Efficient solve using Cholesky
cho_fac = cho_factor(A, lower=True)   # Pre-factorize
b = np.random.randn(n)
x = cho_solve(cho_fac, b)

# Verify: Ax = b
print(f"Residual: {np.linalg.norm(A @ x - b):.2e}")  # ~machine precision

# Use case: Gaussian process covariance
def gp_log_likelihood(K: np.ndarray, y: np.ndarray) -> float:
    """
    Compute log marginal likelihood for a Gaussian process.
    Uses Cholesky for numerical stability and efficiency.

    log p(y | K) = -1/2 y^T K^{-1} y - 1/2 log|K| - n/2 log(2π)
    """
    n = len(y)
    eps = 1e-6  # Jitter for numerical stability
    K_stable = K + eps * np.eye(n)

    # Cholesky: K = L L^T
    L = cholesky(K_stable, lower=True)

    # Solve K^{-1} y = (L L^T)^{-1} y = L^{-T} (L^{-1} y)
    alpha = np.linalg.solve(L.T, np.linalg.solve(L, y))

    # log|K| = 2 * sum(log(diag(L))) - from det of triangular matrix
    log_det_K = 2.0 * np.sum(np.log(np.diag(L)))

    return -0.5 * (y @ alpha + log_det_K + n * np.log(2 * np.pi))

:::note Cholesky failure = matrix is not SPD If scipy.linalg.cholesky(A) throws a LinAlgError, the matrix $A$ is not positive definite. This often happens in practice when:

You compute a covariance matrix on too few data points (rank-deficient)
Accumulated floating-point errors make a theoretically SPD matrix numerically indefinite
Features are exactly linearly dependent

Fix: Add a small jitter/diagonal: A + ε·I with $\varepsilon = 10^{-6}$ to $10^{-4}$ . This is standard practice in Gaussian processes and is the right thing to do. :::

Part 5 - Why You Should Never Invert a Matrix

This is one of the most important practical rules in numerical linear algebra:

Never compute $A^{-1}$ explicitly. Always solve the linear system $Ax = b$ directly.

Why this rule exists

import numpy as np
import time

n = 1000

# Ill-conditioned matrix (for demonstration)
A = np.random.randn(n, n)
A = A + A.T + n * np.eye(n)  # Make it positive definite
b = np.random.randn(n)

# Method 1: WRONG - explicit inverse (slow + numerically worse)
start = time.time()
x_wrong = np.linalg.inv(A) @ b
time_wrong = time.time() - start

# Method 2: CORRECT - solve the linear system directly
start = time.time()
x_correct = np.linalg.solve(A, b)
time_correct = time.time() - start

# Both give the same answer, but:
print(f"Time (inv):   {time_wrong:.4f}s")
print(f"Time (solve): {time_correct:.4f}s")
# solve() is ~2-3x faster - inv() computes n solutions to verify, solve() computes 1

# More importantly: accuracy on ill-conditioned systems
A_bad = np.array([[1.0, 1.0], [1.0, 1.0001]])
b_bad = np.array([2.0, 2.0001])

x_inv = np.linalg.inv(A_bad) @ b_bad
x_solve = np.linalg.solve(A_bad, b_bad)

# x_inv accumulates more rounding error - solve() uses LU with pivoting
print(f"Residual (inv):   {np.linalg.norm(A_bad @ x_inv - b_bad):.2e}")
print(f"Residual (solve): {np.linalg.norm(A_bad @ x_solve - b_bad):.2e}")

The fundamental reason: Computing $A^{-1}$ requires solving $n$ linear systems (one per column of $I$ ). But if you only need $A^{-1}b$ , that is just one linear system. Wasting $n-1$ extra solves introduces more rounding error.

The only legitimate use of explicit inverse

When you need the actual matrix $A^{-1}$ (not just the solution to a particular system), such as:

Computing $A^{-1} B$ for matrix-valued right-hand sides
Analytical formulas that require the inverse

Even then, use np.linalg.solve(A, B) for multiple right-hand sides, not np.linalg.inv(A) @ B.

Part 6 - Numerical Stability of Backpropagation

The Jacobian product view

Backpropagation is the chain rule applied to a composition of functions:

$\frac{\partial \mathcal{L}}{\partial \theta_l} = \underbrace{\frac{\partial \mathcal{L}}{\partial y_L}}_{\text{loss gradient}} \cdot \underbrace{J_{f_L} \cdots J_{f_{l+1}}}_{\text{upstream Jacobians}} \cdot \underbrace{J_{f_l}^{(\theta)}}_{\text{layer gradient}}$

where $J_{f_l}$ is the Jacobian of layer $l$ with respect to its inputs.

The numerical stability of the entire computation depends on the product of these Jacobians.

Eigenvalue analysis of gradient flow

Consider a deep network with linear layers and the same weight matrix $W$ at each layer. The gradient at layer $l$ involves $W^{L-l}$ . Let $\lambda_1, \ldots, \lambda_n$ be eigenvalues of $W$ :

$\left\|\frac{\partial \mathcal{L}}{\partial h_l}\right\| \approx \left\|\frac{\partial \mathcal{L}}{\partial h_L}\right\| \cdot |\lambda_{\max}|^{L-l}$

If $|\lambda_{\max}| > 1$ : gradients explode exponentially with depth
If $|\lambda_{\max}| < 1$ : gradients vanish exponentially with depth
If $|\lambda_{\max}| \approx 1$ : stable gradient flow

import numpy as np

def simulate_gradient_flow(n_layers: int, weight_scale: float) -> float:
    """
    Simulate gradient magnitude through a deep linear network.

    weight_scale > 1: gradient explosion
    weight_scale < 1: gradient vanishing
    weight_scale ~ 1: stable
    """
    d = 32  # hidden dimension
    # Initialize weights with specific scale
    weights = [np.random.randn(d, d) * weight_scale / np.sqrt(d)
               for _ in range(n_layers)]

    # Forward pass: h_l = W_l h_{l-1}
    h = np.random.randn(d)
    h /= np.linalg.norm(h)

    activations = [h]
    for W in weights:
        h = W @ h
        activations.append(h)

    # Backward pass: g_l = W_{l+1}^T g_{l+1}
    g = np.random.randn(d)
    g /= np.linalg.norm(g)

    gradient_norms = [np.linalg.norm(g)]
    for W in reversed(weights):
        g = W.T @ g
        gradient_norms.append(np.linalg.norm(g))

    return gradient_norms[-1]  # gradient at first layer

# Gradient explosion
exploded = simulate_gradient_flow(n_layers=20, weight_scale=1.5)
print(f"weight_scale=1.5, 20 layers: gradient norm = {exploded:.2e}")  # huge

# Gradient vanishing
vanished = simulate_gradient_flow(n_layers=20, weight_scale=0.7)
print(f"weight_scale=0.7, 20 layers: gradient norm = {vanished:.2e}")  # ~0

# Stable (Xavier initialization)
stable = simulate_gradient_flow(n_layers=20, weight_scale=1.0)
print(f"weight_scale=1.0, 20 layers: gradient norm = {stable:.2e}")    # ~1

Xavier and He initialization: numerical stability through eigenvalue control

Xavier (Glorot) initialization targets $|\lambda| \approx 1$ by initializing weights as:

$W_{ij} \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}, \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right)$

This keeps the variance of activations and gradients roughly constant across layers.

He initialization (for ReLU activations, which halve the variance):

$W_{ij} \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)$

import torch
import torch.nn as nn

# PyTorch's default initialization for Linear layers
# Uses Kaiming uniform (He initialization for ReLU)
layer = nn.Linear(512, 512)
print(f"Default weight std: {layer.weight.std():.4f}")
# ≈ 0.0442 = sqrt(2/512) - He initialization for ReLU

# Manual initialization
def stable_init(layer: nn.Linear, activation: str = 'relu') -> None:
    """Apply numerically stable initialization."""
    if activation == 'relu':
        nn.init.kaiming_normal_(layer.weight, mode='fan_in', nonlinearity='relu')
    elif activation == 'tanh':
        nn.init.xavier_normal_(layer.weight, gain=nn.init.calculate_gain('tanh'))
    elif activation == 'linear':
        nn.init.xavier_normal_(layer.weight)
    nn.init.zeros_(layer.bias)

Condition number monitoring in training

import torch
import numpy as np

def monitor_weight_conditioning(model: torch.nn.Module) -> dict:
    """
    Compute condition numbers of weight matrices.
    High condition numbers predict gradient instability.
    """
    conditions = {}
    for name, param in model.named_parameters():
        if param.dim() == 2:  # Only for weight matrices
            W = param.detach().cpu().numpy()
            try:
                kappa = np.linalg.cond(W)
                conditions[name] = kappa
                if kappa > 1e6:
                    print(f"WARNING: {name} has condition number {kappa:.2e}")
            except np.linalg.LinAlgError:
                conditions[name] = float('inf')
    return conditions

Part 7 - SVD: The Most Numerically Robust Decomposition

The Singular Value Decomposition (SVD) decomposes any matrix:

$A = U \Sigma V^T$

where $U, V$ are orthogonal and $\Sigma$ is diagonal with non-negative entries (singular values).

SVD is the most numerically robust decomposition. It is used when:

The matrix may be rank-deficient
You need the pseudoinverse
You need to compute the condition number

import numpy as np

A = np.random.randn(50, 10)   # Overdetermined

# Full SVD
U, s, Vt = np.linalg.svd(A, full_matrices=False)
print(f"U shape: {U.shape}")   # (50, 10) - left singular vectors
print(f"s shape: {s.shape}")   # (10,) - singular values
print(f"Vt shape: {Vt.shape}") # (10, 10) - right singular vectors (transposed)

# Condition number
kappa = s[0] / s[-1]
print(f"Condition number: {kappa:.2f}")  # sigma_max / sigma_min

# Moore-Penrose pseudoinverse via SVD (more stable than inv for rank-deficient)
def stable_pseudoinverse(A: np.ndarray, rcond: float = 1e-10) -> np.ndarray:
    """
    Compute pseudoinverse using SVD.
    Equivalent to np.linalg.pinv() but with explicit threshold.
    """
    U, s, Vt = np.linalg.svd(A, full_matrices=False)
    # Zero out singular values below threshold (treat as numerically zero)
    s_inv = np.where(s > rcond * s[0], 1.0 / s, 0.0)
    return Vt.T @ np.diag(s_inv) @ U.T

A_pinv = stable_pseudoinverse(A)
b = np.random.randn(50)
x_pinv = A_pinv @ b
print(f"Residual: {np.linalg.norm(A @ x_pinv - b):.6f}")   # minimum norm solution

Part 8 - Practical Solver Selection Guide

Given: solve Ax = b or related problem

┌─── Is A square? ──────────────────────────────────────────────────┐
│                                                                    │
│  YES → Is A symmetric positive definite?                          │
│         YES → Cholesky (fastest, most stable for SPD)             │
│         NO  → Is A well-conditioned and dense?                    │
│                YES → LU (general square solver)                   │
│                NO  → SVD-based pseudoinverse                      │
│                                                                    │
│  NO  → Overdetermined (m > n)?                                    │
│         YES → Is precision critical?                              │
│                YES → SVD via lstsq (most stable)                  │
│                NO  → QR (faster, nearly as stable)                │
│         NO  → Underdetermined (m < n)?                            │
│                → Minimum-norm via SVD pseudoinverse               │
└───────────────────────────────────────────────────────────────────┘

import numpy as np
from scipy.linalg import solve, cho_solve, cho_factor

def smart_solve(A: np.ndarray, b: np.ndarray,
                assume_spd: bool = False) -> np.ndarray:
    """
    Solve Ax = b using the most numerically appropriate method.
    """
    m, n = A.shape

    if m == n:
        if assume_spd:
            # Cholesky: fastest for symmetric positive definite
            try:
                c = cho_factor(A, lower=True)
                return cho_solve(c, b)
            except np.linalg.LinAlgError:
                print("Cholesky failed - matrix not SPD, falling back to LU")
                return solve(A, b)
        else:
            # LU with partial pivoting (scipy's solve)
            return solve(A, b)

    elif m > n:
        # Overdetermined: least squares via SVD (most stable)
        x, _, _, _ = np.linalg.lstsq(A, b, rcond=None)
        return x

    else:
        # Underdetermined: minimum-norm solution via SVD
        x, _, _, _ = np.linalg.lstsq(A, b, rcond=None)
        return x

Interview Questions

Q1: What is the condition number and why does it matter for machine learning?

The condition number $\kappa(A) = \sigma_{\max} / \sigma_{\min}$ (ratio of largest to smallest singular values) measures how sensitive the solution of $Ax = b$ is to perturbations in $b$ or $A$ .

Formal bound: $\frac{\|\delta x\|}{\|x\|} \leq \kappa(A) \cdot \frac{\|\delta b\|}{\|b\|}$

A relative perturbation of $\epsilon$ in $b$ can cause up to $\kappa(A) \cdot \epsilon$ relative error in $x$ .

ML relevance:

Correlated features: When features are highly correlated, the Gram matrix $X^T X$ has a very large condition number. The normal equations $X^T X \beta = X^T y$ become unreliable - tiny data noise causes huge swings in $\beta$ . Ridge regression adds $\lambda I$ to the Gram matrix, reducing the condition number to $(\sigma_{\max}^2 + \lambda)/(\sigma_{\min}^2 + \lambda)$ .
Gradient explosion/vanishing: The condition number of the product of layer Jacobians in a deep network determines whether gradients explode (condition number >> 1) or vanish (condition number << 1).
Covariance estimation: Covariance matrices estimated from fewer samples than dimensions are rank-deficient ( $\kappa = \infty$ ). Adding diagonal regularization (shrinkage) makes them invertible and well-conditioned.

Q2: Why should you never invert a matrix to solve a linear system?

Computing $x = A^{-1} b$ by:

Explicitly computing $A^{-1}$ (cost: $O(n^3)$ + solve $n$ linear systems)
Multiplying $A^{-1} b$ (cost: $O(n^2)$ )

is slower and less accurate than directly solving $Ax = b$ .

Speed: np.linalg.solve(A, b) uses LU factorization - $O(n^3)$ once, $O(n^2)$ per RHS. np.linalg.inv(A) @ b computes all $n$ columns of $A^{-1}$ , equivalent to solving $n$ linear systems. For a single $b$ , this is $n$ times more work.

Accuracy: Explicitly forming $A^{-1}$ accumulates rounding errors across all $n$ column solves. The final matrix-vector product adds more rounding error. Direct solve (numpy.linalg.solve) introduces rounding error only once along the single solution path.

The rule: Always use np.linalg.solve(A, b) or scipy.linalg.solve(A, b). The only exception is when you genuinely need the inverse matrix itself (e.g., computing $A^{-1} B$ for matrix $B$ ), in which case use np.linalg.solve(A, B) with matrix right-hand side.

Q3: When should you use QR decomposition over LU for solving linear systems?

Use LU when: The system is square ( $m = n$ ), well-conditioned, and you are solving $Ax = b$ .

Use QR when:

Overdetermined system ( $m > n$ , more equations than unknowns): QR solves the least squares problem $\min \|Ax - b\|^2$ stably, avoiding the condition-number squaring of the normal equations.
Numerical stability is critical: QR uses orthogonal transformations (which have condition number = 1), so it does not amplify existing ill-conditioning. The normal equations $A^T A x = A^T b$ square the condition number of $A$ .
Gram-Schmidt orthogonalization: When you need an orthogonal basis for the column space of $A$ .

Quantitative comparison:

If $\kappa(A) = 10^4$ , then $\kappa(A^T A) = 10^8$ - loses 4 decimal digits of precision in float32 (which has ~7 digits). QR keeps precision at $\kappa(A) = 10^4$ .
For features with correlation coefficient 0.9999, $\kappa(X^T X)$ can easily reach $10^8$ - QR is essential.

Q4: How does ridge regression stabilize ill-conditioned Gram matrices?

The OLS problem $\min \|X\beta - y\|^2$ has solution $\beta = (X^T X)^{-1} X^T y$ . When $X^T X$ is ill-conditioned (due to correlated features), $\beta$ is numerically unstable.

Ridge regression adds $\lambda I$ to the Gram matrix:

$\beta_{\text{ridge}} = (X^T X + \lambda I)^{-1} X^T y$

The eigenvalues of $X^T X$ are $\sigma_i^2$ (squared singular values of $X$ ). Adding $\lambda I$ shifts all eigenvalues to $\sigma_i^2 + \lambda$ :

$\kappa(X^T X + \lambda I) = \frac{\sigma_{\max}^2 + \lambda}{\sigma_{\min}^2 + \lambda}$

For a matrix where $\sigma_{\max}^2 = 100$ and $\sigma_{\min}^2 = 10^{-6}$ :

Without ridge: $\kappa = 10^8$
With $\lambda = 0.01$ : $\kappa = (100 + 0.01) / (10^{-6} + 0.01) \approx 10^4$ - much better

Trade-off: Larger $\lambda$ → better conditioning but more bias (shrinks $\beta$ toward zero). Cross-validation selects the optimal $\lambda$ .

Q5: Explain gradient vanishing and explosion from a numerical linear algebra perspective.

In a deep network with $L$ layers, backpropagation computes:

$g_l = J_{L}^T J_{L-1}^T \cdots J_{l+1}^T \cdot g_L$

where $J_l$ is the Jacobian of layer $l$ and $g_L$ is the gradient from the loss.

Spectral analysis: If each Jacobian $J_l$ has singular values near constant $\sigma$ :

$\|g_1\| \approx \|g_L\| \cdot \sigma^{L-1}$

$\sigma > 1$ : $\|g_1\|$ grows exponentially → gradient explosion
$\sigma < 1$ : $\|g_1\|$ shrinks exponentially → gradient vanishing
$\sigma = 1$ : stable gradient magnitude throughout network

Solutions targeting this condition:

Careful initialization (Xavier/He): Sets initial $\sigma \approx 1$ by scaling weights by $1/\sqrt{n_{\text{in}}}$ .
Batch/Layer normalization: Normalizes activations to have unit variance, which constrains the Jacobian's spectral radius to be near 1.
Residual connections (ResNets): $h_{l+1} = h_l + F(h_l)$ means the Jacobian is $I + J_F$ . The identity term ensures at least some singular values are near 1, providing a "gradient highway" that bypasses the vanishing problem.
Gradient clipping: Clips $\|\nabla\|$ to max_norm - treats the symptom (explosion) but not the cause.

Quick Reference

Matrix Type	Best Decomposition	When to Use
General square	LU (via `scipy.linalg.solve`)	Default for square systems
SPD (covariance, Gram+λI)	Cholesky	2× faster, guaranteed stable
Overdetermined (m > n)	QR or SVD-lstsq	Least squares problems
Rank-deficient	SVD (via `np.linalg.lstsq`)	When near-singularity is expected
Any (maximum stability)	SVD	When accuracy is paramount

Condition Number	Interpretation	Action
1–10	Well-conditioned	No action needed
10³–10⁶	Moderately ill-conditioned	Monitor, consider regularization
> 10⁸	Severely ill-conditioned	Add regularization λI
> 1/ε_machine	Numerically singular	SVD pseudoinverse required

Next: Lesson 03: Iterative Solvers →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Condition Number demo on the EngineersOfAI Playground - no code required.

:::

What You Will Learn​

Part 1 - The Condition Number​

Definition​

Interpreting the condition number​

Where ill-conditioned matrices appear in ML​

Part 2 - LU Factorization: The Workhorse Linear Solver​

What it is​

Part 3 - QR Factorization: The Numerically Safe Solver​

What it is​

Gram-Schmidt and numerical orthogonalization​

Part 4 - Cholesky Factorization: For Positive Definite Matrices​

When it applies​

Part 5 - Why You Should Never Invert a Matrix​

Why this rule exists​

The only legitimate use of explicit inverse​

Part 6 - Numerical Stability of Backpropagation​

The Jacobian product view​

Eigenvalue analysis of gradient flow​

Xavier and He initialization: numerical stability through eigenvalue control​

Condition number monitoring in training​

Part 7 - SVD: The Most Numerically Robust Decomposition​

Part 8 - Practical Solver Selection Guide​

Interview Questions​

Quick Reference​

What You Will Learn

Part 1 - The Condition Number

Definition

Interpreting the condition number

Where ill-conditioned matrices appear in ML

Part 2 - LU Factorization: The Workhorse Linear Solver

What it is

Part 3 - QR Factorization: The Numerically Safe Solver

What it is

Gram-Schmidt and numerical orthogonalization

Part 4 - Cholesky Factorization: For Positive Definite Matrices

When it applies

Part 5 - Why You Should Never Invert a Matrix

Why this rule exists

The only legitimate use of explicit inverse

Part 6 - Numerical Stability of Backpropagation

The Jacobian product view

Eigenvalue analysis of gradient flow

Xavier and He initialization: numerical stability through eigenvalue control

Condition number monitoring in training

Part 7 - SVD: The Most Numerically Robust Decomposition

Part 8 - Practical Solver Selection Guide

Interview Questions

Quick Reference