Module 08 - Numerical Methods for ML Engineering
The gap between mathematics on paper and mathematics on a computer is where most ML bugs live.
Your model's loss function, written in a textbook, looks perfect. But when you run it on a GPU with float16 tensors, the gradients explode to NaN. Your matrix equation has a unique theoretical solution - but when you solve it numerically, you get garbage because the matrix is ill-conditioned. Your attention weights are computed as softmax of dot products - but at sequence length 4096, the exponentials overflow.
Numerical methods is the discipline that bridges pure mathematics and real computation. For ML engineers, it is not optional - it is the difference between models that train reliably and models that silently produce wrong answers.
Why Numerical Precision Matters in ML
NaN gradients and exploding activations
During backpropagation, gradients flow through hundreds of operations. Any single numerical instability - a division by a very small number, an overflow in an exponential, a catastrophic cancellation in a subtraction - corrupts all downstream gradients. The result: NaN loss values, frozen training, wasted GPU hours.
Understanding floating-point arithmetic lets you read error signals like loss: nan or grad norm: inf and trace them to root causes: unbounded log arguments, zero-divided denominators, or accumulated rounding errors.
Mixed precision training
Modern ML uses float16 or bfloat16 to double throughput on tensor cores. But float16 has a dynamic range of only ±65504 - easily exceeded by intermediate activations. Understanding floating-point formats tells you why PyTorch's autocast context switches specific operations back to float32, and why bfloat16 (same range as float32, less precision) is often safer than float16.
Ill-conditioned systems
When you fit a linear model to data with highly correlated features, the Gram matrix becomes nearly singular - a tiny change in the data produces a huge change in the solution. Understanding condition numbers tells you when to add regularization, when to use QR instead of the normal equations, and when your covariance estimates are numerically meaningless.
Module Map
Module 08: Numerical Methods
│
├── 01 - Floating-Point Arithmetic
│ IEEE 754, machine epsilon, catastrophic cancellation,
│ float16/bfloat16, mixed precision training
│
├── 02 - Numerical Linear Algebra
│ Condition number, LU/QR/Cholesky decompositions,
│ when NOT to invert matrices, backprop stability
│
├── 03 - Iterative Solvers
│ Conjugate gradient, Krylov subspace methods,
│ preconditioning, large-scale ML applications
│
├── 04 - Numerical Differentiation
│ Finite differences, central difference formula,
│ step size selection, comparison with autodiff
│
├── 05 - Numerical Integration
│ Quadrature methods, Monte Carlo integration,
│ importance sampling, Bayesian inference
│
├── 06 - Root-Finding Algorithms
│ Bisection, Newton-Raphson, secant method,
│ ML connection: learning rate and loss optimization
│
└── 07 - Sparse Matrix Methods
CSR/CSC formats, sparse operations,
attention masks, graph adjacency, sparse embeddings
Key Concepts at a Glance
| Concept | Why It Matters in ML |
|---|---|
| Machine epsilon | Lower bound on relative rounding error - determines safe learning rate floors |
| Condition number | Quantifies how ill-conditioned a system is - predicts gradient explosion risk |
| Catastrophic cancellation | Subtracting nearly-equal floats destroys precision - log-sum-exp trick solves this |
| QR decomposition | More stable than solving normal equations - basis of least squares |
| Conjugate gradient | Solve 1M×1M systems without ever forming the matrix explicitly |
| Finite differences | Gradient checking: verify your backprop implementation is correct |
| Monte Carlo integration | Approximate intractable Bayesian integrals - foundation of variational inference |
| Sparse CSR format | 99% sparse attention mask stored as 1% of memory - enables long-context models |
Prerequisites
- Module 01 (Linear Algebra): matrix operations, eigenvalues, SVD
- Module 02 (Calculus and Optimization): gradients, Jacobians, chain rule
- Python and NumPy: comfortable with array operations
- Basic knowledge of neural network training (forward/backward pass)
What You Will Be Able to Do After This Module
- Debug NaN gradients by identifying the numerical source: overflow, underflow, or cancellation
- Choose the right matrix solver (LU vs. QR vs. Cholesky vs. iterative) for a given problem
- Implement gradient checking using finite differences to validate custom autograd operations
- Understand why transformers use scaled dot-product attention ( scaling)
- Work efficiently with sparse matrices in SciPy and understand when sparsity saves memory
- Explain mixed precision training: what float16 can and cannot do, and where bfloat16 is safer
Connections to Other Modules
The ML Engineer's Numerical Methods Checklist
Before training a model, ask:
- What floating-point format am I using? Will activations overflow?
- Are any denominators potentially zero? (log, softmax, normalization layers)
- Is my Gram matrix or covariance matrix well-conditioned?
- Am I using matrix inverse where I should use a solver?
- Are my sparse data structures as memory-efficient as they should be?
- Have I gradient-checked my custom backward pass?
These questions - and the mathematics needed to answer them - are what this module covers.
Start with Lesson 01: Floating-Point Arithmetic →
