Module 08 - Numerical Methods for ML Engineering

The gap between mathematics on paper and mathematics on a computer is where most ML bugs live.

Your model's loss function, written in a textbook, looks perfect. But when you run it on a GPU with float16 tensors, the gradients explode to NaN. Your matrix equation has a unique theoretical solution - but when you solve it numerically, you get garbage because the matrix is ill-conditioned. Your attention weights are computed as softmax of dot products - but at sequence length 4096, the exponentials overflow.

Numerical methods is the discipline that bridges pure mathematics and real computation. For ML engineers, it is not optional - it is the difference between models that train reliably and models that silently produce wrong answers.

Why Numerical Precision Matters in ML

NaN gradients and exploding activations

During backpropagation, gradients flow through hundreds of operations. Any single numerical instability - a division by a very small number, an overflow in an exponential, a catastrophic cancellation in a subtraction - corrupts all downstream gradients. The result: NaN loss values, frozen training, wasted GPU hours.

Understanding floating-point arithmetic lets you read error signals like loss: nan or grad norm: inf and trace them to root causes: unbounded log arguments, zero-divided denominators, or accumulated rounding errors.

Mixed precision training

Modern ML uses float16 or bfloat16 to double throughput on tensor cores. But float16 has a dynamic range of only ±65504 - easily exceeded by intermediate activations. Understanding floating-point formats tells you why PyTorch's autocast context switches specific operations back to float32, and why bfloat16 (same range as float32, less precision) is often safer than float16.

Ill-conditioned systems

When you fit a linear model to data with highly correlated features, the Gram matrix $X^TX$ becomes nearly singular - a tiny change in the data produces a huge change in the solution. Understanding condition numbers tells you when to add regularization, when to use QR instead of the normal equations, and when your covariance estimates are numerically meaningless.

Module Map

Module 08: Numerical Methods
│
├── 01 - Floating-Point Arithmetic
│        IEEE 754, machine epsilon, catastrophic cancellation,
│        float16/bfloat16, mixed precision training
│
├── 02 - Numerical Linear Algebra
│        Condition number, LU/QR/Cholesky decompositions,
│        when NOT to invert matrices, backprop stability
│
├── 03 - Iterative Solvers
│        Conjugate gradient, Krylov subspace methods,
│        preconditioning, large-scale ML applications
│
├── 04 - Numerical Differentiation
│        Finite differences, central difference formula,
│        step size selection, comparison with autodiff
│
├── 05 - Numerical Integration
│        Quadrature methods, Monte Carlo integration,
│        importance sampling, Bayesian inference
│
├── 06 - Root-Finding Algorithms
│        Bisection, Newton-Raphson, secant method,
│        ML connection: learning rate and loss optimization
│
└── 07 - Sparse Matrix Methods
         CSR/CSC formats, sparse operations,
         attention masks, graph adjacency, sparse embeddings

Key Concepts at a Glance

Concept	Why It Matters in ML
Machine epsilon	Lower bound on relative rounding error - determines safe learning rate floors
Condition number	Quantifies how ill-conditioned a system is - predicts gradient explosion risk
Catastrophic cancellation	Subtracting nearly-equal floats destroys precision - log-sum-exp trick solves this
QR decomposition	More stable than solving normal equations - basis of least squares
Conjugate gradient	Solve 1M×1M systems without ever forming the matrix explicitly
Finite differences	Gradient checking: verify your backprop implementation is correct
Monte Carlo integration	Approximate intractable Bayesian integrals - foundation of variational inference
Sparse CSR format	99% sparse attention mask stored as 1% of memory - enables long-context models

Prerequisites

Module 01 (Linear Algebra): matrix operations, eigenvalues, SVD
Module 02 (Calculus and Optimization): gradients, Jacobians, chain rule
Python and NumPy: comfortable with array operations
Basic knowledge of neural network training (forward/backward pass)

What You Will Be Able to Do After This Module

Debug NaN gradients by identifying the numerical source: overflow, underflow, or cancellation
Choose the right matrix solver (LU vs. QR vs. Cholesky vs. iterative) for a given problem
Implement gradient checking using finite differences to validate custom autograd operations
Understand why transformers use scaled dot-product attention ( $1/\sqrt{d_k}$ scaling)
Work efficiently with sparse matrices in SciPy and understand when sparsity saves memory
Explain mixed precision training: what float16 can and cannot do, and where bfloat16 is safer

Connections to Other Modules

The ML Engineer's Numerical Methods Checklist

Before training a model, ask:

What floating-point format am I using? Will activations overflow?
Are any denominators potentially zero? (log, softmax, normalization layers)
Is my Gram matrix or covariance matrix well-conditioned?
Am I using matrix inverse where I should use a solver?
Are my sparse data structures as memory-efficient as they should be?
Have I gradient-checked my custom backward pass?

These questions - and the mathematics needed to answer them - are what this module covers.

Start with Lesson 01: Floating-Point Arithmetic →

Why Numerical Precision Matters in ML​

NaN gradients and exploding activations​

Mixed precision training​

Ill-conditioned systems​

Module Map​

Key Concepts at a Glance​

Prerequisites​

What You Will Be Able to Do After This Module​

Connections to Other Modules​

The ML Engineer's Numerical Methods Checklist​