Numerical Differentiation - Finite Differences, Gradient Checking, and Autodiff

Reading time: ~20 minutes | Level: Numerical Methods → ML Engineering

You implement a custom PyTorch autograd function for a novel attention variant. The model trains, loss decreases. But the model performs worse than the baseline. Did you implement the backward pass correctly?

The answer: gradient checking with finite differences. This technique uses numerical differentiation to compare your analytical gradient against a numerical approximation. Any discrepancy signals a bug in your backward pass.

Numerical differentiation is also the historical foundation of machine learning optimization - before autodiff, we literally approximated gradients this way. Understanding it deeply helps you understand why autodiff was invented and what makes it superior.

What You Will Learn

Forward, backward, and central difference formulas with error analysis
How to select the optimal step size $h$
Truncation vs rounding error: the fundamental trade-off
Numerical Jacobians for multivariable functions
Gradient checking: implementing it correctly and interpreting results
Why automatic differentiation dominates and when numerical differentiation still has a role

Part 1 - Finite Difference Approximations

The forward difference formula

From Taylor's theorem, any smooth function satisfies:

$f(x + h) = f(x) + h f'(x) + \frac{h^2}{2} f''(x) + O(h^3)$

Rearranging:

$f'(x) \approx \frac{f(x+h) - f(x)}{h}$

The error is $O(h)$ - first-order accurate. This is the forward difference formula.

The central difference formula

Using both forward and backward Taylor expansions:

$f(x + h) = f(x) + h f'(x) + \frac{h^2}{2} f''(x) + \frac{h^3}{6} f'''(x) + O(h^4)$ $f(x - h) = f(x) - h f'(x) + \frac{h^2}{2} f''(x) - \frac{h^3}{6} f'''(x) + O(h^4)$

Subtracting and dividing by $2h$ :

$f'(x) \approx \frac{f(x+h) - f(x-h)}{2h} + O(h^2)$

The error is $O(h^2)$ - second-order accurate. For the same step size, central differences are much more accurate.

import numpy as np

def forward_difference(f, x: float, h: float = 1e-5) -> float:
    """First-order accurate finite difference: O(h) error."""
    return (f(x + h) - f(x)) / h

def central_difference(f, x: float, h: float = 1e-5) -> float:
    """Second-order accurate central difference: O(h²) error."""
    return (f(x + h) - f(x - h)) / (2 * h)

def five_point_stencil(f, x: float, h: float = 1e-4) -> float:
    """Fourth-order accurate: O(h⁴) error - higher accuracy, more function calls."""
    return (-f(x + 2*h) + 8*f(x + h) - 8*f(x - h) + f(x - 2*h)) / (12 * h)

# Test on f(x) = sin(x), f'(x) = cos(x)
f = np.sin
x = 1.0
true_deriv = np.cos(x)

h = 1e-4
fwd = forward_difference(f, x, h)
cen = central_difference(f, x, h)
fp5 = five_point_stencil(f, x, h)

print(f"True:     {true_deriv:.10f}")
print(f"Forward:  {fwd:.10f}  error: {abs(fwd - true_deriv):.2e}")
print(f"Central:  {cen:.10f}  error: {abs(cen - true_deriv):.2e}")
print(f"5-point:  {fp5:.10f}  error: {abs(fp5 - true_deriv):.2e}")
# Central has ~h times smaller error than forward for same h
# 5-point has ~h^4 error - near machine precision even for h=1e-4

Part 2 - Step Size Selection: The Critical Trade-off

Two competing error sources

The total error in finite differences has two components:

Truncation error: From dropping higher-order Taylor terms: $E_{\text{trunc}} = O(h^p) \quad \text{(p=1 for forward, p=2 for central)}$

Rounding error: From floating-point subtraction of nearly equal quantities: $E_{\text{round}} = \frac{2\epsilon_{\text{machine}} \cdot |f(x)|}{h}$

The total error is:

$E_{\text{total}}(h) = C_1 h^p + \frac{C_2 \epsilon_{\text{machine}}}{h}$

The optimal step size

Minimizing $E_{\text{total}}$ with respect to $h$ :

$h^* = \left(\frac{C_2 \epsilon_{\text{machine}}}{p C_1}\right)^{1/(p+1)}$

For float64 ( $\epsilon_{\text{machine}} \approx 10^{-16}$ ):

Forward difference (p=1): $h^* \approx \sqrt{\epsilon} \approx 10^{-8}$
Central difference (p=2): $h^* \approx \epsilon^{1/3} \approx 10^{-5}$ to $10^{-6}$

import numpy as np
import matplotlib.pyplot as plt

def analyze_step_size(f, f_prime_true, x: float):
    """
    Show how error varies with step size h.
    Reveals the optimal h where truncation and rounding errors balance.
    """
    h_values = np.logspace(-16, 0, 500)
    errors_fwd = []
    errors_cen = []

    true = f_prime_true(x)

    for h in h_values:
        fwd_est = (f(x + h) - f(x)) / h
        cen_est = (f(x + h) - f(x - h)) / (2 * h)
        errors_fwd.append(abs(fwd_est - true))
        errors_cen.append(abs(cen_est - true))

    # Find optimal h
    opt_h_fwd = h_values[np.argmin(errors_fwd)]
    opt_h_cen = h_values[np.argmin(errors_cen)]
    print(f"Optimal h (forward): {opt_h_fwd:.2e}")  # ~1e-8
    print(f"Optimal h (central): {opt_h_cen:.2e}")  # ~1e-5 to 1e-6

    return h_values, errors_fwd, errors_cen

# For f(x) = sin(x):
h_vals, e_fwd, e_cen = analyze_step_size(
    np.sin, np.cos, x=1.0
)
# Error curve: decreases as h decreases (truncation dominated)
# then increases as h decreases further (rounding dominated)
# Optimal h is the "sweet spot" minimum

Error vs step size h (log-log scale):

log₁₀(error)
     │
  0  ┤   Central:         Forward:
 -2  ┤    error ~ h²       error ~ h
 -4  ┤      ↘              ↘
 -6  ┤        ↘              ↘
 -8  ┤      ↗↘ opt h      ↗↘ opt h
-10  ┤    ↗                ↗
-12  ┤  ↗                ↗  (rounding error: 2ε/h)
     └────────────────────────────────
     -16  -12   -8   -4    0   log₁₀(h)

Central optimal h ≈ 1e-5, error ≈ 1e-10
Forward optimal h ≈ 1e-8, error ≈ 1e-8

:::tip Use central differences for gradient checking Always use central differences for gradient checking, with $h \approx 10^{-5}$ . Forward differences with $h \approx 10^{-8}$ appear to give similar total error but are more sensitive to the specific function being tested. Central differences are more robust. :::

Part 3 - Numerical Jacobians for Multivariable Functions

For a function $f: \mathbb{R}^n \to \mathbb{R}^m$ , the Jacobian $J \in \mathbb{R}^{m \times n}$ has entries:

$J_{ij} = \frac{\partial f_i}{\partial x_j}$

Numerical Jacobian via central differences:

$J_{ij} \approx \frac{f_i(x + h e_j) - f_i(x - h e_j)}{2h}$

where $e_j$ is the $j$ -th standard basis vector.

import numpy as np

def numerical_jacobian(
    f,
    x: np.ndarray,
    h: float = 1e-5
) -> np.ndarray:
    """
    Compute numerical Jacobian using central differences.

    f: R^n -> R^m function
    x: input point, shape (n,)
    Returns: Jacobian J, shape (m, n)
    """
    n = len(x)
    f0 = np.atleast_1d(f(x))
    m = len(f0)

    J = np.zeros((m, n))

    for j in range(n):
        # Perturb jth component
        x_plus = x.copy()
        x_minus = x.copy()
        x_plus[j] += h
        x_minus[j] -= h

        f_plus = np.atleast_1d(f(x_plus))
        f_minus = np.atleast_1d(f(x_minus))

        J[:, j] = (f_plus - f_minus) / (2 * h)

    return J

# Test: f(x, y) = [x^2 + y, x*y - y^2]
# Analytical Jacobian: [[2x, 1], [y, x - 2y]]
def f_test(x):
    return np.array([x[0]**2 + x[1], x[0]*x[1] - x[1]**2])

x0 = np.array([2.0, 3.0])

J_numerical = numerical_jacobian(f_test, x0)
J_analytical = np.array([[2*x0[0], 1],
                          [x0[1], x0[0] - 2*x0[1]]])

print("Numerical Jacobian:")
print(J_numerical)
print("\nAnalytical Jacobian:")
print(J_analytical)
print(f"\nMax error: {np.max(np.abs(J_numerical - J_analytical)):.2e}")
# ~1e-10 - excellent agreement

Part 4 - Gradient Checking: Validating Backpropagation

Gradient checking is the most important application of numerical differentiation in ML engineering. It verifies that your analytical backward pass computes the same gradients as numerical differentiation.

The relative error metric

def relative_error(a: np.ndarray, b: np.ndarray) -> float:
    """
    Compute the relative error between two gradient vectors.
    Standard metric for gradient checking.

    Interpretation:
    < 1e-7:  Excellent - almost certainly correct
    1e-7 to 1e-5: Probably OK (floating point noise)
    > 1e-3:  Bug in the gradient computation
    """
    numerator = np.linalg.norm(a - b)
    denominator = np.linalg.norm(a) + np.linalg.norm(b)
    if denominator < 1e-10:
        return 0.0  # Both near zero
    return numerator / denominator

Full gradient check implementation

import numpy as np

def gradient_check(
    forward_fn,
    backward_fn,
    params: np.ndarray,
    h: float = 1e-5
) -> dict:
    """
    Validate analytical gradients against numerical gradients.

    forward_fn: params -> scalar loss
    backward_fn: params -> gradient of loss w.r.t. params
    params: flat parameter vector

    Returns:
        dict with per-parameter relative errors
    """
    # Analytical gradients
    grad_analytical = backward_fn(params)

    # Numerical gradients via central differences
    grad_numerical = np.zeros_like(params)
    for i in range(len(params)):
        params_plus = params.copy()
        params_minus = params.copy()
        params_plus[i] += h
        params_minus[i] -= h

        loss_plus = forward_fn(params_plus)
        loss_minus = forward_fn(params_minus)

        grad_numerical[i] = (loss_plus - loss_minus) / (2 * h)

    # Relative error
    rel_error = relative_error(grad_analytical, grad_numerical)

    return {
        'relative_error': rel_error,
        'analytical': grad_analytical,
        'numerical': grad_numerical,
        'passed': rel_error < 1e-5
    }


# Example: gradient checking a custom sigmoid function
def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

def sigmoid_loss(params):
    """Scalar loss: sum of sigmoid(params)."""
    return np.sum(sigmoid(params))

def sigmoid_grad(params):
    """Analytical gradient: sigmoid(x) * (1 - sigmoid(x))."""
    s = sigmoid(params)
    return s * (1 - s)

params = np.array([1.0, -0.5, 2.0, 0.0, -1.5])
result = gradient_check(sigmoid_loss, sigmoid_grad, params)
print(f"Relative error: {result['relative_error']:.2e}")
print(f"Gradient check passed: {result['passed']}")

PyTorch gradient check

import torch

def torch_gradient_check(
    module: torch.nn.Module,
    inputs: tuple,
    tol: float = 1e-4
) -> bool:
    """
    Gradient check for a PyTorch nn.Module using torch.autograd.gradcheck.
    """
    # Convert inputs to double precision (more accurate numerical gradients)
    inputs_double = tuple(
        x.double().requires_grad_(True) if x.is_floating_point() else x
        for x in inputs
    )
    module_double = module.double()

    try:
        result = torch.autograd.gradcheck(
            module_double,
            inputs_double,
            eps=1e-6,      # h for central differences
            atol=tol,      # absolute tolerance
            rtol=tol,      # relative tolerance
            raise_exception=True
        )
        print("Gradient check PASSED")
        return True
    except RuntimeError as e:
        print(f"Gradient check FAILED: {e}")
        return False

# Usage: check your custom autograd function
class CustomSoftmax(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        exp_x = torch.exp(x - x.max(dim=-1, keepdim=True).values)
        softmax = exp_x / exp_x.sum(dim=-1, keepdim=True)
        ctx.save_for_backward(softmax)
        return softmax

    @staticmethod
    def backward(ctx, grad_output):
        softmax, = ctx.saved_tensors
        # Jacobian-vector product: J_softmax @ grad_output
        dot = (grad_output * softmax).sum(dim=-1, keepdim=True)
        return softmax * (grad_output - dot)

# Check it
x = torch.randn(3, 5, dtype=torch.float64, requires_grad=True)
torch.autograd.gradcheck(CustomSoftmax.apply, x, eps=1e-6, atol=1e-4)

Part 5 - Automatic Differentiation vs Numerical Differentiation

The fundamental limitations of numerical differentiation

Property	Numerical Diff	Automatic Diff (Reverse Mode)
Accuracy	$O(h^2)$ at best, limited by floating point	Exact (machine precision)
Cost (gradient of scalar)	$O(n)$ function evaluations	$O(1)$ backward passes
Cost ( $n$ -input, $m$ -output Jacobian)	$O(n)$ or $O(m)$	$O(\min(n,m))$
Works for non-smooth functions	Yes (with care)	Only if derivative defined
Implementation complexity	Trivial	Complex (requires computation graph)

For computing gradients of $\mathcal{L}: \mathbb{R}^n \to \mathbb{R}$ (scalar loss), reverse-mode autodiff computes all $n$ partial derivatives in a single backward pass - no approximation, no step size selection, no trade-off between truncation and rounding error.

Numerical differentiation requires $n$ forward function evaluations (or $2n$ for central differences). For a 7B parameter model, that is 14 billion forward passes to compute one gradient - completely infeasible.

When numerical differentiation is still used

Gradient checking: Validate autodiff implementations against numerical approximation
External simulators: Physics engines, legacy code without autodiff support
Implicit differentiation: Differentiating through the solution of a system (e.g., via implicit function theorem)
Derivative-free optimization: When the function is non-differentiable or black-box

import numpy as np

def derivative_free_optimization(
    f,
    x0: np.ndarray,
    learning_rate: float = 0.01,
    h: float = 1e-5,
    n_iters: int = 100
) -> np.ndarray:
    """
    Simple gradient descent using numerical gradients.
    Only sensible for very low-dimensional problems.
    For high-dimensional ML: USE AUTODIFF.
    """
    x = x0.copy()

    for step in range(n_iters):
        # Compute numerical gradient (expensive: n function calls)
        grad = np.zeros_like(x)
        for i in range(len(x)):
            x_plus = x.copy(); x_plus[i] += h
            x_minus = x.copy(); x_minus[i] -= h
            grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)

        x = x - learning_rate * grad

        if step % 10 == 0:
            print(f"Step {step}: f(x) = {f(x):.6f}")

    return x

# This is fine for 2D optimization benchmarks, absurd for neural networks

Interview Questions

Q1: Why is central difference more accurate than forward difference, and what is the optimal step size?

Forward difference error analysis: $\frac{f(x+h) - f(x)}{h} = f'(x) + \frac{h}{2}f''(x) + O(h^2)$ Error = $O(h)$ - first order accurate.

Central difference error analysis: $\frac{f(x+h) - f(x-h)}{2h} = f'(x) + \frac{h^2}{6}f'''(x) + O(h^4)$ The $f''(x)$ terms cancel, giving $O(h^2)$ error - second-order accurate.

For the same $h$ , central differences are $O(h)$ times more accurate. Central differences require 2 function evaluations per derivative versus 1 for forward differences, but this is almost always worthwhile.

Optimal step size balances truncation error ( $\sim h^p$ ) against rounding error ( $\sim \epsilon_{\text{mach}}/h$ ):

$h^*_{\text{forward}} = \sqrt{\epsilon_{\text{mach}}} \approx 10^{-8} \quad (\text{float64})$ $h^*_{\text{central}} = \epsilon_{\text{mach}}^{1/3} \approx 10^{-5} \text{ to } 10^{-6} \quad (\text{float64})$

At these optimal step sizes, central difference achieves relative error $\approx 10^{-10}$ , while forward difference achieves $\approx 10^{-8}$ . Central difference is still more accurate even at its (slightly larger) optimal $h$ .

Q2: How do you implement a correct gradient check in PyTorch, and what relative error threshold indicates a bug?

Correct implementation:

Use float64 (double precision) for both forward passes - numerical differentiation's rounding error decreases dramatically at float64 vs float32.
Use central differences with $h = 10^{-5}$ to $10^{-6}$ .
Compare analytical gradient $g_a$ against numerical gradient $g_n$ using relative error: $\text{rel\_error} = \frac{\|g_a - g_n\|}{\|g_a\| + \|g_n\|}$

Thresholds:

< 1e-7: Correct with high confidence
1e-7 to 1e-5: Likely correct (may reflect float64 precision limits)
1e-5 to 1e-3: Suspicious - investigate
> 1e-3: Almost certainly a bug in the backward pass

Common gotchas:

Don't test at a point where the function is non-smooth (e.g., x=0 for ReLU, abs(x))
Include all parameters including biases
Test with both positive and negative inputs and at multiple random points
PyTorch's torch.autograd.gradcheck automates this correctly - prefer it over rolling your own

Typical bugs found by gradient checking:

Wrong sign (forgot a minus somewhere)
Missing factor of 2 or $n$
Wrong reshape/dimension order in Jacobian
Incorrect handling of batch dimension

Q3: Why is reverse-mode autodiff O(1) for gradients while numerical differentiation is O(n)?

Numerical differentiation approximates: $\frac{\partial \mathcal{L}}{\partial \theta_i} \approx \frac{\mathcal{L}(\theta + h e_i) - \mathcal{L}(\theta - h e_i)}{2h}$

Each partial derivative requires 2 forward function evaluations. For $n$ parameters: $2n$ evaluations total - $O(n)$ cost.

Reverse-mode autodiff (backpropagation):

Forward pass: computes and stores intermediate values - $O(n)$ time, $O(n)$ memory
Backward pass: computes all $n$ partial derivatives simultaneously using the chain rule - $O(n)$ time

The key insight: the backward pass computes all partial derivatives at once by propagating gradient "signals" backward through the computation graph. The total cost is proportional to the forward pass cost, not to the number of parameters.

Formal statement: For $f: \mathbb{R}^n \to \mathbb{R}^m$ :

Forward mode autodiff: $O(\text{cost of f})$ per input direction → efficient when $n \ll m$
Reverse mode autodiff: $O(\text{cost of f})$ per output direction → efficient when $m \ll n$
For neural network training: $m = 1$ (scalar loss), $n = \text{millions}$ → reverse mode is the right choice

Q4: When would you use numerical differentiation instead of automatic differentiation in an ML system?

1. Gradient checking (debugging): Always use numerical differentiation to validate custom autodiff implementations. PyTorch's gradcheck does this automatically.

2. External simulators: When part of your pipeline is a physics engine, a compiled simulation, or legacy code that cannot be differentiated analytically. You can wrap it with numerical gradients. Example: differentiating through a rigid body physics simulator for robotics RL.

3. Implicit differentiation: When your "function" is defined as the solution to a system of equations $F(y; \theta) = 0$ . The derivative $dy/d\theta$ can be computed implicitly via the implicit function theorem without differentiating through the solver iterations.

4. Non-differentiable components: Sorting, argmax, sampling - operations that autodiff cannot handle. Finite differences can still provide a useful signal in some cases (straight-through estimators are an alternative).

5. Derivative-free optimization: Some optimization algorithms (CMA-ES, Bayesian optimization) don't use gradients at all. Numerical gradients can approximate them for simple benchmarks but not for high-dimensional ML.

Never use numerical differentiation for: Training neural networks with millions of parameters - the $O(n)$ function evaluation cost is completely infeasible.

Q5: What is the truncation error vs rounding error trade-off in finite differences?

The total error in a central difference approximation has two components:

Truncation error (from Taylor series approximation): $E_{\text{trunc}} = \frac{h^2}{6} |f'''(\xi)| \quad \text{for some } \xi \in [x-h, x+h]$

This decreases as $h \to 0$ . Smaller $h$ → better truncation accuracy.

Rounding error (from floating-point subtraction of nearly equal numbers): $E_{\text{round}} \approx \frac{2\epsilon_{\text{mach}} |f(x)|}{h}$

$f(x+h) \approx f(x-h)$ for small $h$ , so their difference loses significant digits. Smaller $h$ → worse rounding accuracy.

Total error: $E(h) \approx C_1 h^2 + C_2 \epsilon_{\text{mach}} / h$

Minimizing: $\frac{dE}{dh} = 2C_1 h - C_2 \epsilon_{\text{mach}} / h^2 = 0$

$h^* = \left(\frac{C_2 \epsilon_{\text{mach}}}{2C_1}\right)^{1/3} \approx \epsilon_{\text{mach}}^{1/3} \approx 10^{-5} \text{ (float64)}$

At $h^*$ , the minimum error is $\approx \epsilon_{\text{mach}}^{2/3} \approx 10^{-10}$ - limited by floating-point precision, not step size choice.

Practical consequence: There is a finite minimum error for numerical differentiation, determined by floating-point precision. Automatic differentiation has no such limit - it achieves machine precision by construction.

Quick Reference

Method	Formula	Error Order	Optimal h (float64)	Cost
Forward difference	$[f(x+h) - f(x)]/h$	$O(h)$	$\sqrt{\epsilon} \approx 10^{-8}$	1 eval
Backward difference	$[f(x) - f(x-h)]/h$	$O(h)$	$\sqrt{\epsilon} \approx 10^{-8}$	1 eval
Central difference	$[f(x+h) - f(x-h)]/(2h)$	$O(h^2)$	$\epsilon^{1/3} \approx 10^{-5}$	2 evals
5-point stencil	(see formula above)	$O(h^4)$	$\epsilon^{1/5} \approx 10^{-3}$	4 evals

Relative Error	Interpretation
`< 1e-7`	Gradient is correct
`1e-7` to `1e-5`	Probably correct
`> 1e-3`	Bug in backward pass

Next: Lesson 05: Numerical Integration →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Derivatives & Gradients demo on the EngineersOfAI Playground - no code required.

:::

What You Will Learn​

Part 1 - Finite Difference Approximations​

The forward difference formula​

The central difference formula​

Part 2 - Step Size Selection: The Critical Trade-off​

Two competing error sources​

The optimal step size​

Part 3 - Numerical Jacobians for Multivariable Functions​

Part 4 - Gradient Checking: Validating Backpropagation​

The relative error metric​

Full gradient check implementation​

PyTorch gradient check​

Part 5 - Automatic Differentiation vs Numerical Differentiation​

The fundamental limitations of numerical differentiation​

When numerical differentiation is still used​

Interview Questions​

Quick Reference​