Numerical Differentiation - Finite Differences, Gradient Checking, and Autodiff
Reading time: ~20 minutes | Level: Numerical Methods → ML Engineering
You implement a custom PyTorch autograd function for a novel attention variant. The model trains, loss decreases. But the model performs worse than the baseline. Did you implement the backward pass correctly?
The answer: gradient checking with finite differences. This technique uses numerical differentiation to compare your analytical gradient against a numerical approximation. Any discrepancy signals a bug in your backward pass.
Numerical differentiation is also the historical foundation of machine learning optimization - before autodiff, we literally approximated gradients this way. Understanding it deeply helps you understand why autodiff was invented and what makes it superior.
What You Will Learn
- Forward, backward, and central difference formulas with error analysis
- How to select the optimal step size
- Truncation vs rounding error: the fundamental trade-off
- Numerical Jacobians for multivariable functions
- Gradient checking: implementing it correctly and interpreting results
- Why automatic differentiation dominates and when numerical differentiation still has a role
Part 1 - Finite Difference Approximations
The forward difference formula
From Taylor's theorem, any smooth function satisfies:
Rearranging:
The error is - first-order accurate. This is the forward difference formula.
The central difference formula
Using both forward and backward Taylor expansions:
Subtracting and dividing by :
The error is - second-order accurate. For the same step size, central differences are much more accurate.
import numpy as np
def forward_difference(f, x: float, h: float = 1e-5) -> float:
"""First-order accurate finite difference: O(h) error."""
return (f(x + h) - f(x)) / h
def central_difference(f, x: float, h: float = 1e-5) -> float:
"""Second-order accurate central difference: O(h²) error."""
return (f(x + h) - f(x - h)) / (2 * h)
def five_point_stencil(f, x: float, h: float = 1e-4) -> float:
"""Fourth-order accurate: O(h⁴) error - higher accuracy, more function calls."""
return (-f(x + 2*h) + 8*f(x + h) - 8*f(x - h) + f(x - 2*h)) / (12 * h)
# Test on f(x) = sin(x), f'(x) = cos(x)
f = np.sin
x = 1.0
true_deriv = np.cos(x)
h = 1e-4
fwd = forward_difference(f, x, h)
cen = central_difference(f, x, h)
fp5 = five_point_stencil(f, x, h)
print(f"True: {true_deriv:.10f}")
print(f"Forward: {fwd:.10f} error: {abs(fwd - true_deriv):.2e}")
print(f"Central: {cen:.10f} error: {abs(cen - true_deriv):.2e}")
print(f"5-point: {fp5:.10f} error: {abs(fp5 - true_deriv):.2e}")
# Central has ~h times smaller error than forward for same h
# 5-point has ~h^4 error - near machine precision even for h=1e-4
Part 2 - Step Size Selection: The Critical Trade-off
Two competing error sources
The total error in finite differences has two components:
Truncation error: From dropping higher-order Taylor terms:
Rounding error: From floating-point subtraction of nearly equal quantities:
The total error is:
The optimal step size
Minimizing with respect to :
For float64 ():
- Forward difference (p=1):
- Central difference (p=2): to
import numpy as np
import matplotlib.pyplot as plt
def analyze_step_size(f, f_prime_true, x: float):
"""
Show how error varies with step size h.
Reveals the optimal h where truncation and rounding errors balance.
"""
h_values = np.logspace(-16, 0, 500)
errors_fwd = []
errors_cen = []
true = f_prime_true(x)
for h in h_values:
fwd_est = (f(x + h) - f(x)) / h
cen_est = (f(x + h) - f(x - h)) / (2 * h)
errors_fwd.append(abs(fwd_est - true))
errors_cen.append(abs(cen_est - true))
# Find optimal h
opt_h_fwd = h_values[np.argmin(errors_fwd)]
opt_h_cen = h_values[np.argmin(errors_cen)]
print(f"Optimal h (forward): {opt_h_fwd:.2e}") # ~1e-8
print(f"Optimal h (central): {opt_h_cen:.2e}") # ~1e-5 to 1e-6
return h_values, errors_fwd, errors_cen
# For f(x) = sin(x):
h_vals, e_fwd, e_cen = analyze_step_size(
np.sin, np.cos, x=1.0
)
# Error curve: decreases as h decreases (truncation dominated)
# then increases as h decreases further (rounding dominated)
# Optimal h is the "sweet spot" minimum
Error vs step size h (log-log scale):
log₁₀(error)
│
0 ┤ Central: Forward:
-2 ┤ error ~ h² error ~ h
-4 ┤ ↘ ↘
-6 ┤ ↘ ↘
-8 ┤ ↗↘ opt h ↗↘ opt h
-10 ┤ ↗ ↗
-12 ┤ ↗ ↗ (rounding error: 2ε/h)
└────────────────────────────────
-16 -12 -8 -4 0 log₁₀(h)
Central optimal h ≈ 1e-5, error ≈ 1e-10
Forward optimal h ≈ 1e-8, error ≈ 1e-8
:::tip Use central differences for gradient checking Always use central differences for gradient checking, with . Forward differences with appear to give similar total error but are more sensitive to the specific function being tested. Central differences are more robust. :::
Part 3 - Numerical Jacobians for Multivariable Functions
For a function , the Jacobian has entries:
Numerical Jacobian via central differences:
where is the -th standard basis vector.
import numpy as np
def numerical_jacobian(
f,
x: np.ndarray,
h: float = 1e-5
) -> np.ndarray:
"""
Compute numerical Jacobian using central differences.
f: R^n -> R^m function
x: input point, shape (n,)
Returns: Jacobian J, shape (m, n)
"""
n = len(x)
f0 = np.atleast_1d(f(x))
m = len(f0)
J = np.zeros((m, n))
for j in range(n):
# Perturb jth component
x_plus = x.copy()
x_minus = x.copy()
x_plus[j] += h
x_minus[j] -= h
f_plus = np.atleast_1d(f(x_plus))
f_minus = np.atleast_1d(f(x_minus))
J[:, j] = (f_plus - f_minus) / (2 * h)
return J
# Test: f(x, y) = [x^2 + y, x*y - y^2]
# Analytical Jacobian: [[2x, 1], [y, x - 2y]]
def f_test(x):
return np.array([x[0]**2 + x[1], x[0]*x[1] - x[1]**2])
x0 = np.array([2.0, 3.0])
J_numerical = numerical_jacobian(f_test, x0)
J_analytical = np.array([[2*x0[0], 1],
[x0[1], x0[0] - 2*x0[1]]])
print("Numerical Jacobian:")
print(J_numerical)
print("\nAnalytical Jacobian:")
print(J_analytical)
print(f"\nMax error: {np.max(np.abs(J_numerical - J_analytical)):.2e}")
# ~1e-10 - excellent agreement
Part 4 - Gradient Checking: Validating Backpropagation
Gradient checking is the most important application of numerical differentiation in ML engineering. It verifies that your analytical backward pass computes the same gradients as numerical differentiation.
The relative error metric
def relative_error(a: np.ndarray, b: np.ndarray) -> float:
"""
Compute the relative error between two gradient vectors.
Standard metric for gradient checking.
Interpretation:
< 1e-7: Excellent - almost certainly correct
1e-7 to 1e-5: Probably OK (floating point noise)
> 1e-3: Bug in the gradient computation
"""
numerator = np.linalg.norm(a - b)
denominator = np.linalg.norm(a) + np.linalg.norm(b)
if denominator < 1e-10:
return 0.0 # Both near zero
return numerator / denominator
Full gradient check implementation
import numpy as np
def gradient_check(
forward_fn,
backward_fn,
params: np.ndarray,
h: float = 1e-5
) -> dict:
"""
Validate analytical gradients against numerical gradients.
forward_fn: params -> scalar loss
backward_fn: params -> gradient of loss w.r.t. params
params: flat parameter vector
Returns:
dict with per-parameter relative errors
"""
# Analytical gradients
grad_analytical = backward_fn(params)
# Numerical gradients via central differences
grad_numerical = np.zeros_like(params)
for i in range(len(params)):
params_plus = params.copy()
params_minus = params.copy()
params_plus[i] += h
params_minus[i] -= h
loss_plus = forward_fn(params_plus)
loss_minus = forward_fn(params_minus)
grad_numerical[i] = (loss_plus - loss_minus) / (2 * h)
# Relative error
rel_error = relative_error(grad_analytical, grad_numerical)
return {
'relative_error': rel_error,
'analytical': grad_analytical,
'numerical': grad_numerical,
'passed': rel_error < 1e-5
}
# Example: gradient checking a custom sigmoid function
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
def sigmoid_loss(params):
"""Scalar loss: sum of sigmoid(params)."""
return np.sum(sigmoid(params))
def sigmoid_grad(params):
"""Analytical gradient: sigmoid(x) * (1 - sigmoid(x))."""
s = sigmoid(params)
return s * (1 - s)
params = np.array([1.0, -0.5, 2.0, 0.0, -1.5])
result = gradient_check(sigmoid_loss, sigmoid_grad, params)
print(f"Relative error: {result['relative_error']:.2e}")
print(f"Gradient check passed: {result['passed']}")
PyTorch gradient check
import torch
def torch_gradient_check(
module: torch.nn.Module,
inputs: tuple,
tol: float = 1e-4
) -> bool:
"""
Gradient check for a PyTorch nn.Module using torch.autograd.gradcheck.
"""
# Convert inputs to double precision (more accurate numerical gradients)
inputs_double = tuple(
x.double().requires_grad_(True) if x.is_floating_point() else x
for x in inputs
)
module_double = module.double()
try:
result = torch.autograd.gradcheck(
module_double,
inputs_double,
eps=1e-6, # h for central differences
atol=tol, # absolute tolerance
rtol=tol, # relative tolerance
raise_exception=True
)
print("Gradient check PASSED")
return True
except RuntimeError as e:
print(f"Gradient check FAILED: {e}")
return False
# Usage: check your custom autograd function
class CustomSoftmax(torch.autograd.Function):
@staticmethod
def forward(ctx, x):
exp_x = torch.exp(x - x.max(dim=-1, keepdim=True).values)
softmax = exp_x / exp_x.sum(dim=-1, keepdim=True)
ctx.save_for_backward(softmax)
return softmax
@staticmethod
def backward(ctx, grad_output):
softmax, = ctx.saved_tensors
# Jacobian-vector product: J_softmax @ grad_output
dot = (grad_output * softmax).sum(dim=-1, keepdim=True)
return softmax * (grad_output - dot)
# Check it
x = torch.randn(3, 5, dtype=torch.float64, requires_grad=True)
torch.autograd.gradcheck(CustomSoftmax.apply, x, eps=1e-6, atol=1e-4)
Part 5 - Automatic Differentiation vs Numerical Differentiation
The fundamental limitations of numerical differentiation
| Property | Numerical Diff | Automatic Diff (Reverse Mode) |
|---|---|---|
| Accuracy | at best, limited by floating point | Exact (machine precision) |
| Cost (gradient of scalar) | function evaluations | backward passes |
| Cost (-input, -output Jacobian) | or | |
| Works for non-smooth functions | Yes (with care) | Only if derivative defined |
| Implementation complexity | Trivial | Complex (requires computation graph) |
For computing gradients of (scalar loss), reverse-mode autodiff computes all partial derivatives in a single backward pass - no approximation, no step size selection, no trade-off between truncation and rounding error.
Numerical differentiation requires forward function evaluations (or for central differences). For a 7B parameter model, that is 14 billion forward passes to compute one gradient - completely infeasible.
When numerical differentiation is still used
- Gradient checking: Validate autodiff implementations against numerical approximation
- External simulators: Physics engines, legacy code without autodiff support
- Implicit differentiation: Differentiating through the solution of a system (e.g., via implicit function theorem)
- Derivative-free optimization: When the function is non-differentiable or black-box
import numpy as np
def derivative_free_optimization(
f,
x0: np.ndarray,
learning_rate: float = 0.01,
h: float = 1e-5,
n_iters: int = 100
) -> np.ndarray:
"""
Simple gradient descent using numerical gradients.
Only sensible for very low-dimensional problems.
For high-dimensional ML: USE AUTODIFF.
"""
x = x0.copy()
for step in range(n_iters):
# Compute numerical gradient (expensive: n function calls)
grad = np.zeros_like(x)
for i in range(len(x)):
x_plus = x.copy(); x_plus[i] += h
x_minus = x.copy(); x_minus[i] -= h
grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)
x = x - learning_rate * grad
if step % 10 == 0:
print(f"Step {step}: f(x) = {f(x):.6f}")
return x
# This is fine for 2D optimization benchmarks, absurd for neural networks
Interview Questions
Q1: Why is central difference more accurate than forward difference, and what is the optimal step size?
Forward difference error analysis: Error = - first order accurate.
Central difference error analysis: The terms cancel, giving error - second-order accurate.
For the same , central differences are times more accurate. Central differences require 2 function evaluations per derivative versus 1 for forward differences, but this is almost always worthwhile.
Optimal step size balances truncation error () against rounding error ():
At these optimal step sizes, central difference achieves relative error , while forward difference achieves . Central difference is still more accurate even at its (slightly larger) optimal .
Q2: How do you implement a correct gradient check in PyTorch, and what relative error threshold indicates a bug?
Correct implementation:
- Use
float64(double precision) for both forward passes - numerical differentiation's rounding error decreases dramatically at float64 vs float32. - Use central differences with to .
- Compare analytical gradient against numerical gradient using relative error:
Thresholds:
< 1e-7: Correct with high confidence1e-7to1e-5: Likely correct (may reflect float64 precision limits)1e-5to1e-3: Suspicious - investigate> 1e-3: Almost certainly a bug in the backward pass
Common gotchas:
- Don't test at a point where the function is non-smooth (e.g., x=0 for ReLU, abs(x))
- Include all parameters including biases
- Test with both positive and negative inputs and at multiple random points
- PyTorch's
torch.autograd.gradcheckautomates this correctly - prefer it over rolling your own
Typical bugs found by gradient checking:
- Wrong sign (forgot a minus somewhere)
- Missing factor of 2 or
- Wrong reshape/dimension order in Jacobian
- Incorrect handling of batch dimension
Q3: Why is reverse-mode autodiff O(1) for gradients while numerical differentiation is O(n)?
Numerical differentiation approximates:
Each partial derivative requires 2 forward function evaluations. For parameters: evaluations total - cost.
Reverse-mode autodiff (backpropagation):
- Forward pass: computes and stores intermediate values - time, memory
- Backward pass: computes all partial derivatives simultaneously using the chain rule - time
The key insight: the backward pass computes all partial derivatives at once by propagating gradient "signals" backward through the computation graph. The total cost is proportional to the forward pass cost, not to the number of parameters.
Formal statement: For :
- Forward mode autodiff: per input direction → efficient when
- Reverse mode autodiff: per output direction → efficient when
- For neural network training: (scalar loss), → reverse mode is the right choice
Q4: When would you use numerical differentiation instead of automatic differentiation in an ML system?
1. Gradient checking (debugging): Always use numerical differentiation to validate custom autodiff implementations. PyTorch's gradcheck does this automatically.
2. External simulators: When part of your pipeline is a physics engine, a compiled simulation, or legacy code that cannot be differentiated analytically. You can wrap it with numerical gradients. Example: differentiating through a rigid body physics simulator for robotics RL.
3. Implicit differentiation: When your "function" is defined as the solution to a system of equations . The derivative can be computed implicitly via the implicit function theorem without differentiating through the solver iterations.
4. Non-differentiable components: Sorting, argmax, sampling - operations that autodiff cannot handle. Finite differences can still provide a useful signal in some cases (straight-through estimators are an alternative).
5. Derivative-free optimization: Some optimization algorithms (CMA-ES, Bayesian optimization) don't use gradients at all. Numerical gradients can approximate them for simple benchmarks but not for high-dimensional ML.
Never use numerical differentiation for: Training neural networks with millions of parameters - the function evaluation cost is completely infeasible.
Q5: What is the truncation error vs rounding error trade-off in finite differences?
The total error in a central difference approximation has two components:
Truncation error (from Taylor series approximation):
This decreases as . Smaller → better truncation accuracy.
Rounding error (from floating-point subtraction of nearly equal numbers):
for small , so their difference loses significant digits. Smaller → worse rounding accuracy.
Total error:
Minimizing:
At , the minimum error is - limited by floating-point precision, not step size choice.
Practical consequence: There is a finite minimum error for numerical differentiation, determined by floating-point precision. Automatic differentiation has no such limit - it achieves machine precision by construction.
Quick Reference
| Method | Formula | Error Order | Optimal h (float64) | Cost |
|---|---|---|---|---|
| Forward difference | 1 eval | |||
| Backward difference | 1 eval | |||
| Central difference | 2 evals | |||
| 5-point stencil | (see formula above) | 4 evals |
| Relative Error | Interpretation |
|---|---|
< 1e-7 | Gradient is correct |
1e-7 to 1e-5 | Probably correct |
> 1e-3 | Bug in backward pass |
Next: Lesson 05: Numerical Integration →
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Derivatives & Gradients demo on the EngineersOfAI Playground - no code required.
:::
