Skip to main content

Numerical Differentiation - Finite Differences, Gradient Checking, and Autodiff

Reading time: ~20 minutes | Level: Numerical Methods → ML Engineering

You implement a custom PyTorch autograd function for a novel attention variant. The model trains, loss decreases. But the model performs worse than the baseline. Did you implement the backward pass correctly?

The answer: gradient checking with finite differences. This technique uses numerical differentiation to compare your analytical gradient against a numerical approximation. Any discrepancy signals a bug in your backward pass.

Numerical differentiation is also the historical foundation of machine learning optimization - before autodiff, we literally approximated gradients this way. Understanding it deeply helps you understand why autodiff was invented and what makes it superior.

What You Will Learn

  • Forward, backward, and central difference formulas with error analysis
  • How to select the optimal step size hh
  • Truncation vs rounding error: the fundamental trade-off
  • Numerical Jacobians for multivariable functions
  • Gradient checking: implementing it correctly and interpreting results
  • Why automatic differentiation dominates and when numerical differentiation still has a role

Part 1 - Finite Difference Approximations

The forward difference formula

From Taylor's theorem, any smooth function satisfies:

f(x+h)=f(x)+hf(x)+h22f(x)+O(h3)f(x + h) = f(x) + h f'(x) + \frac{h^2}{2} f''(x) + O(h^3)

Rearranging:

f(x)f(x+h)f(x)hf'(x) \approx \frac{f(x+h) - f(x)}{h}

The error is O(h)O(h) - first-order accurate. This is the forward difference formula.

The central difference formula

Using both forward and backward Taylor expansions:

f(x+h)=f(x)+hf(x)+h22f(x)+h36f(x)+O(h4)f(x + h) = f(x) + h f'(x) + \frac{h^2}{2} f''(x) + \frac{h^3}{6} f'''(x) + O(h^4) f(xh)=f(x)hf(x)+h22f(x)h36f(x)+O(h4)f(x - h) = f(x) - h f'(x) + \frac{h^2}{2} f''(x) - \frac{h^3}{6} f'''(x) + O(h^4)

Subtracting and dividing by 2h2h:

f(x)f(x+h)f(xh)2h+O(h2)f'(x) \approx \frac{f(x+h) - f(x-h)}{2h} + O(h^2)

The error is O(h2)O(h^2) - second-order accurate. For the same step size, central differences are much more accurate.

import numpy as np

def forward_difference(f, x: float, h: float = 1e-5) -> float:
"""First-order accurate finite difference: O(h) error."""
return (f(x + h) - f(x)) / h

def central_difference(f, x: float, h: float = 1e-5) -> float:
"""Second-order accurate central difference: O(h²) error."""
return (f(x + h) - f(x - h)) / (2 * h)

def five_point_stencil(f, x: float, h: float = 1e-4) -> float:
"""Fourth-order accurate: O(h⁴) error - higher accuracy, more function calls."""
return (-f(x + 2*h) + 8*f(x + h) - 8*f(x - h) + f(x - 2*h)) / (12 * h)

# Test on f(x) = sin(x), f'(x) = cos(x)
f = np.sin
x = 1.0
true_deriv = np.cos(x)

h = 1e-4
fwd = forward_difference(f, x, h)
cen = central_difference(f, x, h)
fp5 = five_point_stencil(f, x, h)

print(f"True: {true_deriv:.10f}")
print(f"Forward: {fwd:.10f} error: {abs(fwd - true_deriv):.2e}")
print(f"Central: {cen:.10f} error: {abs(cen - true_deriv):.2e}")
print(f"5-point: {fp5:.10f} error: {abs(fp5 - true_deriv):.2e}")
# Central has ~h times smaller error than forward for same h
# 5-point has ~h^4 error - near machine precision even for h=1e-4

Part 2 - Step Size Selection: The Critical Trade-off

Two competing error sources

The total error in finite differences has two components:

Truncation error: From dropping higher-order Taylor terms: Etrunc=O(hp)(p=1 for forward, p=2 for central)E_{\text{trunc}} = O(h^p) \quad \text{(p=1 for forward, p=2 for central)}

Rounding error: From floating-point subtraction of nearly equal quantities: Eround=2ϵmachinef(x)hE_{\text{round}} = \frac{2\epsilon_{\text{machine}} \cdot |f(x)|}{h}

The total error is:

Etotal(h)=C1hp+C2ϵmachinehE_{\text{total}}(h) = C_1 h^p + \frac{C_2 \epsilon_{\text{machine}}}{h}

The optimal step size

Minimizing EtotalE_{\text{total}} with respect to hh:

h=(C2ϵmachinepC1)1/(p+1)h^* = \left(\frac{C_2 \epsilon_{\text{machine}}}{p C_1}\right)^{1/(p+1)}

For float64 (ϵmachine1016\epsilon_{\text{machine}} \approx 10^{-16}):

  • Forward difference (p=1): hϵ108h^* \approx \sqrt{\epsilon} \approx 10^{-8}
  • Central difference (p=2): hϵ1/3105h^* \approx \epsilon^{1/3} \approx 10^{-5} to 10610^{-6}
import numpy as np
import matplotlib.pyplot as plt

def analyze_step_size(f, f_prime_true, x: float):
"""
Show how error varies with step size h.
Reveals the optimal h where truncation and rounding errors balance.
"""
h_values = np.logspace(-16, 0, 500)
errors_fwd = []
errors_cen = []

true = f_prime_true(x)

for h in h_values:
fwd_est = (f(x + h) - f(x)) / h
cen_est = (f(x + h) - f(x - h)) / (2 * h)
errors_fwd.append(abs(fwd_est - true))
errors_cen.append(abs(cen_est - true))

# Find optimal h
opt_h_fwd = h_values[np.argmin(errors_fwd)]
opt_h_cen = h_values[np.argmin(errors_cen)]
print(f"Optimal h (forward): {opt_h_fwd:.2e}") # ~1e-8
print(f"Optimal h (central): {opt_h_cen:.2e}") # ~1e-5 to 1e-6

return h_values, errors_fwd, errors_cen

# For f(x) = sin(x):
h_vals, e_fwd, e_cen = analyze_step_size(
np.sin, np.cos, x=1.0
)
# Error curve: decreases as h decreases (truncation dominated)
# then increases as h decreases further (rounding dominated)
# Optimal h is the "sweet spot" minimum
Error vs step size h (log-log scale):

log₁₀(error)

0 ┤ Central: Forward:
-2 ┤ error ~ h² error ~ h
-4 ┤ ↘ ↘
-6 ┤ ↘ ↘
-8 ┤ ↗↘ opt h ↗↘ opt h
-10 ┤ ↗ ↗
-12 ┤ ↗ ↗ (rounding error: 2ε/h)
└────────────────────────────────
-16 -12 -8 -4 0 log₁₀(h)

Central optimal h ≈ 1e-5, error ≈ 1e-10
Forward optimal h ≈ 1e-8, error ≈ 1e-8

:::tip Use central differences for gradient checking Always use central differences for gradient checking, with h105h \approx 10^{-5}. Forward differences with h108h \approx 10^{-8} appear to give similar total error but are more sensitive to the specific function being tested. Central differences are more robust. :::

Part 3 - Numerical Jacobians for Multivariable Functions

For a function f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m, the Jacobian JRm×nJ \in \mathbb{R}^{m \times n} has entries:

Jij=fixjJ_{ij} = \frac{\partial f_i}{\partial x_j}

Numerical Jacobian via central differences:

Jijfi(x+hej)fi(xhej)2hJ_{ij} \approx \frac{f_i(x + h e_j) - f_i(x - h e_j)}{2h}

where eje_j is the jj-th standard basis vector.

import numpy as np

def numerical_jacobian(
f,
x: np.ndarray,
h: float = 1e-5
) -> np.ndarray:
"""
Compute numerical Jacobian using central differences.

f: R^n -> R^m function
x: input point, shape (n,)
Returns: Jacobian J, shape (m, n)
"""
n = len(x)
f0 = np.atleast_1d(f(x))
m = len(f0)

J = np.zeros((m, n))

for j in range(n):
# Perturb jth component
x_plus = x.copy()
x_minus = x.copy()
x_plus[j] += h
x_minus[j] -= h

f_plus = np.atleast_1d(f(x_plus))
f_minus = np.atleast_1d(f(x_minus))

J[:, j] = (f_plus - f_minus) / (2 * h)

return J

# Test: f(x, y) = [x^2 + y, x*y - y^2]
# Analytical Jacobian: [[2x, 1], [y, x - 2y]]
def f_test(x):
return np.array([x[0]**2 + x[1], x[0]*x[1] - x[1]**2])

x0 = np.array([2.0, 3.0])

J_numerical = numerical_jacobian(f_test, x0)
J_analytical = np.array([[2*x0[0], 1],
[x0[1], x0[0] - 2*x0[1]]])

print("Numerical Jacobian:")
print(J_numerical)
print("\nAnalytical Jacobian:")
print(J_analytical)
print(f"\nMax error: {np.max(np.abs(J_numerical - J_analytical)):.2e}")
# ~1e-10 - excellent agreement

Part 4 - Gradient Checking: Validating Backpropagation

Gradient checking is the most important application of numerical differentiation in ML engineering. It verifies that your analytical backward pass computes the same gradients as numerical differentiation.

The relative error metric

def relative_error(a: np.ndarray, b: np.ndarray) -> float:
"""
Compute the relative error between two gradient vectors.
Standard metric for gradient checking.

Interpretation:
< 1e-7: Excellent - almost certainly correct
1e-7 to 1e-5: Probably OK (floating point noise)
> 1e-3: Bug in the gradient computation
"""
numerator = np.linalg.norm(a - b)
denominator = np.linalg.norm(a) + np.linalg.norm(b)
if denominator < 1e-10:
return 0.0 # Both near zero
return numerator / denominator

Full gradient check implementation

import numpy as np

def gradient_check(
forward_fn,
backward_fn,
params: np.ndarray,
h: float = 1e-5
) -> dict:
"""
Validate analytical gradients against numerical gradients.

forward_fn: params -> scalar loss
backward_fn: params -> gradient of loss w.r.t. params
params: flat parameter vector

Returns:
dict with per-parameter relative errors
"""
# Analytical gradients
grad_analytical = backward_fn(params)

# Numerical gradients via central differences
grad_numerical = np.zeros_like(params)
for i in range(len(params)):
params_plus = params.copy()
params_minus = params.copy()
params_plus[i] += h
params_minus[i] -= h

loss_plus = forward_fn(params_plus)
loss_minus = forward_fn(params_minus)

grad_numerical[i] = (loss_plus - loss_minus) / (2 * h)

# Relative error
rel_error = relative_error(grad_analytical, grad_numerical)

return {
'relative_error': rel_error,
'analytical': grad_analytical,
'numerical': grad_numerical,
'passed': rel_error < 1e-5
}


# Example: gradient checking a custom sigmoid function
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))

def sigmoid_loss(params):
"""Scalar loss: sum of sigmoid(params)."""
return np.sum(sigmoid(params))

def sigmoid_grad(params):
"""Analytical gradient: sigmoid(x) * (1 - sigmoid(x))."""
s = sigmoid(params)
return s * (1 - s)

params = np.array([1.0, -0.5, 2.0, 0.0, -1.5])
result = gradient_check(sigmoid_loss, sigmoid_grad, params)
print(f"Relative error: {result['relative_error']:.2e}")
print(f"Gradient check passed: {result['passed']}")

PyTorch gradient check

import torch

def torch_gradient_check(
module: torch.nn.Module,
inputs: tuple,
tol: float = 1e-4
) -> bool:
"""
Gradient check for a PyTorch nn.Module using torch.autograd.gradcheck.
"""
# Convert inputs to double precision (more accurate numerical gradients)
inputs_double = tuple(
x.double().requires_grad_(True) if x.is_floating_point() else x
for x in inputs
)
module_double = module.double()

try:
result = torch.autograd.gradcheck(
module_double,
inputs_double,
eps=1e-6, # h for central differences
atol=tol, # absolute tolerance
rtol=tol, # relative tolerance
raise_exception=True
)
print("Gradient check PASSED")
return True
except RuntimeError as e:
print(f"Gradient check FAILED: {e}")
return False

# Usage: check your custom autograd function
class CustomSoftmax(torch.autograd.Function):
@staticmethod
def forward(ctx, x):
exp_x = torch.exp(x - x.max(dim=-1, keepdim=True).values)
softmax = exp_x / exp_x.sum(dim=-1, keepdim=True)
ctx.save_for_backward(softmax)
return softmax

@staticmethod
def backward(ctx, grad_output):
softmax, = ctx.saved_tensors
# Jacobian-vector product: J_softmax @ grad_output
dot = (grad_output * softmax).sum(dim=-1, keepdim=True)
return softmax * (grad_output - dot)

# Check it
x = torch.randn(3, 5, dtype=torch.float64, requires_grad=True)
torch.autograd.gradcheck(CustomSoftmax.apply, x, eps=1e-6, atol=1e-4)

Part 5 - Automatic Differentiation vs Numerical Differentiation

The fundamental limitations of numerical differentiation

PropertyNumerical DiffAutomatic Diff (Reverse Mode)
AccuracyO(h2)O(h^2) at best, limited by floating pointExact (machine precision)
Cost (gradient of scalar)O(n)O(n) function evaluationsO(1)O(1) backward passes
Cost (nn-input, mm-output Jacobian)O(n)O(n) or O(m)O(m)O(min(n,m))O(\min(n,m))
Works for non-smooth functionsYes (with care)Only if derivative defined
Implementation complexityTrivialComplex (requires computation graph)

For computing gradients of L:RnR\mathcal{L}: \mathbb{R}^n \to \mathbb{R} (scalar loss), reverse-mode autodiff computes all nn partial derivatives in a single backward pass - no approximation, no step size selection, no trade-off between truncation and rounding error.

Numerical differentiation requires nn forward function evaluations (or 2n2n for central differences). For a 7B parameter model, that is 14 billion forward passes to compute one gradient - completely infeasible.

When numerical differentiation is still used

  1. Gradient checking: Validate autodiff implementations against numerical approximation
  2. External simulators: Physics engines, legacy code without autodiff support
  3. Implicit differentiation: Differentiating through the solution of a system (e.g., via implicit function theorem)
  4. Derivative-free optimization: When the function is non-differentiable or black-box
import numpy as np

def derivative_free_optimization(
f,
x0: np.ndarray,
learning_rate: float = 0.01,
h: float = 1e-5,
n_iters: int = 100
) -> np.ndarray:
"""
Simple gradient descent using numerical gradients.
Only sensible for very low-dimensional problems.
For high-dimensional ML: USE AUTODIFF.
"""
x = x0.copy()

for step in range(n_iters):
# Compute numerical gradient (expensive: n function calls)
grad = np.zeros_like(x)
for i in range(len(x)):
x_plus = x.copy(); x_plus[i] += h
x_minus = x.copy(); x_minus[i] -= h
grad[i] = (f(x_plus) - f(x_minus)) / (2 * h)

x = x - learning_rate * grad

if step % 10 == 0:
print(f"Step {step}: f(x) = {f(x):.6f}")

return x

# This is fine for 2D optimization benchmarks, absurd for neural networks

Interview Questions

Q1: Why is central difference more accurate than forward difference, and what is the optimal step size?

Forward difference error analysis: f(x+h)f(x)h=f(x)+h2f(x)+O(h2)\frac{f(x+h) - f(x)}{h} = f'(x) + \frac{h}{2}f''(x) + O(h^2) Error = O(h)O(h) - first order accurate.

Central difference error analysis: f(x+h)f(xh)2h=f(x)+h26f(x)+O(h4)\frac{f(x+h) - f(x-h)}{2h} = f'(x) + \frac{h^2}{6}f'''(x) + O(h^4) The f(x)f''(x) terms cancel, giving O(h2)O(h^2) error - second-order accurate.

For the same hh, central differences are O(h)O(h) times more accurate. Central differences require 2 function evaluations per derivative versus 1 for forward differences, but this is almost always worthwhile.

Optimal step size balances truncation error (hp\sim h^p) against rounding error (ϵmach/h\sim \epsilon_{\text{mach}}/h):

hforward=ϵmach108(float64)h^*_{\text{forward}} = \sqrt{\epsilon_{\text{mach}}} \approx 10^{-8} \quad (\text{float64}) hcentral=ϵmach1/3105 to 106(float64)h^*_{\text{central}} = \epsilon_{\text{mach}}^{1/3} \approx 10^{-5} \text{ to } 10^{-6} \quad (\text{float64})

At these optimal step sizes, central difference achieves relative error 1010\approx 10^{-10}, while forward difference achieves 108\approx 10^{-8}. Central difference is still more accurate even at its (slightly larger) optimal hh.

Q2: How do you implement a correct gradient check in PyTorch, and what relative error threshold indicates a bug?

Correct implementation:

  1. Use float64 (double precision) for both forward passes - numerical differentiation's rounding error decreases dramatically at float64 vs float32.
  2. Use central differences with h=105h = 10^{-5} to 10610^{-6}.
  3. Compare analytical gradient gag_a against numerical gradient gng_n using relative error: rel_error=gagnga+gn\text{rel\_error} = \frac{\|g_a - g_n\|}{\|g_a\| + \|g_n\|}

Thresholds:

  • < 1e-7: Correct with high confidence
  • 1e-7 to 1e-5: Likely correct (may reflect float64 precision limits)
  • 1e-5 to 1e-3: Suspicious - investigate
  • > 1e-3: Almost certainly a bug in the backward pass

Common gotchas:

  • Don't test at a point where the function is non-smooth (e.g., x=0 for ReLU, abs(x))
  • Include all parameters including biases
  • Test with both positive and negative inputs and at multiple random points
  • PyTorch's torch.autograd.gradcheck automates this correctly - prefer it over rolling your own

Typical bugs found by gradient checking:

  • Wrong sign (forgot a minus somewhere)
  • Missing factor of 2 or nn
  • Wrong reshape/dimension order in Jacobian
  • Incorrect handling of batch dimension
Q3: Why is reverse-mode autodiff O(1) for gradients while numerical differentiation is O(n)?

Numerical differentiation approximates: LθiL(θ+hei)L(θhei)2h\frac{\partial \mathcal{L}}{\partial \theta_i} \approx \frac{\mathcal{L}(\theta + h e_i) - \mathcal{L}(\theta - h e_i)}{2h}

Each partial derivative requires 2 forward function evaluations. For nn parameters: 2n2n evaluations total - O(n)O(n) cost.

Reverse-mode autodiff (backpropagation):

  • Forward pass: computes and stores intermediate values - O(n)O(n) time, O(n)O(n) memory
  • Backward pass: computes all nn partial derivatives simultaneously using the chain rule - O(n)O(n) time

The key insight: the backward pass computes all partial derivatives at once by propagating gradient "signals" backward through the computation graph. The total cost is proportional to the forward pass cost, not to the number of parameters.

Formal statement: For f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m:

  • Forward mode autodiff: O(cost of f)O(\text{cost of f}) per input direction → efficient when nmn \ll m
  • Reverse mode autodiff: O(cost of f)O(\text{cost of f}) per output direction → efficient when mnm \ll n
  • For neural network training: m=1m = 1 (scalar loss), n=millionsn = \text{millions} → reverse mode is the right choice
Q4: When would you use numerical differentiation instead of automatic differentiation in an ML system?

1. Gradient checking (debugging): Always use numerical differentiation to validate custom autodiff implementations. PyTorch's gradcheck does this automatically.

2. External simulators: When part of your pipeline is a physics engine, a compiled simulation, or legacy code that cannot be differentiated analytically. You can wrap it with numerical gradients. Example: differentiating through a rigid body physics simulator for robotics RL.

3. Implicit differentiation: When your "function" is defined as the solution to a system of equations F(y;θ)=0F(y; \theta) = 0. The derivative dy/dθdy/d\theta can be computed implicitly via the implicit function theorem without differentiating through the solver iterations.

4. Non-differentiable components: Sorting, argmax, sampling - operations that autodiff cannot handle. Finite differences can still provide a useful signal in some cases (straight-through estimators are an alternative).

5. Derivative-free optimization: Some optimization algorithms (CMA-ES, Bayesian optimization) don't use gradients at all. Numerical gradients can approximate them for simple benchmarks but not for high-dimensional ML.

Never use numerical differentiation for: Training neural networks with millions of parameters - the O(n)O(n) function evaluation cost is completely infeasible.

Q5: What is the truncation error vs rounding error trade-off in finite differences?

The total error in a central difference approximation has two components:

Truncation error (from Taylor series approximation): Etrunc=h26f(ξ)for some ξ[xh,x+h]E_{\text{trunc}} = \frac{h^2}{6} |f'''(\xi)| \quad \text{for some } \xi \in [x-h, x+h]

This decreases as h0h \to 0. Smaller hh → better truncation accuracy.

Rounding error (from floating-point subtraction of nearly equal numbers): Eround2ϵmachf(x)hE_{\text{round}} \approx \frac{2\epsilon_{\text{mach}} |f(x)|}{h}

f(x+h)f(xh)f(x+h) \approx f(x-h) for small hh, so their difference loses significant digits. Smaller hh → worse rounding accuracy.

Total error: E(h)C1h2+C2ϵmach/hE(h) \approx C_1 h^2 + C_2 \epsilon_{\text{mach}} / h

Minimizing: dEdh=2C1hC2ϵmach/h2=0\frac{dE}{dh} = 2C_1 h - C_2 \epsilon_{\text{mach}} / h^2 = 0

h=(C2ϵmach2C1)1/3ϵmach1/3105 (float64)h^* = \left(\frac{C_2 \epsilon_{\text{mach}}}{2C_1}\right)^{1/3} \approx \epsilon_{\text{mach}}^{1/3} \approx 10^{-5} \text{ (float64)}

At hh^*, the minimum error is ϵmach2/31010\approx \epsilon_{\text{mach}}^{2/3} \approx 10^{-10} - limited by floating-point precision, not step size choice.

Practical consequence: There is a finite minimum error for numerical differentiation, determined by floating-point precision. Automatic differentiation has no such limit - it achieves machine precision by construction.

Quick Reference

MethodFormulaError OrderOptimal h (float64)Cost
Forward difference[f(x+h)f(x)]/h[f(x+h) - f(x)]/hO(h)O(h)ϵ108\sqrt{\epsilon} \approx 10^{-8}1 eval
Backward difference[f(x)f(xh)]/h[f(x) - f(x-h)]/hO(h)O(h)ϵ108\sqrt{\epsilon} \approx 10^{-8}1 eval
Central difference[f(x+h)f(xh)]/(2h)[f(x+h) - f(x-h)]/(2h)O(h2)O(h^2)ϵ1/3105\epsilon^{1/3} \approx 10^{-5}2 evals
5-point stencil(see formula above)O(h4)O(h^4)ϵ1/5103\epsilon^{1/5} \approx 10^{-3}4 evals
Relative ErrorInterpretation
< 1e-7Gradient is correct
1e-7 to 1e-5Probably correct
> 1e-3Bug in backward pass

Next: Lesson 05: Numerical Integration →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Derivatives & Gradients demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.