Root-Finding Algorithms - Bisection, Newton-Raphson, and ML Applications
Reading time: ~18 minutes | Level: Numerical Methods → Optimization
You want to find the learning rate where the gradient norm equals a target value. You want to find the temperature parameter where the KL divergence between two distributions equals a budget. You implement layer normalization and need to find the scale that makes the output variance equal 1.
These are all root-finding problems: find such that (or equivalently, after shifting). Root-finding algorithms are the workhorses of optimization, calibration, and constraint satisfaction in production ML systems.
What You Will Learn
- Bisection method: guaranteed convergence, linear rate
- Newton-Raphson method: quadratic convergence, fragility
- Secant method: superlinear convergence without derivatives
- Convergence analysis: linear, superlinear, quadratic
- Fixed-point iterations and their connection to ML optimizers
- ML applications: learning rate search, quantile finding, calibration
Part 1 - The Bisection Method
The idea
Given (sign change), by the intermediate value theorem there is a root in . Bisection repeatedly halves the interval:
- Compute
- If : done
- If : root is in → set
- Else: root is in → set
- Repeat until
Convergence: Linear - error halves each step. After iterations, error .
import numpy as np
from typing import Callable, Optional
def bisection(
f: Callable,
a: float,
b: float,
tol: float = 1e-10,
maxiter: int = 100
) -> tuple:
"""
Bisection root-finding method.
Guaranteed to converge if f(a) and f(b) have opposite signs.
Returns: (root, n_iterations, converged)
"""
if f(a) * f(b) > 0:
raise ValueError(f"f(a)={f(a):.4f} and f(b)={f(b):.4f} must have opposite signs")
history = []
for n in range(maxiter):
c = (a + b) / 2.0
fc = f(c)
history.append({'c': c, 'f(c)': fc, 'interval_width': b - a})
if abs(fc) < tol or (b - a) / 2 < tol:
return c, n + 1, True, history
if f(a) * fc < 0:
b = c # Root is in left half
else:
a = c # Root is in right half
return (a + b) / 2.0, maxiter, False, history
# Example: find √2 as root of f(x) = x² - 2
f = lambda x: x**2 - 2.0
root, n_iters, converged, hist = bisection(f, 1.0, 2.0, tol=1e-12)
print(f"Root: {root:.12f}")
print(f"True: {np.sqrt(2):.12f}")
print(f"Iterations: {n_iters}")
print(f"Error: {abs(root - np.sqrt(2)):.2e}")
# Show convergence rate (should halve each iteration)
for i, step in enumerate(hist[:8]):
print(f"Iter {i+1}: c = {step['c']:.8f}, |f(c)| = {abs(step['f(c)']):.2e}")
Key property: Bisection always converges for continuous with a sign-bracketed interval. It is the method of last resort when you need guaranteed convergence and other methods fail.
Part 2 - Newton-Raphson Method
The idea
Newton-Raphson uses both and to take a better step. The linear approximation of near :
Solving for :
Convergence analysis
For a simple root (where and ), Newton-Raphson has quadratic convergence:
This means: once you are close to the root, the number of correct decimal digits doubles each iteration. Going from 4 correct digits to 8 to 16 - convergence is extremely fast.
import numpy as np
def newton_raphson(
f: Callable,
df: Callable,
x0: float,
tol: float = 1e-12,
maxiter: int = 50
) -> tuple:
"""
Newton-Raphson root-finding.
Requires f and its derivative df.
Quadratic convergence near the root.
Can diverge if starting far from root or if f'(x) ≈ 0.
"""
x = x0
history = []
for n in range(maxiter):
fx = f(x)
history.append({'x': x, 'f(x)': fx})
if abs(fx) < tol:
return x, n + 1, True, history
dfx = df(x)
if abs(dfx) < 1e-14:
raise ValueError(f"Near-zero derivative at x={x:.6f} - Newton-Raphson failed")
x = x - fx / dfx # Newton step
return x, maxiter, False, history
# Example: f(x) = x² - 2, f'(x) = 2x
f = lambda x: x**2 - 2.0
df = lambda x: 2.0 * x
root, n_iters, converged, hist = newton_raphson(f, df, x0=1.5, tol=1e-14)
print(f"Root: {root:.14f}")
print(f"Iterations: {n_iters}") # Typically 4-6 for this problem
# Observe quadratic convergence
true_root = np.sqrt(2)
print("\nQuadratic convergence:")
for i, step in enumerate(hist):
err = abs(step['x'] - true_root)
print(f"Iter {i}: error = {err:.2e}")
# Errors: 1e-1 → 1e-2 → 1e-4 → 1e-8 → 1e-16 (digits double!)
Newton-Raphson in multiple dimensions
For systems where , Newton's method uses the Jacobian:
Or equivalently (avoid explicit inverse): solve , then .
def newton_system(
F: Callable,
J_F: Callable,
x0: np.ndarray,
tol: float = 1e-10,
maxiter: int = 50
) -> np.ndarray:
"""
Newton's method for nonlinear systems F(x) = 0.
J_F: Jacobian function x -> J (n x n matrix)
"""
x = x0.copy()
for _ in range(maxiter):
Fx = F(x)
if np.linalg.norm(Fx) < tol:
break
J = J_F(x)
# Solve J @ delta = -F(x) - do NOT use J_inv!
delta = np.linalg.solve(J, -Fx)
x = x + delta
return x
When Newton-Raphson fails
# Failure mode 1: Oscillation (cycle)
f_cycle = lambda x: x**3 - 2*x + 2
df_cycle = lambda x: 3*x**2 - 2
# Starting at x=0: Newton step goes to x=1, then back to x=0 - infinite loop!
# Failure mode 2: Divergence (wrong basin of attraction)
f_div = lambda x: np.arctan(x)
df_div = lambda x: 1.0 / (1 + x**2)
# For x0 far from origin: converges to wrong root or diverges
# Safeguard: Combine Newton with bisection (Brent's method)
from scipy.optimize import brentq, newton
# scipy.optimize.brentq: guaranteed convergence + fast near root
# Best of both worlds for 1D problems
root = brentq(lambda x: x**2 - 2, 1.0, 2.0, xtol=1e-12)
print(f"Brent's method: {root:.12f}")
Part 3 - Secant Method
The secant method approximates using two previous iterates:
Substituting into the Newton step:
No derivative needed. Requires two initial points .
Convergence: Superlinear - convergence order (the golden ratio). Faster than bisection, slower than Newton.
def secant_method(
f: Callable,
x0: float,
x1: float,
tol: float = 1e-12,
maxiter: int = 50
) -> tuple:
"""
Secant method - derivative-free, superlinear convergence.
Requires two starting points but no analytical derivative.
"""
history = [(x0, f(x0)), (x1, f(x1))]
for n in range(maxiter):
f0, f1 = f(x0), f(x1)
if abs(f1 - f0) < 1e-14:
break # Near-parallel secant line - numerical issue
# Secant step
x2 = x1 - f1 * (x1 - x0) / (f1 - f0)
history.append((x2, f(x2)))
if abs(x2 - x1) < tol or abs(f(x2)) < tol:
return x2, n + 1, True, history
x0, x1 = x1, x2
return x1, maxiter, False, history
# Example: find root of f(x) = cos(x) - x (the Dottie number ≈ 0.7391)
f = lambda x: np.cos(x) - x
root, n_iters, converged, _ = secant_method(f, 0.0, 1.0)
print(f"Root (Dottie number): {root:.10f}") # ≈ 0.7390851332
print(f"Iterations: {n_iters}") # ~8-10 iterations
Part 4 - Convergence Order Comparison
Convergence orders:
Method | Order | Evals/step | When to use
─────────────────┼───────┼────────────┼─────────────────────────
Bisection | 1.0 | 1 | Guaranteed convergence, no derivative
Newton-Raphson | 2.0 | 1 | Fast, have derivative, near root
Secant | 1.618 | 1 | Fast, no derivative
Brent's method | 1.618 | 1 | Guaranteed + fast (bracket + secant)
Halley's method | 3.0 | 1 | Have f, f', f'' - very fast
Error reduction per iteration:
Bisection: |e_{k+1}| = 0.5 × |e_k|
Secant: |e_{k+1}| = C × |e_k|^1.618
Newton: |e_{k+1}| = C × |e_k|^2
After 10 iterations starting with error 0.1:
Bisection: 0.1 / 2^10 ≈ 10^-4
Secant: (0.1)^(1.618^10) ≈ 10^-11 ← practically at machine precision
Newton: (0.1)^(2^10) ≈ 10^-1024 ← beyond float range after 4-5 iterations
Part 5 - ML Applications
Finding the optimal learning rate
One practical application of bisection: finding the learning rate where the gradient norm crosses a threshold (learning rate finder algorithm):
import numpy as np
def learning_rate_finder(
model,
dataloader,
loss_fn,
target_loss_ratio: float = 3.0, # Stop when loss is 3x initial loss
min_lr: float = 1e-7,
max_lr: float = 10.0,
n_steps: int = 100
) -> float:
"""
Find the maximum safe learning rate using linear sweep + exponential schedule.
Uses root-finding concept: find LR where loss starts to diverge.
"""
from copy import deepcopy
import torch
initial_state = deepcopy(model.state_dict())
lrs = np.exp(np.linspace(np.log(min_lr), np.log(max_lr), n_steps))
losses = []
initial_loss = None
for step, lr in enumerate(lrs):
batch = next(iter(dataloader))
# Take one step at this learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
loss = loss_fn(model(batch['x']), batch['y'])
if initial_loss is None:
initial_loss = loss.item()
losses.append(loss.item())
if loss.item() > target_loss_ratio * initial_loss or np.isnan(loss.item()):
# Divergence detected - last stable LR is a few steps back
optimal_lr = lrs[max(0, step - 3)]
model.load_state_dict(initial_state)
return optimal_lr
optimizer.zero_grad()
loss.backward()
optimizer.step()
model.load_state_dict(initial_state)
return lrs[np.argmin(losses)] # LR at minimum loss
def find_temperature_by_bisection(
log_probs: np.ndarray,
target_entropy: float,
tol: float = 1e-6
) -> float:
"""
Find temperature T such that entropy of softmax(log_probs / T) = target_entropy.
Used in knowledge distillation and policy entropy regularization.
Entropy is monotonically increasing in T → root-finding is valid.
"""
def entropy_deficit(log_T: float) -> float:
T = np.exp(log_T)
# Compute softmax(log_probs / T)
scaled = log_probs / T
probs = np.exp(scaled - np.max(scaled))
probs /= probs.sum()
entropy = -np.sum(probs * np.log(probs + 1e-12))
return entropy - target_entropy
# Bisection in log-temperature space
log_T_root, _, _, _ = bisection(
entropy_deficit, a=-5.0, b=5.0, tol=tol
)
return np.exp(log_T_root)
def find_calibration_threshold(
probabilities: np.ndarray,
labels: np.ndarray,
target_fpr: float = 0.05
) -> float:
"""
Find decision threshold where False Positive Rate = target_fpr.
Classic root-finding application in classifier calibration.
"""
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(labels, probabilities)
# Find threshold where fpr ≈ target_fpr using bisection
def fpr_at_threshold(t: float) -> float:
preds = (probabilities >= t).astype(int)
fp = np.sum((preds == 1) & (labels == 0))
n = np.sum(labels == 0)
return fp / n - target_fpr # We want this = 0
threshold, _, _, _ = bisection(fpr_at_threshold, a=0.0, b=1.0, tol=1e-6)
return threshold
Fixed-point iterations and ML optimizers
Many ML algorithms can be viewed as fixed-point iterations:
where we are finding a fixed point (equivalently, a root of ).
Banach fixed-point theorem: If is a contraction ( with ), then the iteration converges to a unique fixed point.
import numpy as np
# Gradient descent as fixed-point iteration
# x_{k+1} = x_k - η∇f(x_k) = g(x_k)
# Fixed point: g(x*) = x* → η∇f(x*) = 0 → x* is a critical point
# RMSprop/Adam as preconditioned fixed-point iteration
# Fixed point: adapted step such that accumulated gradient statistics balance
def power_iteration(A: np.ndarray, n_iters: int = 100) -> tuple:
"""
Power iteration: fixed-point method to find the dominant eigenvector.
x_{k+1} = A x_k / ‖A x_k‖
Fixed point: eigenvector corresponding to largest eigenvalue.
"""
n = A.shape[0]
v = np.random.randn(n)
v = v / np.linalg.norm(v)
for _ in range(n_iters):
Av = A @ v
eigenvalue = v @ Av # Rayleigh quotient
v = Av / np.linalg.norm(Av)
return v, v @ A @ v # (eigenvector, eigenvalue)
# Example: find largest eigenvalue of a matrix
A = np.array([[4.0, 1.0, 0.5],
[1.0, 3.0, 0.2],
[0.5, 0.2, 2.0]])
v, lambda_max = power_iteration(A)
print(f"Largest eigenvalue: {lambda_max:.6f}")
print(f"numpy.linalg.eigvals: {np.max(np.linalg.eigvals(A)):.6f}")
Part 6 - SciPy Root-Finding Interface
from scipy.optimize import brentq, newton, bisect, root_scalar
# 1D root finding - use root_scalar for a unified interface
def find_root_demo():
f = lambda x: x**3 - 2*x - 5 # Root near x ≈ 2.0946
# Method 1: Brent's method (recommended for bracketed 1D)
result = root_scalar(f, method='brentq', bracket=[1, 3])
print(f"Brent's: {result.root:.10f} in {result.iterations} iters")
# Method 2: Newton-Raphson (requires derivative, faster if available)
df = lambda x: 3*x**2 - 2
result = root_scalar(f, method='newton', x0=2.0, fprime=df)
print(f"Newton: {result.root:.10f} in {result.iterations} iters")
# Method 3: Secant (no derivative needed)
result = root_scalar(f, method='secant', x0=1.5, x1=2.5)
print(f"Secant: {result.root:.10f} in {result.iterations} iters")
find_root_demo()
# Multi-dimensional systems
from scipy.optimize import fsolve, root
def solve_nonlinear_system():
"""
Solve the system:
f1(x, y) = x² + y² - 4 = 0 (circle of radius 2)
f2(x, y) = x - y = 0 (diagonal)
Solution: (√2, √2) and (-√2, -√2)
"""
def system(vars):
x, y = vars
return [x**2 + y**2 - 4, x - y]
def jacobian(vars):
x, y = vars
return [[2*x, 2*y],
[1, -1]]
# fsolve: classic Newton-Krylov based solver
sol = fsolve(system, x0=[1.0, 1.5], fprime=jacobian)
print(f"Solution: ({sol[0]:.6f}, {sol[1]:.6f})") # (1.41421, 1.41421)
solve_nonlinear_system()
Interview Questions
Q1: Compare bisection, Newton-Raphson, and secant methods. When would you use each?
| Method | Convergence Order | Derivative Needed | Bracketing | Recommendation |
|---|---|---|---|---|
| Bisection | 1 (linear) | No | Required | Guaranteed convergence, last resort |
| Newton-Raphson | 2 (quadratic) | Yes | No | Fastest when derivative available |
| Secant | 1.618 (superlinear) | No | No | Good balance, no derivative needed |
| Brent's | 1.618 + bisection | No | Required | Best practice for 1D problems |
Use bisection when: You need guaranteed convergence and have a sign-bracketed interval. Tolerance requirements are modest (1-6 decimal places). Example: finding a decision threshold in a classifier.
Use Newton-Raphson when: You have an analytical derivative and a good starting guess near the root. Fast convergence is critical. Example: solving the Fisher information system in second-order optimization.
Use secant when: Derivative is unavailable or expensive, but you can evaluate at two nearby points. Example: bisecting in a smooth parameter space where finite differences are cheap.
In practice: scipy.optimize.root_scalar with method='brentq' (for bracketed problems) or 'newton' (for well-specified problems with derivatives). Brent's method gets the best of bisection (guaranteed convergence) and secant (fast near root).
Q2: Why does Newton-Raphson have quadratic convergence and what can break it?
Quadratic convergence proof sketch:
Let be the root: . Newton step: .
Error at step :
where is between and (by Taylor expansion). For near , which is bounded away from zero. So:
The error squares each iteration - quadratic convergence.
What breaks it:
-
Poor initial guess: Far from root, Newton step can overshoot, oscillate, or diverge. No convergence guarantee without bracket.
-
Near-zero derivative: If (near a local extremum), the step is enormous → divergence. Protect with a maximum step size.
-
Multiple roots: At a root where (multiplicity > 1), Newton degrades to linear convergence, not quadratic.
-
Discontinuous derivative: Quadratic convergence requires to be bounded. Functions with sharp kinks may not converge quadratically.
Safeguards: Use Brent's method (combines Newton/secant with bisection bracket), or add a maximum step size, or use line search to ensure each step reduces .
Q3: What is a fixed-point iteration and how does gradient descent fit this framework?
A fixed-point iteration finds satisfying by iterating .
Any root-finding problem can be converted: define for some . Then iff .
Gradient descent as fixed-point iteration: Minimizing :
Fixed point: iff - a critical point of .
Convergence condition (Banach fixed-point theorem): The iteration converges if and only if is a contraction in a neighborhood of : for near .
For gradient descent: where is the Hessian. The spectral radius condition gives the stability condition:
This is exactly the standard gradient descent convergence condition! The learning rate upper bound is determined by the largest eigenvalue of the Hessian.
Q4: How would you use root-finding to implement a temperature search for model calibration?
Problem: A neural network classifier produces logits and you want to find temperature such that the calibrated model is well-calibrated (expected calibration error is minimized, or some calibration metric equals a target).
Common formulation: Find such that the expected confidence equals the expected accuracy:
Since confidence is monotonically decreasing in (higher temperature → more uniform distribution → lower confidence), this is a monotone equation amenable to bisection.
def temperature_scaling_calibration(
logits: np.ndarray, # (n_samples, n_classes)
labels: np.ndarray, # (n_samples,) integer labels
target_nll: float = None
) -> float:
"""
Find optimal temperature T to minimize NLL (negative log-likelihood).
Uses bisection on the derivative of NLL w.r.t. T.
NLL(T) is convex in log(T), so its derivative has exactly one root.
"""
def nll_at_temperature(log_T: float) -> float:
T = np.exp(log_T)
scaled_logits = logits / T
# Numerically stable log-softmax
log_probs = scaled_logits - np.log(np.sum(np.exp(
scaled_logits - scaled_logits.max(axis=1, keepdims=True)
), axis=1, keepdims=True)) - scaled_logits.max(axis=1, keepdims=True)
return -np.mean(log_probs[np.arange(len(labels)), labels])
def dnll_dlogT(log_T: float) -> float:
"""Derivative of NLL w.r.t. log(T) - root = optimal temperature."""
delta = 1e-5
return (nll_at_temperature(log_T + delta) - nll_at_temperature(log_T - delta)) / (2 * delta)
# Bisection to find log_T where dNLL/d(log T) = 0
optimal_log_T, _, _, _ = bisection(dnll_dlogT, a=-3.0, b=3.0)
return np.exp(optimal_log_T)
This is the core of temperature scaling - a simple post-hoc calibration method that achieves state-of-the-art calibration on many classification benchmarks by solving a single 1D root-finding problem.
Q5: What is Brent's method and why is it preferred over both bisection and Newton-Raphson?
Brent's method (also called Brent-Dekker) combines three techniques:
- Bisection: Guaranteed convergence when the other methods are not safe
- Secant method: Superlinear convergence when conditions are favorable
- Inverse quadratic interpolation: Even faster convergence with 3 points
How it works: At each step, Brent's method tries the secant/quadratic interpolation step. If that step falls within the current bracket and makes sufficient progress, it uses it. Otherwise, it falls back to bisection. The algorithm always maintains a valid bracket.
Why it's preferred:
-
Guaranteed convergence: Always converges because bisection is the fallback - unlike pure Newton/secant which can diverge.
-
Superlinear convergence: In practice converges nearly as fast as secant/Newton near the root.
-
No derivative required: Uses only function evaluations.
-
Robustly handles discontinuities and sharp features: Unlike Newton which requires smooth .
scipy implementation: scipy.optimize.brentq(f, a, b) - this is the recommended default for all 1D root-finding in Python. It is what scipy.optimize.root_scalar(method='brentq') uses.
When to use Newton over Brent: Only when you have an analytic derivative and a close initial guess. Newton's quadratic convergence can reach machine precision in 4-5 iterations while Brent needs 8-15. For truly tight tolerances (< 10⁻¹⁴) with smooth functions, Newton-Raphson's derivative-guided steps win.
Quick Reference
| Method | Convergence | Requires | Best for |
|---|---|---|---|
| Bisection | Bracket, f | Guaranteed convergence, coarse problems | |
| Newton-Raphson | f, f', good x₀ | Fast with derivative, smooth f | |
| Secant | f, two x₀ | Moderate, no derivative | |
| Brent's | + safe | Bracket, f | Best general 1D choice |
# Recommended: scipy.optimize.brentq for 1D problems
from scipy.optimize import brentq, root_scalar
# Bracketed (most reliable)
root = brentq(f, a, b, xtol=1e-12)
# With derivative (fastest for smooth f)
root = root_scalar(f, method='newton', x0=x_init, fprime=df).root
# System of equations
from scipy.optimize import fsolve
solution = fsolve(F_system, x0, fprime=J_system)
Next: Lesson 07: Sparse Matrix Methods →
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Root-Finding Algorithms demo on the EngineersOfAI Playground - no code required.
:::
