Root-Finding Algorithms - Bisection, Newton-Raphson, and ML Applications

Reading time: ~18 minutes | Level: Numerical Methods → Optimization

You want to find the learning rate where the gradient norm equals a target value. You want to find the temperature parameter where the KL divergence between two distributions equals a budget. You implement layer normalization and need to find the scale that makes the output variance equal 1.

These are all root-finding problems: find $x$ such that $f(x) = 0$ (or equivalently, $f(x) = c$ after shifting). Root-finding algorithms are the workhorses of optimization, calibration, and constraint satisfaction in production ML systems.

What You Will Learn

Bisection method: guaranteed convergence, linear rate
Newton-Raphson method: quadratic convergence, fragility
Secant method: superlinear convergence without derivatives
Convergence analysis: linear, superlinear, quadratic
Fixed-point iterations and their connection to ML optimizers
ML applications: learning rate search, quantile finding, calibration

Part 1 - The Bisection Method

The idea

Given $f(a) < 0 < f(b)$ (sign change), by the intermediate value theorem there is a root in $[a, b]$ . Bisection repeatedly halves the interval:

Compute $c = (a + b)/2$
If $f(c) = 0$ : done
If $f(a) \cdot f(c) < 0$ : root is in $[a, c]$ → set $b = c$
Else: root is in $[c, b]$ → set $a = c$
Repeat until $|b - a| < \text{tolerance}$

Convergence: Linear - error halves each step. After $n$ iterations, error $\leq (b-a)/2^n$ .

import numpy as np
from typing import Callable, Optional

def bisection(
    f: Callable,
    a: float,
    b: float,
    tol: float = 1e-10,
    maxiter: int = 100
) -> tuple:
    """
    Bisection root-finding method.
    Guaranteed to converge if f(a) and f(b) have opposite signs.

    Returns: (root, n_iterations, converged)
    """
    if f(a) * f(b) > 0:
        raise ValueError(f"f(a)={f(a):.4f} and f(b)={f(b):.4f} must have opposite signs")

    history = []

    for n in range(maxiter):
        c = (a + b) / 2.0
        fc = f(c)
        history.append({'c': c, 'f(c)': fc, 'interval_width': b - a})

        if abs(fc) < tol or (b - a) / 2 < tol:
            return c, n + 1, True, history

        if f(a) * fc < 0:
            b = c   # Root is in left half
        else:
            a = c   # Root is in right half

    return (a + b) / 2.0, maxiter, False, history

# Example: find √2 as root of f(x) = x² - 2
f = lambda x: x**2 - 2.0
root, n_iters, converged, hist = bisection(f, 1.0, 2.0, tol=1e-12)
print(f"Root: {root:.12f}")
print(f"True: {np.sqrt(2):.12f}")
print(f"Iterations: {n_iters}")
print(f"Error: {abs(root - np.sqrt(2)):.2e}")

# Show convergence rate (should halve each iteration)
for i, step in enumerate(hist[:8]):
    print(f"Iter {i+1}: c = {step['c']:.8f}, |f(c)| = {abs(step['f(c)']):.2e}")

Key property: Bisection always converges for continuous $f$ with a sign-bracketed interval. It is the method of last resort when you need guaranteed convergence and other methods fail.

Part 2 - Newton-Raphson Method

The idea

Newton-Raphson uses both $f(x)$ and $f'(x)$ to take a better step. The linear approximation of $f$ near $x_k$ :

$f(x) \approx f(x_k) + f'(x_k)(x - x_k) = 0$

Solving for $x$ :

$x_{k+1} = x_k - \frac{f(x_k)}{f'(x_k)}$

Convergence analysis

For a simple root $x^*$ (where $f(x^*) = 0$ and $f'(x^*) \neq 0$ ), Newton-Raphson has quadratic convergence:

$|x_{k+1} - x^*| \leq C \cdot |x_k - x^*|^2$

This means: once you are close to the root, the number of correct decimal digits doubles each iteration. Going from 4 correct digits to 8 to 16 - convergence is extremely fast.

import numpy as np

def newton_raphson(
    f: Callable,
    df: Callable,
    x0: float,
    tol: float = 1e-12,
    maxiter: int = 50
) -> tuple:
    """
    Newton-Raphson root-finding.
    Requires f and its derivative df.

    Quadratic convergence near the root.
    Can diverge if starting far from root or if f'(x) ≈ 0.
    """
    x = x0
    history = []

    for n in range(maxiter):
        fx = f(x)
        history.append({'x': x, 'f(x)': fx})

        if abs(fx) < tol:
            return x, n + 1, True, history

        dfx = df(x)
        if abs(dfx) < 1e-14:
            raise ValueError(f"Near-zero derivative at x={x:.6f} - Newton-Raphson failed")

        x = x - fx / dfx   # Newton step

    return x, maxiter, False, history

# Example: f(x) = x² - 2, f'(x) = 2x
f = lambda x: x**2 - 2.0
df = lambda x: 2.0 * x

root, n_iters, converged, hist = newton_raphson(f, df, x0=1.5, tol=1e-14)
print(f"Root: {root:.14f}")
print(f"Iterations: {n_iters}")  # Typically 4-6 for this problem

# Observe quadratic convergence
true_root = np.sqrt(2)
print("\nQuadratic convergence:")
for i, step in enumerate(hist):
    err = abs(step['x'] - true_root)
    print(f"Iter {i}: error = {err:.2e}")
# Errors: 1e-1 → 1e-2 → 1e-4 → 1e-8 → 1e-16 (digits double!)

Newton-Raphson in multiple dimensions

For systems $F(x) = 0$ where $F: \mathbb{R}^n \to \mathbb{R}^n$ , Newton's method uses the Jacobian:

$x_{k+1} = x_k - J_F(x_k)^{-1} F(x_k)$

Or equivalently (avoid explicit inverse): solve $J_F(x_k) \delta = -F(x_k)$ , then $x_{k+1} = x_k + \delta$ .

def newton_system(
    F: Callable,
    J_F: Callable,
    x0: np.ndarray,
    tol: float = 1e-10,
    maxiter: int = 50
) -> np.ndarray:
    """
    Newton's method for nonlinear systems F(x) = 0.
    J_F: Jacobian function x -> J (n x n matrix)
    """
    x = x0.copy()

    for _ in range(maxiter):
        Fx = F(x)
        if np.linalg.norm(Fx) < tol:
            break

        J = J_F(x)
        # Solve J @ delta = -F(x) - do NOT use J_inv!
        delta = np.linalg.solve(J, -Fx)
        x = x + delta

    return x

When Newton-Raphson fails

# Failure mode 1: Oscillation (cycle)
f_cycle = lambda x: x**3 - 2*x + 2
df_cycle = lambda x: 3*x**2 - 2
# Starting at x=0: Newton step goes to x=1, then back to x=0 - infinite loop!

# Failure mode 2: Divergence (wrong basin of attraction)
f_div = lambda x: np.arctan(x)
df_div = lambda x: 1.0 / (1 + x**2)
# For x0 far from origin: converges to wrong root or diverges

# Safeguard: Combine Newton with bisection (Brent's method)
from scipy.optimize import brentq, newton

# scipy.optimize.brentq: guaranteed convergence + fast near root
# Best of both worlds for 1D problems
root = brentq(lambda x: x**2 - 2, 1.0, 2.0, xtol=1e-12)
print(f"Brent's method: {root:.12f}")

Part 3 - Secant Method

The secant method approximates $f'(x)$ using two previous iterates:

$f'(x_k) \approx \frac{f(x_k) - f(x_{k-1})}{x_k - x_{k-1}}$

Substituting into the Newton step:

$x_{k+1} = x_k - f(x_k) \cdot \frac{x_k - x_{k-1}}{f(x_k) - f(x_{k-1})}$

No derivative needed. Requires two initial points $x_0, x_1$ .

Convergence: Superlinear - convergence order $\approx 1.618$ (the golden ratio). Faster than bisection, slower than Newton.

def secant_method(
    f: Callable,
    x0: float,
    x1: float,
    tol: float = 1e-12,
    maxiter: int = 50
) -> tuple:
    """
    Secant method - derivative-free, superlinear convergence.
    Requires two starting points but no analytical derivative.
    """
    history = [(x0, f(x0)), (x1, f(x1))]

    for n in range(maxiter):
        f0, f1 = f(x0), f(x1)

        if abs(f1 - f0) < 1e-14:
            break  # Near-parallel secant line - numerical issue

        # Secant step
        x2 = x1 - f1 * (x1 - x0) / (f1 - f0)
        history.append((x2, f(x2)))

        if abs(x2 - x1) < tol or abs(f(x2)) < tol:
            return x2, n + 1, True, history

        x0, x1 = x1, x2

    return x1, maxiter, False, history

# Example: find root of f(x) = cos(x) - x (the Dottie number ≈ 0.7391)
f = lambda x: np.cos(x) - x
root, n_iters, converged, _ = secant_method(f, 0.0, 1.0)
print(f"Root (Dottie number): {root:.10f}")  # ≈ 0.7390851332
print(f"Iterations: {n_iters}")             # ~8-10 iterations

Part 4 - Convergence Order Comparison

Convergence orders:

Method           | Order | Evals/step | When to use
─────────────────┼───────┼────────────┼─────────────────────────
Bisection        | 1.0   | 1          | Guaranteed convergence, no derivative
Newton-Raphson   | 2.0   | 1          | Fast, have derivative, near root
Secant           | 1.618 | 1          | Fast, no derivative
Brent's method   | 1.618 | 1          | Guaranteed + fast (bracket + secant)
Halley's method  | 3.0   | 1          | Have f, f', f'' - very fast

Error reduction per iteration:
Bisection:  |e_{k+1}| = 0.5 × |e_k|
Secant:     |e_{k+1}| = C × |e_k|^1.618
Newton:     |e_{k+1}| = C × |e_k|^2

After 10 iterations starting with error 0.1:
Bisection: 0.1 / 2^10 ≈ 10^-4
Secant:    (0.1)^(1.618^10) ≈ 10^-11 ← practically at machine precision
Newton:    (0.1)^(2^10) ≈ 10^-1024 ← beyond float range after 4-5 iterations

Part 5 - ML Applications

Finding the optimal learning rate

One practical application of bisection: finding the learning rate where the gradient norm crosses a threshold (learning rate finder algorithm):

import numpy as np

def learning_rate_finder(
    model,
    dataloader,
    loss_fn,
    target_loss_ratio: float = 3.0,  # Stop when loss is 3x initial loss
    min_lr: float = 1e-7,
    max_lr: float = 10.0,
    n_steps: int = 100
) -> float:
    """
    Find the maximum safe learning rate using linear sweep + exponential schedule.
    Uses root-finding concept: find LR where loss starts to diverge.
    """
    from copy import deepcopy
    import torch

    initial_state = deepcopy(model.state_dict())

    lrs = np.exp(np.linspace(np.log(min_lr), np.log(max_lr), n_steps))
    losses = []
    initial_loss = None

    for step, lr in enumerate(lrs):
        batch = next(iter(dataloader))

        # Take one step at this learning rate
        optimizer = torch.optim.SGD(model.parameters(), lr=lr)
        loss = loss_fn(model(batch['x']), batch['y'])

        if initial_loss is None:
            initial_loss = loss.item()

        losses.append(loss.item())

        if loss.item() > target_loss_ratio * initial_loss or np.isnan(loss.item()):
            # Divergence detected - last stable LR is a few steps back
            optimal_lr = lrs[max(0, step - 3)]
            model.load_state_dict(initial_state)
            return optimal_lr

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    model.load_state_dict(initial_state)
    return lrs[np.argmin(losses)]  # LR at minimum loss


def find_temperature_by_bisection(
    log_probs: np.ndarray,
    target_entropy: float,
    tol: float = 1e-6
) -> float:
    """
    Find temperature T such that entropy of softmax(log_probs / T) = target_entropy.
    Used in knowledge distillation and policy entropy regularization.

    Entropy is monotonically increasing in T → root-finding is valid.
    """
    def entropy_deficit(log_T: float) -> float:
        T = np.exp(log_T)
        # Compute softmax(log_probs / T)
        scaled = log_probs / T
        probs = np.exp(scaled - np.max(scaled))
        probs /= probs.sum()
        entropy = -np.sum(probs * np.log(probs + 1e-12))
        return entropy - target_entropy

    # Bisection in log-temperature space
    log_T_root, _, _, _ = bisection(
        entropy_deficit, a=-5.0, b=5.0, tol=tol
    )
    return np.exp(log_T_root)


def find_calibration_threshold(
    probabilities: np.ndarray,
    labels: np.ndarray,
    target_fpr: float = 0.05
) -> float:
    """
    Find decision threshold where False Positive Rate = target_fpr.
    Classic root-finding application in classifier calibration.
    """
    from sklearn.metrics import roc_curve

    fpr, tpr, thresholds = roc_curve(labels, probabilities)

    # Find threshold where fpr ≈ target_fpr using bisection
    def fpr_at_threshold(t: float) -> float:
        preds = (probabilities >= t).astype(int)
        fp = np.sum((preds == 1) & (labels == 0))
        n = np.sum(labels == 0)
        return fp / n - target_fpr  # We want this = 0

    threshold, _, _, _ = bisection(fpr_at_threshold, a=0.0, b=1.0, tol=1e-6)
    return threshold

Fixed-point iterations and ML optimizers

Many ML algorithms can be viewed as fixed-point iterations:

$x_{k+1} = g(x_k)$

where we are finding a fixed point $x^* = g(x^*)$ (equivalently, a root of $f(x) = x - g(x)$ ).

Banach fixed-point theorem: If $g$ is a contraction ( $\|g(x) - g(y)\| \leq L\|x - y\|$ with $L < 1$ ), then the iteration converges to a unique fixed point.

import numpy as np

# Gradient descent as fixed-point iteration
# x_{k+1} = x_k - η∇f(x_k) = g(x_k)
# Fixed point: g(x*) = x* → η∇f(x*) = 0 → x* is a critical point

# RMSprop/Adam as preconditioned fixed-point iteration
# Fixed point: adapted step such that accumulated gradient statistics balance

def power_iteration(A: np.ndarray, n_iters: int = 100) -> tuple:
    """
    Power iteration: fixed-point method to find the dominant eigenvector.
    x_{k+1} = A x_k / ‖A x_k‖
    Fixed point: eigenvector corresponding to largest eigenvalue.
    """
    n = A.shape[0]
    v = np.random.randn(n)
    v = v / np.linalg.norm(v)

    for _ in range(n_iters):
        Av = A @ v
        eigenvalue = v @ Av    # Rayleigh quotient
        v = Av / np.linalg.norm(Av)

    return v, v @ A @ v   # (eigenvector, eigenvalue)

# Example: find largest eigenvalue of a matrix
A = np.array([[4.0, 1.0, 0.5],
              [1.0, 3.0, 0.2],
              [0.5, 0.2, 2.0]])
v, lambda_max = power_iteration(A)
print(f"Largest eigenvalue: {lambda_max:.6f}")
print(f"numpy.linalg.eigvals: {np.max(np.linalg.eigvals(A)):.6f}")

Part 6 - SciPy Root-Finding Interface

from scipy.optimize import brentq, newton, bisect, root_scalar

# 1D root finding - use root_scalar for a unified interface
def find_root_demo():
    f = lambda x: x**3 - 2*x - 5   # Root near x ≈ 2.0946

    # Method 1: Brent's method (recommended for bracketed 1D)
    result = root_scalar(f, method='brentq', bracket=[1, 3])
    print(f"Brent's: {result.root:.10f} in {result.iterations} iters")

    # Method 2: Newton-Raphson (requires derivative, faster if available)
    df = lambda x: 3*x**2 - 2
    result = root_scalar(f, method='newton', x0=2.0, fprime=df)
    print(f"Newton: {result.root:.10f} in {result.iterations} iters")

    # Method 3: Secant (no derivative needed)
    result = root_scalar(f, method='secant', x0=1.5, x1=2.5)
    print(f"Secant: {result.root:.10f} in {result.iterations} iters")

find_root_demo()

# Multi-dimensional systems
from scipy.optimize import fsolve, root

def solve_nonlinear_system():
    """
    Solve the system:
    f1(x, y) = x² + y² - 4 = 0    (circle of radius 2)
    f2(x, y) = x - y = 0           (diagonal)
    Solution: (√2, √2) and (-√2, -√2)
    """
    def system(vars):
        x, y = vars
        return [x**2 + y**2 - 4, x - y]

    def jacobian(vars):
        x, y = vars
        return [[2*x, 2*y],
                [1,   -1]]

    # fsolve: classic Newton-Krylov based solver
    sol = fsolve(system, x0=[1.0, 1.5], fprime=jacobian)
    print(f"Solution: ({sol[0]:.6f}, {sol[1]:.6f})")  # (1.41421, 1.41421)

solve_nonlinear_system()

Interview Questions

Q1: Compare bisection, Newton-Raphson, and secant methods. When would you use each?

Method	Convergence Order	Derivative Needed	Bracketing	Recommendation
Bisection	1 (linear)	No	Required	Guaranteed convergence, last resort
Newton-Raphson	2 (quadratic)	Yes	No	Fastest when derivative available
Secant	1.618 (superlinear)	No	No	Good balance, no derivative needed
Brent's	1.618 + bisection	No	Required	Best practice for 1D problems

Use bisection when: You need guaranteed convergence and have a sign-bracketed interval. Tolerance requirements are modest (1-6 decimal places). Example: finding a decision threshold in a classifier.

Use Newton-Raphson when: You have an analytical derivative and a good starting guess near the root. Fast convergence is critical. Example: solving the Fisher information system in second-order optimization.

Use secant when: Derivative is unavailable or expensive, but you can evaluate $f$ at two nearby points. Example: bisecting in a smooth parameter space where finite differences are cheap.

In practice: scipy.optimize.root_scalar with method='brentq' (for bracketed problems) or 'newton' (for well-specified problems with derivatives). Brent's method gets the best of bisection (guaranteed convergence) and secant (fast near root).

Q2: Why does Newton-Raphson have quadratic convergence and what can break it?

Quadratic convergence proof sketch:

Let $x^*$ be the root: $f(x^*) = 0$ . Newton step: $x_{k+1} = x_k - f(x_k)/f'(x_k)$ .

Error at step $k+1$ : $e_{k+1} = x_{k+1} - x^* = \frac{f''(\xi)}{2 f'(x_k)} e_k^2$

where $\xi$ is between $x_k$ and $x^*$ (by Taylor expansion). For $x_k$ near $x^*$ , $f'(x_k) \approx f'(x^*)$ which is bounded away from zero. So:

$|e_{k+1}| \leq C |e_k|^2, \quad C = \frac{|f''(x^*)|}{2|f'(x^*)|}$

The error squares each iteration - quadratic convergence.

What breaks it:

Poor initial guess: Far from root, Newton step can overshoot, oscillate, or diverge. No convergence guarantee without bracket.
Near-zero derivative: If $f'(x_k) \approx 0$ (near a local extremum), the step $f/f'$ is enormous → divergence. Protect with a maximum step size.
Multiple roots: At a root where $f(x^*) = f'(x^*) = 0$ (multiplicity > 1), Newton degrades to linear convergence, not quadratic.
Discontinuous derivative: Quadratic convergence requires $f''$ to be bounded. Functions with sharp kinks may not converge quadratically.

Safeguards: Use Brent's method (combines Newton/secant with bisection bracket), or add a maximum step size, or use line search to ensure each step reduces $|f(x)|$ .

Q3: What is a fixed-point iteration and how does gradient descent fit this framework?

A fixed-point iteration finds $x^*$ satisfying $x^* = g(x^*)$ by iterating $x_{k+1} = g(x_k)$ .

Any root-finding problem $f(x) = 0$ can be converted: define $g(x) = x - \alpha f(x)$ for some $\alpha \neq 0$ . Then $x^* = g(x^*)$ iff $f(x^*) = 0$ .

Gradient descent as fixed-point iteration: Minimizing $L(\theta)$ : $\theta_{k+1} = \theta_k - \eta \nabla L(\theta_k) = g(\theta_k)$

Fixed point: $g(\theta^*) = \theta^*$ iff $\nabla L(\theta^*) = 0$ - a critical point of $L$ .

Convergence condition (Banach fixed-point theorem): The iteration converges if and only if $g$ is a contraction in a neighborhood of $x^*$ : $\|g'(x)\|_{op} < 1$ for $x$ near $x^*$ .

For gradient descent: $g'(\theta) = I - \eta H_L(\theta)$ where $H_L$ is the Hessian. The spectral radius condition $\rho(I - \eta H) < 1$ gives the stability condition:

$0 < \eta < \frac{2}{\lambda_{\max}(H)}$

This is exactly the standard gradient descent convergence condition! The learning rate upper bound is determined by the largest eigenvalue of the Hessian.

Q4: How would you use root-finding to implement a temperature search for model calibration?

Problem: A neural network classifier produces logits $z$ and you want to find temperature $T$ such that the calibrated model is well-calibrated (expected calibration error is minimized, or some calibration metric equals a target).

Common formulation: Find $T$ such that the expected confidence equals the expected accuracy: $E\left[\max_k \text{softmax}(z/T)_k\right] = \text{Accuracy}$

Since confidence is monotonically decreasing in $T$ (higher temperature → more uniform distribution → lower confidence), this is a monotone equation amenable to bisection.

def temperature_scaling_calibration(
    logits: np.ndarray,     # (n_samples, n_classes)
    labels: np.ndarray,     # (n_samples,) integer labels
    target_nll: float = None
) -> float:
    """
    Find optimal temperature T to minimize NLL (negative log-likelihood).
    Uses bisection on the derivative of NLL w.r.t. T.

    NLL(T) is convex in log(T), so its derivative has exactly one root.
    """
    def nll_at_temperature(log_T: float) -> float:
        T = np.exp(log_T)
        scaled_logits = logits / T
        # Numerically stable log-softmax
        log_probs = scaled_logits - np.log(np.sum(np.exp(
            scaled_logits - scaled_logits.max(axis=1, keepdims=True)
        ), axis=1, keepdims=True)) - scaled_logits.max(axis=1, keepdims=True)
        return -np.mean(log_probs[np.arange(len(labels)), labels])

    def dnll_dlogT(log_T: float) -> float:
        """Derivative of NLL w.r.t. log(T) - root = optimal temperature."""
        delta = 1e-5
        return (nll_at_temperature(log_T + delta) - nll_at_temperature(log_T - delta)) / (2 * delta)

    # Bisection to find log_T where dNLL/d(log T) = 0
    optimal_log_T, _, _, _ = bisection(dnll_dlogT, a=-3.0, b=3.0)
    return np.exp(optimal_log_T)

This is the core of temperature scaling - a simple post-hoc calibration method that achieves state-of-the-art calibration on many classification benchmarks by solving a single 1D root-finding problem.

Q5: What is Brent's method and why is it preferred over both bisection and Newton-Raphson?

Brent's method (also called Brent-Dekker) combines three techniques:

Bisection: Guaranteed convergence when the other methods are not safe
Secant method: Superlinear convergence when conditions are favorable
Inverse quadratic interpolation: Even faster convergence with 3 points

How it works: At each step, Brent's method tries the secant/quadratic interpolation step. If that step falls within the current bracket and makes sufficient progress, it uses it. Otherwise, it falls back to bisection. The algorithm always maintains a valid bracket.

Why it's preferred:

Guaranteed convergence: Always converges because bisection is the fallback - unlike pure Newton/secant which can diverge.
Superlinear convergence: In practice converges nearly as fast as secant/Newton near the root.
No derivative required: Uses only function evaluations.
Robustly handles discontinuities and sharp features: Unlike Newton which requires smooth $f$ .

scipy implementation: scipy.optimize.brentq(f, a, b) - this is the recommended default for all 1D root-finding in Python. It is what scipy.optimize.root_scalar(method='brentq') uses.

When to use Newton over Brent: Only when you have an analytic derivative and a close initial guess. Newton's quadratic convergence can reach machine precision in 4-5 iterations while Brent needs 8-15. For truly tight tolerances (< 10⁻¹⁴) with smooth functions, Newton-Raphson's derivative-guided steps win.

Quick Reference

Method	Convergence	Requires	Best for
Bisection	$O(1/2^n)$	Bracket, f	Guaranteed convergence, coarse problems
Newton-Raphson	$O(\\|e\\|^2)$	f, f', good x₀	Fast with derivative, smooth f
Secant	$O(\\|e\\|^{1.618})$	f, two x₀	Moderate, no derivative
Brent's	$O(\\|e\\|^{1.618})$ + safe	Bracket, f	Best general 1D choice

# Recommended: scipy.optimize.brentq for 1D problems
from scipy.optimize import brentq, root_scalar

# Bracketed (most reliable)
root = brentq(f, a, b, xtol=1e-12)

# With derivative (fastest for smooth f)
root = root_scalar(f, method='newton', x0=x_init, fprime=df).root

# System of equations
from scipy.optimize import fsolve
solution = fsolve(F_system, x0, fprime=J_system)

Next: Lesson 07: Sparse Matrix Methods →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Root-Finding Algorithms demo on the EngineersOfAI Playground - no code required.

:::

What You Will Learn​

Part 1 - The Bisection Method​

The idea​

Part 2 - Newton-Raphson Method​

The idea​

Convergence analysis​

Newton-Raphson in multiple dimensions​

When Newton-Raphson fails​

Part 3 - Secant Method​

Part 4 - Convergence Order Comparison​

Part 5 - ML Applications​

Finding the optimal learning rate​

Fixed-point iterations and ML optimizers​

Part 6 - SciPy Root-Finding Interface​

Interview Questions​

Quick Reference​

What You Will Learn

Part 1 - The Bisection Method

The idea

Part 2 - Newton-Raphson Method

The idea

Convergence analysis

Newton-Raphson in multiple dimensions

When Newton-Raphson fails

Part 3 - Secant Method

Part 4 - Convergence Order Comparison

Part 5 - ML Applications

Finding the optimal learning rate

Fixed-point iterations and ML optimizers

Part 6 - SciPy Root-Finding Interface

Interview Questions

Quick Reference