Taylor Series and Approximations - The Mathematics Behind Gradient Descent and Newton's Method

Reading time: ~20 minutes | Level: Mathematical Foundations → ML Engineering

Gradient descent takes one step toward the minimum. Newton's method often reaches a minimum in far fewer steps. The difference is not magic - it is how many terms of the Taylor expansion each method uses.

Gradient descent uses the first-order Taylor expansion. Newton's method uses the second-order expansion. Understanding why these approximations lead to these algorithms, and when they are computationally practical, is the foundation for reasoning about all of ML optimization.

What You Will Learn

Taylor series: polynomial approximation of any smooth function
First-order Taylor approximation → gradient descent
Second-order Taylor approximation → Newton's method
The Hessian matrix: the second-derivative object for multi-variable functions
Why second-order methods converge faster but are often impractical for large networks
Quasi-Newton methods: approximating the Hessian efficiently
Practical implications for ML optimizer design

Prerequisites

Lesson 01: Derivatives and Gradients (required)
Lesson 03: Gradient Descent Mechanics (recommended)
Lesson 04: Convex Functions and Optimization (helpful)

Part 1 - Taylor Series

The core idea

Any smooth function f can be approximated near a point x₀ as a polynomial:

$f(x) = f(x_0) + f'(x_0)(x - x_0) + \frac{f''(x_0)}{2!}(x-x_0)^2 + \frac{f'''(x_0)}{3!}(x-x_0)^3 + \cdots$

More compactly:

$f(x) = \sum_{k=0}^{\infty} \frac{f^{(k)}(x_0)}{k!} (x-x_0)^k$

This is the Taylor series of f around x₀. It is exact (when the series converges) - not an approximation. But we often truncate after a few terms to get an approximation.

Why Taylor series matters for optimization

Optimization algorithms work by locally approximating the objective function and then minimizing the approximation. The order of the Taylor expansion used determines the algorithm:

Order	Approximation	Algorithm
0th order	f(x₀) - constant	Random search
1st order	f(x₀) + f'(x₀)Δx	Gradient descent
2nd order	f(x₀) + f'(x₀)Δx + (1/2)f''(x₀)Δx²	Newton's method

Visualizing Taylor approximations

import numpy as np

def taylor_approximations(f, df, d2f, x0: float, n_points: int = 200):
    """
    Compute 1st and 2nd order Taylor approximations of f around x0.
    """
    x = np.linspace(x0 - 2, x0 + 2, n_points)

    f_true = np.array([f(xi) for xi in x])
    f_1st = f(x0) + df(x0) * (x - x0)                        # 1st order
    f_2nd = f(x0) + df(x0) * (x - x0) + 0.5 * d2f(x0) * (x - x0)**2  # 2nd order

    return x, f_true, f_1st, f_2nd

# Example: f(x) = sin(x) around x₀ = 0
f = np.sin
df = np.cos
d2f = lambda x: -np.sin(x)

x, f_true, f_1st, f_2nd = taylor_approximations(f, df, d2f, x0=0.0)

print("Taylor approximations of sin(x) at x=0:")
print(f"{'x':>6} | {'sin(x)':>8} | {'1st order':>10} | {'2nd order':>10}")
print("-" * 45)
for xi in [-1.5, -0.5, 0.0, 0.5, 1.0, 1.5]:
    true_val = np.sin(xi)
    first_val = xi          # sin(0) + cos(0)*(x-0) = x
    second_val = xi - xi**2/2  # + (-sin(0)/2)*(x-0)^2 = x (since sin(0)=0)
    # Note: for sin at 0, second_val should be x (0 contribution from sin(0))
    second_val = np.sin(0) + np.cos(0)*xi + (-np.sin(0)/2)*xi**2
    print(f"{xi:>6.1f} | {true_val:>8.4f} | {first_val:>10.4f} | {second_val:>10.4f}")

# Approximation error
h_values = [1.0, 0.5, 0.1, 0.01]
print("\nApproximation error vs step size:")
print(f"{'h':>6} | {'1st order error':>16} | {'2nd order error':>16}")
for h in h_values:
    f_true_h = np.sin(h)
    f_1st_h = h
    f_2nd_h = h - h**3/6  # actually this is 3rd order for sin... show the pattern
    err_1st = abs(f_true_h - f_1st_h)
    err_2nd = abs(f_true_h - (np.sin(0) + np.cos(0)*h))  # purely first order
    err_full_2nd = abs(f_true_h - (h))  # first order for sin at 0
    print(f"{h:>6.2f} | {err_1st:>16.6f} | {err_full_2nd:>16.6f}")

Part 2 - First-Order Taylor: Gradient Descent

Derivation

At current parameters θ₀, approximate L(θ) to first order:

$L(\theta_0 + \Delta\theta) \approx L(\theta_0) + \nabla L(\theta_0)^T \Delta\theta$

This is a linear function of Δθ. It has no finite minimum - the linear approximation decreases without bound in the direction -∇L(θ₀).

So we impose a constraint: ‖Δθ‖ ≤ α (take a step of size at most α).

Minimizing the linear approximation subject to ‖Δθ‖ ≤ α gives:

$\Delta\theta^* = -\alpha \frac{\nabla L(\theta_0)}{\|\nabla L(\theta_0)\|}$

Step in the direction opposite to the gradient, with step size α.

This is gradient descent, derived from the 1st-order Taylor expansion + step size constraint.

Why the first-order approximation is good enough (usually)

The 1st-order approximation is accurate when ‖Δθ‖ is small. Gradient descent uses a small learning rate precisely to keep each step within the region where the linear approximation holds.

The approximation error is O(‖Δθ‖²):

$|L(\theta_0 + \Delta\theta) - [L(\theta_0) + \nabla L(\theta_0)^T \Delta\theta]| \leq \frac{L_{\text{const}}}{2} \|\Delta\theta\|^2$

where L_const is the Lipschitz constant of the gradient (related to the maximum curvature).

Practical consequence: The learning rate α controls step size, which controls the approximation quality. If α is too large, we step outside the region where the linear approximation is valid - divergence.

import numpy as np

# Visualize: where the linear approximation breaks down
def f(x): return x**4 - 3*x**3 + 2  # some smooth function
def grad_f(x): return 4*x**3 - 9*x**2

x0 = 2.0
steps = [0.01, 0.1, 0.3, 0.5, 1.0]

print(f"Linear approximation quality at x₀={x0}:")
print(f"f(x₀) = {f(x0):.4f}, f'(x₀) = {grad_f(x0):.4f}")
print()
print(f"{'Δx':>8} | {'f(x₀+Δx)':>12} | {'linear approx':>14} | {'error':>10} | {'% error':>9}")
print("-" * 60)
for delta_x in steps:
    true_val = f(x0 + delta_x)
    linear_approx = f(x0) + grad_f(x0) * delta_x
    error = abs(true_val - linear_approx)
    pct_error = 100 * error / (abs(true_val) + 1e-8)
    print(f"{delta_x:>8.2f} | {true_val:>12.4f} | {linear_approx:>14.4f} | {error:>10.4f} | {pct_error:>8.1f}%")

Part 3 - Second-Order Taylor: Newton's Method

The Hessian matrix

For a multi-variable function f: ℝⁿ → ℝ, the second-order Taylor expansion requires the Hessian matrix H = ∇²f - the matrix of all second-order partial derivatives:

$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$

The Hessian captures curvature - how the gradient changes in each direction.

Second-order Taylor expansion

$L(\theta_0 + \Delta\theta) \approx L(\theta_0) + \nabla L(\theta_0)^T \Delta\theta + \frac{1}{2} \Delta\theta^T H \Delta\theta$

This is a quadratic function of Δθ. It has a finite minimum (when H is positive definite).

Setting the gradient of the quadratic approximation to zero:

$\nabla L(\theta_0) + H \Delta\theta = 0 \implies \Delta\theta = -H^{-1} \nabla L(\theta_0)$

This gives the Newton step: θ₁ = θ₀ - H⁻¹∇L(θ₀)

Newton's method vs gradient descent

Property	Gradient Descent	Newton's Method
Taylor order	1st	2nd
Update	θ - α∇L	θ - H⁻¹∇L
Curvature awareness	None (fixed step size)	Full (adapts to curvature)
Steps to converge	Many (linear rate)	Few (quadratic rate near minimum)
Cost per step	O(d)	O(d³) (matrix inversion) or O(d²) (solve linear system)
Memory	O(d)	O(d²) (store Hessian)
For d = 1M params	Feasible	Infeasible (10¹² elements in H)

Convergence rates

Gradient descent: linear convergence - error decreases by constant factor each step
Newton's method: quadratic convergence near optimum - if error is ε, next error is O(ε²)

Gradient descent:  ε → 0.9ε → 0.81ε → 0.73ε → ...  (slow, linear)
Newton's method:   ε → 0.1ε → 0.001ε → 10⁻⁶ε → ... (fast, quadratic)

Newton's method in 1D

For a single parameter: θ₁ = θ₀ - f'(θ₀) / f''(θ₀)

The step size adapts to the curvature: large curvature (large f'') → small step; small curvature → large step.

import numpy as np

def newtons_method_1d(
    f, df, d2f, x0: float,
    n_steps: int = 20, tol: float = 1e-10
) -> list:
    """Newton's method for 1D optimization."""
    x = x0
    history = [{'x': x, 'f': f(x), 'grad': df(x)}]

    for step in range(n_steps):
        grad = df(x)
        hess = d2f(x)

        if abs(hess) < 1e-12:
            print(f"Near-zero Hessian at step {step}, stopping")
            break

        # Newton step: Δx = -f'(x) / f''(x)
        delta_x = -grad / hess
        x = x + delta_x
        history.append({'x': x, 'f': f(x), 'grad': df(x)})

        if abs(grad) < tol:
            print(f"Converged at step {step+1}")
            break

    return history

# Compare gradient descent vs Newton on f(x) = x^4 - 2x^2 + 0.5 (has two minima)
f = lambda x: x**4 - 2*x**2 + 0.5
df = lambda x: 4*x**3 - 4*x
d2f = lambda x: 12*x**2 - 4

# From x0 = 2.0: should converge to minimum near x=1
print("Newton's method convergence (starting from x=2.0):")
hist_newton = newtons_method_1d(f, df, d2f, x0=2.0, n_steps=20)
for i, h in enumerate(hist_newton):
    print(f"  Step {i}: x={h['x']:.8f}, |f'(x)|={abs(h['grad']):.2e}")

# Compare with gradient descent
print("\nGradient descent convergence (lr=0.01, same problem):")
x = 2.0
for step in range(50):
    g = df(x)
    x -= 0.01 * g
    if step < 5 or step % 10 == 0:
        print(f"  Step {step}: x={x:.8f}, |f'(x)|={abs(g):.2e}")

Newton's method in multiple dimensions

import numpy as np

def newtons_method_nd(
    grad_fn,
    hessian_fn,
    theta0: np.ndarray,
    n_steps: int = 20,
    tol: float = 1e-8
) -> list:
    """Newton's method for multi-dimensional optimization."""
    theta = theta0.copy()
    history = [theta.copy()]

    for step in range(n_steps):
        grad = grad_fn(theta)
        H = hessian_fn(theta)

        # Newton step: Δθ = -H⁻¹ ∇L
        # Don't compute H⁻¹ explicitly; solve linear system H Δθ = -∇L
        try:
            delta_theta = np.linalg.solve(H, -grad)  # O(d³) but numerically stable
        except np.linalg.LinAlgError:
            print(f"Singular Hessian at step {step}, adding regularization")
            H_reg = H + 1e-6 * np.eye(len(theta))
            delta_theta = np.linalg.solve(H_reg, -grad)

        theta = theta + delta_theta
        history.append(theta.copy())

        grad_norm = np.linalg.norm(grad)
        if grad_norm < tol:
            print(f"Converged at step {step+1}, ‖∇L‖ = {grad_norm:.2e}")
            break

    return history

# Example: minimize f(θ) = θ₁² + 10θ₂² (condition number 10)
grad_fn = lambda t: np.array([2*t[0], 20*t[1]])
hess_fn = lambda t: np.array([[2, 0], [0, 20]])  # constant Hessian (quadratic)

theta0 = np.array([5.0, 5.0])
hist = newtons_method_nd(grad_fn, hess_fn, theta0)

print(f"\nStarting point: {theta0}")
print(f"Newton converged in {len(hist)-1} steps to: {hist[-1].round(8)}")
# For quadratic functions, Newton's method converges in exactly 1 step
# because the quadratic model is exact

Part 4 - Why Newton's Method Is Impractical for Large Networks

For a neural network with d = 100 million parameters:

Hessian size: d × d = 10¹⁶ float32 values ≈ 10⁷ GB
Hessian computation: O(d²) forward/backward passes
Hessian inversion: O(d³) operations

None of these are remotely feasible. Gradient descent requires O(d) memory and compute per step.

This is why virtually all large-scale deep learning uses first-order methods (gradient descent variants).

When Newton is practical

Newton's method is practical for:

Small models: d < 10,000 parameters
Classic ML: logistic regression, kernel SVMs, linear models
Fine-tuning with frozen layers: few free parameters
Trust region methods: approximate Newton update within a trust region

Part 5 - Quasi-Newton Methods: Approximating the Hessian

Quasi-Newton methods approximate H⁻¹ from gradient differences, without computing H explicitly.

The BFGS idea

If we observe two gradient values g_old = ∇L(θ_old) and g_new = ∇L(θ_new), we know:

The Hessian satisfies (approximately): H · (θ_new - θ_old) = g_new - g_old

This is the secant condition: the Hessian must map the parameter difference to the gradient difference.

BFGS (Broyden-Fletcher-Goldfarb-Shanno) uses this condition to iteratively update an approximation to H⁻¹, using rank-2 updates:

$H_{k+1}^{-1} = \left(I - \rho_k s_k y_k^T\right) H_k^{-1} \left(I - \rho_k y_k s_k^T\right) + \rho_k s_k s_k^T$

where:

s_k = θ_{k+1} - θ_k (parameter difference)
y_k = ∇L_{k+1} - ∇L_k (gradient difference)
ρ_k = 1/(y_k^T s_k)

L-BFGS: Limited-memory BFGS

BFGS requires storing H⁻¹ ∈ ℝ^{d×d}. For d = 100K parameters, this is 10¹⁰ float32 values - still too large.

L-BFGS stores only the last m (typically 10-20) (s_k, y_k) vector pairs and reconstructs the H⁻¹ product on the fly in O(md) time and O(md) memory.

This makes L-BFGS practical for medium-scale problems (d up to ~10 million).

from scipy.optimize import minimize
import numpy as np

def compare_optimizers():
    """
    Compare gradient descent vs L-BFGS on a logistic regression problem.
    """
    np.random.seed(42)
    n, d = 500, 20

    X = np.random.randn(n, d)
    true_w = np.random.randn(d)
    y = (1 / (1 + np.exp(-X @ true_w)) > 0.5).astype(float)

    def sigmoid(z): return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

    def logistic_loss(w):
        z = X @ w
        s = sigmoid(z)
        return -np.mean(y * np.log(s + 1e-10) + (1-y) * np.log(1-s + 1e-10))

    def logistic_grad(w):
        z = X @ w
        s = sigmoid(z)
        return (1/n) * X.T @ (s - y)

    w0 = np.zeros(d)

    # Method 1: gradient descent
    w = w0.copy()
    gd_losses = []
    for _ in range(500):
        grad = logistic_grad(w)
        w -= 0.5 * grad
        gd_losses.append(logistic_loss(w))

    # Method 2: L-BFGS
    result = minimize(logistic_loss, w0, jac=logistic_grad, method='L-BFGS-B')
    lbfgs_loss = result.fun
    lbfgs_steps = result.nit

    print(f"Gradient descent (500 steps): final loss = {gd_losses[-1]:.6f}")
    print(f"L-BFGS ({lbfgs_steps} steps): final loss = {lbfgs_loss:.6f}")
    print(f"GD needed ~{len([l for l in gd_losses if l < lbfgs_loss + 0.001])} steps to match L-BFGS quality")

compare_optimizers()

Part 6 - Taylor Series Applications in ML

Adam as a diagonal approximation to Newton's method

Adam maintains the exponential moving average of squared gradients (v_t). The update is:

$\theta_{t+1} = \theta_t - \frac{\alpha}{\sqrt{v_t} + \epsilon} m_t$

This is equivalent to using a diagonal approximation to the Hessian: H ≈ diag(v_t). Each parameter's learning rate is scaled by (an estimate of) its curvature.

This is the key insight: Adam is an approximation to Newton's method using a diagonal Hessian estimated from squared gradient history.

Learning rate and the Taylor approximation

The learning rate α must be chosen such that:

The step ‖Δθ‖ = α‖∇L‖ is within the region where the linear approximation holds
The Lipschitz condition says this region has radius ≈ 1/L_const

The maximum safe learning rate is α* = 1/L_const, which equals 1/(maximum eigenvalue of Hessian). This is the theoretical foundation for why warm restarts with decreasing learning rates work.

Taylor series and trust regions

Trust region methods explicitly enforce that we only trust the Taylor approximation within a ball:

$\min_{\Delta\theta} L(\theta) + \nabla L(\theta)^T \Delta\theta + \frac{1}{2} \Delta\theta^T H \Delta\theta$ $\text{subject to:} \quad \|\Delta\theta\| \leq r$

The trust region radius r grows when the approximation is accurate (actual decrease matches predicted decrease) and shrinks when the approximation is inaccurate.

import numpy as np

def trust_region_step_1d(
    f, df, d2f,
    x0: float,
    trust_radius: float = 1.0,
    max_iter: int = 100
) -> list:
    """
    1D trust region method.
    Adjusts step size based on how well the quadratic approximation predicts actual decrease.
    """
    x = x0
    history = [{'x': x, 'f': f(x)}]
    r = trust_radius

    for step in range(max_iter):
        grad = df(x)
        hess = d2f(x)

        # Newton step
        if abs(hess) < 1e-12:
            break
        newton_step = -grad / hess

        # Clip to trust region
        actual_step = np.sign(newton_step) * min(abs(newton_step), r)
        x_new = x + actual_step

        # Predicted decrease (from quadratic model)
        predicted_decrease = -(grad * actual_step + 0.5 * hess * actual_step**2)
        # Actual decrease
        actual_decrease = f(x) - f(x_new)

        # Ratio: actual / predicted
        rho = actual_decrease / (predicted_decrease + 1e-10)

        if rho > 0.75 and abs(actual_step) >= 0.9 * r:
            r *= 2.0  # expand trust region (model was accurate)
        elif rho < 0.25:
            r *= 0.5  # shrink trust region (model was inaccurate)

        if rho > 0:
            x = x_new  # accept step
            history.append({'x': x, 'f': f(x)})

        if abs(grad) < 1e-8:
            break

    return history

# Test on a function with varying curvature
f = lambda x: x**4 - 3*x**2 + x
df = lambda x: 4*x**3 - 6*x + 1
d2f = lambda x: 12*x**2 - 6

hist = trust_region_step_1d(f, df, d2f, x0=2.0, trust_radius=0.5, max_iter=50)
print(f"Trust region method: converged in {len(hist)} steps")
print(f"Final x = {hist[-1]['x']:.6f}, f(x) = {hist[-1]['f']:.6f}")

Part 7 - Taylor Series and Numerical Stability

Why f''(x) ≠ (f(x+h) - 2f(x) + f(x-h)) / h² exactly

The finite-difference approximation of the second derivative:

$f''(x) \approx \frac{f(x+h) - 2f(x) + f(x-h)}{h^2}$

has truncation error O(h²) but also floating-point cancellation error O(ε_machine/h²). The optimal h balances these:

$h^* \approx \left(\frac{\epsilon_{\text{machine}}}{f''(x)}\right)^{1/4} \approx 10^{-4} \text{ for } f''(x) \approx 1$

import numpy as np

# Optimal step size for second derivative approximation
f = np.exp  # exp(x), all derivatives = exp(x)
x = 1.0
true_d2f = np.exp(x)  # = e ≈ 2.718

print("Second derivative approximation error vs step size h:")
print(f"{'h':>10} | {'approx':>12} | {'error':>12}")
print("-" * 38)
for h in [1.0, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7, 1e-8]:
    approx = (f(x+h) - 2*f(x) + f(x-h)) / h**2
    error = abs(approx - true_d2f)
    print(f"{h:>10.0e} | {approx:>12.6f} | {error:>12.2e}")

Part 8 - Common Mistakes and Engineering Red Flags

:::danger Newton's method with indefinite Hessian At a saddle point, the Hessian has negative eigenvalues. H⁻¹∇L may point toward a saddle point rather than a minimum. Newton's method requires a positive definite Hessian.

Fix: Add damping - replace H with H + λI (diagonal Levenberg-Marquardt regularization):

H_damped = H + lambda_reg * np.eye(d)
delta_theta = np.linalg.solve(H_damped, -grad)
# Choose lambda_reg such that H_damped is PD

:::

:::warning Learning rate too large relative to curvature The theoretically safe learning rate is α ≤ 1/L_const = 1/(max eigenvalue of Hessian). If you use a larger learning rate, you step outside the region where the linear approximation holds.

This is why learning rate schedules that decay the learning rate over time work: as you approach the minimum, the landscape becomes more curved (larger Hessian eigenvalues), so you need a smaller learning rate. :::

:::tip Gauss-Newton as a practical second-order method for least squares For least-squares problems L(θ) = (1/2)‖r(θ)‖², the Hessian is approximated by the Gauss-Newton matrix G = JᵀJ (where J is the Jacobian of residuals r). This avoids computing the full Hessian and is always positive semidefinite, making it practical for residual-based networks. :::

Interview Questions

Q1: What is the Taylor series and why does it matter for ML optimization?

The Taylor series approximates a smooth function f near a point x₀ as a polynomial:

$f(x_0 + \Delta x) = f(x_0) + f'(x_0)\Delta x + \frac{f''(x_0)}{2}\Delta x^2 + O(\Delta x^3)$

Why it matters for ML:

All optimization algorithms can be derived from truncating the Taylor series at different orders:

0th order: Evaluate f at random points - no derivative information used
1st order: f(x₀) + ∇L·Δθ - minimizing this gives gradient descent
2nd order: f(x₀) + ∇L·Δθ + (1/2)ΔθᵀHΔθ - minimizing this gives Newton's method

The learning rate α controls how far we trust the linear approximation. Large α → step outside valid range → potentially diverge. Small α → stay within valid range but converge slowly.

Adam's adaptive learning rate implicitly uses a diagonal approximation to the Hessian (from squared gradient history) - a partial second-order correction that is computationally cheap.

Q2: Derive Newton's method from the second-order Taylor expansion.

Step 1: Write the 2nd-order Taylor expansion around current θ: $L(\theta + \Delta\theta) \approx L(\theta) + \nabla L(\theta)^T \Delta\theta + \frac{1}{2} \Delta\theta^T H \Delta\theta$

Step 2: This is a quadratic function in Δθ. Minimize by setting the gradient to zero: $\frac{\partial}{\partial \Delta\theta}\left[\nabla L(\theta)^T \Delta\theta + \frac{1}{2} \Delta\theta^T H \Delta\theta\right] = \nabla L(\theta) + H \Delta\theta = 0$

Step 3: Solve for the optimal step: $\Delta\theta^* = -H^{-1} \nabla L(\theta)$

Step 4: Update: θ ← θ - H⁻¹∇L(θ)

Convergence: Near the optimum, Newton's method has quadratic convergence - if the current error is ε, the next error is O(ε²). For gradient descent, convergence is linear - error decreases by a constant factor each step.

Why we don't always use it: For d = 100M parameters, H is a 10M × 10M matrix (10¹⁴ float32 values). Infeasible to store or invert. Gradient descent requires O(d) memory, Newton requires O(d²).

Q3: What is L-BFGS and when would you use it over Adam?

L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) is a quasi-Newton method that:

Approximates H⁻¹ using the last m (typically 10-20) gradient/parameter differences
Requires O(md) memory instead of O(d²) for the full Hessian
Uses a full line search to find the optimal step size in each iteration

When to use L-BFGS over Adam:

Small to medium models (d < 10M): L-BFGS converges much faster than Adam in terms of function evaluations
Second-order convergence: For strongly convex problems, L-BFGS achieves superlinear convergence near the optimum - Adam only achieves linear
Classical ML: Logistic regression, kernel methods, GLMs - L-BFGS is standard
Full-batch optimization: L-BFGS requires consistent (not mini-batch) gradients for its Hessian approximation to be accurate

When to use Adam over L-BFGS:

Large models (d > 10M): L-BFGS becomes expensive per step
Mini-batch training: L-BFGS's Hessian approximation breaks down with noisy mini-batch gradients; Adam handles gradient noise well
Non-convex landscapes: Adam with momentum escapes saddle points naturally; L-BFGS can get stuck

In practice: Adam for almost all deep learning; L-BFGS for classical ML and fine-tuning small models.

Q4: Why does the learning rate in gradient descent need to be at most 1/L (where L is the Lipschitz constant of the gradient)?

Lipschitz gradient condition: ‖∇f(x) - ∇f(y)‖ ≤ L‖x - y‖ for all x, y. This means the gradient cannot change faster than L times the step size.

Connection to Taylor approximation: The error in the 1st-order Taylor approximation satisfies: $|f(x + \Delta x) - [f(x) + \nabla f(x)^T \Delta x]| \leq \frac{L}{2}\|\Delta x\|^2$

For gradient descent with step α, we get: $f(x - \alpha \nabla f) \leq f(x) - \alpha\|\nabla f\|^2 + \frac{L\alpha^2}{2}\|\nabla f\|^2 = f(x) - \alpha\left(1 - \frac{L\alpha}{2}\right)\|\nabla f\|^2$

For this to guarantee a decrease: We need 1 - Lα/2 > 0, which gives α < 2/L.

For optimal decrease: Maximize the right side over α: derivative zero at α* = 1/L, giving guaranteed decrease of (1/2L)‖∇f‖².

So the Lipschitz constant determines the maximum safe learning rate: α_max = 1/L. In practice, L is not known, so we estimate it with a learning rate finder or use adaptive methods.

Q5: How is Adam related to Newton's method?

Adam maintains exponential moving averages of first and second moments of gradients:

$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \quad \text{(first moment - gradient EMA)}$ $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \quad \text{(second moment - squared gradient EMA)}$

The update is: θ ← θ - α · m̂_t / (√v̂_t + ε)

Connection to Newton: The update θ ← θ - H⁻¹∇L for a diagonal Hessian H ≈ diag(v_t) gives: $\theta_{t+1} = \theta_t - \alpha \cdot \text{diag}(v_t)^{-1/2} g_t$

This is exactly Adam's structure (before bias correction).

What Adam is actually doing: Using √v_t as a per-parameter estimate of the gradient's typical magnitude, which is related to the diagonal Hessian. Parameters with large gradient magnitudes (high curvature) get smaller effective learning rates; parameters with small gradient magnitudes (low curvature) get larger effective learning rates.

The difference: Newton uses the true Hessian (full curvature information), Adam uses a diagonal approximation estimated from gradient history. Newton's method theoretically converges faster, but Adam is practical for large-scale, noisy mini-batch optimization.

Quick Reference

Expansion Order	Approximation	Algorithm	Cost/Step
0th	f(x₀)	Random search	O(1)
1st	f(x₀) + g^TΔθ	Gradient descent	O(d)
2nd (full)	f(x₀) + g^TΔθ + (1/2)Δθ^THΔθ	Newton's method	O(d³)
2nd (diagonal)	f(x₀) + g^TΔθ + (1/2)‖Δθ‖²_H	Adam (approx)	O(d)
2nd (approx)	Quasi-Newton with secant updates	L-BFGS	O(md)

Key Takeaways

Taylor series approximates any smooth function as a polynomial; optimization algorithms are derived from truncating this series
Gradient descent uses the 1st-order Taylor approximation - linear, guaranteed to decrease with small enough step
Newton's method uses the 2nd-order Taylor approximation - solves the exact quadratic model, converges quadratically but requires O(d²) memory and O(d³) compute
The maximum safe learning rate for gradient descent is 1/L_const, where L_const is the Lipschitz constant of the gradient (related to maximum Hessian eigenvalue)
L-BFGS approximates the Hessian inverse from gradient history using O(md) memory - practical for small to medium models and full-batch training
Adam is effectively a diagonal approximation to Newton's method, estimating per-parameter curvature from squared gradient history

Next: Automatic Differentiation →

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Taylor Series demo on the EngineersOfAI Playground - no code required.

:::

What You Will Learn​

Prerequisites​

Part 1 - Taylor Series​

The core idea​

Why Taylor series matters for optimization​

Visualizing Taylor approximations​

Part 2 - First-Order Taylor: Gradient Descent​

Derivation​

Why the first-order approximation is good enough (usually)​

Part 3 - Second-Order Taylor: Newton's Method​

The Hessian matrix​

Second-order Taylor expansion​

Newton's method vs gradient descent​

Convergence rates​

Newton's method in 1D​

Newton's method in multiple dimensions​

Part 4 - Why Newton's Method Is Impractical for Large Networks​

When Newton is practical​

Part 5 - Quasi-Newton Methods: Approximating the Hessian​

The BFGS idea​

L-BFGS: Limited-memory BFGS​

Part 6 - Taylor Series Applications in ML​

Adam as a diagonal approximation to Newton's method​

Learning rate and the Taylor approximation​

Taylor series and trust regions​

Part 7 - Taylor Series and Numerical Stability​

Why f''(x) ≠ (f(x+h) - 2f(x) + f(x-h)) / h² exactly​

Part 8 - Common Mistakes and Engineering Red Flags​

Interview Questions​

Quick Reference​

Key Takeaways​