Taylor Series and Approximations - The Mathematics Behind Gradient Descent and Newton's Method
Reading time: ~20 minutes | Level: Mathematical Foundations → ML Engineering
Gradient descent takes one step toward the minimum. Newton's method often reaches a minimum in far fewer steps. The difference is not magic - it is how many terms of the Taylor expansion each method uses.
Gradient descent uses the first-order Taylor expansion. Newton's method uses the second-order expansion. Understanding why these approximations lead to these algorithms, and when they are computationally practical, is the foundation for reasoning about all of ML optimization.
What You Will Learn
- Taylor series: polynomial approximation of any smooth function
- First-order Taylor approximation → gradient descent
- Second-order Taylor approximation → Newton's method
- The Hessian matrix: the second-derivative object for multi-variable functions
- Why second-order methods converge faster but are often impractical for large networks
- Quasi-Newton methods: approximating the Hessian efficiently
- Practical implications for ML optimizer design
Prerequisites
- Lesson 01: Derivatives and Gradients (required)
- Lesson 03: Gradient Descent Mechanics (recommended)
- Lesson 04: Convex Functions and Optimization (helpful)
Part 1 - Taylor Series
The core idea
Any smooth function f can be approximated near a point x₀ as a polynomial:
More compactly:
This is the Taylor series of f around x₀. It is exact (when the series converges) - not an approximation. But we often truncate after a few terms to get an approximation.
Why Taylor series matters for optimization
Optimization algorithms work by locally approximating the objective function and then minimizing the approximation. The order of the Taylor expansion used determines the algorithm:
| Order | Approximation | Algorithm |
|---|---|---|
| 0th order | f(x₀) - constant | Random search |
| 1st order | f(x₀) + f'(x₀)Δx | Gradient descent |
| 2nd order | f(x₀) + f'(x₀)Δx + (1/2)f''(x₀)Δx² | Newton's method |
Visualizing Taylor approximations
import numpy as np
def taylor_approximations(f, df, d2f, x0: float, n_points: int = 200):
"""
Compute 1st and 2nd order Taylor approximations of f around x0.
"""
x = np.linspace(x0 - 2, x0 + 2, n_points)
f_true = np.array([f(xi) for xi in x])
f_1st = f(x0) + df(x0) * (x - x0) # 1st order
f_2nd = f(x0) + df(x0) * (x - x0) + 0.5 * d2f(x0) * (x - x0)**2 # 2nd order
return x, f_true, f_1st, f_2nd
# Example: f(x) = sin(x) around x₀ = 0
f = np.sin
df = np.cos
d2f = lambda x: -np.sin(x)
x, f_true, f_1st, f_2nd = taylor_approximations(f, df, d2f, x0=0.0)
print("Taylor approximations of sin(x) at x=0:")
print(f"{'x':>6} | {'sin(x)':>8} | {'1st order':>10} | {'2nd order':>10}")
print("-" * 45)
for xi in [-1.5, -0.5, 0.0, 0.5, 1.0, 1.5]:
true_val = np.sin(xi)
first_val = xi # sin(0) + cos(0)*(x-0) = x
second_val = xi - xi**2/2 # + (-sin(0)/2)*(x-0)^2 = x (since sin(0)=0)
# Note: for sin at 0, second_val should be x (0 contribution from sin(0))
second_val = np.sin(0) + np.cos(0)*xi + (-np.sin(0)/2)*xi**2
print(f"{xi:>6.1f} | {true_val:>8.4f} | {first_val:>10.4f} | {second_val:>10.4f}")
# Approximation error
h_values = [1.0, 0.5, 0.1, 0.01]
print("\nApproximation error vs step size:")
print(f"{'h':>6} | {'1st order error':>16} | {'2nd order error':>16}")
for h in h_values:
f_true_h = np.sin(h)
f_1st_h = h
f_2nd_h = h - h**3/6 # actually this is 3rd order for sin... show the pattern
err_1st = abs(f_true_h - f_1st_h)
err_2nd = abs(f_true_h - (np.sin(0) + np.cos(0)*h)) # purely first order
err_full_2nd = abs(f_true_h - (h)) # first order for sin at 0
print(f"{h:>6.2f} | {err_1st:>16.6f} | {err_full_2nd:>16.6f}")
Part 2 - First-Order Taylor: Gradient Descent
Derivation
At current parameters θ₀, approximate L(θ) to first order:
This is a linear function of Δθ. It has no finite minimum - the linear approximation decreases without bound in the direction -∇L(θ₀).
So we impose a constraint: ‖Δθ‖ ≤ α (take a step of size at most α).
Minimizing the linear approximation subject to ‖Δθ‖ ≤ α gives:
Step in the direction opposite to the gradient, with step size α.
This is gradient descent, derived from the 1st-order Taylor expansion + step size constraint.
Why the first-order approximation is good enough (usually)
The 1st-order approximation is accurate when ‖Δθ‖ is small. Gradient descent uses a small learning rate precisely to keep each step within the region where the linear approximation holds.
The approximation error is O(‖Δθ‖²):
where L_const is the Lipschitz constant of the gradient (related to the maximum curvature).
Practical consequence: The learning rate α controls step size, which controls the approximation quality. If α is too large, we step outside the region where the linear approximation is valid - divergence.
import numpy as np
# Visualize: where the linear approximation breaks down
def f(x): return x**4 - 3*x**3 + 2 # some smooth function
def grad_f(x): return 4*x**3 - 9*x**2
x0 = 2.0
steps = [0.01, 0.1, 0.3, 0.5, 1.0]
print(f"Linear approximation quality at x₀={x0}:")
print(f"f(x₀) = {f(x0):.4f}, f'(x₀) = {grad_f(x0):.4f}")
print()
print(f"{'Δx':>8} | {'f(x₀+Δx)':>12} | {'linear approx':>14} | {'error':>10} | {'% error':>9}")
print("-" * 60)
for delta_x in steps:
true_val = f(x0 + delta_x)
linear_approx = f(x0) + grad_f(x0) * delta_x
error = abs(true_val - linear_approx)
pct_error = 100 * error / (abs(true_val) + 1e-8)
print(f"{delta_x:>8.2f} | {true_val:>12.4f} | {linear_approx:>14.4f} | {error:>10.4f} | {pct_error:>8.1f}%")
Part 3 - Second-Order Taylor: Newton's Method
The Hessian matrix
For a multi-variable function f: ℝⁿ → ℝ, the second-order Taylor expansion requires the Hessian matrix H = ∇²f - the matrix of all second-order partial derivatives:
The Hessian captures curvature - how the gradient changes in each direction.
Second-order Taylor expansion
This is a quadratic function of Δθ. It has a finite minimum (when H is positive definite).
Setting the gradient of the quadratic approximation to zero:
This gives the Newton step: θ₁ = θ₀ - H⁻¹∇L(θ₀)
Newton's method vs gradient descent
| Property | Gradient Descent | Newton's Method |
|---|---|---|
| Taylor order | 1st | 2nd |
| Update | θ - α∇L | θ - H⁻¹∇L |
| Curvature awareness | None (fixed step size) | Full (adapts to curvature) |
| Steps to converge | Many (linear rate) | Few (quadratic rate near minimum) |
| Cost per step | O(d) | O(d³) (matrix inversion) or O(d²) (solve linear system) |
| Memory | O(d) | O(d²) (store Hessian) |
| For d = 1M params | Feasible | Infeasible (10¹² elements in H) |
Convergence rates
- Gradient descent: linear convergence - error decreases by constant factor each step
- Newton's method: quadratic convergence near optimum - if error is ε, next error is O(ε²)
Gradient descent: ε → 0.9ε → 0.81ε → 0.73ε → ... (slow, linear)
Newton's method: ε → 0.1ε → 0.001ε → 10⁻⁶ε → ... (fast, quadratic)
Newton's method in 1D
For a single parameter: θ₁ = θ₀ - f'(θ₀) / f''(θ₀)
The step size adapts to the curvature: large curvature (large f'') → small step; small curvature → large step.
import numpy as np
def newtons_method_1d(
f, df, d2f, x0: float,
n_steps: int = 20, tol: float = 1e-10
) -> list:
"""Newton's method for 1D optimization."""
x = x0
history = [{'x': x, 'f': f(x), 'grad': df(x)}]
for step in range(n_steps):
grad = df(x)
hess = d2f(x)
if abs(hess) < 1e-12:
print(f"Near-zero Hessian at step {step}, stopping")
break
# Newton step: Δx = -f'(x) / f''(x)
delta_x = -grad / hess
x = x + delta_x
history.append({'x': x, 'f': f(x), 'grad': df(x)})
if abs(grad) < tol:
print(f"Converged at step {step+1}")
break
return history
# Compare gradient descent vs Newton on f(x) = x^4 - 2x^2 + 0.5 (has two minima)
f = lambda x: x**4 - 2*x**2 + 0.5
df = lambda x: 4*x**3 - 4*x
d2f = lambda x: 12*x**2 - 4
# From x0 = 2.0: should converge to minimum near x=1
print("Newton's method convergence (starting from x=2.0):")
hist_newton = newtons_method_1d(f, df, d2f, x0=2.0, n_steps=20)
for i, h in enumerate(hist_newton):
print(f" Step {i}: x={h['x']:.8f}, |f'(x)|={abs(h['grad']):.2e}")
# Compare with gradient descent
print("\nGradient descent convergence (lr=0.01, same problem):")
x = 2.0
for step in range(50):
g = df(x)
x -= 0.01 * g
if step < 5 or step % 10 == 0:
print(f" Step {step}: x={x:.8f}, |f'(x)|={abs(g):.2e}")
Newton's method in multiple dimensions
import numpy as np
def newtons_method_nd(
grad_fn,
hessian_fn,
theta0: np.ndarray,
n_steps: int = 20,
tol: float = 1e-8
) -> list:
"""Newton's method for multi-dimensional optimization."""
theta = theta0.copy()
history = [theta.copy()]
for step in range(n_steps):
grad = grad_fn(theta)
H = hessian_fn(theta)
# Newton step: Δθ = -H⁻¹ ∇L
# Don't compute H⁻¹ explicitly; solve linear system H Δθ = -∇L
try:
delta_theta = np.linalg.solve(H, -grad) # O(d³) but numerically stable
except np.linalg.LinAlgError:
print(f"Singular Hessian at step {step}, adding regularization")
H_reg = H + 1e-6 * np.eye(len(theta))
delta_theta = np.linalg.solve(H_reg, -grad)
theta = theta + delta_theta
history.append(theta.copy())
grad_norm = np.linalg.norm(grad)
if grad_norm < tol:
print(f"Converged at step {step+1}, ‖∇L‖ = {grad_norm:.2e}")
break
return history
# Example: minimize f(θ) = θ₁² + 10θ₂² (condition number 10)
grad_fn = lambda t: np.array([2*t[0], 20*t[1]])
hess_fn = lambda t: np.array([[2, 0], [0, 20]]) # constant Hessian (quadratic)
theta0 = np.array([5.0, 5.0])
hist = newtons_method_nd(grad_fn, hess_fn, theta0)
print(f"\nStarting point: {theta0}")
print(f"Newton converged in {len(hist)-1} steps to: {hist[-1].round(8)}")
# For quadratic functions, Newton's method converges in exactly 1 step
# because the quadratic model is exact
Part 4 - Why Newton's Method Is Impractical for Large Networks
For a neural network with d = 100 million parameters:
- Hessian size: d × d = 10¹⁶ float32 values ≈ 10⁷ GB
- Hessian computation: O(d²) forward/backward passes
- Hessian inversion: O(d³) operations
None of these are remotely feasible. Gradient descent requires O(d) memory and compute per step.
This is why virtually all large-scale deep learning uses first-order methods (gradient descent variants).
When Newton is practical
Newton's method is practical for:
- Small models: d < 10,000 parameters
- Classic ML: logistic regression, kernel SVMs, linear models
- Fine-tuning with frozen layers: few free parameters
- Trust region methods: approximate Newton update within a trust region
Part 5 - Quasi-Newton Methods: Approximating the Hessian
Quasi-Newton methods approximate H⁻¹ from gradient differences, without computing H explicitly.
The BFGS idea
If we observe two gradient values g_old = ∇L(θ_old) and g_new = ∇L(θ_new), we know:
The Hessian satisfies (approximately): H · (θ_new - θ_old) = g_new - g_old
This is the secant condition: the Hessian must map the parameter difference to the gradient difference.
BFGS (Broyden-Fletcher-Goldfarb-Shanno) uses this condition to iteratively update an approximation to H⁻¹, using rank-2 updates:
where:
- s_k = θ_{k+1} - θ_k (parameter difference)
- y_k = ∇L_{k+1} - ∇L_k (gradient difference)
- ρ_k = 1/(y_k^T s_k)
L-BFGS: Limited-memory BFGS
BFGS requires storing H⁻¹ ∈ ℝ^{d×d}. For d = 100K parameters, this is 10¹⁰ float32 values - still too large.
L-BFGS stores only the last m (typically 10-20) (s_k, y_k) vector pairs and reconstructs the H⁻¹ product on the fly in O(md) time and O(md) memory.
This makes L-BFGS practical for medium-scale problems (d up to ~10 million).
from scipy.optimize import minimize
import numpy as np
def compare_optimizers():
"""
Compare gradient descent vs L-BFGS on a logistic regression problem.
"""
np.random.seed(42)
n, d = 500, 20
X = np.random.randn(n, d)
true_w = np.random.randn(d)
y = (1 / (1 + np.exp(-X @ true_w)) > 0.5).astype(float)
def sigmoid(z): return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def logistic_loss(w):
z = X @ w
s = sigmoid(z)
return -np.mean(y * np.log(s + 1e-10) + (1-y) * np.log(1-s + 1e-10))
def logistic_grad(w):
z = X @ w
s = sigmoid(z)
return (1/n) * X.T @ (s - y)
w0 = np.zeros(d)
# Method 1: gradient descent
w = w0.copy()
gd_losses = []
for _ in range(500):
grad = logistic_grad(w)
w -= 0.5 * grad
gd_losses.append(logistic_loss(w))
# Method 2: L-BFGS
result = minimize(logistic_loss, w0, jac=logistic_grad, method='L-BFGS-B')
lbfgs_loss = result.fun
lbfgs_steps = result.nit
print(f"Gradient descent (500 steps): final loss = {gd_losses[-1]:.6f}")
print(f"L-BFGS ({lbfgs_steps} steps): final loss = {lbfgs_loss:.6f}")
print(f"GD needed ~{len([l for l in gd_losses if l < lbfgs_loss + 0.001])} steps to match L-BFGS quality")
compare_optimizers()
Part 6 - Taylor Series Applications in ML
Adam as a diagonal approximation to Newton's method
Adam maintains the exponential moving average of squared gradients (v_t). The update is:
This is equivalent to using a diagonal approximation to the Hessian: H ≈ diag(v_t). Each parameter's learning rate is scaled by (an estimate of) its curvature.
This is the key insight: Adam is an approximation to Newton's method using a diagonal Hessian estimated from squared gradient history.
Learning rate and the Taylor approximation
The learning rate α must be chosen such that:
- The step ‖Δθ‖ = α‖∇L‖ is within the region where the linear approximation holds
- The Lipschitz condition says this region has radius ≈ 1/L_const
The maximum safe learning rate is α* = 1/L_const, which equals 1/(maximum eigenvalue of Hessian). This is the theoretical foundation for why warm restarts with decreasing learning rates work.
Taylor series and trust regions
Trust region methods explicitly enforce that we only trust the Taylor approximation within a ball:
The trust region radius r grows when the approximation is accurate (actual decrease matches predicted decrease) and shrinks when the approximation is inaccurate.
import numpy as np
def trust_region_step_1d(
f, df, d2f,
x0: float,
trust_radius: float = 1.0,
max_iter: int = 100
) -> list:
"""
1D trust region method.
Adjusts step size based on how well the quadratic approximation predicts actual decrease.
"""
x = x0
history = [{'x': x, 'f': f(x)}]
r = trust_radius
for step in range(max_iter):
grad = df(x)
hess = d2f(x)
# Newton step
if abs(hess) < 1e-12:
break
newton_step = -grad / hess
# Clip to trust region
actual_step = np.sign(newton_step) * min(abs(newton_step), r)
x_new = x + actual_step
# Predicted decrease (from quadratic model)
predicted_decrease = -(grad * actual_step + 0.5 * hess * actual_step**2)
# Actual decrease
actual_decrease = f(x) - f(x_new)
# Ratio: actual / predicted
rho = actual_decrease / (predicted_decrease + 1e-10)
if rho > 0.75 and abs(actual_step) >= 0.9 * r:
r *= 2.0 # expand trust region (model was accurate)
elif rho < 0.25:
r *= 0.5 # shrink trust region (model was inaccurate)
if rho > 0:
x = x_new # accept step
history.append({'x': x, 'f': f(x)})
if abs(grad) < 1e-8:
break
return history
# Test on a function with varying curvature
f = lambda x: x**4 - 3*x**2 + x
df = lambda x: 4*x**3 - 6*x + 1
d2f = lambda x: 12*x**2 - 6
hist = trust_region_step_1d(f, df, d2f, x0=2.0, trust_radius=0.5, max_iter=50)
print(f"Trust region method: converged in {len(hist)} steps")
print(f"Final x = {hist[-1]['x']:.6f}, f(x) = {hist[-1]['f']:.6f}")
Part 7 - Taylor Series and Numerical Stability
Why f''(x) ≠ (f(x+h) - 2f(x) + f(x-h)) / h² exactly
The finite-difference approximation of the second derivative:
has truncation error O(h²) but also floating-point cancellation error O(ε_machine/h²). The optimal h balances these:
import numpy as np
# Optimal step size for second derivative approximation
f = np.exp # exp(x), all derivatives = exp(x)
x = 1.0
true_d2f = np.exp(x) # = e ≈ 2.718
print("Second derivative approximation error vs step size h:")
print(f"{'h':>10} | {'approx':>12} | {'error':>12}")
print("-" * 38)
for h in [1.0, 1e-1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6, 1e-7, 1e-8]:
approx = (f(x+h) - 2*f(x) + f(x-h)) / h**2
error = abs(approx - true_d2f)
print(f"{h:>10.0e} | {approx:>12.6f} | {error:>12.2e}")
Part 8 - Common Mistakes and Engineering Red Flags
:::danger Newton's method with indefinite Hessian At a saddle point, the Hessian has negative eigenvalues. H⁻¹∇L may point toward a saddle point rather than a minimum. Newton's method requires a positive definite Hessian.
Fix: Add damping - replace H with H + λI (diagonal Levenberg-Marquardt regularization):
H_damped = H + lambda_reg * np.eye(d)
delta_theta = np.linalg.solve(H_damped, -grad)
# Choose lambda_reg such that H_damped is PD
:::
:::warning Learning rate too large relative to curvature The theoretically safe learning rate is α ≤ 1/L_const = 1/(max eigenvalue of Hessian). If you use a larger learning rate, you step outside the region where the linear approximation holds.
This is why learning rate schedules that decay the learning rate over time work: as you approach the minimum, the landscape becomes more curved (larger Hessian eigenvalues), so you need a smaller learning rate. :::
:::tip Gauss-Newton as a practical second-order method for least squares For least-squares problems L(θ) = (1/2)‖r(θ)‖², the Hessian is approximated by the Gauss-Newton matrix G = JᵀJ (where J is the Jacobian of residuals r). This avoids computing the full Hessian and is always positive semidefinite, making it practical for residual-based networks. :::
Interview Questions
Q1: What is the Taylor series and why does it matter for ML optimization?
The Taylor series approximates a smooth function f near a point x₀ as a polynomial:
Why it matters for ML:
All optimization algorithms can be derived from truncating the Taylor series at different orders:
- 0th order: Evaluate f at random points - no derivative information used
- 1st order: f(x₀) + ∇L·Δθ - minimizing this gives gradient descent
- 2nd order: f(x₀) + ∇L·Δθ + (1/2)ΔθᵀHΔθ - minimizing this gives Newton's method
The learning rate α controls how far we trust the linear approximation. Large α → step outside valid range → potentially diverge. Small α → stay within valid range but converge slowly.
Adam's adaptive learning rate implicitly uses a diagonal approximation to the Hessian (from squared gradient history) - a partial second-order correction that is computationally cheap.
Q2: Derive Newton's method from the second-order Taylor expansion.
Step 1: Write the 2nd-order Taylor expansion around current θ:
Step 2: This is a quadratic function in Δθ. Minimize by setting the gradient to zero:
Step 3: Solve for the optimal step:
Step 4: Update: θ ← θ - H⁻¹∇L(θ)
Convergence: Near the optimum, Newton's method has quadratic convergence - if the current error is ε, the next error is O(ε²). For gradient descent, convergence is linear - error decreases by a constant factor each step.
Why we don't always use it: For d = 100M parameters, H is a 10M × 10M matrix (10¹⁴ float32 values). Infeasible to store or invert. Gradient descent requires O(d) memory, Newton requires O(d²).
Q3: What is L-BFGS and when would you use it over Adam?
L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) is a quasi-Newton method that:
- Approximates H⁻¹ using the last m (typically 10-20) gradient/parameter differences
- Requires O(md) memory instead of O(d²) for the full Hessian
- Uses a full line search to find the optimal step size in each iteration
When to use L-BFGS over Adam:
- Small to medium models (d < 10M): L-BFGS converges much faster than Adam in terms of function evaluations
- Second-order convergence: For strongly convex problems, L-BFGS achieves superlinear convergence near the optimum - Adam only achieves linear
- Classical ML: Logistic regression, kernel methods, GLMs - L-BFGS is standard
- Full-batch optimization: L-BFGS requires consistent (not mini-batch) gradients for its Hessian approximation to be accurate
When to use Adam over L-BFGS:
- Large models (d > 10M): L-BFGS becomes expensive per step
- Mini-batch training: L-BFGS's Hessian approximation breaks down with noisy mini-batch gradients; Adam handles gradient noise well
- Non-convex landscapes: Adam with momentum escapes saddle points naturally; L-BFGS can get stuck
In practice: Adam for almost all deep learning; L-BFGS for classical ML and fine-tuning small models.
Q4: Why does the learning rate in gradient descent need to be at most 1/L (where L is the Lipschitz constant of the gradient)?
Lipschitz gradient condition: ‖∇f(x) - ∇f(y)‖ ≤ L‖x - y‖ for all x, y. This means the gradient cannot change faster than L times the step size.
Connection to Taylor approximation: The error in the 1st-order Taylor approximation satisfies:
For gradient descent with step α, we get:
For this to guarantee a decrease: We need 1 - Lα/2 > 0, which gives α < 2/L.
For optimal decrease: Maximize the right side over α: derivative zero at α* = 1/L, giving guaranteed decrease of (1/2L)‖∇f‖².
So the Lipschitz constant determines the maximum safe learning rate: α_max = 1/L. In practice, L is not known, so we estimate it with a learning rate finder or use adaptive methods.
Q5: How is Adam related to Newton's method?
Adam maintains exponential moving averages of first and second moments of gradients:
The update is: θ ← θ - α · m̂_t / (√v̂_t + ε)
Connection to Newton: The update θ ← θ - H⁻¹∇L for a diagonal Hessian H ≈ diag(v_t) gives:
This is exactly Adam's structure (before bias correction).
What Adam is actually doing: Using √v_t as a per-parameter estimate of the gradient's typical magnitude, which is related to the diagonal Hessian. Parameters with large gradient magnitudes (high curvature) get smaller effective learning rates; parameters with small gradient magnitudes (low curvature) get larger effective learning rates.
The difference: Newton uses the true Hessian (full curvature information), Adam uses a diagonal approximation estimated from gradient history. Newton's method theoretically converges faster, but Adam is practical for large-scale, noisy mini-batch optimization.
Quick Reference
| Expansion Order | Approximation | Algorithm | Cost/Step |
|---|---|---|---|
| 0th | f(x₀) | Random search | O(1) |
| 1st | f(x₀) + g^TΔθ | Gradient descent | O(d) |
| 2nd (full) | f(x₀) + g^TΔθ + (1/2)Δθ^THΔθ | Newton's method | O(d³) |
| 2nd (diagonal) | f(x₀) + g^TΔθ + (1/2)‖Δθ‖²_H | Adam (approx) | O(d) |
| 2nd (approx) | Quasi-Newton with secant updates | L-BFGS | O(md) |
Key Takeaways
- Taylor series approximates any smooth function as a polynomial; optimization algorithms are derived from truncating this series
- Gradient descent uses the 1st-order Taylor approximation - linear, guaranteed to decrease with small enough step
- Newton's method uses the 2nd-order Taylor approximation - solves the exact quadratic model, converges quadratically but requires O(d²) memory and O(d³) compute
- The maximum safe learning rate for gradient descent is 1/L_const, where L_const is the Lipschitz constant of the gradient (related to maximum Hessian eigenvalue)
- L-BFGS approximates the Hessian inverse from gradient history using O(md) memory - practical for small to medium models and full-batch training
- Adam is effectively a diagonal approximation to Newton's method, estimating per-parameter curvature from squared gradient history
Next: Automatic Differentiation →
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Taylor Series demo on the EngineersOfAI Playground - no code required.
:::
