Skip to main content

Gradient Descent From Scratch

The Real Interview Moment

You are three rounds into the ML engineer interview loop at a self-driving startup. The technical screen went well. The system design round was solid. Now the final round: a senior researcher puts this problem on the whiteboard.

"You have a dataset with 10 million samples and 500 features. You tried the normal equation and the process was killed after 2 hours. Walk me through exactly what gradient descent does differently and why it works. Derive the update rule. Explain why your learning rate matters. And tell me: why does feature scaling change everything for gradient descent but nothing for the normal equation?"

This question tests depth. The researcher does not want to hear "gradient descent takes steps in the direction of the negative gradient." They want to hear about Lipschitz constants, convergence rates, condition numbers, and the geometric relationship between the loss landscape and the feature covariance structure.

This lesson teaches you to answer that question completely. And more: to understand gradient descent well enough to debug convergence failures, choose learning rates intelligently, and explain why momentum exists.

Why Gradient Descent Exists

The normal equation gives the exact answer in one step: w^=(XX)1Xy\hat{\mathbf{w}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}. So why does the entire field of deep learning use gradient descent instead?

Forming XX\mathbf{X}^\top\mathbf{X} requires O(nd2)O(nd^2) operations. For n=107,d=500n = 10^7, d = 500: that is 2.5×10122.5 \times 10^{12} multiply-adds. Even if you had the compute, inverting the resulting 500×500500 \times 500 matrix can fail numerically with near-singular features. And for neural networks, no closed form exists at all - gradient descent is the only option.

Gradient descent uses a fundamentally different strategy: take small steps in the direction of steepest descent, updating weights after each step (or each batch). Each step costs O(nd)O(nd) at most - and with mini-batches, just O(Bd)O(Bd) where BnB \ll n.

Historical Context

Cauchy proposed the gradient descent method in 1847 for solving systems of equations. The method remained a mathematical curiosity until the 1960s when Robbins and Monro (1951) formalized stochastic approximation, and the backpropagation algorithm (Rumelhart, Hinton, Williams 1986) made gradient-based training of neural networks practical. The modern understanding of gradient descent - convergence rates, condition number dependence, the role of learning rate schedules - was largely developed through the 2000s and 2010s as deep learning scaled to massive datasets.

Deriving the Gradient of MSE Loss

The Mean Squared Error (MSE) loss over nn training samples is:

L(w)=1nyXw2=1ni=1n(yiwxi)2\mathcal{L}(\mathbf{w}) = \frac{1}{n} \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 = \frac{1}{n} \sum_{i=1}^n (y_i - \mathbf{w}^\top\mathbf{x}_i)^2

To find the gradient, apply the chain rule. Let r=yXw\mathbf{r} = \mathbf{y} - \mathbf{X}\mathbf{w} (the residual vector), so L=1nrr\mathcal{L} = \frac{1}{n}\mathbf{r}^\top\mathbf{r}:

Lw=w1n(yXw)(yXw)\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \frac{\partial}{\partial \mathbf{w}} \frac{1}{n} (\mathbf{y} - \mathbf{X}\mathbf{w})^\top(\mathbf{y} - \mathbf{X}\mathbf{w})

Expanding and differentiating term by term using matrix calculus:

  • wyy=0\frac{\partial}{\partial\mathbf{w}} \mathbf{y}^\top\mathbf{y} = \mathbf{0}
  • w(2wXy)=2Xy\frac{\partial}{\partial\mathbf{w}} (-2\mathbf{w}^\top\mathbf{X}^\top\mathbf{y}) = -2\mathbf{X}^\top\mathbf{y}
  • wwXXw=2XXw\frac{\partial}{\partial\mathbf{w}} \mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w} = 2\mathbf{X}^\top\mathbf{X}\mathbf{w}

Combining:

wL=2nX(yXw)=2nX(Xwy)\boxed{\nabla_{\mathbf{w}} \mathcal{L} = -\frac{2}{n}\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\mathbf{w}) = \frac{2}{n}\mathbf{X}^\top(\mathbf{X}\mathbf{w} - \mathbf{y})}

The gradient points in the direction of steepest ascent. To minimize, we move opposite to the gradient, scaled by learning rate η\eta:

w(t+1)=w(t)ηwL(w(t))=w(t)2ηnX(Xw(t)y)\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \eta \cdot \nabla_\mathbf{w}\mathcal{L}(\mathbf{w}^{(t)}) = \mathbf{w}^{(t)} - \frac{2\eta}{n}\mathbf{X}^\top(\mathbf{X}\mathbf{w}^{(t)} - \mathbf{y})

Step-by-Step Scalar Derivation (Intuition)

For the single-weight case, with y^i=wxi\hat{y}_i = w x_i:

L(w)=1ni=1n(yiwxi)2\mathcal{L}(w) = \frac{1}{n}\sum_{i=1}^n (y_i - w x_i)^2

dLdw=1ni=1n2(yiwxi)(xi)=2ni=1nxi(yiwxi)\frac{d\mathcal{L}}{dw} = \frac{1}{n}\sum_{i=1}^n 2(y_i - w x_i)(-x_i) = -\frac{2}{n}\sum_{i=1}^n x_i(y_i - w x_i)

This is the scalar version of 2nX(yXw)-\frac{2}{n}\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\mathbf{w}). Reading it: for each sample ii, if the prediction wxiw x_i is too low (positive residual yiwxiy_i - w x_i), then the gradient is negative (telling us to increase ww). This is exactly right intuitively.

The Loss Landscape for Linear Regression

For linear regression with MSE loss, the loss surface has a special structure: it is a convex quadratic bowl - smooth, with a single global minimum, no local minima, no saddle points.

The Hessian (second derivative matrix) of the MSE loss is:

w2L=2nXX\nabla^2_\mathbf{w} \mathcal{L} = \frac{2}{n}\mathbf{X}^\top\mathbf{X}

Since XX\mathbf{X}^\top\mathbf{X} is a Gram matrix, it is always positive semi-definite (all eigenvalues 0\geq 0). For an overdetermined system (n>dn > d with full-rank X\mathbf{X}), it is strictly positive definite - the loss surface is a strictly convex bowl with a unique global minimum.

This guarantee vanishes for neural networks. The composition of nonlinear activation functions creates a highly non-convex loss surface with many saddle points and local minima. For linear models, gradient descent is provably correct and converges to the global optimum.

Geometry of the Loss Landscape

The shape of the bowl is determined by the eigenvalues of XX\mathbf{X}^\top\mathbf{X}:

  • Eigenvalues all equal (κ=1\kappa = 1): perfectly circular bowl - gradient descent converges in a straight line to the minimum. Any learning rate up to the convergence threshold works equally well.
  • Eigenvalues very different (κ1\kappa \gg 1): narrow elongated valley - gradient descent oscillates across the narrow dimension while making slow progress along the long dimension. This is why learning rate choice is critical.

The Gradient Descent Algorithm

Starting from initial weights w(0)\mathbf{w}^{(0)} (typically all zeros or small random values), iterate:

w(t+1)w(t)ηwL(w(t))\mathbf{w}^{(t+1)} \leftarrow \mathbf{w}^{(t)} - \eta \cdot \nabla_{\mathbf{w}} \mathcal{L}(\mathbf{w}^{(t)})

The algorithm continues until:

  • The gradient norm L<ϵ\|\nabla\mathcal{L}\| < \epsilon (convergence criterion)
  • A maximum number of iterations is reached
  • The loss stops decreasing (loss difference <δ< \delta)

Learning Rate Sensitivity: The Lipschitz Constant

The Lipschitz constant LL of the gradient is the key quantity governing learning rate choice. For the MSE loss:

L=2λmax(XX)nL = \frac{2\lambda_{\max}(\mathbf{X}^\top\mathbf{X})}{n}

where λmax\lambda_{\max} is the largest eigenvalue of XX\mathbf{X}^\top\mathbf{X}.

The gradient descent converges if and only if the learning rate satisfies:

η<2L=nλmax(XX)\eta < \frac{2}{L} = \frac{n}{\lambda_{\max}(\mathbf{X}^\top\mathbf{X})}

For η\eta above this threshold, the weight update overshoots the minimum in the direction of the dominant eigenvector and diverges. For η\eta below, it converges - but slowly if η2/L\eta \ll 2/L.

The theoretically optimal learning rate is:

η=2λmax+λmin=2L+μ\eta^* = \frac{2}{\lambda_{\max} + \lambda_{\min}} = \frac{2}{L + \mu}

where μ=2λmin/n\mu = 2\lambda_{\min}/n is the strong convexity constant. This balances between fast progress (using large steps) and stability (not overshooting).

Full NumPy Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression


class BatchGradientDescent:
"""
Linear regression via batch gradient descent.
Uses all n training samples per gradient step.
Includes convergence history and diagnostic methods.
"""

def __init__(self, learning_rate=0.01, n_iterations=1000,
tolerance=1e-6, verbose=False):
self.lr = learning_rate
self.n_iterations = n_iterations
self.tolerance = tolerance
self.verbose = verbose
self.weights = None
self.bias = None
self.loss_history = []
self.grad_norm_history = []

def fit(self, X, y):
n, d = X.shape
# Initialize at zero (safe for convex problems)
self.weights = np.zeros(d)
self.bias = 0.0

for iteration in range(self.n_iterations):
# Forward pass: compute predictions and residuals
y_pred = X @ self.weights + self.bias
residuals = y_pred - y # shape (n,) - prediction error

# MSE loss for monitoring
loss = np.mean(residuals ** 2)
self.loss_history.append(loss)

# Gradients: ∇w L = (2/n) X^T (Xw - y), ∇b L = (2/n) sum(residuals)
grad_w = (2 / n) * X.T @ residuals # shape (d,)
grad_b = (2 / n) * np.sum(residuals) # scalar

grad_norm = np.linalg.norm(grad_w)
self.grad_norm_history.append(grad_norm)

if self.verbose and iteration % 100 == 0:
print(f"Iter {iteration:5d} | Loss: {loss:.6f} | "
f"|∇L|: {grad_norm:.6f} | lr: {self.lr:.4f}")

# Update weights (gradient descent step)
self.weights -= self.lr * grad_w
self.bias -= self.lr * grad_b

# Convergence check: gradient norm below tolerance
if grad_norm < self.tolerance:
if self.verbose:
print(f"Converged at iteration {iteration}")
break

return self

def predict(self, X):
return X @ self.weights + self.bias

def score(self, X, y):
y_pred = self.predict(X)
ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - np.mean(y)) ** 2)
return 1 - ss_res / ss_tot


# Setup dataset
np.random.seed(42)
X, y = make_regression(n_samples=2000, n_features=10, noise=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Always scale before gradient descent
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

model = BatchGradientDescent(learning_rate=0.05, n_iterations=3000, verbose=True)
model.fit(X_train_sc, y_train)

print(f"\nTrain R²: {model.score(X_train_sc, y_train):.4f}")
print(f"Test R²: {model.score(X_test_sc, y_test):.4f}")

# Compare to normal equation
sk = LinearRegression().fit(X_train_sc, y_train)
print(f"sklearn (OLS) Test R²: {sk.score(X_test_sc, y_test):.4f}")

Learning Rate Sensitivity Experiment

def run_gd_with_lr(X_train, y_train, learning_rates, n_iterations=300):
"""Compare gradient descent trajectories for different learning rates."""
results = {}

for lr in learning_rates:
model = BatchGradientDescent(
learning_rate=lr,
n_iterations=n_iterations,
tolerance=1e-12 # disable early stopping for comparison
)
model.fit(X_train, y_train)
results[lr] = model.loss_history

return results


learning_rates = [0.001, 0.01, 0.05, 0.1, 0.3, 0.8]
results = run_gd_with_lr(X_train_sc, y_train, learning_rates)

plt.figure(figsize=(12, 5))
for lr, losses in results.items():
has_nan = any(np.isnan(l) or np.isinf(l) for l in losses)
label = f"η={lr}" + (" (DIVERGED)" if has_nan else "")
style = '--' if has_nan else '-'
clean_losses = [l for l in losses if not (np.isnan(l) or np.isinf(l))]
plt.semilogy(clean_losses, linestyle=style, label=label)

plt.xlabel("Iteration")
plt.ylabel("MSE Loss (log scale)")
plt.title("Learning Rate Sensitivity\n(Gradient Descent on Scaled Features)")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("lr_sensitivity.png", dpi=150)
plt.show()

# Summary table
print("\nLearning Rate Behavior Summary:")
print(f"{'η':>8} | {'Final Loss':>12} | {'Status':>15}")
print("-" * 40)
for lr, losses in results.items():
final = losses[-1]
if np.isnan(final) or np.isinf(final):
status = "DIVERGED"
final_str = "inf"
elif final > 1e6:
status = "Oscillating"
final_str = f"{final:.2e}"
else:
status = "Converged"
final_str = f"{final:.4f}"
print(f"{lr:>8} | {final_str:>12} | {status:>15}")

Convergence Analysis: The Condition Number

Why do some datasets converge in 100 iterations and others take 100,000? The answer is the condition number of XX\mathbf{X}^\top\mathbf{X}.

κ(XX)=λmaxλmin\kappa(\mathbf{X}^\top\mathbf{X}) = \frac{\lambda_{\max}}{\lambda_{\min}}

The convergence rate of gradient descent (with optimal learning rate) is:

w(T)w(κ1κ+1)Tw(0)w\|\mathbf{w}^{(T)} - \mathbf{w}^*\| \leq \left(\frac{\kappa - 1}{\kappa + 1}\right)^T \|\mathbf{w}^{(0)} - \mathbf{w}^*\|

The factor ρ=(κ1)/(κ+1)\rho = (\kappa-1)/(\kappa+1) is the convergence rate per iteration:

  • κ=1\kappa = 1: ρ=0\rho = 0 - converges in 1 iteration (the exact Newton step)
  • κ=10\kappa = 10: ρ=0.818\rho = 0.818 - needs ~25 iterations to reduce error by 10210^{-2}
  • κ=100\kappa = 100: ρ=0.980\rho = 0.980 - needs ~230 iterations to reduce error by 10210^{-2}
  • κ=1000\kappa = 1000: ρ=0.998\rho = 0.998 - needs ~2300 iterations to reduce error by 10210^{-2}

For Lipschitz-smooth convex objectives (which includes MSE for linear regression), the convergence gap after TT iterations satisfies:

L(w(T))L(w)Lw(0)w22T\mathcal{L}(\mathbf{w}^{(T)}) - \mathcal{L}(\mathbf{w}^*) \leq \frac{L \|\mathbf{w}^{(0)} - \mathbf{w}^*\|^2}{2T}

This is the O(1/T)O(1/T) convergence rate for convex Lipschitz-smooth objectives.

def analyze_convergence(X, label=""):
"""Analyze the condition number and theoretical convergence properties."""
gram = X.T @ X # X^T X: d × d matrix
eigenvalues = np.linalg.eigvalsh(gram) # symmetric → use eigvalsh
eigenvalues = np.sort(eigenvalues)
# Filter out numerical zeros
pos_eigs = eigenvalues[eigenvalues > 1e-10]

lambda_max = pos_eigs.max()
lambda_min = pos_eigs.min()
kappa = lambda_max / lambda_min
n = X.shape[0]

# Optimal learning rate for batch GD
alpha_opt = 2 * n / (lambda_max + lambda_min)

# Lipschitz constant of the gradient
L = 2 * lambda_max / n
mu = 2 * lambda_min / n # strong convexity constant

# Convergence rate at optimal lr
rho = (kappa - 1) / (kappa + 1)

# Iterations to reach epsilon = 1e-6 accuracy
eps = 1e-6
iters_needed = np.log(1 / eps) / np.log(1 / rho) if rho < 1 else float('inf')

print(f"\n=== Convergence Analysis: {label} ===")
print(f" λ_max = {lambda_max:.2f}, λ_min = {lambda_min:.6f}")
print(f" Condition number κ = {kappa:.1f}")
print(f" Lipschitz constant L = {L:.6f}")
print(f" Strong convexity μ = {mu:.6f}")
print(f" Optimal learning rate η* = {alpha_opt:.6f}")
print(f" Convergence rate ρ = {rho:.6f}")
print(f" Iterations for 1e-6 accuracy: ~{int(iters_needed)}")

return kappa, alpha_opt, rho


# Well-conditioned: scaled features
kappa_good, lr_good, rho_good = analyze_convergence(X_train_sc, "Scaled features")

# Poorly conditioned: one feature on a very different scale
X_bad = X_train.copy()
X_bad[:, 0] *= 1000 # massive scale difference
kappa_bad, lr_bad, rho_bad = analyze_convergence(X_bad, "Unscaled features")

print(f"\nCondition number increase from bad scaling: {kappa_bad / kappa_good:.1f}x")
print(f"Extra iterations needed: {rho_bad:.4f} vs {rho_good:.4f} convergence rate")

Gradient Clipping

For models where gradients can explode (large datasets, poorly scaled features, neural networks), gradient clipping prevents divergence:

If g>c, then gcgg\text{If } \|\mathbf{g}\| > c, \text{ then } \mathbf{g} \leftarrow c \cdot \frac{\mathbf{g}}{\|\mathbf{g}\|}

This rescales the gradient to have norm exactly cc when it exceeds the threshold, preserving the direction but bounding the magnitude.

class GradientDescentWithClipping:
"""
Gradient descent with gradient norm clipping and loss curve plotting.
Gradient clipping prevents divergence in poorly conditioned problems.
"""

def __init__(self, learning_rate=0.01, n_iterations=1000,
clip_norm=None, tolerance=1e-6):
self.lr = learning_rate
self.n_iterations = n_iterations
self.clip_norm = clip_norm # None = no clipping
self.tolerance = tolerance
self.weights = None
self.bias = None
self.loss_history = []
self.clipped_count = 0

def _clip_gradient(self, grad):
"""Apply gradient norm clipping if clip_norm is set."""
if self.clip_norm is None:
return grad
norm = np.linalg.norm(grad)
if norm > self.clip_norm:
self.clipped_count += 1
return self.clip_norm * grad / norm
return grad

def fit(self, X, y):
n, d = X.shape
self.weights = np.zeros(d)
self.bias = 0.0
self.clipped_count = 0

for t in range(self.n_iterations):
y_pred = X @ self.weights + self.bias
residuals = y_pred - y

loss = np.mean(residuals ** 2)
self.loss_history.append(loss)

grad_w = (2 / n) * X.T @ residuals
grad_b = (2 / n) * np.sum(residuals)

# Apply clipping before the update step
grad_w = self._clip_gradient(grad_w)

if np.linalg.norm(grad_w) < self.tolerance:
break

self.weights -= self.lr * grad_w
self.bias -= self.lr * grad_b

return self

def predict(self, X):
return X @ self.weights + self.bias

def score(self, X, y):
y_pred = self.predict(X)
return 1 - np.sum((y - y_pred)**2) / np.sum((y - np.mean(y))**2)


# Demonstrate: large learning rate without clipping diverges; with clipping it recovers
large_lr = 0.4 # above the convergence threshold for this data

gd_no_clip = GradientDescentWithClipping(learning_rate=large_lr, n_iterations=200)
gd_no_clip.fit(X_train_sc, y_train)

gd_clip = GradientDescentWithClipping(learning_rate=large_lr, n_iterations=200,
clip_norm=10.0)
gd_clip.fit(X_train_sc, y_train)

print(f"\nWith lr={large_lr}:")
print(f" No clipping: final loss = {gd_no_clip.loss_history[-1]:.2e}")
print(f" With clipping (c=10): final loss = {gd_clip.loss_history[-1]:.2f}")
print(f" Gradients clipped: {gd_clip.clipped_count} times")

plt.figure(figsize=(10, 4))
valid_no_clip = [l for l in gd_no_clip.loss_history if not np.isinf(l) and not np.isnan(l)]
plt.semilogy(range(len(valid_no_clip)), valid_no_clip,
label=f'No clipping (lr={large_lr})', color='red')
plt.semilogy(gd_clip.loss_history,
label=f'Clipping c=10 (lr={large_lr})', color='green')
plt.xlabel("Iteration")
plt.ylabel("MSE Loss (log scale)")
plt.title("Effect of Gradient Clipping with Large Learning Rate")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("gradient_clipping.png", dpi=150)
plt.show()

Momentum: Accelerating Gradient Descent

Gradient descent on an elongated loss landscape oscillates across the narrow dimension. Momentum (Polyak 1964) accumulates a velocity vector that damps oscillations and accelerates progress along consistent directions:

v(t+1)=βv(t)+wL(w(t))\mathbf{v}^{(t+1)} = \beta \mathbf{v}^{(t)} + \nabla_\mathbf{w}\mathcal{L}(\mathbf{w}^{(t)})

w(t+1)=w(t)ηv(t+1)\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \eta \mathbf{v}^{(t+1)}

The momentum coefficient β[0,1)\beta \in [0, 1) (typically 0.90.9) controls how much past gradients are retained. With β=0\beta = 0, this reduces to standard gradient descent. With β=0.9\beta = 0.9, each velocity update is a weighted sum of all previous gradients, with the most recent weighted at 1 and gradients from kk steps ago weighted at βk\beta^k.

Why does this help? In the narrow valley case:

  • Across the valley (oscillating direction): gradients flip sign every step. The momentum vector for this direction averages to near zero - damping the oscillations.
  • Along the valley (slow progress direction): gradients consistently point in the same direction. Momentum accumulates, accelerating progress.
class GradientDescentMomentum:
"""
Gradient descent with Polyak momentum.
Accelerates convergence on ill-conditioned loss landscapes.
"""

def __init__(self, learning_rate=0.01, momentum=0.9, n_iterations=1000,
tolerance=1e-6, verbose=False):
self.lr = learning_rate
self.beta = momentum
self.n_iterations = n_iterations
self.tolerance = tolerance
self.verbose = verbose
self.loss_history = []

def fit(self, X, y):
n, d = X.shape
self.weights = np.zeros(d)
self.bias = 0.0
velocity_w = np.zeros(d) # accumulated gradient (velocity)
velocity_b = 0.0

for t in range(self.n_iterations):
y_pred = X @ self.weights + self.bias
residuals = y_pred - y

loss = np.mean(residuals ** 2)
self.loss_history.append(loss)

grad_w = (2 / n) * X.T @ residuals
grad_b = (2 / n) * np.sum(residuals)

if np.linalg.norm(grad_w) < self.tolerance:
if self.verbose:
print(f"Converged at step {t}")
break

# Momentum update: v = β*v + ∇L
velocity_w = self.beta * velocity_w + grad_w
velocity_b = self.beta * velocity_b + grad_b

# Weight update: w = w - η*v
self.weights -= self.lr * velocity_w
self.bias -= self.lr * velocity_b

if self.verbose and t % 200 == 0:
print(f"Iter {t:5d}: loss={loss:.6f}")

return self

def predict(self, X):
return X @ self.weights + self.bias

def score(self, X, y):
yp = self.predict(X)
return 1 - np.sum((y-yp)**2)/np.sum((y-np.mean(y))**2)


# Compare plain GD vs momentum on an ill-conditioned problem
# Create ill-conditioned data: one feature has 100x larger scale than others
np.random.seed(42)
X_ill, y_ill = make_regression(n_samples=500, n_features=5, noise=15, random_state=42)
X_ill_scaled = StandardScaler().fit_transform(X_ill)
# Artificially make it ill-conditioned by removing scaling on first feature
X_ill_scaled[:, 0] *= 20

gd_plain = BatchGradientDescent(learning_rate=0.001, n_iterations=2000)
gd_plain.fit(X_ill_scaled, y_ill)

gd_momentum = GradientDescentMomentum(learning_rate=0.001, momentum=0.9,
n_iterations=2000)
gd_momentum.fit(X_ill_scaled, y_ill)

plt.figure(figsize=(10, 4))
plt.semilogy(gd_plain.loss_history, label='Plain GD', color='blue')
plt.semilogy(gd_momentum.loss_history, label='GD + Momentum (β=0.9)', color='green')
plt.xlabel("Iteration")
plt.ylabel("MSE Loss (log scale)")
plt.title("Momentum Accelerates Convergence on Ill-Conditioned Problems")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("momentum_comparison.png", dpi=150)
plt.show()

# How many iterations for momentum to reach the same loss as plain GD at 2000 steps?
target_loss = gd_plain.loss_history[-1]
momentum_iters = next((i for i, l in enumerate(gd_momentum.loss_history)
if l < target_loss), None)
print(f"\nPlain GD final loss: {gd_plain.loss_history[-1]:.4f}")
print(f"Momentum reaches that loss at iteration: {momentum_iters}")

Learning Rate Finder Algorithm

In practice, instead of computing eigenvalues (expensive for large dd), use an empirical learning rate finder:

def find_learning_rate(X, y, lr_min=1e-5, lr_max=1.0, n_steps=100,
beta=0.98):
"""
Learning rate range test (Smith 2015, refined with exponential smoothing).
Increases lr exponentially, records smoothed loss. The optimal lr is
just before the point where smoothed loss begins rising sharply.

Parameters:
beta: exponential smoothing factor for the loss (0 = no smoothing)
"""
n, d = X.shape
lrs = np.logspace(np.log10(lr_min), np.log10(lr_max), n_steps)

weights = np.zeros(d)
bias = 0.0
smoothed_loss = 0.0
best_loss = float('inf')
recorded_losses = []
recorded_lrs = []

for i, lr in enumerate(lrs):
# One gradient step
y_pred = X @ weights + bias
residuals = y_pred - y

raw_loss = np.mean(residuals ** 2)

# Exponential smoothing to remove noise
smoothed_loss = beta * smoothed_loss + (1 - beta) * raw_loss
# Bias correction
smooth_corr = smoothed_loss / (1 - beta ** (i + 1))

if np.isnan(smooth_corr) or np.isinf(smooth_corr):
break
if smooth_corr < best_loss:
best_loss = smooth_corr
if smooth_corr > 4 * best_loss:
break # loss exploding - stop here

recorded_losses.append(smooth_corr)
recorded_lrs.append(lr)

grad_w = (2 / n) * X.T @ residuals
grad_b = (2 / n) * np.sum(residuals)

weights -= lr * grad_w
bias -= lr * grad_b

# Find the lr with steepest decline
if len(recorded_losses) > 10:
log_lrs = np.log10(recorded_lrs)
loss_arr = np.array(recorded_losses)
# Gradient of loss w.r.t. log lr
grad_loss = np.gradient(loss_arr, log_lrs)
best_idx = np.argmin(grad_loss)
suggested_lr = recorded_lrs[best_idx] / 10 # one order of magnitude below steepest

plt.figure(figsize=(8, 4))
plt.semilogx(recorded_lrs, recorded_losses, 'b-', linewidth=1.5)
plt.axvline(suggested_lr, color='r', linestyle='--',
label=f'Suggested lr = {suggested_lr:.2e}')
plt.xlabel("Learning Rate (log scale)")
plt.ylabel("Smoothed MSE Loss")
plt.title("Learning Rate Range Test\n(Find the lr just before loss starts rising)")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("lr_finder.png", dpi=150)
plt.show()

print(f"Suggested learning rate: {suggested_lr:.2e}")
return suggested_lr, recorded_lrs, recorded_losses

return None, recorded_lrs, recorded_losses


suggested_lr, _, _ = find_learning_rate(X_train_sc, y_train)

Ill-Conditioning and Preconditioning

Ill-conditioning is the root cause of slow gradient descent convergence. Preconditioning is the mathematical solution: transform the optimization problem to have a better-conditioned Hessian.

The simplest preconditioner is feature scaling (StandardScaler). More sophisticated preconditioners include:

  • Diagonal preconditioner: scale each gradient component by 1/Hjj1/H_{jj} (the inverse of the diagonal Hessian). This approximates Newton's method at low cost.
  • Adam optimizer: uses per-parameter adaptive learning rates, effectively scaling by an estimate of the RMS gradient magnitude. This is a form of diagonal preconditioning.
  • L-BFGS: stores a low-rank approximation of the inverse Hessian. Extremely effective for well-structured problems with smooth loss landscapes.
def compare_scaling_convergence(X_train, y_train, n_iterations=500):
"""
Demonstrate the dramatic effect of feature scaling on convergence.
Unscaled features → high condition number → slow convergence.
Scaled features → low condition number → fast convergence.
"""
# Unscaled features
kappa_raw, _, _ = analyze_convergence(X_train, "Unscaled")
gd_raw = BatchGradientDescent(learning_rate=1e-5, n_iterations=n_iterations)
gd_raw.fit(X_train, y_train)

# Scaled features
scaler = StandardScaler()
X_sc = scaler.fit_transform(X_train)
kappa_sc, _, _ = analyze_convergence(X_sc, "Scaled")
gd_sc = BatchGradientDescent(learning_rate=0.05, n_iterations=n_iterations)
gd_sc.fit(X_sc, y_train)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

ax = axes[0]
ax.semilogy(gd_raw.loss_history, label=f'Unscaled (κ={kappa_raw:.0f})', color='red')
ax.semilogy(gd_sc.loss_history, label=f'Scaled (κ={kappa_sc:.0f})', color='green')
ax.set_xlabel("Iteration")
ax.set_ylabel("MSE Loss (log scale)")
ax.set_title("Training Loss vs Iteration")
ax.legend()
ax.grid(True, alpha=0.3)

ax = axes[1]
ax.semilogy(gd_raw.grad_norm_history,
label=f'Unscaled (η={1e-5})', color='red')
ax.semilogy(gd_sc.grad_norm_history,
label=f'Scaled (η={0.05})', color='green')
ax.set_xlabel("Iteration")
ax.set_ylabel("Gradient Norm (log scale)")
ax.set_title("Gradient Norm vs Iteration")
ax.legend()
ax.grid(True, alpha=0.3)

plt.suptitle("Feature Scaling Dramatically Improves GD Convergence", y=1.02)
plt.tight_layout()
plt.savefig("scaling_convergence.png", dpi=150)
plt.show()


compare_scaling_convergence(X_train, y_train)

Comparing GD to the Normal Equation

import time

def compare_methods(n_samples, n_features, noise=20.0):
"""Compare gradient descent and normal equation on various sizes."""
X, y = make_regression(n_samples=n_samples, n_features=n_features,
noise=noise, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

sc = StandardScaler()
X_tr_sc = sc.fit_transform(X_tr)
X_te_sc = sc.transform(X_te)

# Normal equation (sklearn uses LAPACK SVD)
t0 = time.time()
sk = LinearRegression().fit(X_tr_sc, y_tr)
t_ne = time.time() - t0
r2_ne = sk.score(X_te_sc, y_te)

# Gradient descent
t0 = time.time()
gd = BatchGradientDescent(learning_rate=0.05, n_iterations=2000, tolerance=1e-5)
gd.fit(X_tr_sc, y_tr)
t_gd = time.time() - t0
r2_gd = gd.score(X_te_sc, y_te)

print(f"\nn={n_samples:>8,d}, d={n_features}")
print(f" Normal Equation: {t_ne:.3f}s, R²={r2_ne:.4f}")
print(f" Gradient Descent:{t_gd:.3f}s, R²={r2_gd:.4f}")
return t_ne, t_gd


compare_methods(1_000, 20)
compare_methods(10_000, 50)
compare_methods(50_000, 100)

Common Mistakes

:::danger Forgetting to scale features before gradient descent

# WRONG - features on different scales create ill-conditioned problems
model = BatchGradientDescent(lr=0.01)
model.fit(X_raw, y) # may need 100,000 iterations or diverge

# CORRECT - scale first, gradient descent works in the scaled space
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
model.fit(X_scaled, y_train)
# At inference: also scale test data with the SAME fitted scaler
X_test_scaled = scaler.transform(X_test) # NOT fit_transform!

Fitting the scaler on test data (a common mistake) introduces data leakage. :::

:::danger Learning rate too large - use a finder or start small The condition for convergence is η<2/L\eta < 2/L where LL depends on the data. Without scaling, the Lipschitz constant can be very large, requiring tiny learning rates. Always use a learning rate finder or start with η=0.01\eta = 0.01 after StandardScaling and monitor the loss curve. If it oscillates or diverges, reduce by 10x. :::

:::warning Not checking gradient norm at convergence Stopping after a fixed number of iterations without checking the gradient norm can leave the model far from the optimum. A large final gradient norm (>0.01> 0.01 for standardized data) means the model has not converged. Either increase iterations or increase the learning rate. :::

:::warning Momentum coefficient too close to 1 causes instability β=0.9\beta = 0.9 is the standard safe value for momentum. Values above 0.990.99 can cause oscillations even with a small learning rate, because the accumulated velocity takes a long time to decay when the gradient changes direction. Start with 0.90.9 and only increase if convergence is provably too slow. :::

YouTube Resources

ResourceChannelWhy Watch
Gradient Descent, Step-by-StepStatQuest with Josh StarmerClear visual derivation of the gradient update rule
MIT 18.065: Optimization and Deep LearningMIT OpenCourseWareGilbert Strang on condition number and convergence rates
Why Momentum Really Worksdistill.pub talkDeep visual intuition for Polyak momentum and Nesterov acceleration
An Introduction to OptimizationAndrej Karpathy / CS231nPractical GD variants explained for neural networks
Learning Rate Finderfast.ai / Jeremy HowardThe Smith (2015) learning rate range test explained and demonstrated

Interview Q&A

Q1: Derive the gradient of MSE loss for linear regression. What does it mean geometrically?

Starting with L(w)=1nyXw2\mathcal{L}(\mathbf{w}) = \frac{1}{n}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2, expand and differentiate term by term: wL=1n(2Xy+2XXw)=2nX(Xwy)\nabla_\mathbf{w}\mathcal{L} = \frac{1}{n}(-2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\mathbf{w}) = \frac{2}{n}\mathbf{X}^\top(\mathbf{X}\mathbf{w} - \mathbf{y}).

Geometrically: Xwy\mathbf{X}\mathbf{w} - \mathbf{y} is the residual vector pointing from the target to the current prediction. X\mathbf{X}^\top projects this residual back into parameter space. The gradient is the direction in which the loss increases most steeply. Moving opposite to it (gradient descent) moves toward the projection of y\mathbf{y} onto the column space of X\mathbf{X}.

Q2: How does the condition number of XX\mathbf{X}^\top\mathbf{X} affect gradient descent convergence?

The condition number κ=λmax/λmin\kappa = \lambda_{\max}/\lambda_{\min} determines how elongated the loss bowl is. High κ\kappa means the loss surface is a narrow valley - gradient descent oscillates across the narrow dimension while making slow progress along the long dimension. The convergence rate per step is ρ=(κ1)/(κ+1)\rho = (\kappa-1)/(\kappa+1): iterations to converge scale as O(κlog(1/ϵ))O(\kappa\log(1/\epsilon)).

Feature standardization normalizes eigenvalues of XX\mathbf{X}^\top\mathbf{X} - each feature having unit variance means its contribution to the Gram matrix is similar. This reduces κ\kappa from potentially thousands to close to 1, and can reduce the required iterations by 3-4 orders of magnitude.

Q3: What learning rate would you choose? What is the Lipschitz condition?

The gradient of the MSE loss is Lipschitz continuous with constant L=2λmax(XX)/nL = 2\lambda_{\max}(\mathbf{X}^\top\mathbf{X})/n. For convergence, the learning rate must satisfy η<2/L\eta < 2/L. The optimal theoretical learning rate is η=2/(L+μ)\eta^* = 2/(L + \mu) where μ=2λmin/n\mu = 2\lambda_{\min}/n is the strong convexity constant.

In practice: (1) run a learning rate finder (exponentially increasing lr, find where loss starts rising); (2) start with η=0.01\eta = 0.01 on StandardScaled features; (3) monitor the loss curve - oscillation means η\eta is too large, very slow linear descent means η\eta is too small; (4) for production, use a cosine decay schedule with warmup.

Q4: Why is the MSE loss surface for linear regression convex? Why does this not hold for neural networks?

For linear regression, the Hessian is H=2nXX\mathbf{H} = \frac{2}{n}\mathbf{X}^\top\mathbf{X}, which is positive semi-definite (Gram matrix, all eigenvalues 0\geq 0). A positive semi-definite Hessian means the function is convex - any local minimum is the global minimum, and gradient descent converges to it.

For neural networks, the function composition through nonlinear activations (ReLU, sigmoid) creates a non-convex landscape. The Hessian has both positive and negative eigenvalues at most points. Saddle points (where the gradient is zero but the Hessian has both signs) are extremely common. Despite this, neural network training works because: (a) modern optimizers (Adam, SGD with momentum) escape saddle points via gradient noise; (b) large networks tend to have many high-quality local minima with similar loss values.

Q5: When would gradient descent outperform the normal equation even for linear regression?

Three key scenarios: (1) Large nn - gradient descent processes data in O(nd)O(nd) per iteration. The normal equation costs O(nd2)O(nd^2) to form the Gram matrix and O(d3)O(d^3) to invert. For n=107,d=500n = 10^7, d = 500: GD is 500×\sim 500\times faster per effective step. (2) Large dd - inverting the d×dd \times d Gram matrix costs O(d3)O(d^3), prohibitive for d>10,000d > 10,000. (3) Online/streaming settings - GD processes one mini-batch at a time, updating weights as new data arrives. The normal equation requires recomputing from scratch. (4) Regularized Lasso - no closed form exists; coordinate descent is the only option.

Q6: What is gradient clipping and when do you need it?

Gradient clipping rescales the gradient vector when its norm exceeds a threshold cc: gcg/g\mathbf{g} \leftarrow c \cdot \mathbf{g}/\|\mathbf{g}\|. This preserves the gradient direction but bounds its magnitude. You need it when: (1) features have very different scales and you cannot normalize (e.g., streaming data with unknown statistics); (2) the loss landscape has steep ravines (common in RNNs); (3) using a large learning rate for fast training in early epochs. For well-scaled features and linear models, clipping is usually unnecessary. For neural networks with recurrent connections, clipping with c=1.0c = 1.0 or c=5.0c = 5.0 is standard practice (Pascanu et al. 2013).

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Gradient Descent demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.