What is gradient descent?

Implement gradient descent for linear regression from first principles - derive the gradient, analyze the loss landscape, understand learning rate via Lipschitz constants, implement momentum, gradient clipping, and convergence analysis via condition number.

How does optimization work in practice?

Gradient Descent From Scratch covers gradient descent, optimization, machine learning from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/linear-models/gradient-descent-from-scratch

What is the difference between gradient descent and machine learning?

See the full breakdown at https://engineersofai.com/docs/ml/linear-models/gradient-descent-from-scratch

Gradient Descent From Scratch

The Real Interview Moment

You are three rounds into the ML engineer interview loop at a self-driving startup. The technical screen went well. The system design round was solid. Now the final round: a senior researcher puts this problem on the whiteboard.

"You have a dataset with 10 million samples and 500 features. You tried the normal equation and the process was killed after 2 hours. Walk me through exactly what gradient descent does differently and why it works. Derive the update rule. Explain why your learning rate matters. And tell me: why does feature scaling change everything for gradient descent but nothing for the normal equation?"

This question tests depth. The researcher does not want to hear "gradient descent takes steps in the direction of the negative gradient." They want to hear about Lipschitz constants, convergence rates, condition numbers, and the geometric relationship between the loss landscape and the feature covariance structure.

This lesson teaches you to answer that question completely. And more: to understand gradient descent well enough to debug convergence failures, choose learning rates intelligently, and explain why momentum exists.

Why Gradient Descent Exists

The normal equation gives the exact answer in one step: $\hat{\mathbf{w}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ . So why does the entire field of deep learning use gradient descent instead?

Forming $\mathbf{X}^\top\mathbf{X}$ requires $O(nd^2)$ operations. For $n = 10^7, d = 500$ : that is $2.5 \times 10^{12}$ multiply-adds. Even if you had the compute, inverting the resulting $500 \times 500$ matrix can fail numerically with near-singular features. And for neural networks, no closed form exists at all - gradient descent is the only option.

Gradient descent uses a fundamentally different strategy: take small steps in the direction of steepest descent, updating weights after each step (or each batch). Each step costs $O(nd)$ at most - and with mini-batches, just $O(Bd)$ where $B \ll n$ .

Historical Context

Cauchy proposed the gradient descent method in 1847 for solving systems of equations. The method remained a mathematical curiosity until the 1960s when Robbins and Monro (1951) formalized stochastic approximation, and the backpropagation algorithm (Rumelhart, Hinton, Williams 1986) made gradient-based training of neural networks practical. The modern understanding of gradient descent - convergence rates, condition number dependence, the role of learning rate schedules - was largely developed through the 2000s and 2010s as deep learning scaled to massive datasets.

Deriving the Gradient of MSE Loss

The Mean Squared Error (MSE) loss over $n$ training samples is:

$\mathcal{L}(\mathbf{w}) = \frac{1}{n} \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 = \frac{1}{n} \sum_{i=1}^n (y_i - \mathbf{w}^\top\mathbf{x}_i)^2$

To find the gradient, apply the chain rule. Let $\mathbf{r} = \mathbf{y} - \mathbf{X}\mathbf{w}$ (the residual vector), so $\mathcal{L} = \frac{1}{n}\mathbf{r}^\top\mathbf{r}$ :

$\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \frac{\partial}{\partial \mathbf{w}} \frac{1}{n} (\mathbf{y} - \mathbf{X}\mathbf{w})^\top(\mathbf{y} - \mathbf{X}\mathbf{w})$

Expanding and differentiating term by term using matrix calculus:

$\frac{\partial}{\partial\mathbf{w}} \mathbf{y}^\top\mathbf{y} = \mathbf{0}$
$\frac{\partial}{\partial\mathbf{w}} (-2\mathbf{w}^\top\mathbf{X}^\top\mathbf{y}) = -2\mathbf{X}^\top\mathbf{y}$
$\frac{\partial}{\partial\mathbf{w}} \mathbf{w}^\top\mathbf{X}^\top\mathbf{X}\mathbf{w} = 2\mathbf{X}^\top\mathbf{X}\mathbf{w}$

Combining:

$\boxed{\nabla_{\mathbf{w}} \mathcal{L} = -\frac{2}{n}\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\mathbf{w}) = \frac{2}{n}\mathbf{X}^\top(\mathbf{X}\mathbf{w} - \mathbf{y})}$

The gradient points in the direction of steepest ascent. To minimize, we move opposite to the gradient, scaled by learning rate $\eta$ :

$\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \eta \cdot \nabla_\mathbf{w}\mathcal{L}(\mathbf{w}^{(t)}) = \mathbf{w}^{(t)} - \frac{2\eta}{n}\mathbf{X}^\top(\mathbf{X}\mathbf{w}^{(t)} - \mathbf{y})$

Step-by-Step Scalar Derivation (Intuition)

For the single-weight case, with $\hat{y}_i = w x_i$ :

$\mathcal{L}(w) = \frac{1}{n}\sum_{i=1}^n (y_i - w x_i)^2$

$\frac{d\mathcal{L}}{dw} = \frac{1}{n}\sum_{i=1}^n 2(y_i - w x_i)(-x_i) = -\frac{2}{n}\sum_{i=1}^n x_i(y_i - w x_i)$

This is the scalar version of $-\frac{2}{n}\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\mathbf{w})$ . Reading it: for each sample $i$ , if the prediction $w x_i$ is too low (positive residual $y_i - w x_i$ ), then the gradient is negative (telling us to increase $w$ ). This is exactly right intuitively.

The Loss Landscape for Linear Regression

For linear regression with MSE loss, the loss surface has a special structure: it is a convex quadratic bowl - smooth, with a single global minimum, no local minima, no saddle points.

The Hessian (second derivative matrix) of the MSE loss is:

$\nabla^2_\mathbf{w} \mathcal{L} = \frac{2}{n}\mathbf{X}^\top\mathbf{X}$

Since $\mathbf{X}^\top\mathbf{X}$ is a Gram matrix, it is always positive semi-definite (all eigenvalues $\geq 0$ ). For an overdetermined system ( $n > d$ with full-rank $\mathbf{X}$ ), it is strictly positive definite - the loss surface is a strictly convex bowl with a unique global minimum.

This guarantee vanishes for neural networks. The composition of nonlinear activation functions creates a highly non-convex loss surface with many saddle points and local minima. For linear models, gradient descent is provably correct and converges to the global optimum.

Geometry of the Loss Landscape

The shape of the bowl is determined by the eigenvalues of $\mathbf{X}^\top\mathbf{X}$ :

Eigenvalues all equal ( $\kappa = 1$ ): perfectly circular bowl - gradient descent converges in a straight line to the minimum. Any learning rate up to the convergence threshold works equally well.
Eigenvalues very different ( $\kappa \gg 1$ ): narrow elongated valley - gradient descent oscillates across the narrow dimension while making slow progress along the long dimension. This is why learning rate choice is critical.

The Gradient Descent Algorithm

Starting from initial weights $\mathbf{w}^{(0)}$ (typically all zeros or small random values), iterate:

$\mathbf{w}^{(t+1)} \leftarrow \mathbf{w}^{(t)} - \eta \cdot \nabla_{\mathbf{w}} \mathcal{L}(\mathbf{w}^{(t)})$

The algorithm continues until:

The gradient norm $\|\nabla\mathcal{L}\| < \epsilon$ (convergence criterion)
A maximum number of iterations is reached
The loss stops decreasing (loss difference $< \delta$ )

Learning Rate Sensitivity: The Lipschitz Constant

The Lipschitz constant $L$ of the gradient is the key quantity governing learning rate choice. For the MSE loss:

$L = \frac{2\lambda_{\max}(\mathbf{X}^\top\mathbf{X})}{n}$

where $\lambda_{\max}$ is the largest eigenvalue of $\mathbf{X}^\top\mathbf{X}$ .

The gradient descent converges if and only if the learning rate satisfies:

$\eta < \frac{2}{L} = \frac{n}{\lambda_{\max}(\mathbf{X}^\top\mathbf{X})}$

For $\eta$ above this threshold, the weight update overshoots the minimum in the direction of the dominant eigenvector and diverges. For $\eta$ below, it converges - but slowly if $\eta \ll 2/L$ .

The theoretically optimal learning rate is:

$\eta^* = \frac{2}{\lambda_{\max} + \lambda_{\min}} = \frac{2}{L + \mu}$

where $\mu = 2\lambda_{\min}/n$ is the strong convexity constant. This balances between fast progress (using large steps) and stability (not overshooting).

Full NumPy Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression


class BatchGradientDescent:
    """
    Linear regression via batch gradient descent.
    Uses all n training samples per gradient step.
    Includes convergence history and diagnostic methods.
    """

    def __init__(self, learning_rate=0.01, n_iterations=1000,
                 tolerance=1e-6, verbose=False):
        self.lr = learning_rate
        self.n_iterations = n_iterations
        self.tolerance = tolerance
        self.verbose = verbose
        self.weights = None
        self.bias = None
        self.loss_history = []
        self.grad_norm_history = []

    def fit(self, X, y):
        n, d = X.shape
        # Initialize at zero (safe for convex problems)
        self.weights = np.zeros(d)
        self.bias = 0.0

        for iteration in range(self.n_iterations):
            # Forward pass: compute predictions and residuals
            y_pred = X @ self.weights + self.bias
            residuals = y_pred - y  # shape (n,) - prediction error

            # MSE loss for monitoring
            loss = np.mean(residuals ** 2)
            self.loss_history.append(loss)

            # Gradients: ∇w L = (2/n) X^T (Xw - y), ∇b L = (2/n) sum(residuals)
            grad_w = (2 / n) * X.T @ residuals  # shape (d,)
            grad_b = (2 / n) * np.sum(residuals)  # scalar

            grad_norm = np.linalg.norm(grad_w)
            self.grad_norm_history.append(grad_norm)

            if self.verbose and iteration % 100 == 0:
                print(f"Iter {iteration:5d} | Loss: {loss:.6f} | "
                      f"|∇L|: {grad_norm:.6f} | lr: {self.lr:.4f}")

            # Update weights (gradient descent step)
            self.weights -= self.lr * grad_w
            self.bias    -= self.lr * grad_b

            # Convergence check: gradient norm below tolerance
            if grad_norm < self.tolerance:
                if self.verbose:
                    print(f"Converged at iteration {iteration}")
                break

        return self

    def predict(self, X):
        return X @ self.weights + self.bias

    def score(self, X, y):
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return 1 - ss_res / ss_tot


# Setup dataset
np.random.seed(42)
X, y = make_regression(n_samples=2000, n_features=10, noise=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Always scale before gradient descent
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

model = BatchGradientDescent(learning_rate=0.05, n_iterations=3000, verbose=True)
model.fit(X_train_sc, y_train)

print(f"\nTrain R²: {model.score(X_train_sc, y_train):.4f}")
print(f"Test  R²: {model.score(X_test_sc,  y_test):.4f}")

# Compare to normal equation
sk = LinearRegression().fit(X_train_sc, y_train)
print(f"sklearn (OLS) Test R²: {sk.score(X_test_sc, y_test):.4f}")

Learning Rate Sensitivity Experiment

def run_gd_with_lr(X_train, y_train, learning_rates, n_iterations=300):
    """Compare gradient descent trajectories for different learning rates."""
    results = {}

    for lr in learning_rates:
        model = BatchGradientDescent(
            learning_rate=lr,
            n_iterations=n_iterations,
            tolerance=1e-12  # disable early stopping for comparison
        )
        model.fit(X_train, y_train)
        results[lr] = model.loss_history

    return results


learning_rates = [0.001, 0.01, 0.05, 0.1, 0.3, 0.8]
results = run_gd_with_lr(X_train_sc, y_train, learning_rates)

plt.figure(figsize=(12, 5))
for lr, losses in results.items():
    has_nan = any(np.isnan(l) or np.isinf(l) for l in losses)
    label = f"η={lr}" + (" (DIVERGED)" if has_nan else "")
    style = '--' if has_nan else '-'
    clean_losses = [l for l in losses if not (np.isnan(l) or np.isinf(l))]
    plt.semilogy(clean_losses, linestyle=style, label=label)

plt.xlabel("Iteration")
plt.ylabel("MSE Loss (log scale)")
plt.title("Learning Rate Sensitivity\n(Gradient Descent on Scaled Features)")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("lr_sensitivity.png", dpi=150)
plt.show()

# Summary table
print("\nLearning Rate Behavior Summary:")
print(f"{'η':>8} | {'Final Loss':>12} | {'Status':>15}")
print("-" * 40)
for lr, losses in results.items():
    final = losses[-1]
    if np.isnan(final) or np.isinf(final):
        status = "DIVERGED"
        final_str = "inf"
    elif final > 1e6:
        status = "Oscillating"
        final_str = f"{final:.2e}"
    else:
        status = "Converged"
        final_str = f"{final:.4f}"
    print(f"{lr:>8} | {final_str:>12} | {status:>15}")

Convergence Analysis: The Condition Number

Why do some datasets converge in 100 iterations and others take 100,000? The answer is the condition number of $\mathbf{X}^\top\mathbf{X}$ .

$\kappa(\mathbf{X}^\top\mathbf{X}) = \frac{\lambda_{\max}}{\lambda_{\min}}$

The convergence rate of gradient descent (with optimal learning rate) is:

$\|\mathbf{w}^{(T)} - \mathbf{w}^*\| \leq \left(\frac{\kappa - 1}{\kappa + 1}\right)^T \|\mathbf{w}^{(0)} - \mathbf{w}^*\|$

The factor $\rho = (\kappa-1)/(\kappa+1)$ is the convergence rate per iteration:

$\kappa = 1$ : $\rho = 0$ - converges in 1 iteration (the exact Newton step)
$\kappa = 10$ : $\rho = 0.818$ - needs ~25 iterations to reduce error by $10^{-2}$
$\kappa = 100$ : $\rho = 0.980$ - needs ~230 iterations to reduce error by $10^{-2}$
$\kappa = 1000$ : $\rho = 0.998$ - needs ~2300 iterations to reduce error by $10^{-2}$

For Lipschitz-smooth convex objectives (which includes MSE for linear regression), the convergence gap after $T$ iterations satisfies:

$\mathcal{L}(\mathbf{w}^{(T)}) - \mathcal{L}(\mathbf{w}^*) \leq \frac{L \|\mathbf{w}^{(0)} - \mathbf{w}^*\|^2}{2T}$

This is the $O(1/T)$ convergence rate for convex Lipschitz-smooth objectives.

def analyze_convergence(X, label=""):
    """Analyze the condition number and theoretical convergence properties."""
    gram = X.T @ X  # X^T X: d × d matrix
    eigenvalues = np.linalg.eigvalsh(gram)  # symmetric → use eigvalsh
    eigenvalues = np.sort(eigenvalues)
    # Filter out numerical zeros
    pos_eigs = eigenvalues[eigenvalues > 1e-10]

    lambda_max = pos_eigs.max()
    lambda_min = pos_eigs.min()
    kappa = lambda_max / lambda_min
    n = X.shape[0]

    # Optimal learning rate for batch GD
    alpha_opt = 2 * n / (lambda_max + lambda_min)

    # Lipschitz constant of the gradient
    L = 2 * lambda_max / n
    mu = 2 * lambda_min / n  # strong convexity constant

    # Convergence rate at optimal lr
    rho = (kappa - 1) / (kappa + 1)

    # Iterations to reach epsilon = 1e-6 accuracy
    eps = 1e-6
    iters_needed = np.log(1 / eps) / np.log(1 / rho) if rho < 1 else float('inf')

    print(f"\n=== Convergence Analysis: {label} ===")
    print(f"  λ_max = {lambda_max:.2f},  λ_min = {lambda_min:.6f}")
    print(f"  Condition number κ = {kappa:.1f}")
    print(f"  Lipschitz constant L = {L:.6f}")
    print(f"  Strong convexity μ = {mu:.6f}")
    print(f"  Optimal learning rate η* = {alpha_opt:.6f}")
    print(f"  Convergence rate ρ = {rho:.6f}")
    print(f"  Iterations for 1e-6 accuracy: ~{int(iters_needed)}")

    return kappa, alpha_opt, rho


# Well-conditioned: scaled features
kappa_good, lr_good, rho_good = analyze_convergence(X_train_sc, "Scaled features")

# Poorly conditioned: one feature on a very different scale
X_bad = X_train.copy()
X_bad[:, 0] *= 1000  # massive scale difference
kappa_bad, lr_bad, rho_bad = analyze_convergence(X_bad, "Unscaled features")

print(f"\nCondition number increase from bad scaling: {kappa_bad / kappa_good:.1f}x")
print(f"Extra iterations needed: {rho_bad:.4f} vs {rho_good:.4f} convergence rate")

Gradient Clipping

For models where gradients can explode (large datasets, poorly scaled features, neural networks), gradient clipping prevents divergence:

$\text{If } \|\mathbf{g}\| > c, \text{ then } \mathbf{g} \leftarrow c \cdot \frac{\mathbf{g}}{\|\mathbf{g}\|}$

This rescales the gradient to have norm exactly $c$ when it exceeds the threshold, preserving the direction but bounding the magnitude.

class GradientDescentWithClipping:
    """
    Gradient descent with gradient norm clipping and loss curve plotting.
    Gradient clipping prevents divergence in poorly conditioned problems.
    """

    def __init__(self, learning_rate=0.01, n_iterations=1000,
                 clip_norm=None, tolerance=1e-6):
        self.lr = learning_rate
        self.n_iterations = n_iterations
        self.clip_norm = clip_norm  # None = no clipping
        self.tolerance = tolerance
        self.weights = None
        self.bias = None
        self.loss_history = []
        self.clipped_count = 0

    def _clip_gradient(self, grad):
        """Apply gradient norm clipping if clip_norm is set."""
        if self.clip_norm is None:
            return grad
        norm = np.linalg.norm(grad)
        if norm > self.clip_norm:
            self.clipped_count += 1
            return self.clip_norm * grad / norm
        return grad

    def fit(self, X, y):
        n, d = X.shape
        self.weights = np.zeros(d)
        self.bias = 0.0
        self.clipped_count = 0

        for t in range(self.n_iterations):
            y_pred = X @ self.weights + self.bias
            residuals = y_pred - y

            loss = np.mean(residuals ** 2)
            self.loss_history.append(loss)

            grad_w = (2 / n) * X.T @ residuals
            grad_b = (2 / n) * np.sum(residuals)

            # Apply clipping before the update step
            grad_w = self._clip_gradient(grad_w)

            if np.linalg.norm(grad_w) < self.tolerance:
                break

            self.weights -= self.lr * grad_w
            self.bias    -= self.lr * grad_b

        return self

    def predict(self, X):
        return X @ self.weights + self.bias

    def score(self, X, y):
        y_pred = self.predict(X)
        return 1 - np.sum((y - y_pred)**2) / np.sum((y - np.mean(y))**2)


# Demonstrate: large learning rate without clipping diverges; with clipping it recovers
large_lr = 0.4  # above the convergence threshold for this data

gd_no_clip = GradientDescentWithClipping(learning_rate=large_lr, n_iterations=200)
gd_no_clip.fit(X_train_sc, y_train)

gd_clip = GradientDescentWithClipping(learning_rate=large_lr, n_iterations=200,
                                       clip_norm=10.0)
gd_clip.fit(X_train_sc, y_train)

print(f"\nWith lr={large_lr}:")
print(f"  No clipping: final loss = {gd_no_clip.loss_history[-1]:.2e}")
print(f"  With clipping (c=10): final loss = {gd_clip.loss_history[-1]:.2f}")
print(f"  Gradients clipped: {gd_clip.clipped_count} times")

plt.figure(figsize=(10, 4))
valid_no_clip = [l for l in gd_no_clip.loss_history if not np.isinf(l) and not np.isnan(l)]
plt.semilogy(range(len(valid_no_clip)), valid_no_clip,
             label=f'No clipping (lr={large_lr})', color='red')
plt.semilogy(gd_clip.loss_history,
             label=f'Clipping c=10 (lr={large_lr})', color='green')
plt.xlabel("Iteration")
plt.ylabel("MSE Loss (log scale)")
plt.title("Effect of Gradient Clipping with Large Learning Rate")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("gradient_clipping.png", dpi=150)
plt.show()

Momentum: Accelerating Gradient Descent

Gradient descent on an elongated loss landscape oscillates across the narrow dimension. Momentum (Polyak 1964) accumulates a velocity vector that damps oscillations and accelerates progress along consistent directions:

$\mathbf{v}^{(t+1)} = \beta \mathbf{v}^{(t)} + \nabla_\mathbf{w}\mathcal{L}(\mathbf{w}^{(t)})$

$\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \eta \mathbf{v}^{(t+1)}$

The momentum coefficient $\beta \in [0, 1)$ (typically $0.9$ ) controls how much past gradients are retained. With $\beta = 0$ , this reduces to standard gradient descent. With $\beta = 0.9$ , each velocity update is a weighted sum of all previous gradients, with the most recent weighted at 1 and gradients from $k$ steps ago weighted at $\beta^k$ .

Why does this help? In the narrow valley case:

Across the valley (oscillating direction): gradients flip sign every step. The momentum vector for this direction averages to near zero - damping the oscillations.
Along the valley (slow progress direction): gradients consistently point in the same direction. Momentum accumulates, accelerating progress.

class GradientDescentMomentum:
    """
    Gradient descent with Polyak momentum.
    Accelerates convergence on ill-conditioned loss landscapes.
    """

    def __init__(self, learning_rate=0.01, momentum=0.9, n_iterations=1000,
                 tolerance=1e-6, verbose=False):
        self.lr = learning_rate
        self.beta = momentum
        self.n_iterations = n_iterations
        self.tolerance = tolerance
        self.verbose = verbose
        self.loss_history = []

    def fit(self, X, y):
        n, d = X.shape
        self.weights = np.zeros(d)
        self.bias = 0.0
        velocity_w = np.zeros(d)  # accumulated gradient (velocity)
        velocity_b = 0.0

        for t in range(self.n_iterations):
            y_pred = X @ self.weights + self.bias
            residuals = y_pred - y

            loss = np.mean(residuals ** 2)
            self.loss_history.append(loss)

            grad_w = (2 / n) * X.T @ residuals
            grad_b = (2 / n) * np.sum(residuals)

            if np.linalg.norm(grad_w) < self.tolerance:
                if self.verbose:
                    print(f"Converged at step {t}")
                break

            # Momentum update: v = β*v + ∇L
            velocity_w = self.beta * velocity_w + grad_w
            velocity_b = self.beta * velocity_b + grad_b

            # Weight update: w = w - η*v
            self.weights -= self.lr * velocity_w
            self.bias    -= self.lr * velocity_b

            if self.verbose and t % 200 == 0:
                print(f"Iter {t:5d}: loss={loss:.6f}")

        return self

    def predict(self, X):
        return X @ self.weights + self.bias

    def score(self, X, y):
        yp = self.predict(X)
        return 1 - np.sum((y-yp)**2)/np.sum((y-np.mean(y))**2)


# Compare plain GD vs momentum on an ill-conditioned problem
# Create ill-conditioned data: one feature has 100x larger scale than others
np.random.seed(42)
X_ill, y_ill = make_regression(n_samples=500, n_features=5, noise=15, random_state=42)
X_ill_scaled = StandardScaler().fit_transform(X_ill)
# Artificially make it ill-conditioned by removing scaling on first feature
X_ill_scaled[:, 0] *= 20

gd_plain = BatchGradientDescent(learning_rate=0.001, n_iterations=2000)
gd_plain.fit(X_ill_scaled, y_ill)

gd_momentum = GradientDescentMomentum(learning_rate=0.001, momentum=0.9,
                                       n_iterations=2000)
gd_momentum.fit(X_ill_scaled, y_ill)

plt.figure(figsize=(10, 4))
plt.semilogy(gd_plain.loss_history, label='Plain GD', color='blue')
plt.semilogy(gd_momentum.loss_history, label='GD + Momentum (β=0.9)', color='green')
plt.xlabel("Iteration")
plt.ylabel("MSE Loss (log scale)")
plt.title("Momentum Accelerates Convergence on Ill-Conditioned Problems")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("momentum_comparison.png", dpi=150)
plt.show()

# How many iterations for momentum to reach the same loss as plain GD at 2000 steps?
target_loss = gd_plain.loss_history[-1]
momentum_iters = next((i for i, l in enumerate(gd_momentum.loss_history)
                       if l < target_loss), None)
print(f"\nPlain GD final loss: {gd_plain.loss_history[-1]:.4f}")
print(f"Momentum reaches that loss at iteration: {momentum_iters}")

Learning Rate Finder Algorithm

In practice, instead of computing eigenvalues (expensive for large $d$ ), use an empirical learning rate finder:

def find_learning_rate(X, y, lr_min=1e-5, lr_max=1.0, n_steps=100,
                        beta=0.98):
    """
    Learning rate range test (Smith 2015, refined with exponential smoothing).
    Increases lr exponentially, records smoothed loss. The optimal lr is
    just before the point where smoothed loss begins rising sharply.

    Parameters:
        beta: exponential smoothing factor for the loss (0 = no smoothing)
    """
    n, d = X.shape
    lrs = np.logspace(np.log10(lr_min), np.log10(lr_max), n_steps)

    weights = np.zeros(d)
    bias = 0.0
    smoothed_loss = 0.0
    best_loss = float('inf')
    recorded_losses = []
    recorded_lrs = []

    for i, lr in enumerate(lrs):
        # One gradient step
        y_pred = X @ weights + bias
        residuals = y_pred - y

        raw_loss = np.mean(residuals ** 2)

        # Exponential smoothing to remove noise
        smoothed_loss = beta * smoothed_loss + (1 - beta) * raw_loss
        # Bias correction
        smooth_corr = smoothed_loss / (1 - beta ** (i + 1))

        if np.isnan(smooth_corr) or np.isinf(smooth_corr):
            break
        if smooth_corr < best_loss:
            best_loss = smooth_corr
        if smooth_corr > 4 * best_loss:
            break  # loss exploding - stop here

        recorded_losses.append(smooth_corr)
        recorded_lrs.append(lr)

        grad_w = (2 / n) * X.T @ residuals
        grad_b = (2 / n) * np.sum(residuals)

        weights -= lr * grad_w
        bias    -= lr * grad_b

    # Find the lr with steepest decline
    if len(recorded_losses) > 10:
        log_lrs = np.log10(recorded_lrs)
        loss_arr = np.array(recorded_losses)
        # Gradient of loss w.r.t. log lr
        grad_loss = np.gradient(loss_arr, log_lrs)
        best_idx = np.argmin(grad_loss)
        suggested_lr = recorded_lrs[best_idx] / 10  # one order of magnitude below steepest

        plt.figure(figsize=(8, 4))
        plt.semilogx(recorded_lrs, recorded_losses, 'b-', linewidth=1.5)
        plt.axvline(suggested_lr, color='r', linestyle='--',
                    label=f'Suggested lr = {suggested_lr:.2e}')
        plt.xlabel("Learning Rate (log scale)")
        plt.ylabel("Smoothed MSE Loss")
        plt.title("Learning Rate Range Test\n(Find the lr just before loss starts rising)")
        plt.legend()
        plt.grid(True, alpha=0.3)
        plt.tight_layout()
        plt.savefig("lr_finder.png", dpi=150)
        plt.show()

        print(f"Suggested learning rate: {suggested_lr:.2e}")
        return suggested_lr, recorded_lrs, recorded_losses

    return None, recorded_lrs, recorded_losses


suggested_lr, _, _ = find_learning_rate(X_train_sc, y_train)

Ill-Conditioning and Preconditioning

Ill-conditioning is the root cause of slow gradient descent convergence. Preconditioning is the mathematical solution: transform the optimization problem to have a better-conditioned Hessian.

The simplest preconditioner is feature scaling (StandardScaler). More sophisticated preconditioners include:

Diagonal preconditioner: scale each gradient component by $1/H_{jj}$ (the inverse of the diagonal Hessian). This approximates Newton's method at low cost.
Adam optimizer: uses per-parameter adaptive learning rates, effectively scaling by an estimate of the RMS gradient magnitude. This is a form of diagonal preconditioning.
L-BFGS: stores a low-rank approximation of the inverse Hessian. Extremely effective for well-structured problems with smooth loss landscapes.

def compare_scaling_convergence(X_train, y_train, n_iterations=500):
    """
    Demonstrate the dramatic effect of feature scaling on convergence.
    Unscaled features → high condition number → slow convergence.
    Scaled features → low condition number → fast convergence.
    """
    # Unscaled features
    kappa_raw, _, _ = analyze_convergence(X_train, "Unscaled")
    gd_raw = BatchGradientDescent(learning_rate=1e-5, n_iterations=n_iterations)
    gd_raw.fit(X_train, y_train)

    # Scaled features
    scaler = StandardScaler()
    X_sc = scaler.fit_transform(X_train)
    kappa_sc, _, _ = analyze_convergence(X_sc, "Scaled")
    gd_sc = BatchGradientDescent(learning_rate=0.05, n_iterations=n_iterations)
    gd_sc.fit(X_sc, y_train)

    fig, axes = plt.subplots(1, 2, figsize=(12, 4))

    ax = axes[0]
    ax.semilogy(gd_raw.loss_history, label=f'Unscaled (κ={kappa_raw:.0f})', color='red')
    ax.semilogy(gd_sc.loss_history,  label=f'Scaled (κ={kappa_sc:.0f})',   color='green')
    ax.set_xlabel("Iteration")
    ax.set_ylabel("MSE Loss (log scale)")
    ax.set_title("Training Loss vs Iteration")
    ax.legend()
    ax.grid(True, alpha=0.3)

    ax = axes[1]
    ax.semilogy(gd_raw.grad_norm_history,
                label=f'Unscaled (η={1e-5})', color='red')
    ax.semilogy(gd_sc.grad_norm_history,
                label=f'Scaled (η={0.05})', color='green')
    ax.set_xlabel("Iteration")
    ax.set_ylabel("Gradient Norm (log scale)")
    ax.set_title("Gradient Norm vs Iteration")
    ax.legend()
    ax.grid(True, alpha=0.3)

    plt.suptitle("Feature Scaling Dramatically Improves GD Convergence", y=1.02)
    plt.tight_layout()
    plt.savefig("scaling_convergence.png", dpi=150)
    plt.show()


compare_scaling_convergence(X_train, y_train)

Comparing GD to the Normal Equation

import time

def compare_methods(n_samples, n_features, noise=20.0):
    """Compare gradient descent and normal equation on various sizes."""
    X, y = make_regression(n_samples=n_samples, n_features=n_features,
                           noise=noise, random_state=42)
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

    sc = StandardScaler()
    X_tr_sc = sc.fit_transform(X_tr)
    X_te_sc  = sc.transform(X_te)

    # Normal equation (sklearn uses LAPACK SVD)
    t0 = time.time()
    sk = LinearRegression().fit(X_tr_sc, y_tr)
    t_ne = time.time() - t0
    r2_ne = sk.score(X_te_sc, y_te)

    # Gradient descent
    t0 = time.time()
    gd = BatchGradientDescent(learning_rate=0.05, n_iterations=2000, tolerance=1e-5)
    gd.fit(X_tr_sc, y_tr)
    t_gd = time.time() - t0
    r2_gd = gd.score(X_te_sc, y_te)

    print(f"\nn={n_samples:>8,d}, d={n_features}")
    print(f"  Normal Equation: {t_ne:.3f}s, R²={r2_ne:.4f}")
    print(f"  Gradient Descent:{t_gd:.3f}s, R²={r2_gd:.4f}")
    return t_ne, t_gd


compare_methods(1_000, 20)
compare_methods(10_000, 50)
compare_methods(50_000, 100)

Common Mistakes

:::danger Forgetting to scale features before gradient descent

# WRONG - features on different scales create ill-conditioned problems
model = BatchGradientDescent(lr=0.01)
model.fit(X_raw, y)  # may need 100,000 iterations or diverge

# CORRECT - scale first, gradient descent works in the scaled space
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
model.fit(X_scaled, y_train)
# At inference: also scale test data with the SAME fitted scaler
X_test_scaled = scaler.transform(X_test)  # NOT fit_transform!

Fitting the scaler on test data (a common mistake) introduces data leakage. :::

:::danger Learning rate too large - use a finder or start small The condition for convergence is $\eta < 2/L$ where $L$ depends on the data. Without scaling, the Lipschitz constant can be very large, requiring tiny learning rates. Always use a learning rate finder or start with $\eta = 0.01$ after StandardScaling and monitor the loss curve. If it oscillates or diverges, reduce by 10x. :::

:::warning Not checking gradient norm at convergence Stopping after a fixed number of iterations without checking the gradient norm can leave the model far from the optimum. A large final gradient norm ( $> 0.01$ for standardized data) means the model has not converged. Either increase iterations or increase the learning rate. :::

:::warning Momentum coefficient too close to 1 causes instability $\beta = 0.9$ is the standard safe value for momentum. Values above $0.99$ can cause oscillations even with a small learning rate, because the accumulated velocity takes a long time to decay when the gradient changes direction. Start with $0.9$ and only increase if convergence is provably too slow. :::

YouTube Resources

Resource	Channel	Why Watch
Gradient Descent, Step-by-Step	StatQuest with Josh Starmer	Clear visual derivation of the gradient update rule
MIT 18.065: Optimization and Deep Learning	MIT OpenCourseWare	Gilbert Strang on condition number and convergence rates
Why Momentum Really Works	distill.pub talk	Deep visual intuition for Polyak momentum and Nesterov acceleration
An Introduction to Optimization	Andrej Karpathy / CS231n	Practical GD variants explained for neural networks
Learning Rate Finder	fast.ai / Jeremy Howard	The Smith (2015) learning rate range test explained and demonstrated

Interview Q&A

Q1: Derive the gradient of MSE loss for linear regression. What does it mean geometrically?

Starting with $\mathcal{L}(\mathbf{w}) = \frac{1}{n}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2$ , expand and differentiate term by term: $\nabla_\mathbf{w}\mathcal{L} = \frac{1}{n}(-2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{X}\mathbf{w}) = \frac{2}{n}\mathbf{X}^\top(\mathbf{X}\mathbf{w} - \mathbf{y})$ .

Geometrically: $\mathbf{X}\mathbf{w} - \mathbf{y}$ is the residual vector pointing from the target to the current prediction. $\mathbf{X}^\top$ projects this residual back into parameter space. The gradient is the direction in which the loss increases most steeply. Moving opposite to it (gradient descent) moves toward the projection of $\mathbf{y}$ onto the column space of $\mathbf{X}$ .

Q2: How does the condition number of $\mathbf{X}^\top\mathbf{X}$ affect gradient descent convergence?

The condition number $\kappa = \lambda_{\max}/\lambda_{\min}$ determines how elongated the loss bowl is. High $\kappa$ means the loss surface is a narrow valley - gradient descent oscillates across the narrow dimension while making slow progress along the long dimension. The convergence rate per step is $\rho = (\kappa-1)/(\kappa+1)$ : iterations to converge scale as $O(\kappa\log(1/\epsilon))$ .

Feature standardization normalizes eigenvalues of $\mathbf{X}^\top\mathbf{X}$ - each feature having unit variance means its contribution to the Gram matrix is similar. This reduces $\kappa$ from potentially thousands to close to 1, and can reduce the required iterations by 3-4 orders of magnitude.

Q3: What learning rate would you choose? What is the Lipschitz condition?

The gradient of the MSE loss is Lipschitz continuous with constant $L = 2\lambda_{\max}(\mathbf{X}^\top\mathbf{X})/n$ . For convergence, the learning rate must satisfy $\eta < 2/L$ . The optimal theoretical learning rate is $\eta^* = 2/(L + \mu)$ where $\mu = 2\lambda_{\min}/n$ is the strong convexity constant.

In practice: (1) run a learning rate finder (exponentially increasing lr, find where loss starts rising); (2) start with $\eta = 0.01$ on StandardScaled features; (3) monitor the loss curve - oscillation means $\eta$ is too large, very slow linear descent means $\eta$ is too small; (4) for production, use a cosine decay schedule with warmup.

Q4: Why is the MSE loss surface for linear regression convex? Why does this not hold for neural networks?

For linear regression, the Hessian is $\mathbf{H} = \frac{2}{n}\mathbf{X}^\top\mathbf{X}$ , which is positive semi-definite (Gram matrix, all eigenvalues $\geq 0$ ). A positive semi-definite Hessian means the function is convex - any local minimum is the global minimum, and gradient descent converges to it.

For neural networks, the function composition through nonlinear activations (ReLU, sigmoid) creates a non-convex landscape. The Hessian has both positive and negative eigenvalues at most points. Saddle points (where the gradient is zero but the Hessian has both signs) are extremely common. Despite this, neural network training works because: (a) modern optimizers (Adam, SGD with momentum) escape saddle points via gradient noise; (b) large networks tend to have many high-quality local minima with similar loss values.

Q5: When would gradient descent outperform the normal equation even for linear regression?

Three key scenarios: (1) Large $n$ - gradient descent processes data in $O(nd)$ per iteration. The normal equation costs $O(nd^2)$ to form the Gram matrix and $O(d^3)$ to invert. For $n = 10^7, d = 500$ : GD is $\sim 500\times$ faster per effective step. (2) Large $d$ - inverting the $d \times d$ Gram matrix costs $O(d^3)$ , prohibitive for $d > 10,000$ . (3) Online/streaming settings - GD processes one mini-batch at a time, updating weights as new data arrives. The normal equation requires recomputing from scratch. (4) Regularized Lasso - no closed form exists; coordinate descent is the only option.

Q6: What is gradient clipping and when do you need it?

Gradient clipping rescales the gradient vector when its norm exceeds a threshold $c$ : $\mathbf{g} \leftarrow c \cdot \mathbf{g}/\|\mathbf{g}\|$ . This preserves the gradient direction but bounds its magnitude. You need it when: (1) features have very different scales and you cannot normalize (e.g., streaming data with unknown statistics); (2) the loss landscape has steep ravines (common in RNNs); (3) using a large learning rate for fast training in early epochs. For well-scaled features and linear models, clipping is usually unnecessary. For neural networks with recurrent connections, clipping with $c = 1.0$ or $c = 5.0$ is standard practice (Pascanu et al. 2013).

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Gradient Descent demo on the EngineersOfAI Playground - no code required.

:::

The Real Interview Moment​

Why Gradient Descent Exists​

Historical Context​

Deriving the Gradient of MSE Loss​

Step-by-Step Scalar Derivation (Intuition)​

The Loss Landscape for Linear Regression​

Geometry of the Loss Landscape​

The Gradient Descent Algorithm​

Learning Rate Sensitivity: The Lipschitz Constant​

Full NumPy Implementation​

Learning Rate Sensitivity Experiment​

Convergence Analysis: The Condition Number​

Gradient Clipping​

Momentum: Accelerating Gradient Descent​

Learning Rate Finder Algorithm​

Ill-Conditioning and Preconditioning​

Comparing GD to the Normal Equation​

Common Mistakes​

YouTube Resources​

Interview Q&A​