Derivatives and Gradients - The Compass of Training

Reading time: ~24 minutes | Level: Mathematical Foundations → ML Engineering

Your model has just finished its first epoch. The training loss is 2.47. You call loss.backward() and then optimizer.step(). The loss drops to 2.31.

What just happened? PyTorch computed the gradient - a vector with one number per parameter, each number telling the optimizer: "if you increase this parameter slightly, the loss goes up by this much." The optimizer used that compass reading to take a small step in the direction that reduces loss.

Every single training run of every neural network ever built follows this pattern. The gradient is the compass. Derivatives are what the compass reads.

What You Will Learn

The formal definition of a derivative and what it means geometrically
Partial derivatives: extending derivatives to multi-variable functions
The gradient vector: the direction of steepest ascent in parameter space
Directional derivatives: rate of change in any chosen direction
The Jacobian matrix: gradients for vector-valued functions
Computing gradients in NumPy and understanding what PyTorch .grad contains
Where gradients appear in real ML systems

Prerequisites

Basic algebra and function notation (f(x), y = mx + b)
Python and NumPy arrays
No prior calculus required

Part 1 - The Derivative: Rate of Change

What a derivative measures

A derivative measures how much a function's output changes when its input changes by a tiny amount.

Formally, the derivative of f at point x is:

$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$

Read this as: "Take a tiny step h in the input. Measure the change in output. Divide by h to get the rate. Shrink h to zero."

The result f'(x) is the instantaneous rate of change at x.

Geometric interpretation

Geometrically, the derivative is the slope of the tangent line to the function's curve at point x:

f(x)
│
│         ╱ ← tangent line at x₀
│        ╱    slope = f'(x₀)
│   ╱╲  ╱
│  ╱  ╲╱
│ ╱    ●  ← point (x₀, f(x₀))
│╱
└────────────── x
              x₀

If f'(x) > 0: the function is increasing at x (slope goes up)
If f'(x) < 0: the function is decreasing at x (slope goes down)
If f'(x) = 0: the function is flat at x - a potential minimum or maximum

This flat point is exactly what optimization seeks. Gradient descent moves toward places where f'(x) = 0.

Key derivatives you need

Function f(x)	Derivative f'(x)	ML where this appears
xⁿ	nxⁿ⁻¹	Polynomial activations
eˣ	eˣ	Softmax, normalization
ln(x)	1/x	Cross-entropy loss
sigmoid(x) = 1/(1+e⁻ˣ)	σ(x)(1-σ(x))	Binary classification
ReLU(x) = max(0,x)	0 if x<0, 1 if x>0	Most modern networks
tanh(x)	1 - tanh²(x)	RNNs, LSTM gates

import numpy as np

# Verify derivative of sigmoid numerically
def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

def sigmoid_derivative_analytical(x):
    s = sigmoid(x)
    return s * (1 - s)

def numerical_derivative(f, x, h=1e-5):
    """Finite difference approximation of derivative."""
    return (f(x + h) - f(x - h)) / (2 * h)  # central differences

x = np.array([0.0, 1.0, -1.0, 2.0])

analytical = sigmoid_derivative_analytical(x)
numerical = numerical_derivative(sigmoid, x)

print("x:          ", x)
print("Analytical: ", np.round(analytical, 6))
print("Numerical:  ", np.round(numerical, 6))
print("Max error:  ", np.max(np.abs(analytical - numerical)))
# Max error: ~1e-10 (extremely small - validates the formula)

:::note Why central differences are more accurate The forward difference (f(x+h) - f(x)) / h has error O(h). The central difference (f(x+h) - f(x-h)) / (2h) has error O(h²). For h=1e-5, this is 1e-10 vs 1e-5 - a 100,000x improvement. Always use central differences for numerical gradient checking. :::

Part 2 - Partial Derivatives: Many Variables

Neural network loss functions have millions of variables (one per parameter). We need derivatives with respect to each variable separately.

Definition

Given f(x₁, x₂, ..., xₙ), the partial derivative with respect to xᵢ is:

$\frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(x_1, \ldots, x_i + h, \ldots, x_n) - f(x_1, \ldots, x_i, \ldots, x_n)}{h}$

In practice: treat all variables as constants except xᵢ, then differentiate with respect to xᵢ as if it were the only variable.

Example: Loss function partial derivatives

Consider mean squared error loss with two parameters:

$L(w, b) = \frac{1}{n} \sum_{i=1}^{n} (wx_i + b - y_i)^2$

Partial derivative with respect to w (treat b as constant):

$\frac{\partial L}{\partial w} = \frac{2}{n} \sum_{i=1}^{n} (wx_i + b - y_i) \cdot x_i$

Partial derivative with respect to b (treat w as constant):

$\frac{\partial L}{\partial b} = \frac{2}{n} \sum_{i=1}^{n} (wx_i + b - y_i)$

import numpy as np

def mse_loss(w: float, b: float, X: np.ndarray, y: np.ndarray) -> float:
    """Mean squared error loss: (1/n) * sum((w*x + b - y)^2)"""
    predictions = w * X + b
    return np.mean((predictions - y) ** 2)

def mse_gradients(w: float, b: float, X: np.ndarray, y: np.ndarray) -> tuple:
    """Analytical partial derivatives of MSE with respect to w and b."""
    n = len(X)
    residuals = w * X + b - y  # (predictions - targets)

    dL_dw = (2 / n) * np.sum(residuals * X)
    dL_db = (2 / n) * np.sum(residuals)

    return dL_dw, dL_db

# Generate synthetic data
np.random.seed(42)
X = np.linspace(0, 10, 100)
y = 2.5 * X + 1.0 + np.random.randn(100) * 0.5  # true w=2.5, b=1.0

# Compute gradients at current parameter values
w, b = 1.0, 0.0  # wrong initial values
dL_dw, dL_db = mse_gradients(w, b, X, y)

print(f"Current w={w}, b={b}")
print(f"Loss = {mse_loss(w, b, X, y):.4f}")
print(f"∂L/∂w = {dL_dw:.4f}")  # positive → increasing w reduces loss
print(f"∂L/∂b = {dL_db:.4f}")  # positive → increasing b reduces loss

# Verify with numerical derivatives
h = 1e-5
numerical_dw = (mse_loss(w+h, b, X, y) - mse_loss(w-h, b, X, y)) / (2*h)
numerical_db = (mse_loss(w, b+h, X, y) - mse_loss(w, b-h, X, y)) / (2*h)
print(f"\nNumerical ∂L/∂w = {numerical_dw:.4f}  (matches: {np.isclose(dL_dw, numerical_dw)})")
print(f"Numerical ∂L/∂b = {numerical_db:.4f}  (matches: {np.isclose(dL_db, numerical_db)})")

What partial derivatives tell you

Each partial derivative answers: "If I change this one parameter by a tiny amount, holding everything else fixed, how much does the loss change?"

∂L/∂wᵢ > 0: Increasing wᵢ increases loss → decrease wᵢ
∂L/∂wᵢ < 0: Increasing wᵢ decreases loss → increase wᵢ
∂L/∂wᵢ = 0: wᵢ has no local effect on loss at this point

The sign and magnitude of each partial derivative directly drives the parameter update in gradient descent.

Part 3 - The Gradient Vector

The gradient assembles all partial derivatives into a single vector:

$\nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}$

The gradient ∇f(x) is a vector in the same space as x. Each component is the partial derivative with respect to the corresponding parameter.

Geometric interpretation: direction of steepest ascent

This is the most important geometric fact about gradients:

The gradient ∇f(x) points in the direction in which f increases most rapidly.

Equivalently: -∇f(x) points in the direction in which f decreases most rapidly.

This is the direction gradient descent moves.

Loss surface (2D parameter space):

     b
     │
 2.0 ┤    ·  ·  ↑  ·  ·
     │    · ↗ │ ↖ ·  ·     ← gradient ∇L points
 1.5 ┤    ·  ·  ●  ·  ·       toward higher loss
     │    · ↘ │ ↙ ·  ·     ← -∇L points toward
 1.0 ┤    ·  · ↓ ·  ·        lower loss (gradient descent)
     │
     └─────┬─────┬────── w
          1.0   2.5
               ↑ minimum (∇L = 0)

Gradient in NumPy

import numpy as np

def compute_gradient_mse(params: np.ndarray, X: np.ndarray, y: np.ndarray) -> np.ndarray:
    """
    Compute gradient of MSE loss with respect to all parameters.

    For linear regression with n features:
    params = [w₁, w₂, ..., wₙ, b]  (concatenated weights and bias)
    """
    w = params[:-1]  # all but last element
    b = params[-1]   # last element is bias

    predictions = X @ w + b  # (n_samples,)
    residuals = predictions - y  # (n_samples,)

    # Gradient with respect to each weight
    # ∂L/∂wⱼ = (2/n) * sum(residuals * Xⱼ)
    grad_w = (2 / len(y)) * (X.T @ residuals)  # (n_features,)

    # Gradient with respect to bias
    # ∂L/∂b = (2/n) * sum(residuals)
    grad_b = (2 / len(y)) * np.sum(residuals)   # scalar

    # Assemble gradient vector (same shape as params)
    gradient = np.concatenate([grad_w, [grad_b]])
    return gradient

# Multi-feature linear regression example
np.random.seed(42)
n_samples, n_features = 200, 3
X = np.random.randn(n_samples, n_features)
true_weights = np.array([1.5, -2.0, 0.8])
true_bias = 0.5
y = X @ true_weights + true_bias + np.random.randn(n_samples) * 0.1

# Initial random parameters
params = np.zeros(n_features + 1)  # [w₁, w₂, w₃, b]
gradient = compute_gradient_mse(params, X, y)

print(f"Parameters: {params}")
print(f"Gradient:   {gradient.round(4)}")
print(f"Gradient norm ‖∇L‖₂ = {np.linalg.norm(gradient):.4f}")
print()
print("Each gradient component:")
for i, (param, grad) in enumerate(zip(params, gradient)):
    direction = "increase" if grad < 0 else "decrease"
    print(f"  param[{i}]={param:.3f}, ∂L/∂p={grad:.4f} → {direction} this param")

Properties of the gradient

Property	Statement	ML Implication
Direction	∇f points in direction of steepest ascent	GD moves in -∇f direction
Magnitude	‖∇f‖ = rate of change in steepest direction	Large gradient → steep loss surface
Zero gradient	∇f = 0 is a necessary condition for min/max	Training converges when ‖∇f‖ ≈ 0
Orthogonality	∇f is orthogonal to level curves of f	Gradient is perpendicular to contour lines
Linearity	∇(af + bg) = a∇f + b∇g	Gradient of sum = sum of gradients

Part 4 - Directional Derivatives

The gradient tells you the steepest direction. But what if you want the rate of change in a specific direction?

The directional derivative of f at x in the direction of unit vector u is:

$D_{\mathbf{u}} f(\mathbf{x}) = \nabla f(\mathbf{x}) \cdot \mathbf{u} = \|\nabla f(\mathbf{x})\| \cos(\theta)$

where θ is the angle between ∇f and u.

Key insight

The directional derivative is maximized when u = ∇f/‖∇f‖ (parallel to the gradient), giving maximum value ‖∇f‖. This confirms: the gradient direction is the steepest ascent direction.

The directional derivative is minimized when u = -∇f/‖∇f‖, giving -‖∇f‖. This is the steepest descent direction - exactly what gradient descent uses.

import numpy as np

def directional_derivative(grad: np.ndarray, direction: np.ndarray) -> float:
    """
    Compute directional derivative given gradient and unit direction vector.

    D_u f(x) = ∇f(x) · u  (dot product)
    """
    # Ensure direction is a unit vector
    unit_dir = direction / np.linalg.norm(direction)
    return float(np.dot(grad, unit_dir))

# Example: gradient at current position
gradient = np.array([3.0, 4.0])  # gradient vector

# Compute rate of change in various directions
directions = {
    "gradient direction (steepest ascent)": gradient,
    "negative gradient (steepest descent)": -gradient,
    "orthogonal to gradient": np.array([-4.0, 3.0]),  # perpendicular
    "random direction": np.array([1.0, 0.0]),
}

print(f"Gradient: {gradient}")
print(f"‖∇f‖ = {np.linalg.norm(gradient):.2f}  (maximum possible rate)")
print()
for name, direction in directions.items():
    rate = directional_derivative(gradient, direction)
    print(f"{name}: {rate:.4f}")

Output:

Gradient: [3. 4.]
‖∇f‖ = 5.00  (maximum possible rate)

gradient direction (steepest ascent): 5.0000
negative gradient (steepest descent): -5.0000
orthogonal to gradient: 0.0000
random direction: 3.0000

:::tip ML Connection: Gradient descent steps In gradient descent, each update is:

θ ← θ - α · ∇L(θ)

The step direction is -∇L (steepest descent). The step size is α (learning rate). This is a directional derivative optimization: we move in the direction that most rapidly decreases the loss. :::

Part 5 - The Jacobian Matrix

A neural network layer maps a vector to a vector: f: ℝⁿ → ℝᵐ. The gradient alone (which is for scalar functions) is not enough. We need the Jacobian matrix.

Definition

Given f: ℝⁿ → ℝᵐ, the Jacobian J is an m×n matrix where:

$J_{ij} = \frac{\partial f_i}{\partial x_j}$

The ith row contains the gradient of the ith output with respect to all inputs. The jth column contains the partial derivative of all outputs with respect to the jth input.

What You Will Learn​

Prerequisites​

Part 1 - The Derivative: Rate of Change​

What a derivative measures​

Geometric interpretation​

Key derivatives you need​

Part 2 - Partial Derivatives: Many Variables​

Definition​

Example: Loss function partial derivatives​

What partial derivatives tell you​

Part 3 - The Gradient Vector​

Geometric interpretation: direction of steepest ascent​

Gradient in NumPy​

Properties of the gradient​

Part 4 - Directional Derivatives​

Key insight​

Part 5 - The Jacobian Matrix​

Definition​

What You Will Learn

Prerequisites

Part 1 - The Derivative: Rate of Change

What a derivative measures

Geometric interpretation

Key derivatives you need

Part 2 - Partial Derivatives: Many Variables

Definition

Example: Loss function partial derivatives

What partial derivatives tell you

Part 3 - The Gradient Vector

Geometric interpretation: direction of steepest ascent

Gradient in NumPy

Properties of the gradient

Part 4 - Directional Derivatives

Key insight

Part 5 - The Jacobian Matrix

Definition