Derivatives and Gradients - The Compass of Training
Reading time: ~24 minutes | Level: Mathematical Foundations → ML Engineering
Your model has just finished its first epoch. The training loss is 2.47. You call loss.backward() and then optimizer.step(). The loss drops to 2.31.
What just happened? PyTorch computed the gradient - a vector with one number per parameter, each number telling the optimizer: "if you increase this parameter slightly, the loss goes up by this much." The optimizer used that compass reading to take a small step in the direction that reduces loss.
Every single training run of every neural network ever built follows this pattern. The gradient is the compass. Derivatives are what the compass reads.
What You Will Learn
- The formal definition of a derivative and what it means geometrically
- Partial derivatives: extending derivatives to multi-variable functions
- The gradient vector: the direction of steepest ascent in parameter space
- Directional derivatives: rate of change in any chosen direction
- The Jacobian matrix: gradients for vector-valued functions
- Computing gradients in NumPy and understanding what PyTorch
.gradcontains - Where gradients appear in real ML systems
Prerequisites
- Basic algebra and function notation (f(x), y = mx + b)
- Python and NumPy arrays
- No prior calculus required
Part 1 - The Derivative: Rate of Change
What a derivative measures
A derivative measures how much a function's output changes when its input changes by a tiny amount.
Formally, the derivative of f at point x is:
Read this as: "Take a tiny step h in the input. Measure the change in output. Divide by h to get the rate. Shrink h to zero."
The result f'(x) is the instantaneous rate of change at x.
Geometric interpretation
Geometrically, the derivative is the slope of the tangent line to the function's curve at point x:
f(x)
│
│ ╱ ← tangent line at x₀
│ ╱ slope = f'(x₀)
│ ╱╲ ╱
│ ╱ ╲╱
│ ╱ ● ← point (x₀, f(x₀))
│╱
└────────────── x
x₀
- If f'(x) > 0: the function is increasing at x (slope goes up)
- If f'(x) < 0: the function is decreasing at x (slope goes down)
- If f'(x) = 0: the function is flat at x - a potential minimum or maximum
This flat point is exactly what optimization seeks. Gradient descent moves toward places where f'(x) = 0.
Key derivatives you need
| Function f(x) | Derivative f'(x) | ML where this appears |
|---|---|---|
| xⁿ | nxⁿ⁻¹ | Polynomial activations |
| eˣ | eˣ | Softmax, normalization |
| ln(x) | 1/x | Cross-entropy loss |
| sigmoid(x) = 1/(1+e⁻ˣ) | σ(x)(1-σ(x)) | Binary classification |
| ReLU(x) = max(0,x) | 0 if x<0, 1 if x>0 | Most modern networks |
| tanh(x) | 1 - tanh²(x) | RNNs, LSTM gates |
import numpy as np
# Verify derivative of sigmoid numerically
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
def sigmoid_derivative_analytical(x):
s = sigmoid(x)
return s * (1 - s)
def numerical_derivative(f, x, h=1e-5):
"""Finite difference approximation of derivative."""
return (f(x + h) - f(x - h)) / (2 * h) # central differences
x = np.array([0.0, 1.0, -1.0, 2.0])
analytical = sigmoid_derivative_analytical(x)
numerical = numerical_derivative(sigmoid, x)
print("x: ", x)
print("Analytical: ", np.round(analytical, 6))
print("Numerical: ", np.round(numerical, 6))
print("Max error: ", np.max(np.abs(analytical - numerical)))
# Max error: ~1e-10 (extremely small - validates the formula)
:::note Why central differences are more accurate
The forward difference (f(x+h) - f(x)) / h has error O(h). The central difference (f(x+h) - f(x-h)) / (2h) has error O(h²). For h=1e-5, this is 1e-10 vs 1e-5 - a 100,000x improvement. Always use central differences for numerical gradient checking.
:::
Part 2 - Partial Derivatives: Many Variables
Neural network loss functions have millions of variables (one per parameter). We need derivatives with respect to each variable separately.
Definition
Given f(x₁, x₂, ..., xₙ), the partial derivative with respect to xᵢ is:
In practice: treat all variables as constants except xᵢ, then differentiate with respect to xᵢ as if it were the only variable.
Example: Loss function partial derivatives
Consider mean squared error loss with two parameters:
Partial derivative with respect to w (treat b as constant):
Partial derivative with respect to b (treat w as constant):
import numpy as np
def mse_loss(w: float, b: float, X: np.ndarray, y: np.ndarray) -> float:
"""Mean squared error loss: (1/n) * sum((w*x + b - y)^2)"""
predictions = w * X + b
return np.mean((predictions - y) ** 2)
def mse_gradients(w: float, b: float, X: np.ndarray, y: np.ndarray) -> tuple:
"""Analytical partial derivatives of MSE with respect to w and b."""
n = len(X)
residuals = w * X + b - y # (predictions - targets)
dL_dw = (2 / n) * np.sum(residuals * X)
dL_db = (2 / n) * np.sum(residuals)
return dL_dw, dL_db
# Generate synthetic data
np.random.seed(42)
X = np.linspace(0, 10, 100)
y = 2.5 * X + 1.0 + np.random.randn(100) * 0.5 # true w=2.5, b=1.0
# Compute gradients at current parameter values
w, b = 1.0, 0.0 # wrong initial values
dL_dw, dL_db = mse_gradients(w, b, X, y)
print(f"Current w={w}, b={b}")
print(f"Loss = {mse_loss(w, b, X, y):.4f}")
print(f"∂L/∂w = {dL_dw:.4f}") # positive → increasing w reduces loss
print(f"∂L/∂b = {dL_db:.4f}") # positive → increasing b reduces loss
# Verify with numerical derivatives
h = 1e-5
numerical_dw = (mse_loss(w+h, b, X, y) - mse_loss(w-h, b, X, y)) / (2*h)
numerical_db = (mse_loss(w, b+h, X, y) - mse_loss(w, b-h, X, y)) / (2*h)
print(f"\nNumerical ∂L/∂w = {numerical_dw:.4f} (matches: {np.isclose(dL_dw, numerical_dw)})")
print(f"Numerical ∂L/∂b = {numerical_db:.4f} (matches: {np.isclose(dL_db, numerical_db)})")
What partial derivatives tell you
Each partial derivative answers: "If I change this one parameter by a tiny amount, holding everything else fixed, how much does the loss change?"
- ∂L/∂wᵢ > 0: Increasing wᵢ increases loss → decrease wᵢ
- ∂L/∂wᵢ < 0: Increasing wᵢ decreases loss → increase wᵢ
- ∂L/∂wᵢ = 0: wᵢ has no local effect on loss at this point
The sign and magnitude of each partial derivative directly drives the parameter update in gradient descent.
Part 3 - The Gradient Vector
The gradient assembles all partial derivatives into a single vector:
The gradient ∇f(x) is a vector in the same space as x. Each component is the partial derivative with respect to the corresponding parameter.
Geometric interpretation: direction of steepest ascent
This is the most important geometric fact about gradients:
The gradient ∇f(x) points in the direction in which f increases most rapidly.
Equivalently: -∇f(x) points in the direction in which f decreases most rapidly.
This is the direction gradient descent moves.
Loss surface (2D parameter space):
b
│
2.0 ┤ · · ↑ · ·
│ · ↗ │ ↖ · · ← gradient ∇L points
1.5 ┤ · · ● · · toward higher loss
│ · ↘ │ ↙ · · ← -∇L points toward
1.0 ┤ · · ↓ · · lower loss (gradient descent)
│
└─────┬─────┬────── w
1.0 2.5
↑ minimum (∇L = 0)
Gradient in NumPy
import numpy as np
def compute_gradient_mse(params: np.ndarray, X: np.ndarray, y: np.ndarray) -> np.ndarray:
"""
Compute gradient of MSE loss with respect to all parameters.
For linear regression with n features:
params = [w₁, w₂, ..., wₙ, b] (concatenated weights and bias)
"""
w = params[:-1] # all but last element
b = params[-1] # last element is bias
predictions = X @ w + b # (n_samples,)
residuals = predictions - y # (n_samples,)
# Gradient with respect to each weight
# ∂L/∂wⱼ = (2/n) * sum(residuals * Xⱼ)
grad_w = (2 / len(y)) * (X.T @ residuals) # (n_features,)
# Gradient with respect to bias
# ∂L/∂b = (2/n) * sum(residuals)
grad_b = (2 / len(y)) * np.sum(residuals) # scalar
# Assemble gradient vector (same shape as params)
gradient = np.concatenate([grad_w, [grad_b]])
return gradient
# Multi-feature linear regression example
np.random.seed(42)
n_samples, n_features = 200, 3
X = np.random.randn(n_samples, n_features)
true_weights = np.array([1.5, -2.0, 0.8])
true_bias = 0.5
y = X @ true_weights + true_bias + np.random.randn(n_samples) * 0.1
# Initial random parameters
params = np.zeros(n_features + 1) # [w₁, w₂, w₃, b]
gradient = compute_gradient_mse(params, X, y)
print(f"Parameters: {params}")
print(f"Gradient: {gradient.round(4)}")
print(f"Gradient norm ‖∇L‖₂ = {np.linalg.norm(gradient):.4f}")
print()
print("Each gradient component:")
for i, (param, grad) in enumerate(zip(params, gradient)):
direction = "increase" if grad < 0 else "decrease"
print(f" param[{i}]={param:.3f}, ∂L/∂p={grad:.4f} → {direction} this param")
Properties of the gradient
| Property | Statement | ML Implication |
|---|---|---|
| Direction | ∇f points in direction of steepest ascent | GD moves in -∇f direction |
| Magnitude | ‖∇f‖ = rate of change in steepest direction | Large gradient → steep loss surface |
| Zero gradient | ∇f = 0 is a necessary condition for min/max | Training converges when ‖∇f‖ ≈ 0 |
| Orthogonality | ∇f is orthogonal to level curves of f | Gradient is perpendicular to contour lines |
| Linearity | ∇(af + bg) = a∇f + b∇g | Gradient of sum = sum of gradients |
Part 4 - Directional Derivatives
The gradient tells you the steepest direction. But what if you want the rate of change in a specific direction?
The directional derivative of f at x in the direction of unit vector u is:
where θ is the angle between ∇f and u.
Key insight
The directional derivative is maximized when u = ∇f/‖∇f‖ (parallel to the gradient), giving maximum value ‖∇f‖. This confirms: the gradient direction is the steepest ascent direction.
The directional derivative is minimized when u = -∇f/‖∇f‖, giving -‖∇f‖. This is the steepest descent direction - exactly what gradient descent uses.
import numpy as np
def directional_derivative(grad: np.ndarray, direction: np.ndarray) -> float:
"""
Compute directional derivative given gradient and unit direction vector.
D_u f(x) = ∇f(x) · u (dot product)
"""
# Ensure direction is a unit vector
unit_dir = direction / np.linalg.norm(direction)
return float(np.dot(grad, unit_dir))
# Example: gradient at current position
gradient = np.array([3.0, 4.0]) # gradient vector
# Compute rate of change in various directions
directions = {
"gradient direction (steepest ascent)": gradient,
"negative gradient (steepest descent)": -gradient,
"orthogonal to gradient": np.array([-4.0, 3.0]), # perpendicular
"random direction": np.array([1.0, 0.0]),
}
print(f"Gradient: {gradient}")
print(f"‖∇f‖ = {np.linalg.norm(gradient):.2f} (maximum possible rate)")
print()
for name, direction in directions.items():
rate = directional_derivative(gradient, direction)
print(f"{name}: {rate:.4f}")
Output:
Gradient: [3. 4.]
‖∇f‖ = 5.00 (maximum possible rate)
gradient direction (steepest ascent): 5.0000
negative gradient (steepest descent): -5.0000
orthogonal to gradient: 0.0000
random direction: 3.0000
:::tip ML Connection: Gradient descent steps In gradient descent, each update is:
θ ← θ - α · ∇L(θ)
The step direction is -∇L (steepest descent). The step size is α (learning rate). This is a directional derivative optimization: we move in the direction that most rapidly decreases the loss. :::
Part 5 - The Jacobian Matrix
A neural network layer maps a vector to a vector: f: ℝⁿ → ℝᵐ. The gradient alone (which is for scalar functions) is not enough. We need the Jacobian matrix.
Definition
Given f: ℝⁿ → ℝᵐ, the Jacobian J is an m×n matrix where:
The ith row contains the gradient of the ith output with respect to all inputs. The jth column contains the partial derivative of all outputs with respect to the jth input.
\frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}$$ ### Jacobian of a neural network layer For a linear layer y = Wx + b (W is m×n, x is n-dim, y is m-dim): $$J = \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = W$$ The Jacobian of a linear layer is just its weight matrix. This is why backpropagation involves multiplying by transposed weight matrices - you are propagating gradients through Jacobians. ```python import numpy as np def jacobian_numerical(f, x: np.ndarray, h: float = 1e-5) -> np.ndarray: """ Compute Jacobian of f: ℝⁿ → ℝᵐ numerically. Returns J where J[i,j] = ∂f_i/∂x_j """ f0 = f(x) m = len(f0) # output dimension n = len(x) # input dimension J = np.zeros((m, n)) for j in range(n): # Perturb the j-th input component x_plus = x.copy() x_plus[j] += h x_minus = x.copy() x_minus[j] -= h # Central difference for column j J[:, j] = (f(x_plus) - f(x_minus)) / (2 * h) return J # Example: Jacobian of softmax def softmax(x: np.ndarray) -> np.ndarray: """Softmax: maps ℝⁿ → ℝⁿ (probability simplex).""" exp_x = np.exp(x - np.max(x)) # numerical stability return exp_x / exp_x.sum() def softmax_jacobian_analytical(x: np.ndarray) -> np.ndarray: """ Analytical Jacobian of softmax. J[i,j] = s_i * (δ_ij - s_j) where s = softmax(x) """ s = softmax(x) n = len(s) J = np.zeros((n, n)) for i in range(n): for j in range(n): if i == j: J[i, j] = s[i] * (1 - s[i]) else: J[i, j] = -s[i] * s[j] return J # More efficiently using outer product: def softmax_jacobian_vectorized(x: np.ndarray) -> np.ndarray: s = softmax(x) return np.diag(s) - np.outer(s, s) x = np.array([1.0, 2.0, 0.5]) J_numerical = jacobian_numerical(softmax, x) J_analytical = softmax_jacobian_analytical(x) print(f"Numerical Jacobian:\n{J_numerical.round(6)}") print(f"\nAnalytical Jacobian:\n{J_analytical.round(6)}") print(f"\nMax error: {np.max(np.abs(J_numerical - J_analytical)):.2e}") ``` ### Why Jacobians matter in backpropagation In backpropagation, if you have a composed function y = f(g(x)), the chain rule gives: $$\frac{\partial y}{\partial x} = \frac{\partial y}{\partial g} \cdot \frac{\partial g}{\partial x} = J_f \cdot J_g$$ This is a **Jacobian matrix multiplication**. Backpropagation chains these multiplications together from the loss back to the inputs. The efficiency comes from the fact that for a scalar loss L and vector x, we never actually form the full Jacobian - we compute the vector-Jacobian product (VJP) directly. ## Part 6 - Gradient in PyTorch Understanding what PyTorch `.grad` actually contains: ```python import torch import torch.nn as nn import numpy as np # ── Simple example: gradient of scalar loss ────────────────────────────────── # Simulate one linear layer: y = Wx + b, loss = mean(y^2) torch.manual_seed(42) W = torch.randn(3, 4, requires_grad=True) # 3 output × 4 input b = torch.randn(3, requires_grad=True) x = torch.randn(4) y = W @ x + b # shape: (3,) loss = y.pow(2).mean() # scalar loss.backward() # compute all gradients print(f"W.grad shape: {W.grad.shape}") # (3, 4) - same shape as W print(f"b.grad shape: {b.grad.shape}") # (3,) - same shape as b print() # Each element of W.grad is ∂loss/∂W[i,j] # Let's verify one element numerically i, j = 1, 2 # pick a specific weight h = 1e-5 W_plus = W.clone().detach() W_plus[i, j] += h y_plus = W_plus @ x + b.detach() loss_plus = y_plus.pow(2).mean() W_minus = W.clone().detach() W_minus[i, j] -= h y_minus = W_minus @ x + b.detach() loss_minus = y_minus.pow(2).mean() numerical_grad = (loss_plus - loss_minus) / (2 * h) analytical_grad = W.grad[i, j] print(f"Analytical ∂loss/∂W[{i},{j}] = {analytical_grad.item():.6f}") print(f"Numerical ∂loss/∂W[{i},{j}] = {numerical_grad.item():.6f}") print(f"Match: {torch.isclose(analytical_grad, numerical_grad, atol=1e-4)}") # ── Gradient of vector output with respect to parameters ──────────────────── # For vector loss, we need to sum or reduce to scalar before backward() x = torch.randn(5, requires_grad=True) f = x ** 2 + 2 * x + 1 # element-wise operation, output is vector # To get gradient of sum with respect to x: f.sum().backward() print(f"\nf = x^2 + 2x + 1") print(f"x = {x.detach().numpy().round(3)}") print(f"∂(Σf)/∂x = 2x + 2 = {x.grad.numpy().round(3)}") print(f"Expected: {(2*x + 2).detach().numpy().round(3)}") ``` ### What `.grad` accumulates :::warning Gradient accumulation gotcha PyTorch **accumulates** gradients into `.grad` with each `.backward()` call. If you call `loss.backward()` twice without zeroing gradients, `.grad` contains the **sum** of gradients from both calls. Always call `optimizer.zero_grad()` before computing new gradients. ```python # WRONG: gradients accumulate for batch in dataloader: loss = compute_loss(batch) loss.backward() # .grad adds to previous .grad! optimizer.step() # updates with wrong accumulated gradient # RIGHT: zero gradients before each backward pass for batch in dataloader: optimizer.zero_grad() # ← this is required loss = compute_loss(batch) loss.backward() optimizer.step() ``` ::: ## Part 7 - Gradient Computation at Scale In production ML, computing gradients efficiently is critical. ### Gradient of cross-entropy loss The most common loss in classification is cross-entropy: $$L = -\sum_{i=1}^{C} y_i \log(\hat{y}_i)$$ where y is the one-hot true label and ŷ = softmax(logits). The gradient with respect to the logits has an elegant form: $$\frac{\partial L}{\partial \text{logits}} = \hat{y} - y$$ ```python import numpy as np def cross_entropy_loss(logits: np.ndarray, true_label: int) -> float: """Cross-entropy loss for multi-class classification.""" # Softmax exp_logits = np.exp(logits - np.max(logits)) # subtract max for stability probs = exp_logits / exp_logits.sum() return -np.log(probs[true_label] + 1e-10) def cross_entropy_gradient(logits: np.ndarray, true_label: int) -> np.ndarray: """ Gradient of cross-entropy loss with respect to logits. For softmax + cross-entropy combined: ∂L/∂logits = softmax(logits) - one_hot(true_label) This elegant result comes from the chain rule applied to softmax → cross-entropy composition. """ exp_logits = np.exp(logits - np.max(logits)) probs = exp_logits / exp_logits.sum() one_hot = np.zeros_like(probs) one_hot[true_label] = 1.0 return probs - one_hot # ← beautifully simple gradient # Example logits = np.array([2.0, 1.0, 0.5, 3.0]) # 4-class problem true_label = 3 # class index 3 is correct loss = cross_entropy_loss(logits, true_label) grad = cross_entropy_gradient(logits, true_label) # Numerical gradient for verification numerical_grad = np.array([ (cross_entropy_loss(logits + h_vec, true_label) - cross_entropy_loss(logits - h_vec, true_label)) / (2 * 1e-5) for h_vec in [np.eye(len(logits))[i] * 1e-5 for i in range(len(logits))] ]) print(f"Logits: {logits}") print(f"Loss: {loss:.4f}") print(f"Gradient (analytical): {grad.round(6)}") print(f"Gradient (numerical): {numerical_grad.round(6)}") print(f"Max error: {np.max(np.abs(grad - numerical_grad)):.2e}") ``` ### Batch gradient computation In practice, gradients are computed on minibatches: ```python import numpy as np def batch_gradient_cross_entropy( logits: np.ndarray, # (batch_size, n_classes) labels: np.ndarray, # (batch_size,) integer labels ) -> np.ndarray: """ Compute gradient of mean cross-entropy loss over a batch. Gradient shape: same as logits (batch_size, n_classes) But we want per-parameter gradient for optimizer step: this is the mean over batch dimension. """ batch_size, n_classes = logits.shape # Softmax (numerically stable) shifted = logits - logits.max(axis=1, keepdims=True) exp_logits = np.exp(shifted) probs = exp_logits / exp_logits.sum(axis=1, keepdims=True) # One-hot encode labels one_hot = np.zeros_like(probs) one_hot[np.arange(batch_size), labels] = 1.0 # Gradient: (probs - one_hot) / batch_size # Division by batch_size because we take the mean loss return (probs - one_hot) / batch_size # Simulate a batch batch_size, n_classes = 32, 10 logits = np.random.randn(batch_size, n_classes) labels = np.random.randint(0, n_classes, size=batch_size) grad = batch_gradient_cross_entropy(logits, labels) print(f"Gradient shape: {grad.shape}") # (32, 10) print(f"Gradient mean: {grad.mean():.6f}") # ~0 (probs and one-hot sum to 1) print(f"Gradient norm: {np.linalg.norm(grad):.4f}") ``` ## Part 8 - Common Mistakes and Engineering Red Flags :::danger Forgetting to zero gradients The most common PyTorch bug. Gradients accumulate by default. If you call `loss.backward()` without `optimizer.zero_grad()` first, the gradient from the previous step is added to the current one, corrupting the update. Symptom: Loss decreases initially then suddenly jumps or oscillates wildly. ::: :::danger Numerical instability in softmax Never compute `np.exp(logits) / np.exp(logits).sum()` directly. For large logits (e.g., 500), `np.exp(500)` is inf. Always subtract the max: ```python # WRONG: probs = np.exp(logits) / np.exp(logits).sum() # can overflow # RIGHT: shifted = logits - logits.max() probs = np.exp(shifted) / np.exp(shifted).sum() # numerically stable ``` ::: :::warning Gradient of discontinuous functions ReLU is not differentiable at exactly x=0. PyTorch convention: the subgradient at 0 is 0 (derivative of the "off" branch). This is fine in practice because hitting exactly 0 with floating-point arithmetic has probability 0. However, some custom loss functions or activations may have problematic discontinuities that cause NaN gradients. Always check gradient norms during training for NaN/Inf values. ::: :::tip Gradient checking for custom implementations When implementing a custom layer or loss, always verify your analytical gradients against numerical gradients before training: ```python def gradient_check(f, x, grad_fn, eps=1e-5, tol=1e-4): """Check analytical vs numerical gradient.""" analytical = grad_fn(x) numerical = np.zeros_like(x) for i in range(len(x.flat)): x_plus = x.copy(); x_plus.flat[i] += eps x_minus = x.copy(); x_minus.flat[i] -= eps numerical.flat[i] = (f(x_plus) - f(x_minus)) / (2 * eps) relative_error = np.abs(analytical - numerical) / (np.abs(analytical) + np.abs(numerical) + 1e-8) max_error = relative_error.max() if max_error < tol: print(f"Gradient check PASSED (max relative error: {max_error:.2e})") else: print(f"Gradient check FAILED (max relative error: {max_error:.2e})") return max_error ``` ::: ## Interview Questions <details> <summary><strong>Q1: What is the gradient, and what does it represent geometrically?</strong></summary> The gradient ∇f(x) is the vector of all partial derivatives of a scalar function f: $$\nabla f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n} \end{bmatrix}$$ **Geometrically**: The gradient points in the direction of steepest ascent - the direction in which f increases most rapidly. Its magnitude ‖∇f‖ is the rate of change in that direction. **In ML**: The gradient of the loss function ∇L(θ) tells us which direction in parameter space increases the loss. Gradient descent moves in the opposite direction: θ ← θ - α∇L(θ). After training converges, ∇L(θ) ≈ 0 at the final parameters. **Key facts to mention in interviews**: - Gradient is orthogonal to level curves/surfaces of f - A zero gradient is a necessary (but not sufficient) condition for a minimum - saddle points also have zero gradient - The gradient has the same dimension as the parameter vector </details> <details> <summary><strong>Q2: What is a partial derivative, and how do you compute it?</strong></summary> A partial derivative ∂f/∂xᵢ measures how f changes when you perturb the i-th input while holding all other inputs fixed. **Computation rule**: When taking ∂f/∂xᵢ, treat all variables except xᵢ as constants and differentiate normally. **Example for an ML interview**: For MSE loss L(w,b) = (1/n)Σ(wx+b-y)²: $$\frac{\partial L}{\partial w} = \frac{2}{n}\sum(wx + b - y) \cdot x$$ $$\frac{\partial L}{\partial b} = \frac{2}{n}\sum(wx + b - y)$$ These are the exact updates used in linear regression training. For each parameter, the gradient tells us: "increase this parameter → loss goes up (positive gradient) or down (negative gradient)." **Practical note**: In deep networks, we never compute partial derivatives by hand. Autograd does it automatically. But understanding what partial derivatives mean is essential for debugging gradient flow and custom loss functions. </details> <details> <summary><strong>Q3: What is the Jacobian, and why does it matter for backpropagation?</strong></summary> The Jacobian is the generalization of the gradient to vector-valued functions. For f: ℝⁿ → ℝᵐ, the Jacobian J is an m×n matrix where J[i,j] = ∂fᵢ/∂xⱼ. **Why it matters for backpropagation**: Neural network layers map vectors to vectors (not scalars). When you chain layers together, the chain rule for vector-valued functions requires multiplying Jacobians: $$\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \frac{\partial \mathbf{y}}{\partial \mathbf{h}} \cdot \frac{\partial \mathbf{h}}{\partial \mathbf{x}} = J_{\text{output}} \cdot J_{\text{hidden}}$$ For a linear layer y = Wx + b, the Jacobian with respect to x is simply W. This is why backpropagation involves multiplying by Wᵀ when propagating gradients backward. **Crucial efficiency insight**: In ML, loss is a scalar, so we never need the full Jacobian. We need the vector-Jacobian product (VJP): vᵀJ, where v is the upstream gradient. PyTorch's reverse-mode autodiff computes VJPs efficiently without materializing full Jacobians. </details> <details> <summary><strong>Q4: How would you verify that your analytical gradient implementation is correct?</strong></summary> Use **gradient checking** - compare analytical gradients to numerical finite-difference approximations. **Algorithm**: ```python for each parameter θᵢ: numerical_grad[i] = (L(θ + hᵢ) - L(θ - hᵢ)) / (2h) relative_error[i] = |analytical[i] - numerical[i]| / (|analytical[i]| + |numerical[i]| + eps) ``` **Use central differences** (not forward differences) for O(h²) accuracy instead of O(h). **Acceptance threshold**: Relative error < 1e-4 is acceptable. < 1e-6 is very good. **Important caveats**: - Don't gradient check during training - it's O(n) forward passes per gradient - Watch out for functions that are not differentiable at the test point (e.g., ReLU at exactly 0) - Numerical errors grow with parameter magnitude - use relative error, not absolute - Always test with several different input points, not just one **When to use it**: Any time you implement a custom `torch.autograd.Function`, a custom loss, or a non-standard layer. Catching gradient bugs early saves days of debugging. </details> <details> <summary><strong>Q5: Why is the gradient of softmax + cross-entropy simply (ŷ - y)?</strong></summary> This elegant result comes from the chain rule applied to the composed function. **Step 1**: Cross-entropy loss for true class k: L = -log(ŷₖ) = -log(softmax(logits)ₖ) **Step 2**: Chain rule: ∂L/∂logitsⱼ = Σᵢ (∂L/∂ŷᵢ) · (∂ŷᵢ/∂logitsⱼ) **Step 3**: ∂L/∂ŷₖ = -1/ŷₖ (from log derivative). For i ≠ k: ∂L/∂ŷᵢ = 0. **Step 4**: The softmax Jacobian: ∂ŷᵢ/∂logitsⱼ = ŷᵢ(δᵢⱼ - ŷⱼ) **Step 5**: Combining: - For j = k: ∂L/∂logitsₖ = (-1/ŷₖ) · ŷₖ(1 - ŷₖ) = ŷₖ - 1 - For j ≠ k: ∂L/∂logitsⱼ = (-1/ŷₖ) · ŷₖ(-ŷⱼ) = ŷⱼ In vector form: ∂L/∂logits = ŷ - y (predicted probabilities minus one-hot labels). **Why this matters**: This simple gradient is why classification networks are so easy to train. No special handling needed - the gradient just says "push the predicted probability toward 1 for the correct class, toward 0 for others." </details> ## Quick Reference | Concept | Formula | NumPy | |---------|---------|-------| | Derivative | f'(x) = lim(f(x+h)-f(x))/h | `np.gradient(f, x)` | | Partial derivative | ∂f/∂xᵢ (treat others as constant) | Analytical or `jacobian_numerical` | | Gradient | ∇f = [∂f/∂x₁, ..., ∂f/∂xₙ] | `compute_gradient(params, ...)` | | Gradient direction | direction of steepest ascent | `-∇f` for steepest descent | | Directional derivative | D_u f = ∇f · u | `np.dot(grad, unit_dir)` | | Jacobian | J[i,j] = ∂fᵢ/∂xⱼ | `jacobian_numerical(f, x)` | | Numerical gradient check | (f(x+h)-f(x-h))/(2h) | Central difference | | CE gradient | ŷ - y (softmax output minus one-hot) | `probs - one_hot` | ## Key Takeaways - The derivative measures rate of change - geometrically, it is the slope of the tangent line - Partial derivatives extend this to multi-variable functions by varying one input at a time - The gradient ∇f assembles all partial derivatives into a vector that points in the direction of steepest ascent - Gradient descent moves in the direction -∇f to minimize the loss function - The Jacobian generalizes the gradient to vector-valued functions and is the building block of backpropagation - The gradient of softmax + cross-entropy is simply (ŷ - y) - a beautiful result from the chain rule - Always verify custom gradient implementations against numerical finite differences *Next: [Chain Rule and Backpropagation →](./02-Chain-Rule-and-Backpropagation.md)* :::tip 🎮 Interactive Playground **Visualize this concept:** Try the **[Derivatives & Gradients](/playground/derivatives-3d)** demo on the EngineersOfAI Playground - no code required. :::