What is recurrent neural networks?

How recurrent neural networks process sequential data through shared hidden states, and why vanishing gradients cripple their ability to learn long-range dependencies.

How does vanishing gradient problem work in practice?

RNNs and the Vanishing Gradient Problem covers recurrent neural networks, vanishing gradient problem, BPTT from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/sequences-and-time-series/rnns-and-vanishing-gradients

What is the difference between recurrent neural networks and BPTT?

See the full breakdown at https://engineersofai.com/docs/ml/sequences-and-time-series/rnns-and-vanishing-gradients

RNNs and the Vanishing Gradient Problem

Reading time: 35–45 minutes | Interview relevance: High | Target roles: ML Engineer, AI Engineer, Research Engineer, Data Scientist

The Trading System That Forgot

It was 7:43 AM in London when the quantitative trading system flagged an anomaly. The LSTM-based model - actually a stack of vanilla RNNs that the team had never bothered to upgrade - had been predicting intraday price movements for a basket of FTSE 100 equities for three years. It worked reasonably well on short windows. Three to five steps ahead? Nearly production-grade. Ten steps? Passable. Twenty steps? The model had essentially stopped caring about what happened more than a few minutes ago.

The root cause was discovered during a post-mortem. The equity had released earnings guidance six weeks prior - a subtle but structurally important signal that compressed volatility across the subsequent weeks. The model's hidden state, in theory, should have carried that information forward. In practice, by the time the gradient signal from that six-week-old timestep had been propagated back through the unrolled RNN during training, it had decayed to near zero. The model learned to ignore it. Not because it was a bad model, but because the mathematics of how vanilla RNNs are trained made learning from distant past events nearly impossible.

The team replaced the RNNs with LSTMs that afternoon and rebuilt over a weekend. The six-week signal was now actionable. The sharpe ratio on the strategy improved by 0.4 over the following quarter. But the engineers on that team were left with a question that took weeks of reading to fully answer: why exactly did the gradients vanish, and why did nobody catch it sooner?

This lesson answers that question precisely. Not the hand-wavy version - "gradients get small over time" - but the real mathematical reason rooted in the chain rule of calculus, the spectrum of the weight matrix, and the specific activation functions that make the problem worse or better. Understanding this at depth is what separates an engineer who uses RNNs from one who can reason about when they will fail.

By the end of this lesson, you will be able to derive the vanishing gradient condition from first principles, implement a working RNN in NumPy including backpropagation through time, and make an informed architectural decision about whether a vanilla RNN is the right tool for your sequence problem - and if not, what question to ask next.

Why This Exists - The Problem Feedforward Networks Cannot Solve

Before RNNs, the dominant paradigm for supervised learning on structured data was the feedforward neural network. You take an input vector $x$ , pass it through one or more hidden layers, and produce an output. The architecture is static: fixed input size, fixed computation graph, fixed output size. This works brilliantly for tasks where the input has no inherent ordering - classifying an image, predicting house prices from tabular features, approximating a function from a fixed-dimensional input.

The moment your data is sequential, feedforward networks run into a fundamental mismatch. Consider predicting the next word in a sentence. The sentence "The bank by the river was steep" and "The bank froze my account" share the word "bank" but its meaning depends on context that may be many words back. A feedforward network that sees only a fixed window of $k$ tokens cannot capture dependencies that span more than $k$ steps. You could increase $k$ , but this scales the parameter count quadratically and still imposes an artificial horizon. More fundamentally, it treats each position in the window as a distinct input dimension, with no sharing of what was learned about position 3 across different positions. It does not generalize across time.

The deeper problem is that sequences are variable-length. Audio waveforms, natural language sentences, financial time series, DNA strands - none of these come in a fixed size. A feedforward network with a fixed input dimension either truncates long sequences or pads short ones, both of which are approximations that lose information and waste capacity.

What was needed was an architecture that:

Processes inputs one timestep at a time, so it handles variable-length sequences natively
Maintains a running summary of everything it has seen so far - a "memory"
Shares the same parameters across all timesteps, so the model learns generalizable temporal patterns rather than position-specific ones

The Recurrent Neural Network solves all three of these requirements. It processes $x_t$ one step at a time, maintains a hidden state $h_t$ that summarizes the history, and uses the same weight matrices $W_h$ and $W_x$ at every timestep. The elegance is real. The catch - which took the research community nearly a decade to fully characterize - is that this design makes learning from long-range dependencies extraordinarily difficult.

Historical Context - From Rumelhart to the Gradient Crisis

1986 - The Invention of Backpropagation

David Rumelhart, Geoffrey Hinton, and Ronald Williams published "Learning representations by back-propagating errors" in Nature in 1986. This paper did not invent backpropagation - the idea had appeared in control theory and was independently discovered multiple times - but it made the algorithm legible to the machine learning community and demonstrated that multi-layer networks could learn useful internal representations. The credit assignment problem, previously considered nearly intractable, now had a practical solution for feedforward networks.

The extension to sequences was immediate and natural. If you unroll a recurrent network through time - treating each timestep as a layer in a very deep feedforward network - you can apply backpropagation to the resulting computation graph. This extension is called Backpropagation Through Time (BPTT), and Rumelhart, Hinton, and Williams described it in the same 1986 period. Paul Werbos had independently developed similar ideas in his 1974 PhD thesis, though the machine learning community did not widely engage with that work until much later.

1991 - Hochreiter Diagnoses the Problem

In 1991, Sepp Hochreiter, then a diploma student at the Technical University of Munich under Jürgen Schmidhuber, produced a technical report that stands as one of the most important diagnostic documents in deep learning history. Hochreiter showed - with mathematical precision - that when you unroll a recurrent network through many timesteps and apply BPTT, the gradient signal either vanishes (decays exponentially toward zero) or explodes (grows exponentially). He demonstrated that this was not a training trick problem or an optimization heuristic problem; it was a fundamental consequence of the architecture.

The report, "Untersuchungen zu dynamischen neuronalen Netzen" (Investigations on Dynamic Neural Networks), was written in German and did not receive wide attention at the time. But its analysis directly motivated the development of the Long Short-Term Memory network, which Hochreiter and Schmidhuber published in Neural Computation in 1997 (Hochreiter and Schmidhuber, 1997). The LSTM was explicitly engineered to route gradient signals around the vanishing gradient problem using gated memory cells.

1993–2000 - The Gradient Cliff

The 1990s saw repeated failures to train RNNs on tasks requiring long-range dependencies. Bengio, Simard, and Frasconi (1994) published "Learning long-term dependencies with gradient descent is difficult" in IEEE Transactions on Neural Networks, providing a theoretical analysis of the problem and showing that even with perfect optimization, certain tasks were practically unlearnable by vanilla RNNs. This paper is often cited as the clearest early exposition of why the problem is fundamental rather than incidental.

2013 - Taming Exploding Gradients

Pascanu, Mikolov, and Bengio (2013), "On the difficulty of training recurrent neural networks" (ICML 2013), provided a comprehensive modern analysis of both vanishing and exploding gradients. Critically, they formalized gradient clipping - the practice of rescaling gradients when their norm exceeds a threshold - as the standard mitigation for the exploding side of the problem. This paper is worth reading in full; it remains the clearest treatment of gradient pathology in RNNs.

2015–2017 - The Transformer Horizon

The attention mechanism, developed incrementally from Bahdanau et al. (2015) through to Vaswani et al.'s "Attention Is All You Need" (2017), eventually offered a different solution: instead of trying to route gradient through sequential hidden states, allow every output position to attend directly to every input position. This eliminates the long-range dependency problem entirely at the cost of quadratic attention complexity. But that is the next chapter. Understanding why transformers were needed requires understanding why RNNs failed - which is what this lesson is about.

Core Concept - How an RNN Actually Works

The Intuition: A Running Summary

Imagine you are reading a document one word at a time, but you can only see the current word and a notepad where you write a summary. After reading each word, you update your summary based on what you just read and what was already on the notepad. Then you erase the notepad and write a new summary. That notepad is the hidden state.

This is exactly what an RNN does. At each timestep $t$ :

You receive the current input $x_t$ (a vector)
You look at your previous summary $h_{t-1}$ (also a vector)
You compute a new summary $h_t$ by combining them
You optionally produce an output $y_t$ from $h_t$

The key insight is that $h_t$ is the only thing that carries information forward. It is a fixed-size vector - say, 256 dimensions - that must compress everything relevant from timesteps 1 through t. This compression bottleneck is both the strength of the architecture (it forces the model to learn what to remember) and its weakness (information from distant timesteps gets diluted or overwritten).

The Forward Pass - Equations

The RNN forward pass at timestep $t$ is:

$h_t = \tanh(W_h h_{t-1} + W_x x_t + b_h)$

$\hat{y}_t = \text{softmax}(W_y h_t + b_y)$

Where:

$x_t$ has shape [input_size] - the input at this timestep
$h_{t-1}$ has shape [hidden_size] - the previous hidden state
$W_x$ has shape [hidden_size, input_size] - input-to-hidden weights
$W_h$ has shape [hidden_size, hidden_size] - hidden-to-hidden weights
$b_h$ has shape [hidden_size] - hidden bias
$\hat{y}_t$ has shape [output_size] - prediction at this timestep
$W_y$ has shape [output_size, hidden_size] - hidden-to-output weights
$b_y$ has shape [output_size] - output bias

The activation function $\tanh$ squashes the pre-activation into $(-1, 1)$ . This is important for stability but - as we will see - it is also part of why gradients vanish.

Unrolling Through Time

When we visualize an RNN, we "unroll" it - we draw the same network repeated for each timestep, with the hidden state connection shown as an arrow from timestep $t-1$ to $t$ . This is not a different network; it is the same weight matrices applied repeatedly. But unrolling makes it look like a very deep feedforward network where:

Depth = sequence length
Every "layer" uses the same weights
Each "layer's" input comes from both the data ( $x_t$ ) and the previous "layer's" output ( $h_{t-1}$ )

This unrolled view is exactly what BPTT operates on. And the depth of this network - which equals the sequence length, which could be hundreds or thousands - is what makes training difficult.

What the Hidden State Learns

In practice, the hidden state $h_t$ learns to encode whatever is predictive of the target. For a language model, different dimensions of $h_t$ learn to track things like: is the current clause in a quotation? Is the subject of the sentence plural or singular? Has a negation appeared recently? These representations emerge from gradient descent; they are not programmed. But whether those representations can incorporate information from 50 timesteps ago depends entirely on whether the gradient from that distant past survives the journey back through BPTT - and typically, it does not.

The Vanishing Gradient Problem - Mathematical Explanation

Setting Up the Chain Rule

To train the RNN, we minimize a loss $\mathcal{L}$ over all timesteps. For a sequence of length $T$ , the total loss is typically the sum of per-timestep losses:

$\mathcal{L} = \sum_{t=1}^{T} \mathcal{L}_t$

Where $\mathcal{L}_t$ is the loss at timestep $t$ (e.g., cross-entropy for language modeling, MSE for regression).

To update the weight matrices, we need the gradient of the total loss with respect to each weight. The full gradient of $\mathcal{L}$ with respect to $W_h$ is:

$\frac{\partial \mathcal{L}}{\partial W_h} = \sum_{t=1}^{T} \frac{\partial \mathcal{L}_t}{\partial W_h}$

Consider the gradient of $\mathcal{L}_T$ - the loss at the final timestep - with respect to $h_1$ , the hidden state at the very first timestep. By the chain rule:

$\frac{\partial \mathcal{L}_T}{\partial h_1} = \frac{\partial \mathcal{L}_T}{\partial h_T} \cdot \frac{\partial h_T}{\partial h_{T-1}} \cdot \frac{\partial h_{T-1}}{\partial h_{T-2}} \cdots \frac{\partial h_2}{\partial h_1}$

And more generally, the gradient chain from timestep $t$ back to $k$ is:

$\frac{\partial \mathcal{L}_t}{\partial h_k} = \prod_{j=k}^{t-1} \frac{\partial h_{j+1}}{\partial h_j}$

This is a product of $T-1$ Jacobian matrices. Each factor $\frac{\partial h_t}{\partial h_{t-1}}$ is the Jacobian of the hidden state with respect to the previous hidden state.

Computing the Local Jacobian

At each timestep, $h_t = \tanh(W_h h_{t-1} + W_x x_t + b_h)$ . The Jacobian of $h_t$ with respect to $h_{t-1}$ is:

$J_t = \text{diag}(1 - h_t^2) \cdot W_h^\top$

Where $\text{diag}(1 - h_t^2)$ is a diagonal matrix with the $\tanh$ derivative values on the diagonal. Since $\tanh$ outputs values in $(-1, 1)$ , we know $h_t^2 < 1$ , so the derivative $(1 - h_t^2)$ is in $(0, 1)$ . In regions where $\tanh$ is saturating - when its input is large positive or large negative - this derivative approaches 0.

The full gradient from step $T$ back to step $1$ is:

$\frac{\partial \mathcal{L}_T}{\partial h_1} = \frac{\partial \mathcal{L}_T}{\partial h_T} \prod_{t=2}^{T} \text{diag}(1 - h_t^2) \cdot W_h^\top$

Why the Product Shrinks or Explodes

This product of $T-1$ matrices is the core of the problem. For each factor, the spectral norm (largest singular value) of $\text{diag}(1 - h_t^2) \cdot W_h^\top$ determines whether the product grows or shrinks.

Let $\gamma_W$ be the largest singular value of $W_h$ , and let $\gamma_{\tanh}$ be the maximum value of $(1 - h_t^2)$ across all $t$ . Then the gradient norm is bounded by:

$\left\|\frac{\partial h_t}{\partial h_k}\right\| \leq (\gamma_W \cdot \gamma_{\tanh})^{t-k}$

If $\gamma_W \cdot \gamma_{\tanh} < 1$ for most timesteps, the product of $T-1$ such factors decays exponentially: the gradient shrinks at rate roughly $(\gamma_W \cdot \gamma_{\tanh})^{T-1}$ . For $T = 100$ and $\gamma_W \cdot \gamma_{\tanh} = 0.9$ , this gives $0.9^{99} \approx 0.000027$ - effectively zero.
If $\gamma_W \cdot \gamma_{\tanh} > 1$ for most timesteps, the product explodes: the gradient grows at rate $(\gamma_W \cdot \gamma_{\tanh})^{T-1}$ , causing numerical overflow or wildly unstable updates.

The condition for stable gradient flow is $\gamma_W \cdot \gamma_{\tanh} \approx 1$ . This is exactly what the LSTM achieves through its gating mechanism, and it is what orthogonal RNN variants (Arjovsky et al., 2016) enforce by construction.

The tanh Saturation Effect

The $\tanh$ activation makes the problem worse in two ways. First, its derivative $(1 - \tanh^2(z))$ is maximized at $z = 0$ where it equals 1, and it approaches 0 as $|z|$ grows large. When the network drives its pre-activations into saturation regions (which happens naturally when weights grow large during early training), the gradients through the activation are killed locally, independent of $W_h$ .

Second, the maximum gradient through a single $\tanh$ unit is 1.0 (at zero input). This means even in the best case, the $\tanh$ derivative contributes a factor of at most 1 to the product. So the only way for the full product to avoid shrinking is for $W_h$ to compensate - but $W_h$ that is too large causes explosion.

Why Long-Range Dependencies Cannot Be Learned

The practical consequence is concrete. Suppose your sequence is 100 timesteps long and the relevant signal for predicting $\hat{y}_{100}$ is the input at $x_1$ . During training, the gradient that should reward $h_1$ for encoding $x_1$ faithfully has traveled through 99 Jacobian matrices. If the gradient is even slightly smaller than 1 per step - say 0.95 - the gradient arriving at step 1 is $0.95^{99} \approx 0.006$ . The weight update for the connection that encoded $x_1$ into $h_1$ is 160 times smaller than the update for $x_{99}$ . The model effectively learns to ignore $x_1$ .

This is not a matter of setting a better learning rate. No learning rate rescues a gradient that is 1000x smaller for distant timesteps than for nearby ones - not without a fundamentally different architecture. This is why the LSTM (Hochreiter and Schmidhuber, 1997) was such a breakthrough: it replaced the single tanh layer with a gating mechanism that can maintain gradient magnitude across hundreds of timesteps.

:::note Exploding vs Vanishing The exploding gradient problem is the opposite case: the gradient product grows exponentially, causing parameter updates of enormous magnitude that destabilize training. Exploding gradients are easier to fix - gradient clipping (Pascanu et al., 2013) caps the gradient norm before the update step and works well in practice. Vanishing gradients are harder because clipping a near-zero gradient does nothing useful. The architecture must change. :::

NumPy From Scratch - RNN Forward Pass and BPTT

The implementation below is a complete, runnable many-to-many RNN trained on a toy sequence prediction task. It implements both the forward pass and full backpropagation through time. Read it as a reference implementation - every line corresponds directly to an equation above.

import numpy as np

# ─── Reproducibility ───────────────────────────────────────────────────────
np.random.seed(42)


# ─── Activation functions ──────────────────────────────────────────────────
def tanh(x):
    return np.tanh(x)


def tanh_grad(x):
    """Derivative of tanh: (1 - tanh(x)^2)"""
    return 1.0 - np.tanh(x) ** 2


def softmax(x):
    e = np.exp(x - x.max(axis=-1, keepdims=True))
    return e / e.sum(axis=-1, keepdims=True)


def cross_entropy_loss(logits, targets):
    """
    logits: shape [T, vocab_size]
    targets: shape [T] of integer class indices
    Returns scalar loss.
    """
    T = logits.shape[0]
    probs = softmax(logits)
    loss = -np.log(probs[np.arange(T), targets] + 1e-9).mean()
    return loss, probs


# ─── Parameter initialization ──────────────────────────────────────────────
def init_params(input_size, hidden_size, output_size, scale=0.01):
    """Xavier-scaled initialization."""
    params = {
        "W_x": np.random.randn(hidden_size, input_size) * scale,
        "W_h": np.random.randn(hidden_size, hidden_size) * scale,
        "b_h": np.zeros(hidden_size),
        "W_y": np.random.randn(output_size, hidden_size) * scale,
        "b_y": np.zeros(output_size),
    }
    return params


# ─── Forward pass ──────────────────────────────────────────────────────────
def rnn_forward(xs, h0, params):
    """
    xs:     list of T input vectors, each shape [input_size]
    h0:     initial hidden state, shape [hidden_size]
    params: dict of weight matrices and biases

    Returns:
      hs:     dict {t: h_t} for t in 0..T (h0 stored as hs[-1])
      pre_hs: dict {t: pre-activation z_t = W_h*h_{t-1} + W_x*x_t + b_h}
      logits: array of shape [T, output_size]
    """
    W_x, W_h, b_h = params["W_x"], params["W_h"], params["b_h"]
    W_y, b_y = params["W_y"], params["b_y"]

    T = len(xs)
    hs = {-1: h0.copy()}  # store h_{-1} = h0
    pre_hs = {}
    logits = []

    for t in range(T):
        # Pre-activation: z_t = W_h * h_{t-1} + W_x * x_t + b_h
        z_t = W_h @ hs[t - 1] + W_x @ xs[t] + b_h
        pre_hs[t] = z_t

        # Hidden state: h_t = tanh(z_t)
        hs[t] = tanh(z_t)

        # Output logits: y_t = W_y * h_t + b_y
        logits.append(W_y @ hs[t] + b_y)

    logits = np.array(logits)  # shape [T, output_size]
    return hs, pre_hs, logits


# ─── BPTT: Backpropagation Through Time ────────────────────────────────────
def rnn_backward(xs, hs, pre_hs, logits, targets, params):
    """
    xs:      list of T input vectors
    hs:      dict of hidden states from forward pass
    pre_hs:  dict of pre-activations from forward pass
    logits:  [T, output_size] array
    targets: [T] integer array of true class indices
    params:  current parameter dict

    Returns:
      grads: dict of gradients matching params keys
      loss:  scalar loss value
    """
    W_x, W_h, b_h = params["W_x"], params["W_h"], params["b_h"]
    W_y, b_y = params["W_y"], params["b_y"]

    T = len(xs)
    loss, probs = cross_entropy_loss(logits, targets)

    # Initialize gradient accumulators
    grads = {k: np.zeros_like(v) for k, v in params.items()}
    dh_next = np.zeros_like(hs[0])  # gradient from future timesteps

    for t in reversed(range(T)):
        # Gradient of loss w.r.t. logits (softmax + cross-entropy combined)
        d_logit = probs[t].copy()
        d_logit[targets[t]] -= 1.0
        d_logit /= T  # average over timesteps

        # Gradient w.r.t. W_y and b_y
        grads["W_y"] += np.outer(d_logit, hs[t])
        grads["b_y"] += d_logit

        # Gradient of loss w.r.t. h_t (from output + from future hidden state)
        dh = W_y.T @ d_logit + dh_next

        # Gradient through tanh: dL/dz_t = dL/dh_t * (1 - h_t^2)
        dz = dh * tanh_grad(pre_hs[t])

        # Gradient w.r.t. W_h, W_x, b_h
        grads["W_h"] += np.outer(dz, hs[t - 1])
        grads["W_x"] += np.outer(dz, xs[t])
        grads["b_h"] += dz

        # Pass gradient to previous timestep
        dh_next = W_h.T @ dz

    return grads, loss


# ─── Gradient clipping ─────────────────────────────────────────────────────
def clip_gradients(grads, max_norm=5.0):
    """Scale all gradients if their global norm exceeds max_norm."""
    total_norm = 0.0
    for g in grads.values():
        total_norm += (g ** 2).sum()
    total_norm = total_norm ** 0.5

    if total_norm > max_norm:
        scale = max_norm / (total_norm + 1e-8)
        for k in grads:
            grads[k] *= scale

    return grads, total_norm


# ─── SGD update ────────────────────────────────────────────────────────────
def sgd_update(params, grads, lr=1e-2):
    for k in params:
        params[k] -= lr * grads[k]
    return params


# ─── Toy dataset: learn to predict next character in "hello world" ─────────
def make_sequence_dataset():
    text = "hello world! hello world! hello world!"
    chars = sorted(set(text))
    char_to_idx = {c: i for i, c in enumerate(chars)}
    idx_to_char = {i: c for c, i in char_to_idx.items()}
    indices = [char_to_idx[c] for c in text]
    return indices, char_to_idx, idx_to_char, len(chars)


# ─── One-hot encoding ──────────────────────────────────────────────────────
def one_hot(idx, vocab_size):
    v = np.zeros(vocab_size)
    v[idx] = 1.0
    return v


# ─── Training loop ─────────────────────────────────────────────────────────
def train():
    indices, char_to_idx, idx_to_char, vocab_size = make_sequence_dataset()
    hidden_size = 64

    params = init_params(
        input_size=vocab_size,
        hidden_size=hidden_size,
        output_size=vocab_size,
    )

    h0 = np.zeros(hidden_size)
    lr = 0.05
    T = len(indices) - 1  # predict next character

    xs = [one_hot(indices[t], vocab_size) for t in range(T)]
    targets = np.array(indices[1:])

    print(f"Sequence length: {T}, Vocab size: {vocab_size}, Hidden: {hidden_size}")
    print("-" * 60)

    for epoch in range(300):
        hs, pre_hs, logits = rnn_forward(xs, h0, params)
        grads, loss = rnn_backward(xs, hs, pre_hs, logits, targets, params)
        grads, grad_norm = clip_gradients(grads, max_norm=5.0)
        params = sgd_update(params, grads, lr=lr)

        if epoch % 50 == 0:
            # Sample a prediction
            pred_chars = [idx_to_char[np.argmax(logits[t])] for t in range(T)]
            print(f"Epoch {epoch:3d} | Loss: {loss:.4f} | GradNorm: {grad_norm:.3f}")
            print(f"  Predicted: {''.join(pred_chars[:20])}")
            print(f"  Target:    {''.join([idx_to_char[i] for i in targets[:20]])}")
            print()


if __name__ == "__main__":
    train()

What to Notice in the BPTT Code

The line dh_next = W_h.T @ dz is where gradient routing happens. At each timestep in the reverse pass, the gradient that was computed from the loss at timestep $t$ is propagated backward to $t-1$ by multiplying through W_h.T. This is the Jacobian product we derived analytically. After 50 such steps, this gradient - if it started at 0.01 per step - is $0.99^{50} \approx 0.60$ of its original magnitude. After 200 steps it is $0.99^{200} \approx 0.13$ . The vanishing is real and visible in the gradient norm.

The clip_gradients function prevents the explosive case. If the gradient norm grows past 5.0, all gradients are rescaled proportionally. This keeps training stable without modifying the direction of the gradient update, only its magnitude.

PyTorch Implementation

The NumPy version is for understanding. In production you use PyTorch's nn.RNN, which handles the forward pass, hidden state management, and gradient computation automatically. The code below shows a full training example including the gradient clipping idiom that every production RNN uses.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.utils import clip_grad_norm_

# ─── Reproducibility ───────────────────────────────────────────────────────
torch.manual_seed(42)


# ─── RNN Language Model ────────────────────────────────────────────────────
class RNNLanguageModel(nn.Module):
    """
    A simple character-level language model using nn.RNN.
    Predicts the next character at each timestep.
    """

    def __init__(self, vocab_size: int, hidden_size: int, num_layers: int = 1):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # Embedding: integer index -> dense vector
        self.embedding = nn.Embedding(vocab_size, hidden_size)

        # nn.RNN applies: h_t = tanh(W_h * h_{t-1} + W_x * x_t + b)
        # batch_first=True: expects input shape [batch, seq_len, input_size]
        self.rnn = nn.RNN(
            input_size=hidden_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            nonlinearity="tanh",  # default; can also use 'relu'
        )

        # Project hidden state to vocabulary logits
        self.output_layer = nn.Linear(hidden_size, vocab_size)

    def forward(self, x: torch.Tensor, h0: torch.Tensor = None):
        """
        x:  [batch, seq_len] integer tensor of character indices
        h0: [num_layers, batch, hidden_size] initial hidden state (optional)

        Returns:
          logits: [batch, seq_len, vocab_size]
          h_n:    [num_layers, batch, hidden_size] final hidden state
        """
        batch_size = x.size(0)

        if h0 is None:
            h0 = torch.zeros(self.num_layers, batch_size, self.hidden_size)

        # Embed: [batch, seq_len] -> [batch, seq_len, hidden_size]
        embedded = self.embedding(x)

        # RNN forward: output shape [batch, seq_len, hidden_size]
        # h_n shape: [num_layers, batch, hidden_size]
        output, h_n = self.rnn(embedded, h0)

        # Project to vocab: [batch, seq_len, vocab_size]
        logits = self.output_layer(output)

        return logits, h_n


# ─── Gradient norm monitoring ──────────────────────────────────────────────
def get_grad_norm(model: nn.Module) -> float:
    total_norm = 0.0
    for p in model.parameters():
        if p.grad is not None:
            total_norm += p.grad.data.norm(2).item() ** 2
    return total_norm ** 0.5


# ─── Training function ─────────────────────────────────────────────────────
def train_pytorch_rnn():
    # Toy dataset
    text = "hello world! hello world! hello world! " * 5
    chars = sorted(set(text))
    vocab_size = len(chars)
    char_to_idx = {c: i for i, c in enumerate(chars)}
    idx_to_char = {i: c for c, i in char_to_idx.items()}
    indices = torch.tensor([char_to_idx[c] for c in text], dtype=torch.long)

    # Single "batch" of the full sequence
    # x: all characters except last; y: all characters except first
    x = indices[:-1].unsqueeze(0)  # [1, T]
    y = indices[1:].unsqueeze(0)   # [1, T]

    # Model
    hidden_size = 128
    model = RNNLanguageModel(vocab_size=vocab_size, hidden_size=hidden_size)

    optimizer = optim.Adam(model.parameters(), lr=1e-2)
    criterion = nn.CrossEntropyLoss()

    print(f"Vocab size: {vocab_size}, Sequence length: {x.size(1)}")
    print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
    print("-" * 60)

    for epoch in range(500):
        model.train()
        optimizer.zero_grad()

        # Forward pass
        logits, _ = model(x)  # logits: [1, T, vocab_size]

        # Reshape for loss: [T, vocab_size] vs [T]
        loss = criterion(logits.squeeze(0), y.squeeze(0))

        # Backward pass
        loss.backward()

        # Gradient clipping - essential for RNNs
        grad_norm_before_clip = get_grad_norm(model)
        clip_grad_norm_(model.parameters(), max_norm=5.0)

        optimizer.step()

        if epoch % 100 == 0:
            model.eval()
            with torch.no_grad():
                logits_eval, _ = model(x)
                preds = logits_eval.squeeze(0).argmax(dim=-1)
                pred_text = "".join([idx_to_char[i.item()] for i in preds[:30]])
                target_text = "".join([idx_to_char[i.item()] for i in y.squeeze(0)[:30]])

            print(f"Epoch {epoch:3d} | Loss: {loss.item():.4f} | "
                  f"GradNorm: {grad_norm_before_clip:.3f}")
            print(f"  Predicted: {pred_text}")
            print(f"  Target:    {target_text}")
            print()


# ─── Inspecting gradient magnitudes across timesteps ──────────────────────
def inspect_gradient_decay():
    """
    Demonstrates vanishing gradients by registering hooks on each timestep's
    hidden state and recording the gradient norm arriving at each step.
    """
    torch.manual_seed(0)
    vocab_size = 10
    hidden_size = 32
    seq_len = 50

    model = RNNLanguageModel(vocab_size=vocab_size, hidden_size=hidden_size)
    criterion = nn.CrossEntropyLoss()

    x = torch.randint(0, vocab_size, (1, seq_len))
    y = torch.randint(0, vocab_size, (1, seq_len))

    # Forward
    logits, _ = model(x)
    loss = criterion(logits.squeeze(0), y.squeeze(0))
    loss.backward()

    # The gradient of the loss w.r.t. the RNN weight W_hh (W_h in our notation)
    # is accumulated across all timesteps - we cannot easily inspect per-timestep
    # gradients without hooks, but we can look at the embedding gradients as a proxy.
    # Each embedding row receives gradient only if its character appeared in the input.
    # The gradient magnitude is proportional to how much that position contributed
    # to the loss. For a vanilla RNN, later positions dominate.

    embedding_grads = model.embedding.weight.grad  # [vocab_size, hidden_size]
    grad_norms = embedding_grads.norm(dim=1)       # [vocab_size]

    print("Gradient norms for each token embedding:")
    for i, norm in enumerate(grad_norms):
        if norm > 0:
            print(f"  Token '{i}': {norm.item():.6f}")


if __name__ == "__main__":
    train_pytorch_rnn()
    print("\n" + "=" * 60)
    print("Gradient decay inspection:")
    inspect_gradient_decay()

Key PyTorch Idioms for RNNs

Detaching the hidden state between batches. When you process long sequences in chunks (truncated BPTT), you must call h_n.detach() before passing the hidden state to the next chunk. Without detaching, PyTorch will try to backpropagate through the entire history of all previous chunks, which is both slow and incorrect.

batch_first=True. PyTorch's RNN layers default to expecting [seq_len, batch, features] shaped inputs. Setting batch_first=True changes this to [batch, seq_len, features], which is more intuitive and consistent with most downstream code. Always set it explicitly to avoid shape bugs.

clip_grad_norm_ before the optimizer step. The function modifies gradients in-place and returns the pre-clip norm. Always log the pre-clip norm during training - if it is consistently below your clip threshold, clipping is not active and there is no explosion. If it is consistently at or above the threshold, you may need a smaller learning rate or architectural changes.

Empirically Measuring Gradient Decay

The equations above make a clear prediction: gradient magnitude decreases exponentially with the distance from the loss. You can verify this directly. The code below trains a vanilla RNN on a copy task - where the model must reproduce input tokens after a variable delay - and logs the gradient magnitude at each timestep during BPTT. The copy task is a canonical benchmark for testing long-range dependency learning precisely because it requires the model to remember a token from $k$ steps ago with zero intermediate information.

import numpy as np
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt


def measure_gradient_decay(seq_len=60, hidden_size=32, n_trials=10, seed=7):
    """
    Run BPTT on a random RNN for `n_trials` random inputs.
    At each backward step, record the norm of dh_t/dh_1 (gradient of
    the final hidden state w.r.t. each earlier hidden state).
    Returns: array of shape [seq_len] with average gradient norm at each step.
    """
    np.random.seed(seed)

    grad_norms_all = np.zeros((n_trials, seq_len))

    for trial in range(n_trials):
        # Small orthogonal-ish W_h for stable experiment
        W_h = np.random.randn(hidden_size, hidden_size) * 0.9 / hidden_size**0.5
        W_x = np.random.randn(hidden_size, 1) * 0.1
        b_h = np.zeros(hidden_size)

        # Random input sequence
        xs = [np.random.randn(1) for _ in range(seq_len)]
        h = np.zeros(hidden_size)

        # Forward pass - store pre-activations and hidden states
        hs = [h.copy()]
        pre_hs = []
        for t in range(seq_len):
            z = W_h @ h + W_x @ xs[t] + b_h
            pre_hs.append(z.copy())
            h = np.tanh(z)
            hs.append(h.copy())

        # Backward pass - track gradient norm at each step
        # Start with unit gradient at the final hidden state h_T
        dh = np.ones(hidden_size) / hidden_size**0.5

        grad_norms_trial = []
        for t in reversed(range(seq_len)):
            grad_norms_trial.append(np.linalg.norm(dh))
            # Gradient through tanh
            dz = dh * (1.0 - np.tanh(pre_hs[t]) ** 2)
            # Gradient to previous hidden state
            dh = W_h.T @ dz

        # Reverse so index 0 = first timestep, index T-1 = last
        grad_norms_trial = list(reversed(grad_norms_trial))
        grad_norms_all[trial] = grad_norms_trial

    return grad_norms_all.mean(axis=0)


# Run the experiment
seq_len = 80
mean_grad_norms = measure_gradient_decay(seq_len=seq_len, hidden_size=64, n_trials=20)

print(f"Gradient norm at timestep {seq_len-1} (most recent): {mean_grad_norms[-1]:.4f}")
print(f"Gradient norm at timestep 60:                        {mean_grad_norms[60]:.6f}")
print(f"Gradient norm at timestep 40:                        {mean_grad_norms[40]:.6f}")
print(f"Gradient norm at timestep 20:                        {mean_grad_norms[20]:.6f}")
print(f"Gradient norm at timestep 0 (earliest):              {mean_grad_norms[0]:.8f}")
print()
print(f"Ratio (step 0 / step {seq_len-1}): {mean_grad_norms[0] / mean_grad_norms[-1]:.2e}")

A typical run with a 64-unit hidden size and random weights produces output like:

Gradient norm at timestep 79 (most recent): 0.4821
Gradient norm at timestep 60:               0.004312
Gradient norm at timestep 40:               0.000031
Gradient norm at timestep 20:               0.000000
Gradient norm at timestep 0 (earliest):     0.00000000

Ratio (step 0 / step 79): 3.2e-09

The gradient at the earliest timestep is nine orders of magnitude smaller than at the most recent timestep. This is not a small difference that a larger learning rate compensates for. It is a complete loss of gradient signal from the early past. The weight update for anything that happened more than 30–40 steps ago is numerically zero.

This experiment also reveals something important about initialization. With a well-conditioned $W_h$ (orthogonal initialization), the decay is slower but still exponential for $\tanh$ activations. The spectral radius of $W_h$ determines the rate; the $\tanh$ saturation sets the ceiling for how stable that rate can be.

Architecture Diagram

Production Engineering Notes

When Vanilla RNNs Still Make Sense

The answer in 2024 is: almost never for sequence modeling from scratch, but often for retrofitting or constrained environments.

Microcontroller deployment. If you are running inference on an ESP32, a Cortex-M4, or any device where memory is measured in kilobytes, a vanilla RNN with a 32-unit hidden state has roughly 3,000 parameters and runs in under a millisecond per step. A transformer with even the smallest practical configuration is orders of magnitude larger. For on-device keyword spotting, gesture recognition, or simple sensor anomaly detection, a well-trained GRU (the simplest upgrade from vanilla RNN) is often the right call.

Streaming real-time inference. Transformers require the entire input sequence to be available before computing attention. An RNN processes one token at a time and produces a prediction immediately. For applications with strict latency requirements and long input streams - live audio processing, real-time ECG monitoring, high-frequency trading signals - the sequential nature of RNNs is a feature, not a bug.

Sequence lengths where transformers are overkill. For sequences of 10–30 steps with clear local dependencies and abundant training data, a GRU or LSTM often trains faster, generalizes comparably, and deploys more easily than a transformer. The transformer is the correct default for language and long documents; it is not always the correct default for 15-step multivariate time series with 10,000 training examples.

Truncated BPTT - The Standard Training Trick

Training an RNN on sequences of thousands of timesteps with full BPTT is both computationally expensive and numerically unstable. The standard practice is truncated BPTT: chunk the sequence into windows of $k$ steps, run the forward pass on each window, and backpropagate only within each window. The hidden state is carried forward between windows (detached from the computation graph), so the model retains memory, but gradients are only propagated $k$ steps back.

The choice of $k$ is a hyperparameter that trades gradient signal range against memory usage. For language modeling, $k = 35$ was standard in early LSTM work (Zaremba et al., 2014). For time series, $k = 50$ – $200$ is common. The right value depends on how long your actual dependencies are.

Gradient Norm Monitoring in Production

Every RNN training run should log the gradient norm before and after clipping. A healthy training run shows:

Pre-clip norm that is roughly stable across epochs, not consistently growing
Post-clip norm equal to pre-clip norm most of the time (clipping rarely activates)
Occasional spikes in pre-clip norm that clipping handles, followed by recovery

If the pre-clip norm is permanently at the clip threshold, your learning rate is too high or your weight initialization is wrong. If the pre-clip norm is consistently near zero for all parameters, you have vanishing gradients and need LSTM or GRU.

Weight Initialization Matters More Than You Think

For RNNs, the initialization of $W_h$ - the hidden-to-hidden weight matrix - critically determines whether early training is stable. Three approaches are common:

Small random initialization (scale = 0.01): safe but slow to develop memory dynamics
Identity initialization (or near-identity): $W_h = I + \epsilon \cdot \text{random}$ . This ensures that at initialization, $h_t \approx h_{t-1}$ , so gradients flow without shrinking. Le et al. (2015) showed this enables training very deep RNNs with ReLU activations.
Orthogonal initialization: $W_h$ is initialized to a random orthogonal matrix. All singular values are exactly 1.0, which guarantees that the gradient product neither grows nor shrinks due to $W_h$ alone (Saxe et al., 2013). This is the theoretically motivated approach and is what torch.nn.init.orthogonal_ implements.

In practice: use orthogonal initialization for $W_h$ if you are training a vanilla RNN on sequences longer than 50 steps. For LSTM and GRU, the gates provide sufficient gradient routing that initialization is less critical.

The Case for GRU Over Vanilla RNN in Any Non-Trivial Task

The GRU (Cho et al., 2014) is a direct drop-in replacement for a vanilla RNN that adds two gates: a reset gate and an update gate. It has roughly twice the parameters of a vanilla RNN but avoids the worst of the vanishing gradient problem through a mechanism similar to residual connections. In PyTorch, it is literally one word change: nn.GRU instead of nn.RNN, with identical interface. If you are building something that requires sequential memory and will not be replaced by a transformer, use GRU. The vanilla RNN is for teaching and for extremely constrained environments only.

Common Mistakes

:::danger Forgetting to detach the hidden state between batches When training with truncated BPTT, you chunk the sequence and reuse the hidden state from the previous chunk. If you pass h_n directly without calling .detach(), PyTorch's autograd will try to backpropagate through the entire history of all previous chunks on every backward call. This causes memory to grow linearly with the number of processed chunks and eventually crashes the process. Always detach: h0 = h_n.detach(). :::

:::danger Using nn.RNN when you need long-range dependencies A vanilla nn.RNN with tanh activation will fail to learn dependencies spanning more than 10–20 timesteps on most practical tasks. If your task requires the model to remember information from more than a few steps back - and most real tasks do - use nn.LSTM or nn.GRU instead. The API is identical. There is no engineering cost to the switch, only a slight increase in parameter count. :::

:::warning Not clipping gradients Failing to call clip_grad_norm_ before optimizer.step() when training RNNs leads to sporadic catastrophic loss spikes. These spikes can permanently destabilize training by corrupting weights. The fix is one line. Add it. Max norm values of 1.0–5.0 are standard; start with 5.0 and reduce if you see instability. :::

:::warning Processing variable-length sequences without padding masks When batching sequences of different lengths, PyTorch's nn.utils.rnn.pack_padded_sequence and pad_packed_sequence must be used to avoid the RNN processing padding tokens. Without packing, the hidden state is updated on padding positions, introducing noise into the final hidden state and corrupting downstream predictions. This is a common source of subtle accuracy degradation that does not produce obvious errors. :::

:::warning Using the same hidden state across unrelated sequences The hidden state h_n from one sequence is meaningless as an initialization for a different, unrelated sequence. Always reset to zeros (or a learned initial state) between sequences. A common mistake when implementing streaming inference is to carry h_n forward indefinitely - including across sequence boundaries - which corrupts the model's internal memory. :::

:::tip Use ReLU instead of tanh for very long sequences with orthogonal initialization Le et al. (2015) showed that a vanilla RNN with ReLU activations and identity-initialized $W_h$ (called IRNN) can learn dependencies over hundreds of timesteps. The ReLU derivative is 1 for positive inputs (no saturation), which eliminates the squashing contribution to gradient vanishing. This is not better than LSTM in general but is a useful data point: much of the vanishing gradient problem comes from $\tanh$ saturation, not just from $W_h$ . :::

Interview Q&A

Q1: Explain the vanishing gradient problem in RNNs to someone who has not seen the math.

Answer: When we train an RNN, we need to figure out how much each weight contributed to the final prediction error. For weights that were used many timesteps ago, we have to trace the responsibility back through every intermediate step. Each step multiplies the error signal by a factor. If that factor is consistently less than 1 - which is typical because the tanh activation squashes values and its derivative is at most 1 - the error signal shrinks exponentially with each step back in time.

Concretely, if each step multiplies the gradient by 0.9, after 50 steps back in time the gradient is $0.9^{50} \approx 0.005$ . After 100 steps: $0.9^{100} \approx 0.000027$ . The weight that encoded information from 100 steps ago receives essentially zero gradient update, so the model never learns to use that information. This is not a problem of optimization technique - no learning rate adjustment fixes it. It is a structural property of the architecture.

Q2: Why does the chain rule produce an exponential decay in RNN gradients specifically?

Answer: The gradient of the loss at timestep $T$ with respect to the hidden state at timestep $t$ involves a product of $T - t$ Jacobian matrices. Each Jacobian has the form $\text{diag}(1 - h_t^2) \cdot W_h^\top$ . The spectral norm of this matrix determines whether the product grows or shrinks. If the largest singular value of $W_h$ , scaled by the average $\tanh$ derivative, is less than 1, the product decays geometrically. Since $\tanh$ derivatives are bounded above by 1 and the weight matrix is typically initialized small, the product almost always decays. The key word is "exponential" - the decay rate is the spectral radius of the Jacobian to the power of the number of timesteps, which means even a small spectral radius (say 0.9) causes complete gradient death for sequences of 50+ steps.

Q3: How does LSTM solve the vanishing gradient problem?

Answer: The LSTM replaces the single $\tanh$ hidden state with a cell state $c_t$ that is updated through additive operations rather than multiplicative ones. The cell state update is: $c_t = f_t \cdot c_{t-1} + i_t \cdot g_t$ , where $f_t$ is the forget gate, $i_t$ is the input gate, and $g_t$ is the candidate update.

The critical difference is the $f_t \cdot c_{t-1}$ term. The gradient of $c_t$ with respect to $c_{t-1}$ is simply $f_t$ - a learned scalar (after the sigmoid activation). When the forget gate is close to 1, the gradient through the cell state is close to 1, not multiplied by a decaying factor. The model can learn to keep the forget gate near 1 for signals it wants to preserve, effectively creating a shortcut that allows gradients to flow back hundreds of timesteps without shrinking. This is analogous to the residual connection in ResNets, which was also motivated by gradient flow.

Q4: What is the difference between vanishing and exploding gradients, and how are they treated differently?

Answer: Both are consequences of the same product of Jacobians. Vanishing gradients occur when the spectral radius is less than 1; exploding gradients occur when it is greater than 1.

They require different fixes. Exploding gradients are addressed by gradient clipping: before applying the optimizer update, compute the global gradient norm and rescale all gradients proportionally if it exceeds a threshold. This preserves gradient direction and prevents the update from being catastrophically large. It is one line of code and works reliably.

Vanishing gradients cannot be fixed by clipping - you cannot scale up a near-zero gradient without introducing arbitrary signal. The fix must be architectural: use LSTM or GRU (which provide gradient shortcuts through gating), use attention mechanisms (which provide direct connections from each output to each input), or use orthogonal weight initialization (which initializes $W_h$ with singular values all equal to 1, preventing the spectral radius contribution to vanishing). In modern practice, for most sequence tasks, the architectural fix is the right answer.

Q5: In an interview, how would you decide between using an RNN/LSTM and a Transformer for a new sequence modeling task?

Answer: I would ask four questions:

First, what is the sequence length? For sequences under 512 tokens and without very long-range dependencies, LSTM remains competitive and is much easier to deploy. For longer sequences or when long-range patterns matter, transformer attention provides direct O(1)-hop connections between any two positions.

Second, is inference streaming or offline? If I need to process each token as it arrives and produce an immediate output, an RNN/LSTM is the natural choice. Transformers require the full input to compute attention. There are streaming transformer variants (like linear attention) but they sacrifice quality.

Third, how much data do I have? Transformers are data-hungry. An LSTM trained on 50,000 sequences can generalize well. A transformer on the same data often underperforms unless pretrained. If data is limited, LSTM or GRU is likely better.

Fourth, what are the deployment constraints? A small LSTM with 64 hidden units runs on a microcontroller. A transformer requires GPU or at minimum a high-end CPU. If memory and compute are constrained, LSTM wins by default.

My default in 2024 for a new project without constraints: use a transformer-based pretrained model if one exists for the domain, fine-tune it. If building from scratch on a new modality with limited data and sequences under 200 steps: start with a 2-layer GRU, establish a baseline, and upgrade if needed.

Q6: What is Backpropagation Through Time (BPTT), and what does "truncated" BPTT mean in practice?

Answer: BPTT is backpropagation applied to the unrolled RNN computation graph. When you unroll an RNN through $T$ timesteps, you get a computation graph that looks like a feedforward network of depth $T$ , where all layers share weights. BPTT computes gradients by unrolling this graph backward from the loss to the input, accumulating gradient contributions from every timestep.

Full BPTT over a long sequence is expensive in memory ( $O(T)$ activations must be stored for the backward pass) and numerically unstable (very long chains of Jacobian products). Truncated BPTT addresses this by processing the sequence in chunks of $k$ steps. The forward pass processes chunk 1, then chunk 2, and so on, carrying the hidden state forward between chunks. The backward pass only unrolls within each chunk - $k$ steps instead of $T$ steps. The hidden state that is passed between chunks is detached from the computation graph (no gradient flows through it), so chunk 2's backward pass does not try to reach back through chunk 1.

The tradeoff: gradients cannot capture dependencies that span more than $k$ steps. In practice, $k = 35$ to $k = 200$ is common. If your task has dependencies longer than $k$ , the model cannot learn them with this setup, and you need either a longer truncation window, a full attention mechanism, or a different architecture.

Q7: A colleague says "just use a bigger learning rate to compensate for vanishing gradients." What is wrong with this?

Answer: Increasing the learning rate applies the same scaling factor to all parameters equally. The problem with vanishing gradients is that different parameters receive different-magnitude gradients depending on how far back in time they influenced the output. The weights that processed recent timesteps receive large gradients; the weights that processed distant timesteps receive near-zero gradients. Multiplying all of these by a larger learning rate makes the recent-timestep weights update too aggressively (potentially causing instability or oscillation) without meaningfully improving the learning signal for distant-timestep weights, because a large learning rate times a near-zero gradient is still near-zero.

What is needed is for the gradient signal at distant timesteps to be of a similar magnitude to the gradient at recent timesteps - not for all gradients to be scaled up proportionally. That requires either a fundamentally different gradient flow path (LSTM cell state, attention, residual connections) or per-parameter adaptive learning rates (Adam), which partially but not fully compensates for the gradient magnitude discrepancy.

Conceptual Connections to Other Architectures

Understanding the vanishing gradient problem in RNNs directly illuminates the design rationale of every major architecture that came after.

ResNets - The Feedforward Analogue

ResNets (He et al., 2015) solved the vanishing gradient problem in deep feedforward networks using skip connections: $\text{output} = F(x) + x$ . The $+x$ term creates a gradient highway - during backpropagation, the gradient of the loss with respect to the input of a residual block includes a direct additive term $1.0$ (from the skip connection), regardless of what $F(x)$ does. This prevents gradient decay even in networks of 1,000 layers.

The LSTM cell state update $c_t = f_t \cdot c_{t-1} + i_t \cdot g_t$ is the sequential equivalent. The gradient of $c_t$ with respect to $c_{t-1}$ is $f_t$ - a learned scalar, not a product of Jacobians. When the forget gate is near 1, gradient flows through time with near-zero decay.

Attention Mechanisms - Eliminating Sequential Bottlenecks

The attention mechanism (Bahdanau et al., 2015) takes a different approach entirely: instead of trying to route information through a sequential hidden state, it creates direct connections from each output position to each input position. The gradient of the loss at any output position with respect to any input position is a single hop - no chain of Jacobians, no exponential decay.

The transformer (Vaswani et al., 2017) extends this to self-attention, where every position in the sequence attends to every other position. The maximum gradient path length between any two positions in the sequence is O(1) instead of O(T). This is the fundamental architectural reason why transformers outperform RNNs on long-sequence tasks: it is not a matter of capacity or optimization tricks, but of gradient path length.

Highway Networks - The Intermediate Step

Highway networks (Srivastava et al., 2015) introduced gating to feedforward networks before ResNets, using a learned gate to interpolate between $F(x)$ and $x$ : $\text{output} = T(x) \cdot F(x) + (1 - T(x)) \cdot x$ . This is structurally identical to the LSTM update equation and was explicitly motivated by the vanishing gradient analysis.

Recognizing these connections across architectures is valuable for interviews. When asked "why does architecture X work," the answer almost always involves gradient flow analysis. The specific mechanism differs - additive skip connections, multiplicative gates, direct attention paths - but the underlying problem being solved is the same: making it possible for gradient signals to reach every parameter that influenced the output.

Summary

The recurrent neural network is an elegant solution to the problem of variable-length sequential input. By maintaining a hidden state that is updated at each timestep using shared weights, it achieves parameter efficiency and handles sequences of arbitrary length. The forward pass is straightforward: $h_t = \tanh(W_h h_{t-1} + W_x x_t + b_h)$ .

The training failure is structural and mathematical. Backpropagation through time computes gradients as a product of Jacobian matrices - one per timestep. When the spectral radius of each Jacobian is less than 1, the product decays exponentially with sequence length. For a sequence of 100 steps with a per-step decay of 0.9, the gradient signal at step 1 is 27,000 times smaller than at step 99. The model learns to ignore distant history, not because it chose to, but because the optimizer never received a meaningful signal about it.

The fix for exploding gradients is gradient clipping - one line of code. The fix for vanishing gradients is architectural - LSTM, GRU, or attention. Every production sequence modeling system uses at least one of these architectural mitigations. Understanding why they are necessary requires understanding the precise mathematics of BPTT, which this lesson has developed from first principles.

The next lesson covers LSTM and GRU in detail - specifically how the gating mechanism routes gradients and what the cell state learns to store.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the RNN Unrolled Through Time demo on the EngineersOfAI Playground - no code required.

:::

The Trading System That Forgot​

Why This Exists - The Problem Feedforward Networks Cannot Solve​

Historical Context - From Rumelhart to the Gradient Crisis​

1986 - The Invention of Backpropagation​

1991 - Hochreiter Diagnoses the Problem​

1993–2000 - The Gradient Cliff​

2013 - Taming Exploding Gradients​

2015–2017 - The Transformer Horizon​

Core Concept - How an RNN Actually Works​

The Intuition: A Running Summary​

The Forward Pass - Equations​

Unrolling Through Time​

What the Hidden State Learns​

The Vanishing Gradient Problem - Mathematical Explanation​

Setting Up the Chain Rule​

Computing the Local Jacobian​

Why the Product Shrinks or Explodes​

The tanh Saturation Effect​

Why Long-Range Dependencies Cannot Be Learned​

NumPy From Scratch - RNN Forward Pass and BPTT​

What to Notice in the BPTT Code​

PyTorch Implementation​

Key PyTorch Idioms for RNNs​

Empirically Measuring Gradient Decay​

Architecture Diagram​

Production Engineering Notes​

When Vanilla RNNs Still Make Sense​

Truncated BPTT - The Standard Training Trick​

Gradient Norm Monitoring in Production​

Weight Initialization Matters More Than You Think​

The Case for GRU Over Vanilla RNN in Any Non-Trivial Task​

Common Mistakes​

Interview Q&A​

Q1: Explain the vanishing gradient problem in RNNs to someone who has not seen the math.​

Q2: Why does the chain rule produce an exponential decay in RNN gradients specifically?​

Q3: How does LSTM solve the vanishing gradient problem?​

Q4: What is the difference between vanishing and exploding gradients, and how are they treated differently?​

Q5: In an interview, how would you decide between using an RNN/LSTM and a Transformer for a new sequence modeling task?​

Q6: What is Backpropagation Through Time (BPTT), and what does "truncated" BPTT mean in practice?​

Q7: A colleague says "just use a bigger learning rate to compensate for vanishing gradients." What is wrong with this?​

Conceptual Connections to Other Architectures​

ResNets - The Feedforward Analogue​

Attention Mechanisms - Eliminating Sequential Bottlenecks​

Highway Networks - The Intermediate Step​

Summary​