Skip to main content

RNNs and LSTMs - Sequence Modeling Before Transformers

Reading time: ~40 min | Interview relevance: High | Roles: MLE, AI Eng, Research Engineer, NLP Engineer

The Real Interview Moment

You are in a Meta MLE interview. The interviewer draws a sequence-to-sequence diagram on the whiteboard and says: "Walk me through how an LSTM processes a sentence. What happens at each time step, what are the gates doing, and why can it learn long-range dependencies that a vanilla RNN cannot?"

You start explaining the hidden state update, and she immediately follows up: "Write the gradient flow equations for a vanilla RNN. Show me exactly where the vanishing gradient problem occurs. Then show me how the LSTM cell state fixes it."

This is not an obscure question. RNNs remain foundational to understanding modern sequence modeling. Even though Transformers have replaced RNNs in most production systems, interviewers ask about RNNs to test whether you understand the problems that Transformers were designed to solve. If you cannot explain vanishing gradients mathematically, you cannot convincingly explain why self-attention was such a breakthrough.

Candidates who say "RNNs are obsolete" get a probing follow-up: "When would you still use one?" Those who cannot answer lose credibility. This page gives you complete coverage of recurrent architectures - the math, the intuition, the gates, the modern alternatives, and the edge cases where RNNs still win.

What You Will Master

  • Define the vanilla RNN forward pass and derive its gradient flow
  • Prove mathematically why vanilla RNNs suffer from vanishing and exploding gradients
  • Explain each LSTM gate (forget, input, output) with precise equations and intuition
  • Compare LSTM vs GRU - when to use which and the parameter tradeoff
  • Draw and explain bidirectional RNNs and their use cases
  • Articulate why Transformers replaced RNNs (parallelism, long-range dependencies, scaling)
  • Identify when RNNs still make sense (streaming, low-latency, small devices)
  • Solve interview problems on RNN architectures, gradient flow, and sequence modeling design

Self-Assessment: Where Are You Now?

Skill1 -- Cannot2 -- Vaguely3 -- Can Explain4 -- Can Derive5 -- Can TeachYour Score
Write the vanilla RNN equations___
Derive the vanishing gradient problem___
Write all LSTM gate equations___
Explain why the cell state fixes vanishing gradients___
Compare LSTM vs GRU with parameter counts___
Explain bidirectional RNNs___
Articulate why Transformers replaced RNNs___
Identify when RNNs still make sense___

Target: All 4s and 5s before your interview.

Part 1 - The Vanilla RNN

The Core Idea

A recurrent neural network processes sequences one element at a time, maintaining a hidden state that acts as a "memory" of what it has seen so far. At each time step tt, the RNN takes two inputs: the current input xtx_t and the previous hidden state ht1h_{t-1}, and produces a new hidden state hth_t.

Forward Pass Equations

ht=tanh(Whhht1+Wxhxt+bh)h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h) yt=Whyht+byy_t = W_{hy} h_t + b_y

Where:

  • htRdhh_t \in \mathbb{R}^{d_h} is the hidden state at time tt
  • xtRdxx_t \in \mathbb{R}^{d_x} is the input at time tt
  • WhhRdh×dhW_{hh} \in \mathbb{R}^{d_h \times d_h} is the hidden-to-hidden weight matrix
  • WxhRdh×dxW_{xh} \in \mathbb{R}^{d_h \times d_x} is the input-to-hidden weight matrix
  • WhyRdy×dhW_{hy} \in \mathbb{R}^{d_y \times d_h} is the hidden-to-output weight matrix

RNN Unrolled Through Time

RNN Unrolled Through Time - Shared Weights Across All Steps

The key insight: the same weight matrices are shared across all time steps. This is weight tying - it is what makes RNNs parameter-efficient compared to having a separate network for each position, but it is also what creates the vanishing gradient problem.

60-Second Answer

"A vanilla RNN processes sequences step by step, updating a hidden state at each time step using the current input and the previous hidden state. The hidden state acts as a compressed memory of the entire history. The same weights are shared across all time steps, which gives parameter efficiency but also causes the vanishing gradient problem - during backpropagation through time, gradients are multiplied by the same weight matrix at each step, causing them to either explode or vanish exponentially. This makes vanilla RNNs unable to learn dependencies longer than about 10-20 steps in practice."

Parameter Count

For a vanilla RNN with hidden dimension dhd_h and input dimension dxd_x:

ComponentShapeParameters
WhhW_{hh}dh×dhd_h \times d_hdh2d_h^2
WxhW_{xh}dh×dxd_h \times d_xdhdxd_h \cdot d_x
bhb_hdhd_hdhd_h
WhyW_{hy}dy×dhd_y \times d_hdydhd_y \cdot d_h
byb_ydyd_ydyd_y
Totaldh2+dhdx+dh+dydh+dyd_h^2 + d_h \cdot d_x + d_h + d_y \cdot d_h + d_y

For a typical configuration with dh=256d_h = 256, dx=100d_x = 100, dy=50d_y = 50: approximately 97,000 parameters.

Part 2 - The Vanishing Gradient Problem

Backpropagation Through Time (BPTT)

To train an RNN, we unroll it through time and apply backpropagation. The loss at time TT depends on all previous hidden states:

L=t=1TLt(yt,y^t)L = \sum_{t=1}^{T} L_t(y_t, \hat{y}_t)

The gradient of the loss with respect to the weight matrix WhhW_{hh} requires computing:

LWhh=t=1TLththtWhh\frac{\partial L}{\partial W_{hh}} = \sum_{t=1}^{T} \frac{\partial L_t}{\partial h_t} \frac{\partial h_t}{\partial W_{hh}}

The critical chain rule expansion is:

hthk=i=k+1thihi1=i=k+1tWhhTdiag(tanh(zi))\frac{\partial h_t}{\partial h_k} = \prod_{i=k+1}^{t} \frac{\partial h_i}{\partial h_{i-1}} = \prod_{i=k+1}^{t} W_{hh}^T \cdot \text{diag}(\tanh'(z_i))

where zi=Whhhi1+Wxhxi+bhz_i = W_{hh} h_{i-1} + W_{xh} x_i + b_h.

Why Gradients Vanish

The product i=k+1tWhhTdiag(tanh(zi))\prod_{i=k+1}^{t} W_{hh}^T \cdot \text{diag}(\tanh'(z_i)) involves multiplying the same matrix WhhW_{hh} repeatedly. Consider the eigendecomposition of WhhW_{hh}:

Whh=QΛQ1W_{hh} = Q \Lambda Q^{-1}

Then:

Whhtk=QΛtkQ1W_{hh}^{t-k} = Q \Lambda^{t-k} Q^{-1}
  • If the largest eigenvalue λmax<1|\lambda_{\max}| < 1: Λtk0\Lambda^{t-k} \to 0 exponentially as tkt - k grows. Gradients vanish.
  • If the largest eigenvalue λmax>1|\lambda_{\max}| > 1: Λtk\Lambda^{t-k} \to \infty exponentially. Gradients explode.

Additionally, tanh(z)(0,1]\tanh'(z) \in (0, 1] (it equals 1 only at z=0z = 0), so the diagonal factors further compress the gradient toward zero.

Common Trap

Do NOT say "the vanishing gradient problem means the gradients become exactly zero." They become exponentially small, making it practically impossible for the optimizer to update the weights in response to long-range dependencies. The model can still learn - it just cannot learn long-range patterns. Interviewers expect precision here.

Quantifying the Problem

For a sequence of length TT and a gap of kk time steps, the gradient magnitude scales as:

hThTkWhhkγk\left\| \frac{\partial h_T}{\partial h_{T-k}} \right\| \approx \| W_{hh} \|^k \cdot \gamma^k

where γ1\gamma \leq 1 is the maximum of tanh\tanh'. If Whhγ=0.9\|W_{hh}\| \cdot \gamma = 0.9, then after k=100k = 100 steps:

0.91002.66×1050.9^{100} \approx 2.66 \times 10^{-5}

The gradient has decayed by five orders of magnitude. The network effectively cannot learn dependencies that span 100 time steps.

Gradient Clipping (Partial Fix for Exploding Gradients)

Gradient clipping addresses the exploding gradient problem (but not vanishing):

g^={gif gθθggif g>θ\hat{g} = \begin{cases} g & \text{if } \|g\| \leq \theta \\ \theta \cdot \frac{g}{\|g\|} & \text{if } \|g\| > \theta \end{cases}

This rescales the gradient to have norm θ\theta when it exceeds the threshold. It is a standard technique used with all RNN variants, including LSTMs.

Interviewer's Perspective

When I ask about vanishing gradients, I want three things: (1) the mathematical mechanism - repeated matrix multiplication, (2) the eigenvalue analysis showing exponential decay, (3) why this specifically prevents learning long-range dependencies while short-range still works. Candidates who only say "gradients get small" without the matrix multiplication insight get moderate marks at best.

Part 3 - The LSTM Cell

The Key Innovation: Additive Cell State

The Long Short-Term Memory network (Hochreiter and Schmidhuber, 1997) solves the vanishing gradient problem by introducing a cell state ctc_t that passes through the network with primarily additive interactions rather than multiplicative ones.

LSTM Cell - Forget, Input, and Output Gates with Additive Cell State Highway

The Four Equations

Forget Gate - What to erase from cell state:

ft=σ(Wf[ht1,xt]+bf)f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)

Input Gate - What new information to write:

it=σ(Wi[ht1,xt]+bi)i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)

Candidate Cell State - What to potentially add:

c~t=tanh(Wc[ht1,xt]+bc)\tilde{c}_t = \tanh(W_c [h_{t-1}, x_t] + b_c)

Cell State Update - The crucial additive update:

ct=ftct1+itc~tc_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t

Output Gate - What to expose as hidden state:

ot=σ(Wo[ht1,xt]+bo)o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)

Hidden State - Filtered cell state:

ht=ottanh(ct)h_t = o_t \odot \tanh(c_t)

Where σ\sigma is the sigmoid function and \odot is element-wise multiplication.

Why the Cell State Fixes Vanishing Gradients

The gradient of ctc_t with respect to ct1c_{t-1} is:

ctct1=ft\frac{\partial c_t}{\partial c_{t-1}} = f_t

This is just a diagonal matrix of forget gate values (values between 0 and 1). Compare this to the vanilla RNN where the gradient involves a full matrix multiplication WhhTW_{hh}^T.

For the gradient to flow from time TT back to time kk:

cTck=i=k+1Tfi\frac{\partial c_T}{\partial c_k} = \prod_{i=k+1}^{T} f_i

If the forget gate is close to 1 (which the network learns when it needs to remember), the gradient flows essentially unimpeded. This is the constant error carousel - information can be stored in the cell state indefinitely by keeping the forget gate near 1.

Interviewer's Perspective

The strongest candidates frame the LSTM as an engineering solution to a mathematical problem. The vanishing gradient arises from repeated multiplication by WhhW_{hh}. The LSTM replaces this with element-wise multiplication by ftf_t (a value in [0,1] that the network controls). When ft1f_t \approx 1, the gradient flows perfectly. When ft0f_t \approx 0, the network has actively decided to forget. This is the key insight.

Gate Intuitions

GateNotationActivationRoleAnalogy
Forgetftf_tsigmoidErase old cell info"Delete key"
Inputiti_tsigmoidSelect what to write"Write permission"
Candidatec~t\tilde{c}_ttanhGenerate new content"Draft text"
Outputoto_tsigmoidFilter cell to hidden"Read permission"

Why sigmoid for gates? Sigmoid outputs values in (0,1)(0, 1), making it a natural "soft switch." A value near 0 means "block this," near 1 means "pass this through."

Why tanh for candidate? The candidate value needs to range from 1-1 to +1+1 to allow both positive and negative updates to the cell state. If we used sigmoid (always positive), the cell state could only increase.

LSTM Parameter Count

For hidden dimension dhd_h and input dimension dxd_x, each gate has a weight matrix of shape dh×(dh+dx)d_h \times (d_h + d_x) and a bias of shape dhd_h. With 4 gates:

Parameters=4(dh(dh+dx)+dh)=4dh(dh+dx+1)\text{Parameters} = 4 \cdot (d_h \cdot (d_h + d_x) + d_h) = 4d_h(d_h + d_x + 1)

For dh=256d_h = 256, dx=100d_x = 100: approximately 366,592 parameters - roughly 4x a vanilla RNN, because there are 4 sets of weights (one per gate).

Common Trap

Do NOT confuse the cell state ctc_t with the hidden state hth_t. The cell state is the "long-term memory" that flows through the constant error carousel. The hidden state is the "working memory" that is exposed to the output and the next time step. Interviewers often ask "what is passed to the next time step?" The answer is BOTH ctc_t and hth_t.

Part 4 - GRU: The Simplified Alternative

GRU Equations

The Gated Recurrent Unit (Cho et al., 2014) simplifies the LSTM by merging the cell state and hidden state into a single state, and using only two gates:

Reset Gate:

rt=σ(Wr[ht1,xt]+br)r_t = \sigma(W_r [h_{t-1}, x_t] + b_r)

Update Gate:

zt=σ(Wz[ht1,xt]+bz)z_t = \sigma(W_z [h_{t-1}, x_t] + b_z)

Candidate Hidden State:

h~t=tanh(Wh[rtht1,xt]+bh)\tilde{h}_t = \tanh(W_h [r_t \odot h_{t-1}, x_t] + b_h)

Hidden State Update:

ht=(1zt)ht1+zth~th_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t

LSTM vs GRU Comparison

AspectLSTMGRU
States2 (ctc_t, hth_t)1 (hth_t)
Gates3 (forget, input, output)2 (reset, update)
Parameters4dh(dh+dx+1)4d_h(d_h + d_x + 1)3dh(dh+dx+1)3d_h(d_h + d_x + 1)
Relative params1.0x0.75x
Long-range dependenciesSlightly betterSlightly worse
Training speedSlowerFaster
When to preferLong sequences, complex dependenciesShorter sequences, limited compute

Key insight about the GRU update gate: The term (1zt)(1 - z_t) creates a complementary pair - what fraction of the old state to keep vs what fraction of the new candidate to use. This is equivalent to combining the LSTM forget and input gates into a single gate with the constraint ft+it=1f_t + i_t = 1. The LSTM is more flexible because ftf_t and iti_t are independent.

Company Variation

At Google and DeepMind, interviewers expect you to know both LSTM and GRU equations from memory. At startups and applied ML roles, they care more about knowing when to choose which. The standard answer: "In practice, the performance difference is usually small. GRU trains faster and has fewer parameters. LSTM has a slight edge on tasks requiring very long memory. I would default to GRU and switch to LSTM if validation metrics suggest the model needs more capacity for long-range dependencies."

Part 5 - Bidirectional RNNs

The Motivation

A standard (unidirectional) RNN processes the sequence left-to-right. The hidden state at position tt encodes information from x1,x2,,xtx_1, x_2, \ldots, x_t - only the past. But for many tasks (named entity recognition, machine translation, sentiment analysis), future context is equally important.

Architecture

A bidirectional RNN runs two separate RNNs:

  1. Forward RNN: processes x1x2xTx_1 \to x_2 \to \ldots \to x_T, producing ht\overrightarrow{h}_t
  2. Backward RNN: processes xTxT1x1x_T \to x_{T-1} \to \ldots \to x_1, producing ht\overleftarrow{h}_t

The final representation at each position is the concatenation:

ht=[ht;ht]R2dhh_t = [\overrightarrow{h}_t ; \overleftarrow{h}_t] \in \mathbb{R}^{2d_h}

Bidirectional RNN - Forward and Backward Context Concatenated

When Bidirectional Works (and Doesn't)

TaskBidirectional?Why
Named entity recognitionYesNeed both left and right context to classify entities
Sentiment analysisYesNegation at end can flip meaning of beginning
Machine translation (encoder)YesEncoder sees full source sentence
Machine translation (decoder)NoCannot look at future tokens during generation
Language modelingNoMust predict next token using only past
Speech recognitionDependsOffline: yes. Streaming: no
Real-time autocompleteNoCannot wait for future input
Instant Rejection

If you propose a bidirectional RNN for autoregressive language modeling or real-time text generation, the interviewer will immediately push back. Bidirectional architectures require the full sequence to be available - they cannot be used for tasks where you generate tokens one at a time. This confusion signals a fundamental misunderstanding of the architecture.

Part 6 - Sequence-to-Sequence Architecture

Encoder-Decoder with RNNs

The seq2seq model (Sutskever et al., 2014) uses one RNN as an encoder and another as a decoder:

  1. Encoder reads the input sequence and compresses it into a fixed-size context vector cc (typically the final hidden state)
  2. Decoder generates the output sequence conditioned on cc
c=hTencc = h_T^{\text{enc}} htdec=RNN(ht1dec,[yt1,c])h_t^{\text{dec}} = \text{RNN}(h_{t-1}^{\text{dec}}, [y_{t-1}, c])

The Information Bottleneck

The entire input sequence must be compressed into the single vector cc. For long sequences, this is catastrophic - the encoder must squeeze hundreds of tokens into a fixed-dimensional vector.

This bottleneck motivated the attention mechanism (covered in the next page). Attention allows the decoder to look at all encoder hidden states, not just the final one.

Seq2Seq Encoder-Decoder: The Information Bottleneck Motivates Attention

Part 7 - Why Transformers Replaced RNNs

The Three Fatal Limitations

1. Sequential computation prevents parallelism.

An RNN must compute h1h_1 before h2h_2, h2h_2 before h3h_3, and so on. For a sequence of length TT, training requires O(T)O(T) sequential operations. This cannot be parallelized across GPUs. Transformers compute all positions simultaneously in O(1)O(1) sequential steps (with O(T2)O(T^2) parallel work), making them dramatically faster on modern hardware.

2. Vanishing gradients limit effective context.

Even LSTMs struggle with dependencies beyond a few hundred tokens. The constant error carousel helps, but in practice, LSTM language models see diminishing returns beyond 200-500 tokens of context. Transformers with self-attention can directly attend to any position, giving O(1)O(1) path length for any dependency.

3. No direct position-to-position interaction.

In an RNN, for information to flow from position 1 to position 100, it must pass through 99 hidden state updates. At each step, information is compressed and potentially lost. In a Transformer, position 1 can directly attend to position 100 in a single layer.

AspectRNN/LSTMTransformer
Sequential ops per layerO(T)O(T)O(1)O(1)
Max path lengthO(T)O(T)O(1)O(1)
Computation per layerO(Td2)O(T \cdot d^2)O(T2d)O(T^2 \cdot d)
ParallelizableNoYes
Effective context (practical)~200-500 tokens2K-128K+ tokens
Hardware utilization on GPUPoorExcellent
60-Second Answer

"Transformers replaced RNNs for three reasons. First, RNNs are inherently sequential - you must compute each hidden state in order, preventing GPU parallelism. Transformers compute all positions simultaneously. Second, RNNs struggle with long-range dependencies - even LSTMs have an effective context window of a few hundred tokens, while Transformers can attend directly to any position. Third, RNNs compress information through a fixed-size hidden state bottleneck, while Transformers maintain full representations at each position and let attention select what is relevant."

When RNNs Still Make Sense

Despite Transformer dominance, there are legitimate use cases for RNNs:

1. Streaming / real-time inference. RNNs naturally process one token at a time with constant memory and computation per step. Transformers require reprocessing the full context (unless using KV caching) and have quadratic memory in context length.

2. Edge devices with limited memory. An LSTM with 256 hidden units uses ~366K parameters. A small Transformer with 4 layers, 4 heads, and 256 dimensions uses ~2.6M parameters. For ultra-low-latency applications on mobile or embedded devices, RNNs can be more practical.

3. Continuous control tasks. In robotics and control systems, decisions must be made at each time step with minimal latency. RNNs process one observation at a time; Transformers must maintain and attend over a growing context window.

4. State-space models are RNNs in disguise. Mamba, S4, and other state-space models are essentially modernized RNNs with structured state transitions. They achieve Transformer-like performance with RNN-like inference efficiency. If someone asks "will RNNs come back?", the answer is "they already have, in a different form."

Company Variation

At Google Brain, DeepMind, and Meta FAIR, interviewers expect deep knowledge of the Transformer advantages AND awareness of state-space models like Mamba. At applied companies, they care more about practical tradeoffs: "When would you NOT use a Transformer?" Being able to articulate concrete scenarios (streaming, edge deployment, latency budgets) is a strong differentiator.

Part 8 - Advanced Topics

Stacked (Deep) RNNs

Multiple RNN layers can be stacked. The hidden state of layer ll at time tt is:

ht(l)=RNN(ht1(l),ht(l1))h_t^{(l)} = \text{RNN}(h_{t-1}^{(l)}, h_t^{(l-1)})

The output of one layer becomes the input to the next. In practice, 2-3 layers work well. Beyond that, training becomes very difficult due to compounded gradient issues.

Teacher Forcing

During training, the decoder receives the ground-truth token yt1y_{t-1} as input, not its own prediction y^t1\hat{y}_{t-1}. This stabilizes training but creates a train-test discrepancy called exposure bias - at inference time, the model receives its own (potentially incorrect) predictions.

Solutions include:

  • Scheduled sampling: gradually replace ground-truth with model predictions during training
  • Beam search: explore multiple hypotheses at inference time
  • Reinforcement learning fine-tuning: optimize the model for sequence-level metrics

Truncated BPTT

For very long sequences, full backpropagation through time is computationally prohibitive. Truncated BPTT limits the gradient computation to the most recent kk steps:

LtWhhj=tktLthjhjWhh\frac{\partial L_t}{\partial W_{hh}} \approx \sum_{j=t-k}^{t} \frac{\partial L_t}{\partial h_j} \frac{\partial h_j}{\partial W_{hh}}

Typical values: k=35k = 35 for language modeling (Merity et al., AWD-LSTM). This bounds computation but limits the ability to learn dependencies longer than kk steps.

Peephole Connections

A variant where the gates can also look at the cell state directly:

ft=σ(Wf[ht1,xt,ct1]+bf)f_t = \sigma(W_f [h_{t-1}, x_t, c_{t-1}] + b_f)

This gives the gates access to the "long-term memory" when making gating decisions. In practice, peephole connections provide marginal improvement and are rarely used in modern architectures.

Forget Gate Bias Initialization

A crucial practical detail: initialize the forget gate bias to 1 or 2 (Jozefowicz et al., 2015). This ensures the forget gate starts near 1 (keeping information), preventing the network from erasing everything at initialization. Without this, LSTM training can be unstable.

# PyTorch example: initializing forget gate bias to 1
import torch.nn as nn

lstm = nn.LSTM(input_size=100, hidden_size=256, num_layers=2)
for name, param in lstm.named_parameters():
if 'bias' in name:
# LSTM bias is [input_gate, forget_gate, cell_gate, output_gate]
n = param.size(0)
param.data[n // 4 : n // 2].fill_(1.0) # forget gate bias = 1

Part 9 - Common Interview Questions

Q1: "Can you train an LSTM without the forget gate?"

Yes - the original LSTM (1997) actually did not have a forget gate. It was added by Gers et al. (2000). Without the forget gate, the cell state can only accumulate - it cannot erase outdated information. This works for simple tasks but fails when the model needs to reset its memory (e.g., at sentence boundaries in language modeling).

Q2: "What happens if all gates are always 1?"

If ft=it=ot=1f_t = i_t = o_t = 1 and c~t=tanh(Wc[ht1,xt]+bc)\tilde{c}_t = \tanh(W_c [h_{t-1}, x_t] + b_c):

ct=ct1+c~tc_t = c_{t-1} + \tilde{c}_t

The cell state becomes a running sum of all candidate values. The gradient flows perfectly (no vanishing), but the cell state grows without bound. The tanh on the cell state before the output gate would saturate, and the hidden state would become nearly constant (tanh(ct)±1\tanh(c_t) \to \pm 1 for all dimensions).

Q3: "How does dropout work in RNNs?"

Standard dropout applied at each time step with different masks destroys the temporal structure. The correct approach is variational dropout (Gal and Ghahramani, 2016): use the same dropout mask at every time step within a sequence. This is equivalent to dropping the same hidden units consistently, which can be interpreted as a Bayesian approximation.

# PyTorch: variational dropout in LSTM
# Use the same dropout mask across time steps
lstm = nn.LSTM(input_size=100, hidden_size=256, dropout=0.3, num_layers=2)
# Note: PyTorch applies dropout between LSTM layers, not within

Q4: "Compare vanilla RNN, LSTM, GRU, and Transformer for a specific task."

TaskBest ChoiceWhy
Real-time speech keyword detectionGRULow latency, streaming, small model
Offline translation (2017)Bidirectional LSTM + attentionState of the art before Transformers
Offline translation (2024)TransformerBetter quality, faster training
Music generation on Raspberry PiLSTMMemory-efficient, sequential generation
Document classificationTransformer (or bidirectional LSTM)Needs full-document context
Time-series forecasting (edge)LSTM or GRULow-latency, streaming updates

Practice Problems

Problem 1: Gradient Flow Analysis

A vanilla RNN has WhhW_{hh} with eigenvalues 0.95, 0.8, 0.6, 0.3. The sequence length is 50. For a dependency between position 1 and position 50:

(a) What is the gradient magnitude through the dominant eigenvalue after 49 steps? (b) What is the gradient magnitude through the smallest eigenvalue? (c) What is the practical consequence for learning?

Hint 1 -- Direction

Compute λk\lambda^k for each eigenvalue with k=49k = 49. The gradient decomposes along eigenvectors, each scaled by its eigenvalue raised to the power of the number of steps.

Hint 2 -- Insight

Even the dominant eigenvalue of 0.95 produces significant decay over 49 steps. The smaller eigenvalues produce near-zero gradients. The network will only learn dependencies along the eigenvector corresponding to 0.95, and even that will be very weak.

Hint 3 -- Full Solution + Rubric

(a) Dominant eigenvalue gradient: 0.95490.07690.95^{49} \approx 0.0769 (about 8% of original signal).

(b) Smallest eigenvalue gradient: 0.3492.4×10260.3^{49} \approx 2.4 \times 10^{-26} (effectively zero).

(c) Practical consequence: The network can only weakly learn dependencies along the principal eigenvector direction. Information from 49 steps ago that aligns with the smallest eigenvector is completely lost. This means the vanilla RNN has a strongly direction-dependent effective memory - some "types" of information (aligned with large eigenvalues) persist longer than others. But even the best case (0.95) loses 92% of gradient signal, making optimization very slow for long-range patterns.

Note: This analysis ignores the additional tanh\tanh' factors which further reduce the gradients since tanh(z)1\tanh'(z) \leq 1.

Scoring Rubric:

  • Strong Hire: Computes both eigenvalue powers correctly, explains direction-dependent memory, notes that even the dominant eigenvalue leads to significant decay, mentions the tanh derivative factor.
  • Lean Hire: Computes the powers correctly but does not discuss directional dependence or practical implications.
  • No Hire: Cannot set up the eigenvalue computation or confuses eigenvalues with singular values.

Problem 2: LSTM Gate Design

You are designing a sentiment analysis model. The input is a movie review, and you notice the model fails on reviews like: "I thought this movie would be terrible, but it was actually one of the best films I have ever seen."

(a) Which LSTM gate is most critical for handling the sentiment reversal? (b) Draw the ideal gate activation pattern across the sentence. (c) How would a vanilla RNN fail on this example?

Hint 1 -- Direction

Think about what needs to happen at the word "but." The model has accumulated a negative sentiment. At "but," it needs to partially erase that negative sentiment and start writing positive sentiment.

Hint 2 -- Insight

The forget gate is the hero here. At "but," the forget gate should activate strongly (close to 0 for the sentiment dimensions of the cell state), erasing the accumulated negative sentiment. The input gate then writes the new positive sentiment. A vanilla RNN cannot selectively erase - the hidden state is a single compressed representation where old and new information are entangled.

Hint 3 -- Full Solution + Rubric

(a) The forget gate is most critical. When the model encounters "but," it needs to erase (or heavily downweight) the negative sentiment accumulated from "I thought this movie would be terrible."

(b) Ideal gate activation pattern (for sentiment-related dimensions of the cell state):

TokenForget GateInput GateCell State (sentiment)
"I thought"~0.95moderateneutral
"terrible"~0.95highstrongly negative
"but"~0.1 (ERASE)low~neutral (reset)
"actually"~0.95moderateslightly positive
"best films"~0.95highstrongly positive

(c) A vanilla RNN would fail because:

  1. The hidden state after "terrible" encodes negative sentiment mixed with all other information
  2. There is no mechanism to selectively erase the sentiment component while keeping other information (like the subject "movie")
  3. The word "but" would need to trigger a complex nonlinear transformation of the entire hidden state, which a single tanh(Wh+Wx+b)\tanh(Wh + Wx + b) step is unlikely to learn reliably
  4. By the time the model reaches "best films," the negative sentiment from "terrible" has been compressed and mixed through multiple hidden state updates, making it very difficult to override

Scoring Rubric:

  • Strong Hire: Identifies the forget gate, provides a convincing gate activation timeline, explains why selective erasure is impossible in a vanilla RNN, discusses the cell state dimensions conceptually.
  • Lean Hire: Identifies the forget gate as important but cannot articulate the gate activation pattern or fully explain the vanilla RNN failure.
  • No Hire: Focuses on the output gate or cannot explain the role of any specific gate in sentiment reversal.

Problem 3: Architecture Selection

You are building a real-time voice assistant that must:

  • Process audio at 16kHz (16,000 samples per second)
  • Detect keywords with less than 50ms latency
  • Run on a device with 512KB of RAM
  • Maintain accuracy above 95%

(a) Choose between vanilla RNN, LSTM, GRU, and Transformer. Justify your choice. (b) Size the model - how many parameters can you afford? (c) What training strategy would you use?

Hint 1 -- Direction

Start with the constraints. 512KB of RAM rules out most Transformer architectures. 50ms latency means you cannot wait for long context windows. 16kHz input means you need to process audio features (not raw samples) - typically mel spectrograms at ~100 frames/second.

Hint 2 -- Insight

A GRU is the sweet spot here. It is 25% smaller than an LSTM with comparable performance, naturally operates in streaming mode, and can fit within 512KB. At ~100 frames per second with 50ms latency, you have ~5 frames of lookahead. A small GRU (64-128 hidden units) with ~50K parameters fits comfortably and can achieve 95%+ accuracy on keyword detection.

Hint 3 -- Full Solution + Rubric

(a) Choice: GRU

  • Vanilla RNN: Could fit the memory budget, but vanishing gradients would hurt keyword detection accuracy when the keyword spans multiple frames. Reject.
  • LSTM: Works well but uses 33% more parameters than GRU for the same hidden size. With 512KB constraint, this limits hidden dimension. Possible but suboptimal.
  • GRU: Best tradeoff - streaming-compatible, fewer parameters than LSTM, handles short-medium range dependencies well, and keyword detection typically requires 0.5-2 seconds of context (50-200 frames).
  • Transformer: Requires storing attention matrices over context window. Even a tiny Transformer with 2 layers and 64 dimensions needs ~160K parameters plus runtime memory for attention. Could work but is overkill and wastes memory budget.

(b) Model sizing:

512KB = 524,288 bytes. With 16-bit (2 bytes) parameters: 262,144 parameters maximum. Reserve 50% for activations and buffers: ~130,000 parameters for the model.

GRU with dh=96d_h = 96, dx=40d_x = 40 (mel features):

  • Parameters: 3×96×(96+40+1)=39, ⁣4563 \times 96 \times (96 + 40 + 1) = 39,\!456 parameters
  • Memory: ~79KB for weights, leaving ~440KB for activations and buffers

This comfortably fits with room for a small output classifier.

(c) Training strategy:

  1. Train on full keyword dataset with cross-entropy loss
  2. Use data augmentation: noise injection, speed perturbation, room simulation
  3. Quantize to INT8 post-training (further reducing to ~40KB)
  4. Use knowledge distillation from a larger Transformer teacher to boost accuracy
  5. Fine-tune on device-recorded audio to handle acoustic domain shift

Scoring Rubric:

  • Strong Hire: Chooses GRU with clear justification, does the parameter budget calculation, addresses all four constraints (latency, memory, accuracy, sample rate), proposes quantization and distillation.
  • Lean Hire: Chooses GRU or LSTM with reasonable justification but does not do the sizing calculation or misses key training strategies.
  • No Hire: Chooses Transformer without addressing memory constraints, or chooses vanilla RNN without acknowledging accuracy limitations.

Problem 4: LSTM vs Transformer Debate

Your team is building a time-series anomaly detection system for IoT sensors. Data arrives at 1 sample per second, and you need to detect anomalies within 200ms. You have 4GB GPU memory available. Your colleague insists on using a Transformer because "RNNs are obsolete." Evaluate their claim.

Hint 1 -- Direction

Consider the streaming requirement. What is the inference cost of a Transformer vs LSTM for processing one new sample?

Hint 2 -- Insight

For streaming inference at 1 sample per second, an LSTM adds one step in O(dh2)O(d_h^2) time and O(dh)O(d_h) memory. A Transformer either recomputes attention over the full context (O(T2d)O(T^2 d)) or uses KV cache (O(Td)O(Td) memory). For a 1-hour context window (3600 samples), the LSTM uses constant memory while the Transformer KV cache grows linearly. The colleague is wrong - this is exactly the scenario where RNNs are competitive.

Hint 3 -- Full Solution + Rubric

Evaluation: The colleague is wrong for this specific use case.

Why the LSTM is better here:

FactorLSTMTransformer
Inference per stepO(dh2)O(d_h^2) - constantO(Td)O(T \cdot d) with KV cache, or O(T2d)O(T^2 \cdot d) without
Memory per stepO(dh)O(d_h) - constantO(Td)O(T \cdot d) - grows with context
1-hour context (T=3600)~2MB with dh=512d_h = 512~56MB KV cache per layer, ~450MB for 8 layers
LatencySub-millisecond5-50ms depending on context length
Streaming-nativeYesNo (requires KV cache management)

Why the Transformer COULD still be considered:

  1. If anomaly patterns span very long ranges (hours), the LSTM may miss them due to vanishing gradients
  2. If you can batch-process (not truly streaming), Transformers train faster
  3. Modern alternatives (Mamba, RWKV) offer Transformer-quality with RNN-like inference

Recommended approach:

Use an LSTM/GRU for online inference with a 5-minute sliding window. If long-range patterns are needed, run a Transformer offline on hourly batches to detect slow-developing anomalies. This hybrid approach uses each architecture where it is strongest.

Scoring Rubric:

  • Strong Hire: Correctly identifies that streaming inference favors RNNs, does the memory calculation, acknowledges the tradeoff (RNN misses very long-range patterns), proposes the hybrid approach, mentions state-space models as modern alternatives.
  • Lean Hire: Correctly sides with LSTM but cannot quantify the memory/compute difference or does not acknowledge the long-range limitation.
  • No Hire: Agrees with the colleague that "RNNs are obsolete" without analyzing the specific requirements, or dogmatically insists RNNs are always better.

Problem 5: Implementing LSTM from Scratch

Write the forward pass of an LSTM cell in NumPy. Given inputs x_t (shape: [batch_size, input_dim]), h_prev (shape: [batch_size, hidden_dim]), c_prev (shape: [batch_size, hidden_dim]), and weight matrices, compute h_t and c_t.

Hint 1 -- Direction

Concatenate ht1h_{t-1} and xtx_t, then compute all four gates in a single matrix multiply for efficiency. Split the result into four chunks.

Hint 2 -- Insight

The efficient implementation concatenates all four weight matrices into one large matrix and does a single matmul. This is how frameworks like PyTorch implement it internally. The four chunks correspond to input gate, forget gate, cell candidate, and output gate.

Hint 3 -- Full Solution + Rubric
import numpy as np

def sigmoid(x):
return 1 / (1 + np.exp(-np.clip(x, -15, 15)))

def lstm_cell(x_t, h_prev, c_prev, W, b):
"""
x_t: [batch_size, input_dim]
h_prev: [batch_size, hidden_dim]
c_prev: [batch_size, hidden_dim]
W: [input_dim + hidden_dim, 4 * hidden_dim] (concatenated weights)
b: [4 * hidden_dim] (concatenated biases)

Returns: h_t, c_t
"""
hidden_dim = h_prev.shape[1]

# Concatenate input and previous hidden state
combined = np.concatenate([x_t, h_prev], axis=1) # [batch, input+hidden]

# Single matrix multiply for all 4 gates
gates = combined @ W + b # [batch, 4*hidden]

# Split into 4 gates
i = sigmoid(gates[:, 0*hidden_dim : 1*hidden_dim]) # input gate
f = sigmoid(gates[:, 1*hidden_dim : 2*hidden_dim]) # forget gate
c_hat = np.tanh(gates[:, 2*hidden_dim : 3*hidden_dim]) # candidate
o = sigmoid(gates[:, 3*hidden_dim : 4*hidden_dim]) # output gate

# Cell state update (the key additive operation)
c_t = f * c_prev + i * c_hat

# Hidden state
h_t = o * np.tanh(c_t)

return h_t, c_t

Key implementation details:

  • Clipping in sigmoid prevents numerical overflow
  • Single concatenated matmul is 4x faster than separate gate computations
  • Element-wise operations (*) correspond to the \odot (Hadamard product) in the equations

Scoring Rubric:

  • Strong Hire: Correct implementation with concatenated matmul, proper gate ordering, numerical stability (sigmoid clipping), and clear comments explaining each step.
  • Lean Hire: Correct equations implemented as separate matmuls, missing numerical stability considerations.
  • No Hire: Incorrect gate equations, missing the cell state update, or confusing matrix dimensions.

Interview Cheat Sheet

ConceptKey FormulaOne-LinerRed Flag
RNN hidden stateht=tanh(Whhht1+Wxhxt+b)h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t + b)Compress history into fixed vector"RNNs have no memory"
Vanishing gradienthT/hk=WhhTdiag(tanh)\partial h_T / \partial h_k = \prod W_{hh}^T \cdot \text{diag}(\tanh')Repeated matrix multiply causes exponential decay"Vanishing = zero gradient"
LSTM cell updatect=ftct1+itc~tc_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_tAdditive update preserves gradient flow"LSTM has no vanishing gradient at all"
Forget gateft=σ(Wf[ht1,xt]+bf)f_t = \sigma(W_f[h_{t-1},x_t] + b_f)Controls what to erase"Forget gate always forgets"
LSTM gradientcT/ck=fi\partial c_T/\partial c_k = \prod f_iProduct of forget gates (not full matrix)Confusing ctc_t and hth_t
GRU updateht=(1zt)ht1+zth~th_t = (1-z_t) \odot h_{t-1} + z_t \odot \tilde{h}_tSingle gate controls keep vs replace"GRU is strictly worse than LSTM"
Bidirectionalht=[ht;ht]h_t = [\overrightarrow{h}_t; \overleftarrow{h}_t]Sees both past and futureUsing bidir for autoregressive tasks
Seq2seq bottleneckc=hTencc = h_T^{\text{enc}}All info in one vector"Bottleneck is fine for short sequences"
RNN vs TransformerSequential O(T)O(T) vs parallel O(1)O(1)RNNs cannot parallelize training"RNNs are completely obsolete"
Forget bias initbf=1b_f = 1Start by remembering everythingNot knowing this practical trick

Spaced Repetition Checkpoints

Day 0 -- Initial Learning

  • Read this entire page
  • Write the vanilla RNN equations and LSTM equations from memory
  • Derive the vanishing gradient for a 3-step vanilla RNN on paper
  • Complete the self-assessment

Day 3 -- First Recall

  • Without notes, write all four LSTM gate equations
  • Give the "60-Second Answer" for why Transformers replaced RNNs, out loud, timed
  • Draw the LSTM cell diagram from memory, labeling all gates and data flows

Day 7 -- Connections

  • Explain the relationship between LSTM cell state, forget gate, and vanishing gradients (3-minute verbal explanation)
  • Do Practice Problem 2 (sentiment reversal) without looking at hints
  • Compare LSTM and GRU from memory - equations, parameter count, and when to prefer each

Day 14 -- Application

  • Do Practice Problem 3 (architecture selection) under timed conditions (10 minutes)
  • Implement an LSTM cell from scratch in NumPy without looking at reference code
  • Explain to an imaginary interviewer when you would choose an RNN over a Transformer

Day 21 -- Mock Interview

  • Have someone ask: "Walk me through an LSTM cell, derive the gradient flow, and explain why Transformers replaced RNNs"
  • Time yourself: LSTM walkthrough in under 5 minutes, gradient derivation in under 5 minutes, Transformer comparison in under 3 minutes
  • Do all 5 practice problems in sequence under timed conditions (50 minutes total)

Key Takeaways

  1. Vanilla RNNs have a beautiful simplicity but a fatal flaw. The shared weight matrix creates parameter efficiency but also repeated matrix multiplication in the gradient, causing exponential vanishing or explosion. Understanding this math is prerequisite to understanding everything that followed.

  2. LSTMs solve vanishing gradients with an additive cell state. The constant error carousel - controlled by the forget gate - allows gradients to flow through time via element-wise multiplication instead of full matrix multiplication. This is an engineering solution to a mathematical problem.

  3. GRUs are LSTMs with fewer parameters and comparable performance. Use GRU as your default recurrent architecture unless you have evidence that the task requires the extra capacity of LSTM's independent forget and input gates.

  4. Transformers replaced RNNs for good reasons, but RNNs are not dead. Streaming inference, edge deployment, and real-time control remain legitimate RNN territory. State-space models (Mamba, S4) are the modern evolution of recurrent ideas, combining RNN-like inference with Transformer-like training and quality.

© 2026 EngineersOfAI. All rights reserved.