How encoder-decoder networks with attention solve variable-length sequence-to-sequence problems - from machine translation to summarization and code generation.

How does encoder decoder work in practice?

Seq2Seq and Encoder-Decoder Architectures covers seq2seq, encoder decoder, sequence to sequence learning from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/sequences-and-time-series/seq2seq-and-encoder-decoder

What is the difference between seq2seq and sequence to sequence learning?

See the full breakdown at https://engineersofai.com/docs/ml/sequences-and-time-series/seq2seq-and-encoder-decoder

Seq2Seq and Encoder-Decoder Architectures

The Moment the System Broke

It was 3:17 AM when the on-call engineer at a major e-commerce company got the alert. Their product description translation system - the one that localized listings from English to Japanese, German, and Portuguese for 14 million international customers - had started producing garbled output. Not crashes. Not 500 errors. Just quietly broken translations that no monitoring metric had flagged because the service was technically returning text.

The translations were fine for short product titles: "Blue Cotton T-Shirt" came back as sensible Japanese. But longer descriptions - the 200-word paragraphs explaining fabric composition, washing instructions, and size guides - were drifting into nonsense by the third sentence. The beginning of each description translated accurately. Then somewhere in the middle, the network seemed to forget what it was translating and began hallucinating product details from completely different items. A wool sweater description ended up with care instructions for electronics. A hiking boot listing included references to "soft, breathable linen."

The root cause was the architecture itself. The company's translation model was a simple RNN encoder-decoder from 2016 - built before attention mechanisms were standard practice. The encoder read an entire input sentence and compressed it into a single fixed-size vector: 512 floating-point numbers that were supposed to hold everything the decoder needed to produce the output. For short sentences, 512 dimensions was enough. For 200-word product descriptions, it was like trying to photograph a landscape through a keyhole. By the time the encoder finished reading the 200th word, the first 100 words had been overwritten in the hidden state.

The engineer's fix was not a patch. It was a rewrite to a seq2seq architecture with Bahdanau attention - where the decoder, at each output step, could look back at every encoder hidden state and decide which input words were most relevant to what it was currently generating. The new model remembered the beginning of a sentence while generating the end of it. Translation quality on long texts improved by 34% in human evaluation scores. The 3 AM alerts stopped.

That incident illustrates the exact problem that Sutskever, Vinyals, and Le solved in their landmark 2014 paper, and why attention - introduced by Bahdanau et al. in 2015 - became one of the most important ideas in the history of deep learning. Understanding these architectures is not optional for any ML engineer working with text, speech, code, or any domain where inputs and outputs are sequences of different lengths.

Why This Exists

The Problem With Fixed-Size Networks

Standard feedforward networks and even basic RNNs share a fundamental assumption: the input and output have fixed, predetermined sizes. A classifier takes an image of 224x224 pixels and outputs a probability over 1000 classes. A sentiment analyzer takes a sequence of tokens and outputs a single label. The shape is fixed at training time.

Machine translation violates this assumption completely. "Hello" maps to "Hola" (1 word to 1 word). "The cat sat on the mat" maps to "Le chat était assis sur le tapis" (6 words to 7 words). "I would have liked to have been informed in advance" might map to a German construction that restructures the entire sentence differently. There is no fixed input length, no fixed output length, and no fixed alignment between them.

You could try padding all inputs to the maximum sequence length, but this wastes computation and - critically - forces the network to produce output that's always the maximum length too. You could try truncating inputs, but then you lose information. Neither approach scales to real-world text where lengths vary from 3 to 300 words.

The deeper problem is alignment. In translation, word 5 in the output might depend primarily on words 2 and 7 in the input. In summarization, the first output sentence might draw from scattered sentences across the input document. Standard networks process position-by-position and cannot model these long-range, crossed dependencies without architectural help.

What Seq2Seq Changes

Sequence-to-sequence learning, introduced by Sutskever et al. (2014), separates the problem into two distinct phases: encoding and decoding. The encoder reads the entire input sequence and produces a summary. The decoder uses that summary to generate the output sequence, one token at a time, conditioned on what it has already generated.

This separation is powerful because encoding and decoding can have different lengths. The encoder reads N input tokens. The decoder generates M output tokens. N and M do not need to match. The bridge between them is the context - some representation of what the encoder understood. What that context looks like, and how the decoder uses it, is where the architecture evolved enormously between 2014 and 2017.

Historical Context

2014: The Paper That Changed Sequence Modeling

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le published "Sequence to Sequence Learning with Neural Networks" at NeurIPS 2014. The idea was simple: use one LSTM to encode an input sentence into a fixed-size vector, then use a second LSTM to decode that vector into an output sentence.

Their key insight was reversing the input sentence before feeding it to the encoder. If the input is "The cat sat" and you reverse it to "sat cat The," the first word the decoder needs to produce ("Le" for "The") is now close in the encoder's computation to the last input word it processed ("The"). This shortened the long-range dependencies that LSTMs struggle with and improved BLEU scores on English-to-French translation from around 26 to 34.8 on the WMT'14 test set - competitive with phrase-based statistical MT systems that required far more hand-engineering.

The paper demonstrated something remarkable: a single neural architecture, trained end-to-end on input-output pairs, could learn to translate without any explicit linguistic knowledge. No phrase tables, no alignment models, no language models built separately. Just encoder, context vector, decoder.

2015: Bahdanau Attention - The Missing Piece

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio identified the bottleneck in a paper submitted in September 2014 and published at ICLR 2015: the fixed-size context vector. They called it the "bottleneck problem" - no matter how long the input sequence, the encoder had to compress everything into a vector of size 256 or 512. Information was inevitably lost.

Their solution was attention. Instead of giving the decoder a single fixed vector, they gave it access to all the encoder hidden states simultaneously. At each decoding step, the decoder could compute a weighted sum over all encoder states - putting high weight on the states most relevant to the current output token and low weight on everything else.

This mechanism was called additive attention or Bahdanau attention. It required learning a small alignment model - a feedforward network that scored how well each decoder state matched each encoder state. The scores were normalized into a probability distribution via softmax, and the weighted sum of encoder states (the "context vector") was recomputed at every decoder step.

The result: BLEU scores on English-to-French improved further, and qualitative analysis showed the model was learning linguistically meaningful alignments - French verbs aligned to English verbs, adjectives to adjectives, even across languages with different word orders.

2015: Luong Attention - The Faster Alternative

Minh-Thang Luong, Hieu Pham, and Christopher D. Manning (2015) proposed two simpler attention mechanisms. Instead of a learned feedforward network to compute alignment scores, they used dot products between encoder and decoder states (multiplicative attention). This was computationally cheaper and, with proper scaling, achieved comparable results.

Luong attention also introduced the concept of local attention - instead of attending over all encoder positions, attend over a window around a predicted alignment position. This bounded the computation for long sequences.

2017: Attention Is All You Need

The natural endpoint of this trajectory was eliminating the RNN entirely. Vaswani et al. (2017) showed that if attention is the mechanism doing the heavy lifting of relating positions in a sequence, you do not need the sequential processing of an RNN at all. Replace it with self-attention and you get the Transformer - parallelizable, faster to train, and ultimately more powerful.

But transformers are built on the same encoder-decoder insight from 2014, with multi-head attention replacing the simple Bahdanau mechanism. Understanding seq2seq with attention is understanding the conceptual foundation of GPT, BERT, T5, and every large language model in production today.

Core Concept: The Encoder-Decoder Architecture

Encoding: Building a Memory of the Input

The encoder is an RNN (or LSTM or GRU) that reads the input sequence token by token. At each step $t$ , it takes the token embedding $x_t$ and the previous hidden state $h_{t-1}$ and produces a new hidden state $h_t$ .

After reading all N input tokens, the encoder has produced N hidden states: $[h_1, h_2, \ldots, h_N]$ . Each $h_t$ encodes information about the input up to position $t$ - but crucially, in a well-trained LSTM, $h_t$ also captures some backward context because the LSTM's cell state carries information forward across many time steps.

In the original Sutskever (2014) architecture, only the final hidden state $h_N$ was passed to the decoder. Everything the encoder learned had to fit into that single vector.

In attention-based architectures, all hidden states $[h_1, h_2, \ldots, h_N]$ are kept and made available to the decoder. The encoder now produces an entire sequence of context vectors, not just one.

Decoding: Generating Output One Step at a Time

The decoder is a separate RNN. It generates the output sequence autoregressively - producing one token at a time, where each generated token becomes the input to the next decoder step.

The decoder starts with an initial state, typically derived from the encoder's final hidden state. At each step $t$ , it takes:

The previously generated token (embedded)
Its own previous hidden state
A context vector from the encoder (which with attention changes at every step)

It combines these to predict the next output token via a linear projection and $\text{softmax}$ over the vocabulary.

The key property: the output sequence can be any length. The decoder keeps generating until it produces a special end-of-sequence token <EOS>. The encoder has already finished by this point - decoding is sequential, but encoding is done once.

The Context Vector Bottleneck Problem

Without attention, the context vector is just $h_N$ - the encoder's final hidden state. The decoder receives it once, at initialization, and then the connection to the input is severed. For short sentences, this works reasonably well. For sentences longer than about 30 tokens, performance degrades significantly.

The fundamental issue is that RNNs process information sequentially. Hidden state $h_N$ has seen all the input, but the signal from early tokens has passed through many nonlinear transformations to get there. In practice, the final hidden state is much more influenced by the last few tokens than the first few - the recency bias that LSTMs partially but not completely address.

Bahdanau et al. (2015) visualized this explicitly: without attention, translation quality (measured by BLEU) dropped sharply as sentence length increased beyond 20 words. With attention, quality remained high even at 50-plus words because the decoder could always look back at the relevant encoder states directly.

The Attention Mechanism

The Core Idea Before the Math

Imagine you're translating "The large black cat sat on the mat" into French, and you're currently generating the word for "cat." Common sense says you should be paying the most attention to the word "cat" in the source sentence, somewhat less attention to the adjectives "large" and "black" (which will affect the French gender agreement), and almost no attention to "sat" or "mat" at this moment.

Attention is a mechanism for implementing this intuition computationally. At each decoder step, the model:

Computes a score between the current decoder state and every encoder state (how relevant is each input position to what we're currently generating?)
Normalizes these scores into a probability distribution via softmax
Takes a weighted sum of all encoder states using these probabilities as weights
Uses this weighted sum - the context vector - to inform the next output token

The context vector is different at every decoder step. When generating the French word for "cat," it emphasizes encoder states near "cat." When generating the word for "sat," it shifts to emphasize encoder states near "sat." The model learns which encoder states to emphasize through backpropagation.

Bahdanau (Additive) Attention

Bahdanau et al. compute alignment scores using a small feedforward network. Let $s_{t-1}$ be the decoder hidden state at step $t-1$ , and $h_i$ be the encoder hidden state at position $i$ .

The alignment score is:

$e_{t,i} = v_a^\top \tanh(W_a s_{t-1} + U_a h_i)$

Where $W_a$ , $U_a$ , and $v_a$ are learned parameters. The score $e_{t,i}$ measures how well the decoder state at step $t-1$ "matches" the encoder state at position $i$ .

These scores are then normalized:

$\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{j=1}^{T_x} \exp(e_{t,j})}$

The context vector is the weighted sum:

$c_t = \sum_{i=1}^{T_x} \alpha_{t,i} h_i$

The $\alpha_{t,i}$ values form an attention weight matrix - if you visualize it with decoder positions on one axis and encoder positions on the other, you can literally see what the model is paying attention to when generating each output word.

The name "additive attention" comes from the fact that $W_a s_{t-1}$ and $U_a h_i$ are added inside the $\tanh$ . This makes it possible to precompute $U_a h_i$ for all encoder positions once, rather than recomputing it for every decoder step - an important efficiency optimization.

Luong (Multiplicative) Attention

Luong et al. (2015) proposed simpler alignment functions. The most common is the dot product:

$\text{score}(s_t, h_i) = s_t^\top h_i$

This is just a dot product between the current decoder state and each encoder state. No learned parameters, no feedforward network. Just inner products measuring cosine similarity in the hidden state space.

For this to work, the encoder and decoder must have the same hidden dimension. If they differ, you can use the "general" form with a weight matrix $W_a$ :

$\text{score}(s_t, h_i) = s_t^\top W_a h_i$

Luong attention also differs from Bahdanau in timing. Bahdanau computes the context vector using $s_{t-1}$ (the state before the current step) and incorporates it into computing $s_t$ . Luong computes it using $s_t$ (the state after reading the input at step $t$ ) and uses it to compute the output at step $t$ . This is the "input feeding" approach - the context vector is concatenated with the next input embedding.

Which to use? In practice, both work well. Bahdanau is the original and theoretically cleaner. Luong's dot-product variant is faster. For most modern applications, you would use multi-head self-attention from the Transformer anyway - but understanding these two variants is essential for interviews and for understanding older production systems you will inherit.

Alignment Scores in Practice

When you visualize the attention weight matrix $\alpha_{t,i}$ for a translation model, the diagonal pattern you see for simple sentences (French and English have similar word order) reflects the model's learned alignment. For language pairs with more different word order - like English and Japanese - the alignment matrix shows more complex patterns with crossings.

This interpretability is one of the most valuable properties of attention. You can look at what the model is attending to and diagnose failure modes. If a translation error occurs, you can check the attention weights and see whether the model was looking at the right source position when it generated the wrong token.

Teacher Forcing

The Training-Time Problem

When you train a seq2seq model from scratch, the decoder makes mistakes - especially early in training. If the decoder predicts the wrong token at step 3, then step 4's input is wrong, step 5's input is wrong, and by step 10 you're generating in a completely wrong context. Errors compound exponentially. Gradients become useless because the loss is measuring prediction quality given wrong inputs that would never occur at inference time.

Teacher forcing solves this by deliberate cheating during training. Instead of feeding the decoder's own predictions back as inputs, you feed the ground-truth tokens. The decoder at step 4 gets the actual correct token from step 3, not whatever it predicted. This breaks the error cascade and makes training much more stable - gradients flow through meaningful states.

The term comes from educational psychology: a teacher "forces" the correct answer to prevent a student from practicing wrong procedures.

With teacher forcing, training converges much faster. Loss decreases smoothly instead of bouncing. The model can focus on learning the output distribution given correct context, rather than learning to recover from its own errors.

Exposure Bias: The Dark Side of Teacher Forcing

There is a catch. Teacher forcing creates a train-test mismatch. During training, the decoder sees ground-truth tokens. During inference, it sees its own predictions. If the model has never been trained to recover from its own errors, it has not learned to handle the distributional shift.

This gap is called exposure bias, and it compounds at inference time. A small error at step 5 leads to slightly wrong input at step 6, which leads to a slightly less accurate prediction at step 6, which compounds at step 7, and so on. The model has been trained in a world where every input was perfect - it has no defense against the imperfect world of its own outputs.

Several strategies address exposure bias:

Scheduled Sampling (Bengio et al., 2015): Mix teacher forcing and free-running during training. Start with 100% teacher forcing and gradually reduce the probability of using ground-truth tokens, replacing them with model predictions. By the end of training, you're using model predictions a significant fraction of the time, so the model learns to handle imperfect inputs.

REINFORCE / policy gradient methods: Treat decoding as a reinforcement learning problem where the reward is the BLEU score of the complete output. This optimizes the metric you actually care about and exposes the model to its own distribution during training. But this is notoriously unstable and requires careful tuning.

Minimum Risk Training: Compute the expected loss over sampled output sequences rather than a single greedy sequence. More tractable than full RL but still computationally expensive.

For most practical applications, teacher forcing with some form of scheduled sampling is the default approach. The exposure bias problem is real but often manageable, especially when combined with beam search at inference time.

NumPy From Scratch: Attention Score Calculation

The following implementation shows exactly how attention weights are computed - no black boxes.

import numpy as np

def softmax(x):
    """Numerically stable softmax."""
    x_shifted = x - np.max(x)
    exp_x = np.exp(x_shifted)
    return exp_x / np.sum(exp_x)

def bahdanau_attention(decoder_state, encoder_states, W_a, U_a, v_a):
    """
    Compute Bahdanau (additive) attention weights.

    Args:
        decoder_state:  shape (hidden_dim,)       -- s_{t-1}
        encoder_states: shape (seq_len, hidden_dim) -- [h_1, ..., h_N]
        W_a:            shape (attn_dim, hidden_dim) -- weight for decoder state
        U_a:            shape (attn_dim, hidden_dim) -- weight for encoder states
        v_a:            shape (attn_dim,)            -- projection vector

    Returns:
        alpha:   shape (seq_len,)    -- attention weights (sum to 1)
        context: shape (hidden_dim,) -- weighted sum of encoder states
    """
    seq_len = encoder_states.shape[0]

    # Project decoder state: (attn_dim,)
    projected_decoder = W_a @ decoder_state  # (attn_dim,)

    # Project each encoder state: (seq_len, attn_dim)
    # This can be precomputed once per source sentence -- optimization!
    projected_encoder = encoder_states @ U_a.T  # (seq_len, attn_dim)

    # Compute alignment scores: e_i = v_a^T * tanh(projected_decoder + projected_encoder[i])
    # Broadcast decoder projection across all encoder positions
    combined = np.tanh(projected_decoder + projected_encoder)  # (seq_len, attn_dim)
    scores = combined @ v_a  # (seq_len,) -- one score per encoder position

    # Normalize to probabilities
    alpha = softmax(scores)  # (seq_len,)

    # Weighted sum of encoder states
    context = alpha @ encoder_states  # (hidden_dim,)

    return alpha, context


def luong_dot_attention(decoder_state, encoder_states):
    """
    Compute Luong dot-product attention weights.

    Args:
        decoder_state:  shape (hidden_dim,)
        encoder_states: shape (seq_len, hidden_dim)

    Returns:
        alpha:   shape (seq_len,)    -- attention weights
        context: shape (hidden_dim,) -- weighted sum of encoder states
    """
    # Dot product between decoder state and each encoder state
    scores = encoder_states @ decoder_state  # (seq_len,)
    alpha = softmax(scores)
    context = alpha @ encoder_states  # (hidden_dim,)
    return alpha, context


# ---- Demonstration ----

np.random.seed(42)

hidden_dim = 8
attn_dim = 6
seq_len = 5  # 5-word source sentence

# Simulate encoder outputs for a 5-word sentence
encoder_states = np.random.randn(seq_len, hidden_dim)

# Simulate decoder state at current step
decoder_state = np.random.randn(hidden_dim)

# Random attention parameters (learned during training)
W_a = np.random.randn(attn_dim, hidden_dim) * 0.1
U_a = np.random.randn(attn_dim, hidden_dim) * 0.1
v_a = np.random.randn(attn_dim) * 0.1

# Bahdanau attention
alpha_bad, context_bad = bahdanau_attention(
    decoder_state, encoder_states, W_a, U_a, v_a
)
print("=== Bahdanau Attention ===")
print(f"Attention weights: {alpha_bad.round(4)}")
print(f"Sum of weights:    {alpha_bad.sum():.6f}  (must be 1.0)")
print(f"Context shape:     {context_bad.shape}")
print(f"Most attended position: {np.argmax(alpha_bad)} (0-indexed)")

# Luong attention
alpha_luong, context_luong = luong_dot_attention(decoder_state, encoder_states)
print("\n=== Luong Dot-Product Attention ===")
print(f"Attention weights: {alpha_luong.round(4)}")
print(f"Sum of weights:    {alpha_luong.sum():.6f}  (must be 1.0)")
print(f"Most attended position: {np.argmax(alpha_luong)} (0-indexed)")

# Verify the context vector is a convex combination of encoder states
print("\n=== Sanity Check ===")
# Manual weighted sum
manual_context = sum(alpha_bad[i] * encoder_states[i] for i in range(seq_len))
print(f"Context matches manual computation: {np.allclose(context_bad, manual_context)}")

This produces output similar to:

=== Bahdanau Attention ===
Attention weights: [0.1823 0.2047 0.1954 0.2181 0.1996]
Sum of weights:    1.000000  (must be 1.0)
Context shape:     (8,)
Most attended position: 3 (0-indexed)

=== Luong Dot-Product Attention ===
Attention weights: [0.2134 0.1872 0.2311 0.1489 0.2194]
Sum of weights:    1.000000  (must be 1.0)
Most attended position: 2 (0-indexed)

=== Sanity Check ===
Context matches manual computation: True

With random initialization, weights are nearly uniform. After training, the model learns to concentrate weight on the relevant positions - the weights become peaked rather than flat.

PyTorch Implementation: Seq2Seq With Attention

This implements a complete seq2seq model with Bahdanau attention for a toy translation task. Every component is explicit and annotated.

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import random
import numpy as np

torch.manual_seed(42)
random.seed(42)

# ---- Vocabulary (tiny toy example) ----
SRC_VOCAB = {
    "<pad>": 0, "<sos>": 1, "<eos>": 2,
    "the": 3, "cat": 4, "sat": 5, "dog": 6, "ran": 7, "mat": 8, "on": 9
}
TGT_VOCAB = {
    "<pad>": 0, "<sos>": 1, "<eos>": 2,
    "le": 3, "chat": 4, "a": 5, "chien": 6, "couru": 7,
    "tapis": 8, "sur": 9, "assis": 10
}
INV_TGT = {v: k for k, v in TGT_VOCAB.items()}
SRC_SIZE = len(SRC_VOCAB)
TGT_SIZE = len(TGT_VOCAB)


# ---- Encoder ----
class Encoder(nn.Module):
    def __init__(self, src_vocab_size, embed_dim, hidden_dim, dropout=0.1):
        super().__init__()
        self.embedding = nn.Embedding(src_vocab_size, embed_dim, padding_idx=0)
        self.rnn = nn.GRU(embed_dim, hidden_dim, batch_first=True, bidirectional=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, src):
        """
        Args:
            src: (batch, src_len) -- token indices
        Returns:
            encoder_outputs: (batch, src_len, hidden_dim) -- all hidden states
            hidden: (1, batch, hidden_dim) -- final hidden state
        """
        embedded = self.dropout(self.embedding(src))  # (batch, src_len, embed_dim)
        encoder_outputs, hidden = self.rnn(embedded)  # (batch, src_len, hidden_dim)
        return encoder_outputs, hidden


# ---- Bahdanau Attention ----
class BahdanauAttention(nn.Module):
    def __init__(self, hidden_dim, attn_dim):
        super().__init__()
        self.encoder_proj = nn.Linear(hidden_dim, attn_dim, bias=False)
        self.decoder_proj = nn.Linear(hidden_dim, attn_dim, bias=False)
        self.score_proj = nn.Linear(attn_dim, 1, bias=False)

    def forward(self, decoder_hidden, encoder_outputs):
        """
        Args:
            decoder_hidden:  (batch, hidden_dim) -- current decoder state s_{t-1}
            encoder_outputs: (batch, src_len, hidden_dim) -- all encoder states
        Returns:
            context: (batch, hidden_dim) -- weighted sum of encoder states
            alpha:   (batch, src_len)   -- attention weights for visualization
        """
        src_len = encoder_outputs.size(1)

        # Project encoder outputs: (batch, src_len, attn_dim)
        enc_proj = self.encoder_proj(encoder_outputs)

        # Project decoder hidden and broadcast to src_len: (batch, src_len, attn_dim)
        dec_proj = self.decoder_proj(decoder_hidden).unsqueeze(1)  # (batch, 1, attn_dim)
        dec_proj = dec_proj.expand(-1, src_len, -1)

        # Compute alignment scores
        energy = torch.tanh(enc_proj + dec_proj)  # (batch, src_len, attn_dim)
        scores = self.score_proj(energy).squeeze(-1)  # (batch, src_len)

        # Normalize
        alpha = F.softmax(scores, dim=-1)  # (batch, src_len)

        # Weighted sum: (batch, 1, src_len) @ (batch, src_len, hidden_dim) -> (batch, hidden_dim)
        context = torch.bmm(alpha.unsqueeze(1), encoder_outputs).squeeze(1)

        return context, alpha


# ---- Decoder ----
class Decoder(nn.Module):
    def __init__(self, tgt_vocab_size, embed_dim, hidden_dim, attn_dim, dropout=0.1):
        super().__init__()
        self.embedding = nn.Embedding(tgt_vocab_size, embed_dim, padding_idx=0)
        self.attention = BahdanauAttention(hidden_dim, attn_dim)
        # GRU input: embedding + context vector concatenated
        self.rnn = nn.GRU(embed_dim + hidden_dim, hidden_dim, batch_first=True)
        self.fc_out = nn.Linear(hidden_dim, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def forward_step(self, token, decoder_hidden, encoder_outputs):
        """
        One decoder step.

        Args:
            token:           (batch,)              -- current input token
            decoder_hidden:  (1, batch, hidden_dim) -- current decoder state
            encoder_outputs: (batch, src_len, hidden_dim)
        Returns:
            prediction:     (batch, tgt_vocab_size) -- logits
            decoder_hidden: (1, batch, hidden_dim)  -- updated state
            alpha:          (batch, src_len)         -- attention weights
        """
        embedded = self.dropout(self.embedding(token.unsqueeze(1)))  # (batch, 1, embed_dim)

        # Compute context vector using s_{t-1}
        s_prev = decoder_hidden.squeeze(0)  # (batch, hidden_dim)
        context, alpha = self.attention(s_prev, encoder_outputs)

        # Concatenate embedding and context
        rnn_input = torch.cat([embedded, context.unsqueeze(1)], dim=-1)

        output, decoder_hidden = self.rnn(rnn_input, decoder_hidden)
        prediction = self.fc_out(output.squeeze(1))  # (batch, tgt_vocab_size)

        return prediction, decoder_hidden, alpha

    def forward(self, tgt, encoder_outputs, encoder_hidden, teacher_forcing_ratio=0.5):
        """
        Full decoding pass with teacher forcing.

        Args:
            tgt:                  (batch, tgt_len) -- includes leading <sos>
            encoder_outputs:      (batch, src_len, hidden_dim)
            encoder_hidden:       (1, batch, hidden_dim)
            teacher_forcing_ratio: probability of using ground truth at each step
        Returns:
            outputs:    (batch, tgt_len-1, tgt_vocab_size)
            attentions: list of (batch, src_len), one per decoding step
        """
        tgt_len = tgt.size(1)
        decoder_hidden = encoder_hidden
        decoder_input = tgt[:, 0]  # first input is <sos>

        outputs = []
        attentions = []

        for t in range(1, tgt_len):
            prediction, decoder_hidden, alpha = self.forward_step(
                decoder_input, decoder_hidden, encoder_outputs
            )
            outputs.append(prediction)
            attentions.append(alpha)

            if random.random() < teacher_forcing_ratio:
                decoder_input = tgt[:, t]         # teacher forcing
            else:
                decoder_input = prediction.argmax(dim=-1)  # model prediction

        outputs = torch.stack(outputs, dim=1)  # (batch, tgt_len-1, vocab_size)
        return outputs, attentions


# ---- Full Seq2Seq Model ----
class Seq2SeqWithAttention(nn.Module):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

    def forward(self, src, tgt, teacher_forcing_ratio=0.5):
        encoder_outputs, encoder_hidden = self.encoder(src)
        outputs, attentions = self.decoder(
            tgt, encoder_outputs, encoder_hidden, teacher_forcing_ratio
        )
        return outputs, attentions

    def translate(self, src, max_len=20, sos_idx=1, eos_idx=2):
        """Greedy decoding at inference time -- no teacher forcing."""
        self.eval()
        with torch.no_grad():
            encoder_outputs, encoder_hidden = self.encoder(src)
            decoder_hidden = encoder_hidden
            decoder_input = torch.tensor([sos_idx] * src.size(0))

            generated = []
            attn_weights = []

            for _ in range(max_len):
                prediction, decoder_hidden, alpha = self.decoder.forward_step(
                    decoder_input, decoder_hidden, encoder_outputs
                )
                token = prediction.argmax(dim=-1)
                generated.append(token)
                attn_weights.append(alpha)

                if token.item() == eos_idx:
                    break

                decoder_input = token

            return generated, attn_weights


# ---- Training ----
EMBED_DIM  = 32
HIDDEN_DIM = 64
ATTN_DIM   = 32

encoder = Encoder(SRC_SIZE, EMBED_DIM, HIDDEN_DIM)
decoder = Decoder(TGT_SIZE, EMBED_DIM, HIDDEN_DIM, ATTN_DIM)
model   = Seq2SeqWithAttention(encoder, decoder)

print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

# Toy training pair: "the cat sat" -> "<sos> le chat assis <eos>"
src_seq = torch.tensor([[SRC_VOCAB["the"], SRC_VOCAB["cat"], SRC_VOCAB["sat"]]])
tgt_seq = torch.tensor([[
    TGT_VOCAB["<sos>"],
    TGT_VOCAB["le"],
    TGT_VOCAB["chat"],
    TGT_VOCAB["assis"],
    TGT_VOCAB["<eos>"]
]])

optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss(ignore_index=0)  # ignore padding

model.train()
for epoch in range(100):
    optimizer.zero_grad()

    outputs, _ = model(src_seq, tgt_seq, teacher_forcing_ratio=0.8)

    # Target: tgt_seq[:, 1:] -- shift by one (predict next token)
    targets = tgt_seq[:, 1:].reshape(-1)
    logits  = outputs.reshape(-1, TGT_SIZE)

    loss = criterion(logits, targets)
    loss.backward()

    # Gradient clipping -- essential for RNNs
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    optimizer.step()

    if (epoch + 1) % 20 == 0:
        print(f"Epoch {epoch+1:3d} | Loss: {loss.item():.4f}")

# ---- Inference ----
generated_tokens, attn = model.translate(src_seq)
translation = [INV_TGT.get(t.item(), "?") for t in generated_tokens]
print(f"\nSource:      the cat sat")
print(f"Translation: {' '.join(translation)}")

Architecture Diagram

The attention module receives all encoder hidden states $[h_1, h_2, h_3]$ and the current decoder state at every step, recomputing a fresh context vector $c_t$ each time. This is the key difference from the vanilla encoder-decoder where only $h_3$ (the final state) was passed to the decoder once at initialization.

Production Engineering Notes

Beam Search vs. Greedy Decoding

Greedy decoding - always picking the token with the highest probability at each step - is fast but myopic. The token with the highest probability at step 5 might lead to a dead end by step 8 where no good continuation exists.

Beam search maintains the top-K hypotheses at each decoding step (K is the beam width). At each step, each of the K hypotheses is expanded by all possible next tokens, producing K times vocab_size candidates. The top-K are kept, the rest are pruned. This explores a much larger part of the decoding space without being as expensive as full search.

Beam width K=4 or K=5 is typical for translation. Increasing K beyond 10-20 rarely helps and can actually hurt - a phenomenon called the "beam search curse" where larger beams favor shorter, safer outputs. At inference time, beam search adds latency proportional to K - a critical tradeoff for low-latency production systems.

After the beam search finishes, sequences are scored by their total log-probability. But longer sequences accumulate more log-probabilities and tend to score lower. The standard fix is length normalization: divide the total log-probability by $\text{length}^\alpha$ where $\alpha$ is typically 0.6 to 0.8.

The Copy Mechanism

Standard seq2seq models cannot copy words from the input verbatim. If the input contains a proper noun - "Schwarzenegger" - the model must have seen this word enough times during training to produce it. For rare names, places, numbers, and technical terms this fails.

The copy mechanism (Gu et al., 2016, "Incorporating Copying Mechanism in Sequence-to-Sequence Learning") solves this by allowing the decoder to either generate a word from the vocabulary OR copy a word from the input source at each step. The mechanism uses the attention weights: high attention on a source position can also trigger copying that token directly.

This is essential for summarization (where you want to preserve key phrases from the original text), machine translation (for proper nouns and technical terms), and code generation (for variable names and string literals). Pointer networks and pointer-generator networks operationalize this idea.

Coverage Penalty

Without a coverage mechanism, attention can repeatedly focus on the same input positions - generating repetitive output. In translation this produces "The the the cat cat sat." In summarization it generates the same fact three times from three different parts of the document.

A coverage penalty penalizes the model when its cumulative attention weights on any source position exceed 1.0. The intuition: if you have already "used up" a source word fully (cumulative attention = 1.0), you should not keep attending to it. This encourages the decoder to cover all source positions at most once.

Wu et al. (2016) in Google's production NMT system uses a coverage penalty in the beam search scoring function: $\log P(\text{output}) - \lambda \sum_i \min(\text{cov}_{N,i},\, 1)$ . The $\min$ ensures that attending to each source position once is rewarded, but attending more than once is not. This dramatically reduces both repetition and omission errors in production translation systems.

When to Use Seq2Seq vs. Fine-Tuned LLM

In 2026, the practical question is rarely "should I build a seq2seq model?" - it is "should I fine-tune a dedicated seq2seq model or use a large pre-trained LLM?"

Use a dedicated seq2seq model (T5, BART, mBART) when:

You have domain-specific data and the LLM's pre-training distribution does not cover your domain well
Latency matters and you need a small, fast model that fits on a single GPU or edge device
Your task is highly structured (SQL generation, code translation between specific languages) and a smaller task-specific model outperforms a larger general one
Budget constraints - a 250M parameter T5 model is roughly 20x cheaper to run than a 7B parameter LLM

Use a fine-tuned LLM (GPT-4o-mini, Llama 3.1 8B, Mistral 7B) when:

You have little task-specific training data and need the LLM's broad world knowledge
Your output requires common sense reasoning beyond what is in your training set
You want zero-shot or few-shot capability without any fine-tuning
Output format flexibility matters more than maximum efficiency

The architectures are not opposites - T5, BART, and the mT5 family are encoder-decoder transformers that use the same fundamental design from 2014, scaled up and pre-trained. Understanding seq2seq with attention gives you the conceptual grounding to work with all of them.

Handling OOV Tokens at Scale

Out-of-vocabulary (OOV) words are the nemesis of character-level and word-level vocabularies. Byte-pair encoding (BPE, Sennrich et al. 2016) and SentencePiece resolve this by breaking rare words into subword units that are guaranteed to appear in the vocabulary. "Schwarzenegger" becomes something like ["Schwar", "zen", "egg", "er"] - each a known subword.

BPE with 32K to 50K merge operations is standard for translation systems. This reduces OOV rates to near zero while keeping vocabulary size manageable.

Common Mistakes

:::danger Using a fixed context vector without attention for sequences longer than 30 tokens The single-vector bottleneck is not a theoretical concern - it is a measurable performance cliff. Bahdanau et al. (2015) showed BLEU score degradation above 30 input tokens with the original 2014 architecture. If you are building any seq2seq system and your inputs can exceed 25-30 tokens, implementing attention is not optional. The computational overhead of Bahdanau attention is modest compared to the encoding step itself. :::

:::danger Forgetting gradient clipping with RNN-based seq2seq RNNs, even LSTMs, can produce exploding gradients during seq2seq training - especially when sequences are long. Without torch.nn.utils.clip_grad_norm_(parameters, max_norm=1.0), gradients can go to NaN in the first few hundred steps. This manifests as sudden loss spikes or immediate NaN losses. Always clip before the optimizer step, every time. :::

:::warning Comparing seq2seq models using different beam widths A model evaluated with beam width K=4 will have higher BLEU scores than the same model evaluated greedily. When comparing two architectures, use the same beam width, the same length penalty alpha, and the same maximum decode length. The difference between greedy and beam-4 can be 2-3 BLEU points - enough to make a worse model appear better if you are inconsistent. :::

:::warning Setting teacher forcing ratio to 1.0 for the entire training run Training with teacher forcing ratio = 1.0 throughout creates severe exposure bias. The model has never seen its own imperfect outputs during training and falls apart at inference time. Implement scheduled sampling or reduce teacher forcing to 0.5-0.7 in the later stages of training. A common schedule: start at 1.0, linearly reduce to 0.5 over the first half of training, then hold at 0.5. :::

:::warning Not applying padding masks to attention When your batch contains sequences of different lengths and you have padded shorter sequences to the maximum length, the attention mechanism will attend to padding positions unless you mask them. Attending to padding produces misleading context vectors and can cause the model to learn spurious patterns from the padding distribution. Pass an attention mask that sets padding positions to a large negative number (such as -1e9) before the softmax, so they receive near-zero weight after normalization. :::

:::warning Evaluating translation quality with BLEU alone BLEU is fast to compute and widely reported, but it has known failures: it penalizes valid paraphrases, does not handle word order well across language pairs with very different structure, and collapses in the multi-reference setting without careful normalization. Modern evaluation uses BLEU combined with chrF (character-level F-score) and human evaluation for any production system. For summarization, ROUGE-1, ROUGE-2, ROUGE-L, and BERTScore are more informative than BLEU. :::

Interview Q&A

Q1: What is the fundamental problem that seq2seq with attention solves, and why could a simpler architecture not handle it?

Standard feedforward networks and unidirectional RNNs require fixed-size inputs and outputs. Machine translation requires mapping variable-length inputs to variable-length outputs with no fixed alignment between positions.

The original seq2seq (Sutskever 2014) addressed the variable-length problem by separating encoding from decoding. But the fixed-size context vector between them - the encoder's final hidden state - became a bottleneck for long sequences. Information about early tokens in a 50-word sentence gets overwritten by later tokens as the encoder processes the sequence. The decoder, initialized only from this final state, has no direct access to early input representations.

Attention (Bahdanau 2015) solves the bottleneck by keeping all encoder hidden states and allowing the decoder to compute a weighted sum of them at every decoding step. The weights - computed by a learned alignment model - are specific to the current decoder state. When generating a translation of "cat," the model puts high weight on the encoder state from the "cat" position. When generating "sat," it shifts weight to that position. The context vector becomes a dynamic, step-specific summary of the input rather than a single static compression.

Q2: Explain the difference between Bahdanau attention and Luong attention. When would you prefer each?

Both compute alignment scores between decoder states and encoder states, normalize with softmax, and compute a weighted sum. The differences are in how scores are computed and when the context vector is used.

Bahdanau (additive): $\text{score}(s_{t-1}, h_i) = v^\top \tanh(W s_{t-1} + U h_i)$ . A small feedforward network learns to score compatibility. Slightly more parameters, but $U h_i$ can be precomputed once per source sentence, making inference efficient. The context vector is computed from $s_{t-1}$ (the previous decoder state) and is used when computing $s_t$ .

Luong (multiplicative): $\text{score}(s_t, h_i) = s_t^\top h_i$ . No learned feedforward network in the dot-product case - just inner products. Requires encoder and decoder to share the same hidden dimension. The context vector is computed from $s_t$ (the current decoder state) and used for the current step's output prediction.

In practice: Luong dot-product attention is faster and simpler to implement. Bahdanau is theoretically cleaner and works when encoder and decoder dimensions differ. For new seq2seq systems in 2026, you would likely use scaled dot-product attention from the Transformer - but for understanding existing RNN-based production systems you will inherit, both matter.

Q3: What is teacher forcing, what problem does it solve, and what problem does it create?

Teacher forcing is a training strategy where, at each decoding step, the decoder receives the ground-truth previous token as input rather than its own prediction from the previous step.

It solves the error cascade problem. Without teacher forcing, a wrong prediction at step 3 produces wrong input at step 4, wrong input at step 5, and so on. By step 10, the model is generating in a context so different from training conditions that gradients are uninformative. Teacher forcing stabilizes training by ensuring the decoder always operates with correct context.

The problem it creates is exposure bias: a mismatch between training (where inputs are always correct) and inference (where inputs are the model's own imperfect predictions). The model has never learned to recover from its own errors.

Solutions: scheduled sampling (Bengio et al., 2015 - gradually reduce teacher forcing probability across training so the model sees its own outputs), minimum risk training (optimize expected task metric over sampled sequences), and REINFORCE with BLEU as reward. The simplest practical fix is a scheduled decay of the teacher forcing ratio, starting at 1.0 and reducing to 0.5 by the latter half of training.

Q4: How does beam search improve over greedy decoding, and what are its failure modes?

Greedy decoding always picks the token with the highest probability at each step. This is fast but myopic - the best token at step 5 might lead to low-probability tokens at steps 6, 7, 8, producing a globally suboptimal sequence.

Beam search maintains K hypotheses simultaneously. At each step, each hypothesis is extended by all possible next tokens, creating K times V candidates (V = vocabulary size). The top-K by cumulative log-probability are kept. This explores a much larger portion of the output space at the cost of K times the computation.

Failure modes:

Length bias: Sequences accumulate negative log-probabilities, so longer sequences score worse. Fix: divide total log-probability by $\text{length}^\alpha$ where $\alpha$ is typically 0.6 to 0.8.
Beam search curse: Very large beams favor short, generic outputs. Beam width K=5 often outperforms K=20 on translation tasks.
Repetition: Beam search can produce repetitive sequences. Fix: coverage penalty or n-gram blocking (never generate the same 3-gram twice).
Diversity collapse: All K beams often converge to slight variations of the same sequence. Fix: diverse beam search (Vijayakumar et al. 2016) which penalizes similarity between beams.

For production systems where latency matters, K=4 or K=5 is a standard tradeoff.

Q5: What is the coverage mechanism and why is it needed?

Attention weights at each decoder step sum to 1.0. But across all decoder steps, the cumulative attention on any input position can be arbitrarily large. The model can attend to the same source word at every decoder step, effectively ignoring the rest of the input.

This causes two problems: the decoder repeats parts of the source (in translation, this produces duplicate content), and other parts of the source are never attended to and thus never reflected in the output (causing omissions).

A coverage vector tracks cumulative attention over all previous decoding steps: $\text{cov}_t = \sum_{t' < t} \alpha_{t'}$ . The attention score computation is modified to penalize high cumulative attention - if position $i$ has already received cumulative attention of 0.8, the model is discouraged from putting high weight on it again.

Wu et al. (2016) in Google's production NMT system uses a coverage penalty in the beam search scoring function: $\log P(\text{output}) - \lambda \sum_i \min(\text{cov}_{N,i},\, 1)$ . The $\min$ ensures attending to each source position once is rewarded, but attending more than once is not. This dramatically reduces both repetition and omission errors in real production translation systems.

Q6: How would you diagnose and fix a seq2seq model that produces good short outputs but degrades on long inputs?

This is the context vector bottleneck in action. Diagnostic steps:

Step 1 - Plot output quality vs. input length: If BLEU or task-specific accuracy drops sharply above 30-50 tokens, the bottleneck is the likely cause.

Step 2 - Visualize attention weights: If attention weights become uniform (no concentrated attention on relevant positions) for long inputs, the attention mechanism is failing to find meaningful alignments. This can happen if the model was not trained on sufficient long examples, or if there is a bug where attention masks are not applied to padding.

Step 3 - Check if the context vector is actually being used: Some implementations accidentally do not incorporate the context vector into the decoder's prediction - for example, forgetting to concatenate it before the output projection. Print the attention weights. If they are uniform across all positions for every decoder step, there is likely a bug in the implementation.

Fixes in order of effort:

Use a bidirectional encoder: A bidirectional GRU or LSTM gives each encoder position information about both left and right context. The resulting hidden states are richer and easier for the attention mechanism to distinguish.
Add positional information explicitly: RNNs can lose track of absolute position in very long sequences. Positional encodings or position-aware embeddings help the model maintain awareness of where in the sequence it is.
Train specifically on long examples: If your training data is dominated by short sequences, the model will not learn to handle long ones. Augment training data with longer examples or oversample them.
Switch to a Transformer encoder: Self-attention scales to long sequences far better than RNNs. Every position can attend directly to every other position, without information having to travel through intermediate hidden states and risk being diluted.
Hierarchical encoding for very long documents: Encode sentences into sentence representations, then encode sentence representations into a document representation. This gives the model a two-level hierarchy that fits more information into the attention context.

Q7: What is the role of a bidirectional encoder in seq2seq, and how does it change what the attention mechanism sees?

A standard unidirectional encoder processes the source sequence left to right. The hidden state at position 3 has seen positions 1, 2, and 3 - but nothing about positions 4, 5, 6, and beyond. In a sentence like "The bank by the river," when the encoder reaches "bank" at position 2, it has not yet seen "river" at position 5 - the context that disambiguates whether "bank" means a financial institution or a riverbank. The encoder state for "bank" encodes an ambiguous representation.

A bidirectional encoder runs two RNNs in parallel: one left-to-right (the forward pass), one right-to-left (the backward pass). The forward pass produces states $[\vec{f}_1, \vec{f}_2, \ldots, \vec{f}_N]$ and the backward pass produces $[\vec{b}_1, \vec{b}_2, \ldots, \vec{b}_N]$ . These are concatenated at each position: $h_i = [\vec{f}_i\,;\,\vec{b}_i]$ . Now $h_2$ for "bank" encodes both the preceding context ("The") and the following context ("by the river"), making it unambiguous.

The effect on attention is significant. Attention scores depend on the quality of encoder states - if encoder states are ambiguous or underspecified, the attention mechanism cannot reliably identify which source positions to focus on. Bidirectional encoder states carry more information per position, so the alignment model has better material to work with. This is why Bahdanau et al. (2015) specifically used a bidirectional GRU in their original attention paper - not a unidirectional one.

The tradeoff: bidirectional encoders require two forward passes through the source sequence instead of one. This roughly doubles encoding compute. For very long sequences (thousands of tokens), this becomes expensive. The Transformer's self-attention solves this differently - every position attends to every other position in a single pass, capturing bidirectional context without the sequential overhead.

Bidirectional Encoder: Code Comparison

This shows exactly what changes when you add bidirectionality to the encoder, including how to handle the hidden state mismatch when initializing a unidirectional decoder.

import torch
import torch.nn as nn

# ---- Unidirectional Encoder (reference) ----
class UniEncoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.rnn = nn.GRU(embed_dim, hidden_dim, batch_first=True, bidirectional=False)

    def forward(self, src):
        # encoder_outputs: (batch, src_len, hidden_dim)
        # hidden:          (1, batch, hidden_dim)
        embedded = self.embedding(src)
        encoder_outputs, hidden = self.rnn(embedded)
        return encoder_outputs, hidden


# ---- Bidirectional Encoder ----
class BiEncoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        # bidirectional=True doubles the output dimension
        self.rnn = nn.GRU(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        # Project bidirectional hidden states back to hidden_dim for the decoder
        # (decoder is unidirectional with hidden_dim, not 2*hidden_dim)
        self.fc = nn.Linear(hidden_dim * 2, hidden_dim)

    def forward(self, src):
        embedded = self.embedding(src)
        # encoder_outputs: (batch, src_len, 2*hidden_dim) -- forward + backward concatenated
        # hidden:          (2, batch, hidden_dim)         -- [forward_final; backward_final]
        encoder_outputs, hidden = self.rnn(embedded)

        # Combine the two directions of encoder_outputs for attention
        # The attention module expects (batch, src_len, hidden_dim), not 2*hidden_dim
        # Option 1: linear projection (most common)
        encoder_outputs_proj = self.fc(encoder_outputs)  # (batch, src_len, hidden_dim)

        # Combine the final hidden states from both directions to initialize decoder
        # hidden is (2, batch, hidden_dim) -- [0] is forward, [1] is backward
        # Concatenate and project to match decoder's hidden_dim
        forward_hidden  = hidden[0]  # (batch, hidden_dim)
        backward_hidden = hidden[1]  # (batch, hidden_dim)
        combined = torch.cat([forward_hidden, backward_hidden], dim=-1)  # (batch, 2*hidden_dim)
        decoder_init = torch.tanh(self.fc(combined)).unsqueeze(0)  # (1, batch, hidden_dim)

        return encoder_outputs_proj, decoder_init


# ---- Demonstrate the shape difference ----
VOCAB = 100
EMBED = 32
HIDDEN = 64
BATCH = 2
SEQ_LEN = 10

src = torch.randint(1, VOCAB, (BATCH, SEQ_LEN))

uni_enc = UniEncoder(VOCAB, EMBED, HIDDEN)
bi_enc  = BiEncoder(VOCAB, EMBED, HIDDEN)

uni_out, uni_hidden = uni_enc(src)
bi_out,  bi_hidden  = bi_enc(src)

print("=== Unidirectional Encoder ===")
print(f"Encoder outputs: {uni_out.shape}")  # (2, 10, 64)
print(f"Final hidden:    {uni_hidden.shape}")  # (1, 2, 64)

print("\n=== Bidirectional Encoder (after projection) ===")
print(f"Encoder outputs: {bi_out.shape}")   # (2, 10, 64) -- same as uni after projection
print(f"Final hidden:    {bi_hidden.shape}")  # (1, 2, 64) -- same shape, different content

print("\nKey point: after projection, both encoders produce the same shape.")
print("The bidirectional version contains richer context per position.")

The critical engineering detail is the hidden state mismatch. A bidirectional GRU with hidden_dim=64 produces a final hidden state of shape (2, batch, 64) - one for each direction. A unidirectional decoder GRU with hidden_dim=64 needs an initial hidden state of shape (1, batch, 64). You must project the concatenated bidirectional hidden states to the decoder's expected shape before passing them as the decoder's initial state.

The seq2seq Family: Common Variants and Their Use Cases

Understanding that seq2seq is a family of architectures - not a single model - matters for choosing the right variant for your task.

Standard seq2seq with attention (Bahdanau 2015): The baseline for most sequence transduction tasks. Works well for sentences under 100 tokens. Appropriate when you have limited compute and need a production model smaller than 100M parameters.

Pointer-Generator Networks (See et al. 2017): Extends seq2seq with a copy mechanism that allows the decoder to either generate from a fixed vocabulary or copy tokens directly from the input. Essential for abstractive summarization, where key terms from the document should appear verbatim in the summary. Used in the CNN/Daily Mail summarization benchmark and many production summarization systems.

Transformer-based seq2seq (Vaswani et al. 2017): Replace the RNN encoder and decoder with Transformer blocks using multi-head self-attention and cross-attention. Parallelizable during training - encodes the full source in one pass rather than sequentially. T5, BART, and mBART are pre-trained Transformer seq2seq models that you can fine-tune on your specific task.

Conditional seq2seq: The decoder is conditioned on additional inputs beyond the encoded source sequence. Examples include: image captioning (conditioned on visual features), conditional text generation (conditioned on style or persona vectors), and code completion (conditioned on function signature and docstring). The architecture is identical - you simply concatenate or project the conditioning signal into the encoder or decoder's input.

Hierarchical seq2seq: Used for long document processing where flat encoding is insufficient. First level encodes words into sentence representations; second level encodes sentence representations into a document representation. The decoder attends to the document-level representation and generates output sentence by sentence. Used in multi-document summarization and dialogue systems.

Key Takeaways

These are the ideas that come up in every serious technical interview on seq2seq, and the concepts that every ML engineer working on text or speech should have internalized.

The bottleneck problem is concrete, not theoretical. The performance cliff above 30 tokens for non-attention seq2seq is documented in Bahdanau et al. (2015) with actual BLEU curves. If your inputs regularly exceed 30 tokens and you are not using attention, you have a known failure mode that will surface in production.

Attention is interpretable. The $\alpha_{t,i}$ matrix tells you what the model was looking at when it generated each output token. This is a diagnostic tool. When the model makes mistakes, look at the attention weights before assuming the architecture is wrong - the problem is often that attention is focusing on the wrong positions, which points to issues in the training data or input preprocessing.

Teacher forcing is a training shortcut that costs you at test time. Always use scheduled sampling or at minimum reduce teacher forcing to 50% in later training epochs. Never evaluate a model trained with 100% teacher forcing on long sequences - the exposure bias will make the results look worse than the architecture deserves on shorter sequences and catastrophically worse on longer ones.

Beam search is a tunable knob. K=1 is greedy. K=5 is standard. K=10-20 is for maximum quality at the cost of latency. Always use length normalization ( $\alpha$ = 0.6–0.8) to prevent the model from preferring shorter sequences. Always use the same K and $\alpha$ when comparing two models - small differences in beam parameters can swing BLEU by 2-3 points.

The choice between seq2seq and fine-tuned LLM is about data, latency, and budget. A dedicated seq2seq model (T5-small at 60M parameters, BART-base at 140M) can outperform a much larger LLM on tasks where you have substantial in-domain training data. Do not default to the largest model available - profile your specific task with both options before committing to an architecture.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Seq2Seq with Attention demo on the EngineersOfAI Playground - no code required.

:::

The Moment the System Broke​

Why This Exists​

The Problem With Fixed-Size Networks​

What Seq2Seq Changes​

Historical Context​

2014: The Paper That Changed Sequence Modeling​

2015: Bahdanau Attention - The Missing Piece​

2015: Luong Attention - The Faster Alternative​

2017: Attention Is All You Need​

Core Concept: The Encoder-Decoder Architecture​

Encoding: Building a Memory of the Input​

Decoding: Generating Output One Step at a Time​

The Context Vector Bottleneck Problem​

The Attention Mechanism​

The Core Idea Before the Math​

Bahdanau (Additive) Attention​

Luong (Multiplicative) Attention​

Alignment Scores in Practice​

Teacher Forcing​

The Training-Time Problem​

Exposure Bias: The Dark Side of Teacher Forcing​

NumPy From Scratch: Attention Score Calculation​

PyTorch Implementation: Seq2Seq With Attention​

Architecture Diagram​

Production Engineering Notes​

Beam Search vs. Greedy Decoding​

The Copy Mechanism​

Coverage Penalty​

When to Use Seq2Seq vs. Fine-Tuned LLM​

Handling OOV Tokens at Scale​

Common Mistakes​

Interview Q&A​

Q1: What is the fundamental problem that seq2seq with attention solves, and why could a simpler architecture not handle it?​

Q2: Explain the difference between Bahdanau attention and Luong attention. When would you prefer each?​

Q3: What is teacher forcing, what problem does it solve, and what problem does it create?​

Q4: How does beam search improve over greedy decoding, and what are its failure modes?​

Q5: What is the coverage mechanism and why is it needed?​

Q6: How would you diagnose and fix a seq2seq model that produces good short outputs but degrades on long inputs?​

Q7: What is the role of a bidirectional encoder in seq2seq, and how does it change what the attention mechanism sees?​

Bidirectional Encoder: Code Comparison​

The seq2seq Family: Common Variants and Their Use Cases​

Key Takeaways​