Attention Is All You Need - The Paper That Changed Everything

Reading time: ~45 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Engineer, Data Scientist

The Real Interview Moment

You are in an OpenAI research engineer interview. The interviewer draws a blank rectangle on the whiteboard and says: "Draw me the Transformer architecture. As you draw, explain every component - what it does, why it is there, and what happens if you remove it." She pauses, then adds: "I want the math for self-attention, and I want to know why you divide by root $d_k$ ."

This is the most commonly asked paper discussion in ML interviews, period. The Transformer is not just a paper - it is the foundation of modern AI. Every large language model, from BERT to GPT-4 to Claude, builds on this architecture. If there is one paper you must know cold, this is it.

This chapter gives you a complete, interview-ready understanding: the motivation, the architecture (component by component), the mathematics (with intuition for every design choice), the training details, the ablation results, and the paper's lasting impact.

What You Will Master

Explain why the Transformer was proposed (limitations of RNNs)
Draw the complete Transformer architecture from memory
Derive scaled dot-product attention from first principles
Explain multi-head attention and why it outperforms single-head
Describe positional encoding and its mathematical properties
Discuss the training setup (optimizer, regularization, label smoothing)
Cite specific ablation results from the paper
Identify limitations and connect to modern improvements

Self-Assessment: Where Are You Now?

Skill	1 - Cannot	2 - Vaguely	3 - Can Explain	4 - Can Derive	5 - Can Teach	Your Score
Explain why RNNs were replaced						___
Draw the full Transformer architecture						___
Derive the attention equation						___
Explain the scaling factor $\sqrt{d_k}$						___
Explain multi-head attention						___
Describe positional encoding						___
Explain residual connections + LayerNorm						___
Cite specific results and ablations						___
List 3+ limitations						___
Connect to modern architectures (GPT, BERT)						___

Target: All 4s and 5s before your interview.

Part 1 - The Problem: Why Replace RNNs?

The Sequential Bottleneck

Before 2017, the state of the art for sequence-to-sequence tasks (machine translation, summarization, etc.) was encoder-decoder RNNs with attention (Bahdanau et al., 2014; Luong et al., 2015). These models had a fundamental limitation:

RNNs process sequences one token at a time.

To compute the hidden state at position $t$ , you need the hidden state at position $t-1$ :

$h_t = f(h_{t-1}, x_t)$

This sequential dependency has three consequences:

No parallelization. You cannot compute $h_5$ until you have computed $h_1$ through $h_4$ . On modern GPUs with thousands of cores, most of the hardware sits idle.
Long-range dependency problems. Information from early tokens must survive through every intermediate hidden state to influence later tokens. Despite LSTMs and GRUs helping with this, gradients still degrade over hundreds of tokens.
Training speed. Sequential processing means training time scales linearly with sequence length, even on parallel hardware.

RNN vs Transformer: Sequential vs Parallel Processing

60-Second Answer

"The Transformer was motivated by the fundamental limitation of RNNs: sequential processing. In an RNN, you cannot compute the representation for position $t$ until you have processed all positions before it, which prevents parallelization and limits practical sequence lengths. The Transformer replaces recurrence entirely with self-attention, which computes all pairwise interactions in parallel. This trades $O(n)$ sequential operations for $O(1)$ - at the cost of $O(n^2)$ total computation and memory - which is a favorable tradeoff for sequences under a few thousand tokens on modern parallel hardware."

What Existed Before: Attention with RNNs

It is crucial to understand that attention itself was not new. Bahdanau et al. (2014) introduced additive attention for machine translation, and Luong et al. (2015) proposed multiplicative (dot-product) attention. Both used attention as a mechanism on top of RNNs.

The Transformer's insight was not "let us use attention." It was "let us use ONLY attention" - removing recurrence entirely.

Feature	RNN + Attention	Transformer
Sequence processing	Sequential (one token at a time)	Parallel (all tokens simultaneously)
Long-range dependencies	Through hidden state chain	Direct pairwise attention
Training parallelization	Limited by sequential nature	Fully parallelizable
Complexity per layer	$O(n \cdot d)$	$O(n^2 \cdot d)$
Sequential operations	$O(n)$	$O(1)$
Maximum path length	$O(n)$	$O(1)$

Common Trap

Do not say "The Transformer invented attention." Attention mechanisms existed since 2014. The Transformer's contribution was showing that attention alone - without any recurrence or convolution - is sufficient for state-of-the-art sequence modeling. This distinction matters in interviews.

Part 2 - The Architecture

Overall Structure

The Transformer uses an encoder-decoder architecture:

Encoder: Processes the input sequence (e.g., the source language sentence). Consists of $N = 6$ identical layers.
Decoder: Generates the output sequence (e.g., the target language sentence). Also $N = 6$ identical layers, with an additional cross-attention sublayer.

Transformer Architecture: Encoder-Decoder

Each Encoder Layer

Every encoder layer has two sublayers:

Multi-head self-attention: Every position attends to all positions in the previous layer's output.
Position-wise feed-forward network: A two-layer MLP applied independently to each position.

Around each sublayer, there is:

A residual connection: $\text{output} = \text{sublayer}(x) + x$
Layer normalization: Applied after the addition

The formula for each sublayer is:

$\text{LayerNorm}(x + \text{Sublayer}(x))$

Each Decoder Layer

The decoder has three sublayers:

Masked multi-head self-attention: Like the encoder, but with a mask that prevents position $i$ from attending to positions $j > i$ . This ensures that predictions for position $i$ depend only on known outputs at positions less than $i$ .
Multi-head cross-attention: Queries come from the previous decoder sublayer, but keys and values come from the encoder output. This is how the decoder "looks at" the input sequence.
Position-wise feed-forward network: Same as the encoder.

Instant Rejection

If asked "What is the difference between the encoder and decoder?", do not say "The decoder has masking." The decoder has TWO additional features compared to the encoder: (1) causal masking in the self-attention to prevent looking ahead, and (2) cross-attention over the encoder output. Missing either one shows shallow understanding.

Part 3 - Scaled Dot-Product Attention

The Core Equation

The heart of the Transformer is the scaled dot-product attention:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

Where:

$Q \in \mathbb{R}^{n \times d_k}$ - Queries: what each position is looking for
$K \in \mathbb{R}^{m \times d_k}$ - Keys: what each position offers to be matched against
$V \in \mathbb{R}^{m \times d_v}$ - Values: the actual content to be retrieved
$n$ is the number of query positions, $m$ is the number of key/value positions
In self-attention: $n = m$ (every position attends to every other position)

Step-by-Step Computation

import numpy as np

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q: (n, d_k) - queries
    K: (m, d_k) - keys
    V: (m, d_v) - values
    mask: optional (n, m) - attention mask
    """
    d_k = Q.shape[-1]

    # Step 1: Compute raw attention scores
    # QK^T: (n, d_k) @ (d_k, m) = (n, m)
    # Each entry (i,j) = dot product of query i with key j
    scores = Q @ K.T  # shape: (n, m)

    # Step 2: Scale by sqrt(d_k)
    # Prevents softmax saturation for large d_k
    scores = scores / np.sqrt(d_k)

    # Step 3: Apply mask (for decoder self-attention)
    if mask is not None:
        scores = scores + mask  # mask has -inf where attention is blocked

    # Step 4: Softmax normalizes each row
    # Row i becomes a probability distribution over all key positions
    weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)

    # Step 5: Weighted sum of values
    # (n, m) @ (m, d_v) = (n, d_v)
    output = weights @ V

    return output, weights


# Example: 4 tokens, d_k = d_v = 8
n, d_k, d_v = 4, 8, 8
Q = np.random.randn(n, d_k)
K = np.random.randn(n, d_k)
V = np.random.randn(n, d_v)

output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Output shape: {output.shape}")    # (4, 8)
print(f"Attention weights shape: {weights.shape}")  # (4, 4)
print(f"Attention weights (row 0): {weights[0]}")  # sums to 1
print(f"Sum of weights: {weights[0].sum():.4f}")    # 1.0000

Why Scale by $\sqrt{d_k}$ ?

This is the most commonly asked "why" question about the Transformer.

The problem: For large $d_k$ , the dot products $q \cdot k = \sum_{i=1}^{d_k} q_i k_i$ grow in magnitude. If $q$ and $k$ have independent components with mean 0 and variance 1:

$\text{Var}(q \cdot k) = \sum_{i=1}^{d_k} \text{Var}(q_i k_i) = \sum_{i=1}^{d_k} 1 = d_k$

So the standard deviation of the dot product is $\sqrt{d_k}$ . For $d_k = 64$ , dot products are typically in the range $[-16, 16]$ .

The consequence: Large dot products push the softmax into saturation:

$\text{softmax}([10, 1, 1, 1]) \approx [0.9999, 0.0000, 0.0000, 0.0000]$

When softmax is nearly one-hot, gradients become vanishingly small:

$\frac{\partial \text{softmax}(z)_i}{\partial z_j} = \text{softmax}(z)_i (\delta_{ij} - \text{softmax}(z)_j) \approx 0$

The fix: Dividing by $\sqrt{d_k}$ normalizes the variance of dot products back to 1, keeping them in the region where softmax gradients are well-behaved.

# Demonstration of the scaling effect
np.random.seed(42)
d_k = 64
n_pairs = 1000

dots = np.array([
    np.random.randn(d_k) @ np.random.randn(d_k)
    for _ in range(n_pairs)
])

print(f"Without scaling:")
print(f"  Mean: {dots.mean():.2f}, Std: {dots.std():.2f}")
# Mean ≈ 0, Std ≈ 8 (sqrt(64))

scaled_dots = dots / np.sqrt(d_k)
print(f"With scaling by sqrt({d_k}):")
print(f"  Mean: {scaled_dots.mean():.2f}, Std: {scaled_dots.std():.2f}")
# Mean ≈ 0, Std ≈ 1

# Effect on softmax entropy
def softmax(x):
    e_x = np.exp(x - x.max())
    return e_x / e_x.sum()

scores_unscaled = np.random.randn(10) * np.sqrt(d_k)
scores_scaled = scores_unscaled / np.sqrt(d_k)

w_unscaled = softmax(scores_unscaled)
w_scaled = softmax(scores_scaled)

entropy_unscaled = -np.sum(w_unscaled * np.log(w_unscaled + 1e-10))
entropy_scaled = -np.sum(w_scaled * np.log(w_scaled + 1e-10))

print(f"\nSoftmax entropy (unscaled): {entropy_unscaled:.4f}")  # Low (peaky)
print(f"Softmax entropy (scaled):   {entropy_scaled:.4f}")    # Higher (smoother)

Dot-Product vs. Additive Attention

The paper discusses two types of attention:

$\text{Additive: } \text{score}(q, k) = v^T \tanh(W_1 q + W_2 k)$

$\text{Dot-product: } \text{score}(q, k) = q^T k$

Property	Additive Attention	Dot-Product Attention
Computational complexity	Involves a learned weight matrix and tanh	Simple matrix multiplication
Speed in practice	Slower (cannot use optimized BLAS)	Faster (uses optimized matmul)
Theoretical power	Can learn arbitrary compatibility	Limited to bilinear compatibility
Performance at small $d_k$	Comparable	Comparable
Performance at large $d_k$	Better (no saturation)	Worse without scaling
With scaling	N/A	Comparable to additive

The authors chose dot-product attention because it is "much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code."

Part 4 - Multi-Head Attention

Why Multiple Heads?

A single attention function computes a single set of attention weights. But a token might need to attend to different things for different reasons:

For syntactic purposes, a verb might need to attend to its subject
For semantic purposes, the same verb might need to attend to its object
For positional purposes, it might need to attend to nearby tokens

Multi-head attention allows the model to jointly attend to information from different representation subspaces:

$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O$

where each head is:

$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$

Dimensions

In the base Transformer model ( $d_{\text{model}} = 512$ , $h = 8$ heads):

Each head projects to $d_k = d_v = d_{\text{model}} / h = 512 / 8 = 64$
Each head computes attention independently over 64-dimensional subspaces
The $h = 8$ head outputs are concatenated to form a $512$ -dimensional vector
A final projection $W^O \in \mathbb{R}^{512 \times 512}$ mixes the head outputs

class MultiHeadAttention:
    """Simplified multi-head attention for understanding."""

    def __init__(self, d_model=512, h=8):
        self.h = h
        self.d_k = d_model // h  # 64

        # Learned projection matrices
        # In practice, these are combined into one matrix for efficiency
        self.W_Q = [np.random.randn(d_model, self.d_k) * 0.02 for _ in range(h)]
        self.W_K = [np.random.randn(d_model, self.d_k) * 0.02 for _ in range(h)]
        self.W_V = [np.random.randn(d_model, self.d_k) * 0.02 for _ in range(h)]
        self.W_O = np.random.randn(d_model, d_model) * 0.02

    def forward(self, Q, K, V):
        """
        Q, K, V: (n, d_model)
        Output: (n, d_model)
        """
        heads = []
        for i in range(self.h):
            # Project to subspace
            Q_i = Q @ self.W_Q[i]  # (n, d_k)
            K_i = K @ self.W_K[i]  # (n, d_k)
            V_i = V @ self.W_V[i]  # (n, d_k)

            # Compute attention in this subspace
            head_i, _ = scaled_dot_product_attention(Q_i, K_i, V_i)
            heads.append(head_i)  # (n, d_k)

        # Concatenate all heads
        concat = np.concatenate(heads, axis=-1)  # (n, d_model)

        # Final projection
        output = concat @ self.W_O  # (n, d_model)
        return output

# The total computation cost is the same as single-head attention
# with full d_model, because we split across h heads with d_k = d_model/h

Why Not One Big Attention Head?

The paper's ablation (Table 3) shows that 8 heads of 64 dimensions outperforms 1 head of 512 dimensions. The authors' explanation: "Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this."

Intuitively, a single head must compute ONE set of attention weights that balances all the different reasons one token might attend to another. Multiple heads let each head specialize.

60-Second Answer

"Multi-head attention splits the model's representation into $h$ parallel subspaces, computes attention independently in each, then concatenates and projects. This allows different heads to capture different types of relationships - syntactic, semantic, positional - simultaneously. The total computation cost is the same as single-head attention because we reduce $d_k$ proportionally: 8 heads of dimension 64 requires the same FLOPs as 1 head of dimension 512."

Part 5 - Positional Encoding

The Problem

Self-attention is permutation-equivariant: if you shuffle the input tokens, the output tokens are shuffled in the same way. This means the Transformer has no notion of word order. "The cat sat on the mat" and "mat the on sat cat the" would produce the same attention patterns (up to permutation).

The Solution

The authors add positional encodings to the input embeddings:

$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$

$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$

Where $pos$ is the position (0, 1, 2, ...) and $i$ is the dimension index (0, 1, ..., $d_{\text{model}}/2 - 1$ ).

Why Sinusoidal?

The authors chose this specific formulation for three reasons:

Deterministic: No learned parameters needed. Works for any sequence length.
Relative position encoding: For any fixed offset $k$ , $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$ . Specifically:

$PE_{pos+k} = M_k \cdot PE_{pos}$

where $M_k$ is a rotation matrix that depends only on $k$ , not on $pos$ . This means the model can learn to attend to relative positions.

Bounded values: Sine and cosine are bounded in $[-1, 1]$ , so they do not dominate the embedding values.

def positional_encoding(max_len, d_model):
    """Generate sinusoidal positional encoding."""
    pe = np.zeros((max_len, d_model))
    position = np.arange(max_len).reshape(-1, 1)  # (max_len, 1)
    div_term = 10000 ** (np.arange(0, d_model, 2) / d_model)  # (d_model/2,)

    pe[:, 0::2] = np.sin(position / div_term)  # Even dimensions
    pe[:, 1::2] = np.cos(position / div_term)  # Odd dimensions

    return pe

pe = positional_encoding(100, 512)
print(f"PE shape: {pe.shape}")  # (100, 512)
print(f"PE[0, :4]: {pe[0, :4]}")  # sin(0), cos(0), sin(0), cos(0) = 0, 1, 0, 1
print(f"PE[1, :4]: {pe[1, :4]}")  # sin(1/1), cos(1/1), sin(1/100^(2/512)), ...

# Key property: different dimensions oscillate at different frequencies
# Low dimensions: high frequency (changes rapidly with position)
# High dimensions: low frequency (changes slowly with position)
# This creates a unique "fingerprint" for each position

Modern Alternatives

While the original paper used sinusoidal positional encodings, modern Transformers have largely moved to better alternatives:

Encoding	Description	Used By
Sinusoidal (original)	Fixed sin/cos frequencies	Original Transformer
Learned absolute	Trainable embedding per position	BERT, GPT-1/2
RoPE (Rotary)	Rotation-based relative encoding	LLaMA, GPT-NeoX, modern LLMs
ALiBi	Linear bias on attention scores	BLOOM, MPT

Company Variation

If asked about positional encoding in an interview for an LLM team, mention RoPE - it is the current standard. Explain why it is better: it encodes relative positions directly in the attention computation, generalizes better to unseen sequence lengths, and naturally decays attention with distance.

Part 6 - Feed-Forward Networks and Residual Connections

Position-Wise Feed-Forward Networks

Each layer contains a two-layer MLP applied independently to each position:

$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$

Input and output dimension: $d_{\text{model}} = 512$
Inner dimension: $d_{ff} = 2048$ (4x expansion)
Activation: ReLU (modern Transformers use GeLU or SwiGLU)

The FFN acts as a "memory" or "knowledge store" in the network. Research has shown that factual knowledge tends to be stored in the FFN layers, while attention layers handle relational reasoning.

Residual Connections

Every sublayer has a residual (skip) connection:

$\text{output} = \text{LayerNorm}(x + \text{Sublayer}(x))$

Residual connections serve two purposes:

Gradient flow: Provide a direct path for gradients, preventing vanishing gradients in deep networks
Easy identity: If a layer is unnecessary, it can learn $\text{Sublayer}(x) \approx 0$ , effectively becoming an identity

Layer Normalization

The original Transformer uses post-norm: LayerNorm is applied after the residual addition. Modern Transformers (GPT-2 onward) use pre-norm: LayerNorm before the sublayer.

$\text{Post-norm (original):} \quad \text{LayerNorm}(x + \text{Sublayer}(x))$ $\text{Pre-norm (modern):} \quad x + \text{Sublayer}(\text{LayerNorm}(x))$

Pre-norm is more stable during training because the residual path remains unnormalized, allowing gradients to flow more freely.

Part 7 - Training Details

The Training Setup

Parameter	Value
Dataset	WMT 2014 English-German (4.5M sentence pairs), English-French (36M)
Tokens per batch	~25,000 source + 25,000 target tokens
Optimizer	Adam ( $\beta_1 = 0.9$ , $\beta_2 = 0.98$ , $\epsilon = 10^{-9}$ )
Learning rate schedule	Warmup + inverse square root decay
Warmup steps	4,000
Regularization	Dropout ( $P_{drop} = 0.1$ ) on each sublayer and embeddings
Label smoothing	$\epsilon_{ls} = 0.1$
Training time	3.5 days on 8 P100 GPUs (base model)

The Learning Rate Schedule

The Transformer uses a custom learning rate schedule:

$lr = d_{\text{model}}^{-0.5} \cdot \min(\text{step}^{-0.5}, \text{step} \cdot \text{warmup}^{-1.5})$

This increases the learning rate linearly during warmup (first 4,000 steps), then decreases it proportionally to the inverse square root of the step number.

def transformer_lr_schedule(step, d_model=512, warmup_steps=4000):
    """The original Transformer learning rate schedule."""
    step = max(step, 1)  # Avoid division by zero
    return d_model ** (-0.5) * min(step ** (-0.5), step * warmup_steps ** (-1.5))

# Visualize
steps = range(1, 100001)
lrs = [transformer_lr_schedule(s) for s in steps]

print(f"LR at step 1:     {transformer_lr_schedule(1):.6f}")
print(f"LR at step 4000:  {transformer_lr_schedule(4000):.6f}")  # Peak
print(f"LR at step 10000: {transformer_lr_schedule(10000):.6f}")
print(f"LR at step 50000: {transformer_lr_schedule(50000):.6f}")

Why warmup? Early in training, parameters are randomly initialized. Large learning rates cause the Adam optimizer's adaptive estimates to be based on very few samples, leading to instability. Warmup allows these estimates to stabilize before using higher learning rates.

Label Smoothing

Instead of training with hard targets (one-hot vectors), the paper uses label smoothing with $\epsilon = 0.1$ :

$q(k) = (1 - \epsilon) \cdot \delta_{k,y} + \epsilon / K$

Where $y$ is the correct class and $K$ is the vocabulary size. This distributes 10% of the probability mass uniformly across all tokens.

Effect: Hurts perplexity (the model is less confident on training data) but improves BLEU score (the model generalizes better). The authors report that label smoothing improves BLEU by about 0.5 points.

Part 8 - Ablation Study

The paper's ablation study (Table 3) is one of the most informative in ML. Key findings:

Number of Attention Heads

Heads ( $h$ )	$d_k$	$d_v$	EN-DE BLEU
1	512	512	24.9
4	128	128	25.5
8	64	64	25.8
16	32	32	25.5
32	16	16	25.4

Insight: 8 heads is optimal. Too few heads (1) limits representational diversity. Too many heads (32) makes each head too small to compute meaningful attention.

Model Dimensions

Increasing $d_{\text{model}}$ from 512 to 1024 and $d_{ff}$ from 2048 to 4096 improved BLEU by about 0.7 points, at the cost of significantly more computation.

Attention Type

Replacing dot-product attention with learned (additive) attention slightly hurt performance while being much slower, confirming the dot-product choice.

Positional Encoding

Learned positional embeddings performed nearly identically to sinusoidal encodings on the translation task. The authors chose sinusoidal because it can potentially generalize to longer sequences.

Part 9 - Results

Main Results

Model	EN-DE BLEU	EN-FR BLEU	Training Cost (FLOPs)
Previous SOTA (ensemble)	26.0	41.0	-
Transformer (base)	25.8	38.1	$3.3 \times 10^{18}$
Transformer (big)	28.4	41.8	$2.3 \times 10^{19}$

Key takeaways:

The Transformer big model exceeded the previous SOTA (which was an ensemble of models) with a single model
Training cost was a fraction of comparable RNN-based models
The base model trained in 12 hours on 8 GPUs; the big model trained in 3.5 days

English Constituency Parsing

The paper also showed the Transformer generalizes beyond translation to English constituency parsing, achieving competitive results. This was an important early signal that the architecture was not task-specific.

Part 10 - Limitations and Modern Improvements

Limitation 1: Quadratic Attention Complexity

Self-attention is $O(n^2)$ in both time and memory, where $n$ is the sequence length. For $n = 4096$ , the attention matrix has 16 million entries per head per layer.

Modern solutions:

Flash Attention (2022): Same $O(n^2)$ computation but $O(n)$ memory through tiling and recomputation
Sparse attention (BigBird, Longformer): $O(n\sqrt{n})$ or $O(n)$ by attending to only a subset of positions
Linear attention (Performer): $O(n)$ through kernel approximations
State space models (Mamba): $O(n)$ through selective state spaces

Limitation 2: Fixed Context Window

The original Transformer has a fixed maximum sequence length. There is no mechanism for processing inputs longer than the training context.

Modern solutions:

RoPE + NTK-aware scaling: Extends context through interpolation
Sliding window attention: Process longer sequences with local windows
Ring attention: Distribute long sequences across multiple devices

Limitation 3: No Explicit Memory

The Transformer processes each input independently - there is no mechanism for maintaining state across inputs (unlike RNNs).

Modern solutions:

Retrieval augmentation (RAG): External memory via retrieval
KV caching: Store and reuse key-value pairs from previous forward passes
Memory tokens: Dedicated tokens that persist across calls

Limitation 4: Position Encoding Limitations

Sinusoidal and learned absolute position encodings do not generalize well to sequence lengths unseen during training.

Modern solutions:

RoPE: Encodes relative position in the attention computation itself
ALiBi: Adds linear bias based on distance, naturally decaying attention

Common Trap

When discussing Transformer limitations, do not just list them - propose solutions and cite the follow-up work. This shows you have kept up with the field, not just read the original paper.

Part 11 - The Three Types of Attention in the Transformer

Understanding the three different uses of attention is critical:

Type	Location	Queries From	Keys/Values From	Mask?
Encoder self-attention	Encoder layers	Encoder input	Encoder input	No
Decoder self-attention	Decoder layers	Decoder input	Decoder input	Yes (causal)
Cross-attention	Decoder layers	Decoder state	Encoder output	No

Attention Types: Encoder Self, Decoder Masked, Cross-Attention

Practice Problems

Problem 1: Dimensional Analysis

Given a Transformer with $d_{\text{model}} = 768$ , $h = 12$ heads, $d_{ff} = 3072$ , and $N = 12$ layers, calculate: (a) the dimension per head $d_k$ , (b) total parameters in one encoder layer (approximate), (c) total parameters in the full encoder.

Hint

(a) $d_k = 768 / 12 = 64$ . (b) One encoder layer has: multi-head attention ( $4 \times 768 \times 768 = 2.36M$ for $W_Q, W_K, W_V, W_O$ ), FFN ( $768 \times 3072 + 3072 \times 768 = 4.72M$ ), LayerNorm ( $2 \times 2 \times 768 = 3K$ ). Total: approximately $7.1M$ . (c) $12 \times 7.1M \approx 85M$ , plus embeddings.

Problem 2: Attention Visualization

If a model has 8 attention heads and the input sentence is "The cat sat on the mat", what might each head learn to attend to?

Hint

Different heads learn different patterns: one might attend to the previous token (local), one to syntactically related tokens ("sat" → "cat"), one to semantically similar tokens ("cat" → "mat" via animal-object relations), one to punctuation/structure, etc. This is empirically verified in attention visualization studies.

Problem 3: Masking in the Decoder

Explain why the decoder needs a causal mask in its self-attention. What would happen without it during training? Would it matter during inference?

Hint

During training, the decoder processes the entire target sequence at once (teacher forcing). Without the mask, position $t$ could "see" the ground truth at positions $t+1, t+2, ...$ , making the task trivially easy (just copy). During inference, future positions do not exist yet (generation is sequential), so masking is implicit. The mask is needed specifically to make training match the inference-time conditions.

Problem 4: Scaling Factor Derivation

Prove that if $q, k \in \mathbb{R}^{d_k}$ have independent components with mean 0 and variance 1, then $\text{Var}(q^T k) = d_k$ .

Hint

$q^T k = \sum_{i=1}^{d_k} q_i k_i$ . Since $q_i$ and $k_i$ are independent with mean 0 and variance 1, $\mathbb{E}[q_i k_i] = 0$ and $\text{Var}(q_i k_i) = \mathbb{E}[q_i^2 k_i^2] - (\mathbb{E}[q_i k_i])^2 = \mathbb{E}[q_i^2]\mathbb{E}[k_i^2] - 0 = 1 \cdot 1 = 1$ . Since all terms are independent, $\text{Var}(\sum q_i k_i) = \sum \text{Var}(q_i k_i) = d_k$ .

Problem 5: Architecture Comparison

If you were designing a model for a task where the input sequence is 100K tokens long, would you use the original Transformer? Why or why not? What modifications would you make?

Hint

No - $O(n^2)$ attention with $n = 100K$ means $10^{10}$ entries in the attention matrix per head per layer. You would need: sparse attention (attend to local windows + global tokens), or linear attention, or a hierarchical approach (chunk the input and attend within/across chunks), or a state space model like Mamba that processes sequences in $O(n)$ .

Interview Cheat Sheet

Question	Key Points
"Why was the Transformer proposed?"	Sequential bottleneck of RNNs. Cannot parallelize training.
"Draw the architecture"	Encoder-decoder, 6 layers each. Self-attention + FFN + residual + LayerNorm. Decoder adds causal mask and cross-attention.
"Write the attention equation"	$\text{softmax}(QK^T / \sqrt{d_k})V$ . Explain Q, K, V, softmax, scaling.
"Why divide by $\sqrt{d_k}$ ?"	Dot products grow with $d_k$ , causing softmax saturation and vanishing gradients.
"Why multi-head?"	Different heads capture different relationship types. 8 heads of 64 dims beats 1 head of 512 dims (ablation).
"How are positions encoded?"	Sinusoidal (fixed sin/cos at different frequencies). Modern: RoPE.
"What are the limitations?"	$O(n^2)$ complexity, fixed context, position encoding does not generalize.
"How does this relate to BERT/GPT?"	BERT uses the encoder. GPT uses the decoder. T5 uses both.
"Encoder vs decoder difference?"	Decoder has causal masking + cross-attention.
"Training details?"	Adam with warmup, dropout 0.1, label smoothing 0.1, 3.5 days on 8 GPUs.

Spaced Repetition Checkpoints

Day 0 (Today)

Understand the motivation: why replace RNNs?
Memorize the architecture diagram
Derive the attention equation and explain each term

Day 3

Draw the full architecture from memory on paper
Explain the scaling factor derivation
Explain multi-head attention with dimensions

Day 7

Practice a 10-minute presentation of this paper
Recite key ablation results (8 heads optimal, etc.)
Explain all three types of attention

Day 14

Mock interview: answer all 10 cheat sheet questions
Connect to BERT, GPT, and modern architectures
Discuss limitations with proposed solutions

Day 21

Full paper discussion interview simulation (20 minutes)
Handle follow-up questions about scaling, efficiency, positional encoding
Compare to at least two alternative architectures

Next Steps

Now that you have mastered the Transformer architecture, move to Chapter 4: BERT to understand how the Transformer encoder was adapted for bidirectional pre-training - one of the most impactful applications of the architecture you just learned.

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - The Problem: Why Replace RNNs?​

The Sequential Bottleneck​

What Existed Before: Attention with RNNs​

Part 2 - The Architecture​

Overall Structure​

Each Encoder Layer​

Each Decoder Layer​

Part 3 - Scaled Dot-Product Attention​

The Core Equation​

Step-by-Step Computation​

Why Scale by dk\sqrt{d_k}dk​​?​

Dot-Product vs. Additive Attention​

Part 4 - Multi-Head Attention​

Why Multiple Heads?​

Dimensions​

Why Not One Big Attention Head?​

Part 5 - Positional Encoding​

The Problem​

The Solution​

Why Sinusoidal?​

Modern Alternatives​

Part 6 - Feed-Forward Networks and Residual Connections​

Position-Wise Feed-Forward Networks​

Residual Connections​

Layer Normalization​

Part 7 - Training Details​

The Training Setup​

The Learning Rate Schedule​

Label Smoothing​

Part 8 - Ablation Study​

Number of Attention Heads​

Model Dimensions​

Attention Type​

Positional Encoding​

Part 9 - Results​

Main Results​

English Constituency Parsing​

Part 10 - Limitations and Modern Improvements​

Limitation 1: Quadratic Attention Complexity​

Limitation 2: Fixed Context Window​

Limitation 3: No Explicit Memory​

Limitation 4: Position Encoding Limitations​

Part 11 - The Three Types of Attention in the Transformer​

Practice Problems​

Problem 1: Dimensional Analysis​

Problem 2: Attention Visualization​

Problem 3: Masking in the Decoder​

Problem 4: Scaling Factor Derivation​

Problem 5: Architecture Comparison​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 (Today)​

Day 3​

Day 7​

Day 14​

Day 21​

Next Steps​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 - The Problem: Why Replace RNNs?

The Sequential Bottleneck

What Existed Before: Attention with RNNs

Part 2 - The Architecture

Overall Structure

Each Encoder Layer

Each Decoder Layer

Part 3 - Scaled Dot-Product Attention

The Core Equation

Step-by-Step Computation

Why Scale by $\sqrt{d_k}$ ?

Dot-Product vs. Additive Attention

Part 4 - Multi-Head Attention

Why Multiple Heads?

Dimensions

Why Not One Big Attention Head?

Part 5 - Positional Encoding

The Problem

The Solution

Why Sinusoidal?

Modern Alternatives

Part 6 - Feed-Forward Networks and Residual Connections

Position-Wise Feed-Forward Networks

Residual Connections

Layer Normalization

Part 7 - Training Details

The Training Setup

The Learning Rate Schedule

Label Smoothing

Part 8 - Ablation Study

Number of Attention Heads

Model Dimensions

Attention Type

Positional Encoding

Part 9 - Results

Main Results

English Constituency Parsing

Part 10 - Limitations and Modern Improvements

Limitation 1: Quadratic Attention Complexity

Limitation 2: Fixed Context Window

Limitation 3: No Explicit Memory

Limitation 4: Position Encoding Limitations

Part 11 - The Three Types of Attention in the Transformer

Practice Problems

Problem 1: Dimensional Analysis

Problem 2: Attention Visualization

Problem 3: Masking in the Decoder

Problem 4: Scaling Factor Derivation

Problem 5: Architecture Comparison

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 (Today)

Day 3

Day 7

Day 14

Day 21

Next Steps