Skip to main content

Attention Is All You Need - The Paper That Changed Everything

Reading time: ~45 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Engineer, Data Scientist

The Real Interview Moment

You are in an OpenAI research engineer interview. The interviewer draws a blank rectangle on the whiteboard and says: "Draw me the Transformer architecture. As you draw, explain every component - what it does, why it is there, and what happens if you remove it." She pauses, then adds: "I want the math for self-attention, and I want to know why you divide by root dkd_k."

This is the most commonly asked paper discussion in ML interviews, period. The Transformer is not just a paper - it is the foundation of modern AI. Every large language model, from BERT to GPT-4 to Claude, builds on this architecture. If there is one paper you must know cold, this is it.

This chapter gives you a complete, interview-ready understanding: the motivation, the architecture (component by component), the mathematics (with intuition for every design choice), the training details, the ablation results, and the paper's lasting impact.

What You Will Master

  • Explain why the Transformer was proposed (limitations of RNNs)
  • Draw the complete Transformer architecture from memory
  • Derive scaled dot-product attention from first principles
  • Explain multi-head attention and why it outperforms single-head
  • Describe positional encoding and its mathematical properties
  • Discuss the training setup (optimizer, regularization, label smoothing)
  • Cite specific ablation results from the paper
  • Identify limitations and connect to modern improvements

Self-Assessment: Where Are You Now?

Skill1 - Cannot2 - Vaguely3 - Can Explain4 - Can Derive5 - Can TeachYour Score
Explain why RNNs were replaced___
Draw the full Transformer architecture___
Derive the attention equation___
Explain the scaling factor dk\sqrt{d_k}___
Explain multi-head attention___
Describe positional encoding___
Explain residual connections + LayerNorm___
Cite specific results and ablations___
List 3+ limitations___
Connect to modern architectures (GPT, BERT)___

Target: All 4s and 5s before your interview.

Part 1 - The Problem: Why Replace RNNs?

The Sequential Bottleneck

Before 2017, the state of the art for sequence-to-sequence tasks (machine translation, summarization, etc.) was encoder-decoder RNNs with attention (Bahdanau et al., 2014; Luong et al., 2015). These models had a fundamental limitation:

RNNs process sequences one token at a time.

To compute the hidden state at position tt, you need the hidden state at position t1t-1:

ht=f(ht1,xt)h_t = f(h_{t-1}, x_t)

This sequential dependency has three consequences:

  1. No parallelization. You cannot compute h5h_5 until you have computed h1h_1 through h4h_4. On modern GPUs with thousands of cores, most of the hardware sits idle.

  2. Long-range dependency problems. Information from early tokens must survive through every intermediate hidden state to influence later tokens. Despite LSTMs and GRUs helping with this, gradients still degrade over hundreds of tokens.

  3. Training speed. Sequential processing means training time scales linearly with sequence length, even on parallel hardware.

RNN vs Transformer: Sequential vs Parallel Processing

60-Second Answer

"The Transformer was motivated by the fundamental limitation of RNNs: sequential processing. In an RNN, you cannot compute the representation for position tt until you have processed all positions before it, which prevents parallelization and limits practical sequence lengths. The Transformer replaces recurrence entirely with self-attention, which computes all pairwise interactions in parallel. This trades O(n)O(n) sequential operations for O(1)O(1) - at the cost of O(n2)O(n^2) total computation and memory - which is a favorable tradeoff for sequences under a few thousand tokens on modern parallel hardware."

What Existed Before: Attention with RNNs

It is crucial to understand that attention itself was not new. Bahdanau et al. (2014) introduced additive attention for machine translation, and Luong et al. (2015) proposed multiplicative (dot-product) attention. Both used attention as a mechanism on top of RNNs.

The Transformer's insight was not "let us use attention." It was "let us use ONLY attention" - removing recurrence entirely.

FeatureRNN + AttentionTransformer
Sequence processingSequential (one token at a time)Parallel (all tokens simultaneously)
Long-range dependenciesThrough hidden state chainDirect pairwise attention
Training parallelizationLimited by sequential natureFully parallelizable
Complexity per layerO(nd)O(n \cdot d)O(n2d)O(n^2 \cdot d)
Sequential operationsO(n)O(n)O(1)O(1)
Maximum path lengthO(n)O(n)O(1)O(1)
Common Trap

Do not say "The Transformer invented attention." Attention mechanisms existed since 2014. The Transformer's contribution was showing that attention alone - without any recurrence or convolution - is sufficient for state-of-the-art sequence modeling. This distinction matters in interviews.

Part 2 - The Architecture

Overall Structure

The Transformer uses an encoder-decoder architecture:

  • Encoder: Processes the input sequence (e.g., the source language sentence). Consists of N=6N = 6 identical layers.
  • Decoder: Generates the output sequence (e.g., the target language sentence). Also N=6N = 6 identical layers, with an additional cross-attention sublayer.

Transformer Architecture: Encoder-Decoder

Each Encoder Layer

Every encoder layer has two sublayers:

  1. Multi-head self-attention: Every position attends to all positions in the previous layer's output.
  2. Position-wise feed-forward network: A two-layer MLP applied independently to each position.

Around each sublayer, there is:

  • A residual connection: output=sublayer(x)+x\text{output} = \text{sublayer}(x) + x
  • Layer normalization: Applied after the addition

The formula for each sublayer is:

LayerNorm(x+Sublayer(x))\text{LayerNorm}(x + \text{Sublayer}(x))

Each Decoder Layer

The decoder has three sublayers:

  1. Masked multi-head self-attention: Like the encoder, but with a mask that prevents position ii from attending to positions j>ij > i. This ensures that predictions for position ii depend only on known outputs at positions less than ii.
  2. Multi-head cross-attention: Queries come from the previous decoder sublayer, but keys and values come from the encoder output. This is how the decoder "looks at" the input sequence.
  3. Position-wise feed-forward network: Same as the encoder.
Instant Rejection

If asked "What is the difference between the encoder and decoder?", do not say "The decoder has masking." The decoder has TWO additional features compared to the encoder: (1) causal masking in the self-attention to prevent looking ahead, and (2) cross-attention over the encoder output. Missing either one shows shallow understanding.

Part 3 - Scaled Dot-Product Attention

The Core Equation

The heart of the Transformer is the scaled dot-product attention:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Where:

  • QRn×dkQ \in \mathbb{R}^{n \times d_k} - Queries: what each position is looking for
  • KRm×dkK \in \mathbb{R}^{m \times d_k} - Keys: what each position offers to be matched against
  • VRm×dvV \in \mathbb{R}^{m \times d_v} - Values: the actual content to be retrieved
  • nn is the number of query positions, mm is the number of key/value positions
  • In self-attention: n=mn = m (every position attends to every other position)

Step-by-Step Computation

import numpy as np

def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q: (n, d_k) - queries
K: (m, d_k) - keys
V: (m, d_v) - values
mask: optional (n, m) - attention mask
"""
d_k = Q.shape[-1]

# Step 1: Compute raw attention scores
# QK^T: (n, d_k) @ (d_k, m) = (n, m)
# Each entry (i,j) = dot product of query i with key j
scores = Q @ K.T # shape: (n, m)

# Step 2: Scale by sqrt(d_k)
# Prevents softmax saturation for large d_k
scores = scores / np.sqrt(d_k)

# Step 3: Apply mask (for decoder self-attention)
if mask is not None:
scores = scores + mask # mask has -inf where attention is blocked

# Step 4: Softmax normalizes each row
# Row i becomes a probability distribution over all key positions
weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)

# Step 5: Weighted sum of values
# (n, m) @ (m, d_v) = (n, d_v)
output = weights @ V

return output, weights


# Example: 4 tokens, d_k = d_v = 8
n, d_k, d_v = 4, 8, 8
Q = np.random.randn(n, d_k)
K = np.random.randn(n, d_k)
V = np.random.randn(n, d_v)

output, weights = scaled_dot_product_attention(Q, K, V)
print(f"Output shape: {output.shape}") # (4, 8)
print(f"Attention weights shape: {weights.shape}") # (4, 4)
print(f"Attention weights (row 0): {weights[0]}") # sums to 1
print(f"Sum of weights: {weights[0].sum():.4f}") # 1.0000

Why Scale by dk\sqrt{d_k}?

This is the most commonly asked "why" question about the Transformer.

The problem: For large dkd_k, the dot products qk=i=1dkqikiq \cdot k = \sum_{i=1}^{d_k} q_i k_i grow in magnitude. If qq and kk have independent components with mean 0 and variance 1:

Var(qk)=i=1dkVar(qiki)=i=1dk1=dk\text{Var}(q \cdot k) = \sum_{i=1}^{d_k} \text{Var}(q_i k_i) = \sum_{i=1}^{d_k} 1 = d_k

So the standard deviation of the dot product is dk\sqrt{d_k}. For dk=64d_k = 64, dot products are typically in the range [16,16][-16, 16].

The consequence: Large dot products push the softmax into saturation:

softmax([10,1,1,1])[0.9999,0.0000,0.0000,0.0000]\text{softmax}([10, 1, 1, 1]) \approx [0.9999, 0.0000, 0.0000, 0.0000]

When softmax is nearly one-hot, gradients become vanishingly small:

softmax(z)izj=softmax(z)i(δijsoftmax(z)j)0\frac{\partial \text{softmax}(z)_i}{\partial z_j} = \text{softmax}(z)_i (\delta_{ij} - \text{softmax}(z)_j) \approx 0

The fix: Dividing by dk\sqrt{d_k} normalizes the variance of dot products back to 1, keeping them in the region where softmax gradients are well-behaved.

# Demonstration of the scaling effect
np.random.seed(42)
d_k = 64
n_pairs = 1000

dots = np.array([
np.random.randn(d_k) @ np.random.randn(d_k)
for _ in range(n_pairs)
])

print(f"Without scaling:")
print(f" Mean: {dots.mean():.2f}, Std: {dots.std():.2f}")
# Mean ≈ 0, Std ≈ 8 (sqrt(64))

scaled_dots = dots / np.sqrt(d_k)
print(f"With scaling by sqrt({d_k}):")
print(f" Mean: {scaled_dots.mean():.2f}, Std: {scaled_dots.std():.2f}")
# Mean ≈ 0, Std ≈ 1

# Effect on softmax entropy
def softmax(x):
e_x = np.exp(x - x.max())
return e_x / e_x.sum()

scores_unscaled = np.random.randn(10) * np.sqrt(d_k)
scores_scaled = scores_unscaled / np.sqrt(d_k)

w_unscaled = softmax(scores_unscaled)
w_scaled = softmax(scores_scaled)

entropy_unscaled = -np.sum(w_unscaled * np.log(w_unscaled + 1e-10))
entropy_scaled = -np.sum(w_scaled * np.log(w_scaled + 1e-10))

print(f"\nSoftmax entropy (unscaled): {entropy_unscaled:.4f}") # Low (peaky)
print(f"Softmax entropy (scaled): {entropy_scaled:.4f}") # Higher (smoother)

Dot-Product vs. Additive Attention

The paper discusses two types of attention:

Additive: score(q,k)=vTtanh(W1q+W2k)\text{Additive: } \text{score}(q, k) = v^T \tanh(W_1 q + W_2 k)

Dot-product: score(q,k)=qTk\text{Dot-product: } \text{score}(q, k) = q^T k

PropertyAdditive AttentionDot-Product Attention
Computational complexityInvolves a learned weight matrix and tanhSimple matrix multiplication
Speed in practiceSlower (cannot use optimized BLAS)Faster (uses optimized matmul)
Theoretical powerCan learn arbitrary compatibilityLimited to bilinear compatibility
Performance at small dkd_kComparableComparable
Performance at large dkd_kBetter (no saturation)Worse without scaling
With scalingN/AComparable to additive

The authors chose dot-product attention because it is "much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code."

Part 4 - Multi-Head Attention

Why Multiple Heads?

A single attention function computes a single set of attention weights. But a token might need to attend to different things for different reasons:

  • For syntactic purposes, a verb might need to attend to its subject
  • For semantic purposes, the same verb might need to attend to its object
  • For positional purposes, it might need to attend to nearby tokens

Multi-head attention allows the model to jointly attend to information from different representation subspaces:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O

where each head is:

headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

Dimensions

In the base Transformer model (dmodel=512d_{\text{model}} = 512, h=8h = 8 heads):

  • Each head projects to dk=dv=dmodel/h=512/8=64d_k = d_v = d_{\text{model}} / h = 512 / 8 = 64
  • Each head computes attention independently over 64-dimensional subspaces
  • The h=8h = 8 head outputs are concatenated to form a 512512-dimensional vector
  • A final projection WOR512×512W^O \in \mathbb{R}^{512 \times 512} mixes the head outputs
class MultiHeadAttention:
"""Simplified multi-head attention for understanding."""

def __init__(self, d_model=512, h=8):
self.h = h
self.d_k = d_model // h # 64

# Learned projection matrices
# In practice, these are combined into one matrix for efficiency
self.W_Q = [np.random.randn(d_model, self.d_k) * 0.02 for _ in range(h)]
self.W_K = [np.random.randn(d_model, self.d_k) * 0.02 for _ in range(h)]
self.W_V = [np.random.randn(d_model, self.d_k) * 0.02 for _ in range(h)]
self.W_O = np.random.randn(d_model, d_model) * 0.02

def forward(self, Q, K, V):
"""
Q, K, V: (n, d_model)
Output: (n, d_model)
"""
heads = []
for i in range(self.h):
# Project to subspace
Q_i = Q @ self.W_Q[i] # (n, d_k)
K_i = K @ self.W_K[i] # (n, d_k)
V_i = V @ self.W_V[i] # (n, d_k)

# Compute attention in this subspace
head_i, _ = scaled_dot_product_attention(Q_i, K_i, V_i)
heads.append(head_i) # (n, d_k)

# Concatenate all heads
concat = np.concatenate(heads, axis=-1) # (n, d_model)

# Final projection
output = concat @ self.W_O # (n, d_model)
return output

# The total computation cost is the same as single-head attention
# with full d_model, because we split across h heads with d_k = d_model/h

Why Not One Big Attention Head?

The paper's ablation (Table 3) shows that 8 heads of 64 dimensions outperforms 1 head of 512 dimensions. The authors' explanation: "Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this."

Intuitively, a single head must compute ONE set of attention weights that balances all the different reasons one token might attend to another. Multiple heads let each head specialize.

60-Second Answer

"Multi-head attention splits the model's representation into hh parallel subspaces, computes attention independently in each, then concatenates and projects. This allows different heads to capture different types of relationships - syntactic, semantic, positional - simultaneously. The total computation cost is the same as single-head attention because we reduce dkd_k proportionally: 8 heads of dimension 64 requires the same FLOPs as 1 head of dimension 512."

Part 5 - Positional Encoding

The Problem

Self-attention is permutation-equivariant: if you shuffle the input tokens, the output tokens are shuffled in the same way. This means the Transformer has no notion of word order. "The cat sat on the mat" and "mat the on sat cat the" would produce the same attention patterns (up to permutation).

The Solution

The authors add positional encodings to the input embeddings:

PE(pos,2i)=sin(pos100002i/dmodel)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

PE(pos,2i+1)=cos(pos100002i/dmodel)PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

Where pospos is the position (0, 1, 2, ...) and ii is the dimension index (0, 1, ..., dmodel/21d_{\text{model}}/2 - 1).

Why Sinusoidal?

The authors chose this specific formulation for three reasons:

  1. Deterministic: No learned parameters needed. Works for any sequence length.

  2. Relative position encoding: For any fixed offset kk, PEpos+kPE_{pos+k} can be represented as a linear function of PEposPE_{pos}. Specifically:

PEpos+k=MkPEposPE_{pos+k} = M_k \cdot PE_{pos}

where MkM_k is a rotation matrix that depends only on kk, not on pospos. This means the model can learn to attend to relative positions.

  1. Bounded values: Sine and cosine are bounded in [1,1][-1, 1], so they do not dominate the embedding values.
def positional_encoding(max_len, d_model):
"""Generate sinusoidal positional encoding."""
pe = np.zeros((max_len, d_model))
position = np.arange(max_len).reshape(-1, 1) # (max_len, 1)
div_term = 10000 ** (np.arange(0, d_model, 2) / d_model) # (d_model/2,)

pe[:, 0::2] = np.sin(position / div_term) # Even dimensions
pe[:, 1::2] = np.cos(position / div_term) # Odd dimensions

return pe

pe = positional_encoding(100, 512)
print(f"PE shape: {pe.shape}") # (100, 512)
print(f"PE[0, :4]: {pe[0, :4]}") # sin(0), cos(0), sin(0), cos(0) = 0, 1, 0, 1
print(f"PE[1, :4]: {pe[1, :4]}") # sin(1/1), cos(1/1), sin(1/100^(2/512)), ...

# Key property: different dimensions oscillate at different frequencies
# Low dimensions: high frequency (changes rapidly with position)
# High dimensions: low frequency (changes slowly with position)
# This creates a unique "fingerprint" for each position

Modern Alternatives

While the original paper used sinusoidal positional encodings, modern Transformers have largely moved to better alternatives:

EncodingDescriptionUsed By
Sinusoidal (original)Fixed sin/cos frequenciesOriginal Transformer
Learned absoluteTrainable embedding per positionBERT, GPT-1/2
RoPE (Rotary)Rotation-based relative encodingLLaMA, GPT-NeoX, modern LLMs
ALiBiLinear bias on attention scoresBLOOM, MPT
Company Variation

If asked about positional encoding in an interview for an LLM team, mention RoPE - it is the current standard. Explain why it is better: it encodes relative positions directly in the attention computation, generalizes better to unseen sequence lengths, and naturally decays attention with distance.

Part 6 - Feed-Forward Networks and Residual Connections

Position-Wise Feed-Forward Networks

Each layer contains a two-layer MLP applied independently to each position:

FFN(x)=max(0,xW1+b1)W2+b2\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2

  • Input and output dimension: dmodel=512d_{\text{model}} = 512
  • Inner dimension: dff=2048d_{ff} = 2048 (4x expansion)
  • Activation: ReLU (modern Transformers use GeLU or SwiGLU)

The FFN acts as a "memory" or "knowledge store" in the network. Research has shown that factual knowledge tends to be stored in the FFN layers, while attention layers handle relational reasoning.

Residual Connections

Every sublayer has a residual (skip) connection:

output=LayerNorm(x+Sublayer(x))\text{output} = \text{LayerNorm}(x + \text{Sublayer}(x))

Residual connections serve two purposes:

  1. Gradient flow: Provide a direct path for gradients, preventing vanishing gradients in deep networks
  2. Easy identity: If a layer is unnecessary, it can learn Sublayer(x)0\text{Sublayer}(x) \approx 0, effectively becoming an identity

Layer Normalization

The original Transformer uses post-norm: LayerNorm is applied after the residual addition. Modern Transformers (GPT-2 onward) use pre-norm: LayerNorm before the sublayer.

Post-norm (original):LayerNorm(x+Sublayer(x))\text{Post-norm (original):} \quad \text{LayerNorm}(x + \text{Sublayer}(x)) Pre-norm (modern):x+Sublayer(LayerNorm(x))\text{Pre-norm (modern):} \quad x + \text{Sublayer}(\text{LayerNorm}(x))

Pre-norm is more stable during training because the residual path remains unnormalized, allowing gradients to flow more freely.

Part 7 - Training Details

The Training Setup

ParameterValue
DatasetWMT 2014 English-German (4.5M sentence pairs), English-French (36M)
Tokens per batch~25,000 source + 25,000 target tokens
OptimizerAdam (β1=0.9\beta_1 = 0.9, β2=0.98\beta_2 = 0.98, ϵ=109\epsilon = 10^{-9})
Learning rate scheduleWarmup + inverse square root decay
Warmup steps4,000
RegularizationDropout (Pdrop=0.1P_{drop} = 0.1) on each sublayer and embeddings
Label smoothingϵls=0.1\epsilon_{ls} = 0.1
Training time3.5 days on 8 P100 GPUs (base model)

The Learning Rate Schedule

The Transformer uses a custom learning rate schedule:

lr=dmodel0.5min(step0.5,stepwarmup1.5)lr = d_{\text{model}}^{-0.5} \cdot \min(\text{step}^{-0.5}, \text{step} \cdot \text{warmup}^{-1.5})

This increases the learning rate linearly during warmup (first 4,000 steps), then decreases it proportionally to the inverse square root of the step number.

def transformer_lr_schedule(step, d_model=512, warmup_steps=4000):
"""The original Transformer learning rate schedule."""
step = max(step, 1) # Avoid division by zero
return d_model ** (-0.5) * min(step ** (-0.5), step * warmup_steps ** (-1.5))

# Visualize
steps = range(1, 100001)
lrs = [transformer_lr_schedule(s) for s in steps]

print(f"LR at step 1: {transformer_lr_schedule(1):.6f}")
print(f"LR at step 4000: {transformer_lr_schedule(4000):.6f}") # Peak
print(f"LR at step 10000: {transformer_lr_schedule(10000):.6f}")
print(f"LR at step 50000: {transformer_lr_schedule(50000):.6f}")

Why warmup? Early in training, parameters are randomly initialized. Large learning rates cause the Adam optimizer's adaptive estimates to be based on very few samples, leading to instability. Warmup allows these estimates to stabilize before using higher learning rates.

Label Smoothing

Instead of training with hard targets (one-hot vectors), the paper uses label smoothing with ϵ=0.1\epsilon = 0.1:

q(k)=(1ϵ)δk,y+ϵ/Kq(k) = (1 - \epsilon) \cdot \delta_{k,y} + \epsilon / K

Where yy is the correct class and KK is the vocabulary size. This distributes 10% of the probability mass uniformly across all tokens.

Effect: Hurts perplexity (the model is less confident on training data) but improves BLEU score (the model generalizes better). The authors report that label smoothing improves BLEU by about 0.5 points.

Part 8 - Ablation Study

The paper's ablation study (Table 3) is one of the most informative in ML. Key findings:

Number of Attention Heads

Heads (hh)dkd_kdvd_vEN-DE BLEU
151251224.9
412812825.5
8646425.8
16323225.5
32161625.4

Insight: 8 heads is optimal. Too few heads (1) limits representational diversity. Too many heads (32) makes each head too small to compute meaningful attention.

Model Dimensions

Increasing dmodeld_{\text{model}} from 512 to 1024 and dffd_{ff} from 2048 to 4096 improved BLEU by about 0.7 points, at the cost of significantly more computation.

Attention Type

Replacing dot-product attention with learned (additive) attention slightly hurt performance while being much slower, confirming the dot-product choice.

Positional Encoding

Learned positional embeddings performed nearly identically to sinusoidal encodings on the translation task. The authors chose sinusoidal because it can potentially generalize to longer sequences.

Part 9 - Results

Main Results

ModelEN-DE BLEUEN-FR BLEUTraining Cost (FLOPs)
Previous SOTA (ensemble)26.041.0-
Transformer (base)25.838.13.3×10183.3 \times 10^{18}
Transformer (big)28.441.82.3×10192.3 \times 10^{19}

Key takeaways:

  • The Transformer big model exceeded the previous SOTA (which was an ensemble of models) with a single model
  • Training cost was a fraction of comparable RNN-based models
  • The base model trained in 12 hours on 8 GPUs; the big model trained in 3.5 days

English Constituency Parsing

The paper also showed the Transformer generalizes beyond translation to English constituency parsing, achieving competitive results. This was an important early signal that the architecture was not task-specific.

Part 10 - Limitations and Modern Improvements

Limitation 1: Quadratic Attention Complexity

Self-attention is O(n2)O(n^2) in both time and memory, where nn is the sequence length. For n=4096n = 4096, the attention matrix has 16 million entries per head per layer.

Modern solutions:

  • Flash Attention (2022): Same O(n2)O(n^2) computation but O(n)O(n) memory through tiling and recomputation
  • Sparse attention (BigBird, Longformer): O(nn)O(n\sqrt{n}) or O(n)O(n) by attending to only a subset of positions
  • Linear attention (Performer): O(n)O(n) through kernel approximations
  • State space models (Mamba): O(n)O(n) through selective state spaces

Limitation 2: Fixed Context Window

The original Transformer has a fixed maximum sequence length. There is no mechanism for processing inputs longer than the training context.

Modern solutions:

  • RoPE + NTK-aware scaling: Extends context through interpolation
  • Sliding window attention: Process longer sequences with local windows
  • Ring attention: Distribute long sequences across multiple devices

Limitation 3: No Explicit Memory

The Transformer processes each input independently - there is no mechanism for maintaining state across inputs (unlike RNNs).

Modern solutions:

  • Retrieval augmentation (RAG): External memory via retrieval
  • KV caching: Store and reuse key-value pairs from previous forward passes
  • Memory tokens: Dedicated tokens that persist across calls

Limitation 4: Position Encoding Limitations

Sinusoidal and learned absolute position encodings do not generalize well to sequence lengths unseen during training.

Modern solutions:

  • RoPE: Encodes relative position in the attention computation itself
  • ALiBi: Adds linear bias based on distance, naturally decaying attention
Common Trap

When discussing Transformer limitations, do not just list them - propose solutions and cite the follow-up work. This shows you have kept up with the field, not just read the original paper.

Part 11 - The Three Types of Attention in the Transformer

Understanding the three different uses of attention is critical:

TypeLocationQueries FromKeys/Values FromMask?
Encoder self-attentionEncoder layersEncoder inputEncoder inputNo
Decoder self-attentionDecoder layersDecoder inputDecoder inputYes (causal)
Cross-attentionDecoder layersDecoder stateEncoder outputNo

Attention Types: Encoder Self, Decoder Masked, Cross-Attention

Practice Problems

Problem 1: Dimensional Analysis

Given a Transformer with dmodel=768d_{\text{model}} = 768, h=12h = 12 heads, dff=3072d_{ff} = 3072, and N=12N = 12 layers, calculate: (a) the dimension per head dkd_k, (b) total parameters in one encoder layer (approximate), (c) total parameters in the full encoder.

Hint

(a) dk=768/12=64d_k = 768 / 12 = 64. (b) One encoder layer has: multi-head attention (4×768×768=2.36M4 \times 768 \times 768 = 2.36M for WQ,WK,WV,WOW_Q, W_K, W_V, W_O), FFN (768×3072+3072×768=4.72M768 \times 3072 + 3072 \times 768 = 4.72M), LayerNorm (2×2×768=3K2 \times 2 \times 768 = 3K). Total: approximately 7.1M7.1M. (c) 12×7.1M85M12 \times 7.1M \approx 85M, plus embeddings.

Problem 2: Attention Visualization

If a model has 8 attention heads and the input sentence is "The cat sat on the mat", what might each head learn to attend to?

Hint

Different heads learn different patterns: one might attend to the previous token (local), one to syntactically related tokens ("sat" → "cat"), one to semantically similar tokens ("cat" → "mat" via animal-object relations), one to punctuation/structure, etc. This is empirically verified in attention visualization studies.

Problem 3: Masking in the Decoder

Explain why the decoder needs a causal mask in its self-attention. What would happen without it during training? Would it matter during inference?

Hint

During training, the decoder processes the entire target sequence at once (teacher forcing). Without the mask, position tt could "see" the ground truth at positions t+1,t+2,...t+1, t+2, ..., making the task trivially easy (just copy). During inference, future positions do not exist yet (generation is sequential), so masking is implicit. The mask is needed specifically to make training match the inference-time conditions.

Problem 4: Scaling Factor Derivation

Prove that if q,kRdkq, k \in \mathbb{R}^{d_k} have independent components with mean 0 and variance 1, then Var(qTk)=dk\text{Var}(q^T k) = d_k.

Hint

qTk=i=1dkqikiq^T k = \sum_{i=1}^{d_k} q_i k_i. Since qiq_i and kik_i are independent with mean 0 and variance 1, E[qiki]=0\mathbb{E}[q_i k_i] = 0 and Var(qiki)=E[qi2ki2](E[qiki])2=E[qi2]E[ki2]0=11=1\text{Var}(q_i k_i) = \mathbb{E}[q_i^2 k_i^2] - (\mathbb{E}[q_i k_i])^2 = \mathbb{E}[q_i^2]\mathbb{E}[k_i^2] - 0 = 1 \cdot 1 = 1. Since all terms are independent, Var(qiki)=Var(qiki)=dk\text{Var}(\sum q_i k_i) = \sum \text{Var}(q_i k_i) = d_k.

Problem 5: Architecture Comparison

If you were designing a model for a task where the input sequence is 100K tokens long, would you use the original Transformer? Why or why not? What modifications would you make?

Hint

No - O(n2)O(n^2) attention with n=100Kn = 100K means 101010^{10} entries in the attention matrix per head per layer. You would need: sparse attention (attend to local windows + global tokens), or linear attention, or a hierarchical approach (chunk the input and attend within/across chunks), or a state space model like Mamba that processes sequences in O(n)O(n).

Interview Cheat Sheet

QuestionKey Points
"Why was the Transformer proposed?"Sequential bottleneck of RNNs. Cannot parallelize training.
"Draw the architecture"Encoder-decoder, 6 layers each. Self-attention + FFN + residual + LayerNorm. Decoder adds causal mask and cross-attention.
"Write the attention equation"softmax(QKT/dk)V\text{softmax}(QK^T / \sqrt{d_k})V. Explain Q, K, V, softmax, scaling.
"Why divide by dk\sqrt{d_k}?"Dot products grow with dkd_k, causing softmax saturation and vanishing gradients.
"Why multi-head?"Different heads capture different relationship types. 8 heads of 64 dims beats 1 head of 512 dims (ablation).
"How are positions encoded?"Sinusoidal (fixed sin/cos at different frequencies). Modern: RoPE.
"What are the limitations?"O(n2)O(n^2) complexity, fixed context, position encoding does not generalize.
"How does this relate to BERT/GPT?"BERT uses the encoder. GPT uses the decoder. T5 uses both.
"Encoder vs decoder difference?"Decoder has causal masking + cross-attention.
"Training details?"Adam with warmup, dropout 0.1, label smoothing 0.1, 3.5 days on 8 GPUs.

Spaced Repetition Checkpoints

Day 0 (Today)

  • Understand the motivation: why replace RNNs?
  • Memorize the architecture diagram
  • Derive the attention equation and explain each term

Day 3

  • Draw the full architecture from memory on paper
  • Explain the scaling factor derivation
  • Explain multi-head attention with dimensions

Day 7

  • Practice a 10-minute presentation of this paper
  • Recite key ablation results (8 heads optimal, etc.)
  • Explain all three types of attention

Day 14

  • Mock interview: answer all 10 cheat sheet questions
  • Connect to BERT, GPT, and modern architectures
  • Discuss limitations with proposed solutions

Day 21

  • Full paper discussion interview simulation (20 minutes)
  • Handle follow-up questions about scaling, efficiency, positional encoding
  • Compare to at least two alternative architectures

Next Steps

Now that you have mastered the Transformer architecture, move to Chapter 4: BERT to understand how the Transformer encoder was adapted for bidirectional pre-training - one of the most impactful applications of the architecture you just learned.

© 2026 EngineersOfAI. All rights reserved.