Transformer Architecture - The Engine of Modern AI
Reading time: ~45 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Engineer, LLM Engineer, NLP Engineer
The Real Interview Moment
You are in an Anthropic research engineer interview. The interviewer hands you a blank sheet and says: "Draw the complete Transformer architecture. Label every component with its exact dimensions for a model with , 8 heads, and a sequence length of 128. Then walk me through a single token from input embedding to output logits."
You draw the architecture, and she follows up: "Why does GPT use only the decoder? Why does BERT use only the encoder? What is the fundamental difference in their attention masks, and what tasks does each architecture excel at?"
This is the most asked question in modern AI interviews. The Transformer is the architecture behind GPT-4, Claude, BERT, ViT, Stable Diffusion, AlphaFold, and virtually every state-of-the-art model. You need to know it at the level of being able to implement it from scratch, explain every design decision, and discuss the modern variants that make it scale.
Candidates who can draw the architecture but cannot explain pre-norm vs post-norm, or who confuse the encoder and decoder attention patterns, get a "lean no-hire." Candidates who can walk through the full forward pass with dimensions, explain positional encoding design choices, and discuss efficiency techniques like Flash Attention and KV caching get a "strong hire."
What You Will Master
- Draw the complete Transformer architecture (encoder + decoder) with all sub-components
- Trace a token through the full forward pass with exact dimensions at every step
- Derive sinusoidal positional encodings and compare with learned and RoPE
- Explain pre-norm vs post-norm and why pre-norm trains more stably
- Compare encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) architectures
- Describe Flash Attention, KV cache, and sliding window attention
- Compute the parameter count and FLOPs for a given Transformer configuration
- Solve interview problems on architecture design, efficiency, and scaling
Self-Assessment: Where Are You Now?
| Skill | 1 -- Cannot | 2 -- Vaguely | 3 -- Can Explain | 4 -- Can Derive | 5 -- Can Teach | Your Score |
|---|---|---|---|---|---|---|
| Draw the full Transformer architecture | ___ | |||||
| Walk through forward pass with dimensions | ___ | |||||
| Explain sinusoidal positional encoding | ___ | |||||
| Compare pre-norm vs post-norm | ___ | |||||
| Explain BERT vs GPT architecture differences | ___ | |||||
| Describe Flash Attention mechanism | ___ | |||||
| Explain KV cache for inference | ___ | |||||
| Compute parameter count for a given config | ___ |
Target: All 4s and 5s before your interview.
Part 1 - The Full Architecture
High-Level Structure
The original Transformer (Vaswani et al., 2017, "Attention Is All You Need") has two main components:
- Encoder: Processes the input sequence into contextual representations
- Decoder: Generates the output sequence token by token, conditioned on the encoder output
Layer-by-Layer Walkthrough with Dimensions
Let us trace a forward pass with: , heads, , sequence length , vocabulary size .
Step 1: Input Embedding
Token embedding lookup: each token ID maps to a 512-dimensional vector. Parameters: .
Step 2: Positional Encoding
Add position information (see Part 2 for details).
Step 3: Multi-Head Self-Attention
For each of 8 heads ():
Concatenate:
Output projection:
Attention parameters per layer: .
Step 4: Add and Norm (Residual Connection + Layer Normalization)
LayerNorm parameters: (scale and shift).
Step 5: Feed-Forward Network
Where , .
FFN parameters per layer: .
Step 6: Add and Norm again.
"The Transformer processes sequences entirely through attention, with no recurrence. Each encoder layer has multi-head self-attention followed by a feed-forward network, with residual connections and layer normalization around each. The decoder adds causal masking in self-attention to prevent attending to future tokens, plus a cross-attention layer that queries the encoder output. Positional encoding is added to the input embeddings since attention has no inherent notion of position. The whole architecture is highly parallelizable because all positions are processed simultaneously."
Total Parameter Count
For a 6-layer encoder + 6-layer decoder Transformer with , , :
| Component | Parameters |
|---|---|
| Token embedding | 32000 x 512 = 16.4M |
| Positional encoding (sinusoidal) | 0 (fixed) |
| Encoder self-attention (per layer) | 4 x 512^2 = 1.05M |
| Encoder FFN (per layer) | 2 x 512 x 2048 + 2048 + 512 = 2.1M |
| Encoder LayerNorm (per layer) | 2 x 2 x 512 = 2K |
| Encoder total (6 layers) | 6 x 3.15M = 18.9M |
| Decoder self-attention (per layer) | 1.05M |
| Decoder cross-attention (per layer) | 1.05M |
| Decoder FFN (per layer) | 2.1M |
| Decoder LayerNorm (per layer) | 3 x 2 x 512 = 3K |
| Decoder total (6 layers) | 6 x 4.2M = 25.2M |
| Output linear (tied with embedding) | 0 (weight tying) |
| Total | ~60.5M |
At Google and Meta, interviewers expect you to compute parameter counts on the spot. At research-focused companies (DeepMind, Anthropic), they want you to know how parameters scale and which component dominates at different model sizes. Key insight: for small models, embeddings dominate; for large models (billions of parameters), FFN layers dominate because they scale as while embeddings scale as .
Part 2 - Positional Encoding
Why We Need It
Self-attention is permutation-equivariant: if you shuffle the input tokens, the output is shuffled in the same way (the attention weights change, but the computation treats all positions identically). Without positional encoding, the Transformer has no notion of word order - "the cat sat on the mat" and "mat the on sat cat the" would produce the same representations.
Sinusoidal Positional Encoding (Original)
Where is the position index and is the dimension index.
Why sinusoidal?
- Fixed (no learned parameters): Does not add to model size
- Extrapolation: Can theoretically handle sequences longer than those seen in training
- Relative position encoding: For any fixed offset , can be expressed as a linear function of :
where is a rotation matrix. This allows the model to learn relative position patterns.
- Multi-scale patterns: Low-frequency sinusoids capture long-range position differences; high-frequency sinusoids capture fine-grained position differences
Learned Positional Encoding
A lookup table of learned vectors, one per position. Used in BERT and GPT-2.
Pros: More flexible, can learn task-specific patterns. Cons: Cannot extrapolate beyond , adds parameters.
Rotary Position Embedding (RoPE)
RoPE (Su et al., 2021) encodes position by rotating the query and key vectors:
Where is a rotation matrix that depends on position . The inner product becomes:
The attention score depends only on the relative position , not the absolute positions. This is the key advantage of RoPE.
RoPE is used by: LLaMA, Mistral, GPT-NeoX, and most modern LLMs.
Positional Encoding Comparison
| Method | Params | Extrapolation | Relative Position | Used By |
|---|---|---|---|---|
| Sinusoidal | 0 | Theoretically yes | Via linear transform | Original Transformer |
| Learned | No | No (absolute only) | BERT, GPT-2 | |
| RoPE | 0 | With NTK scaling | Natively | LLaMA, Mistral |
| ALiBi | 0 | Naturally | Natively (linear bias) | BLOOM, MPT |
Do NOT say "sinusoidal encoding can extrapolate to any length." While mathematically the sinusoids are defined for any position, in practice models trained with max length show degraded performance beyond . True length extrapolation requires techniques like NTK-aware RoPE scaling or ALiBi. The correct statement: "Sinusoidal encoding is defined for arbitrary positions, but the model's learned attention patterns may not generalize to unseen position ranges."
Part 3 - Pre-Norm vs Post-Norm
Post-Norm (Original Transformer)
The sublayer output is added to the input (residual), then normalized.
Pre-Norm (Modern Default)
The input is normalized BEFORE the sublayer, and the residual connection adds the raw (unnormalized) input.
Why Pre-Norm Trains Better
1. Gradient flow through residual stream.
In pre-norm, the gradient through the residual connection is an identity (unmodified by LayerNorm):
The identity term guarantees gradient flow regardless of what happens in the sublayer. In post-norm, the LayerNorm sits on the residual path, complicating gradient flow.
2. Gradient magnitude stability.
With post-norm, gradients can grow or shrink unpredictably as they pass through LayerNorm on the residual path. Pre-norm keeps the residual stream "clean" - it is just a sum of sublayer outputs.
3. Learning rate sensitivity.
Post-norm requires learning rate warmup - without it, training diverges. Pre-norm is much more robust to learning rate choice and does not strictly require warmup, though warmup still helps.
4. Which is used in practice?
| Model | Norm Style |
|---|---|
| Original Transformer | Post-norm |
| BERT | Post-norm |
| GPT-2 | Pre-norm |
| GPT-3 | Pre-norm |
| LLaMA | Pre-norm (with RMSNorm) |
| T5 | Pre-norm |
| PaLM | Pre-norm |
Modern consensus: Pre-norm with RMSNorm is the standard for LLMs.
When I ask about pre-norm vs post-norm, the strongest answer includes: (1) the mathematical difference in gradient flow - pre-norm has a clean identity path, (2) the practical consequence - post-norm needs warmup and careful LR tuning, pre-norm is more robust, (3) the fact that all modern LLMs use pre-norm. Candidates who only know "pre-norm is better" without the gradient argument get moderate marks.
Part 4 - BERT vs GPT: Encoder-Only vs Decoder-Only
The Architecture Split
The original Transformer has both encoder and decoder. But the two most influential models after it each use only one half:
BERT (2018) - Encoder Only
- Bidirectional self-attention (no causal mask)
- Sees all positions simultaneously
- Trained with masked language modeling (predict masked tokens)
- Excels at understanding tasks: classification, NER, QA
GPT (2018-2024) - Decoder Only
- Causal (autoregressive) self-attention
- Position can only attend to positions
- Trained with next-token prediction
- Excels at generation tasks: text generation, code, reasoning
Detailed Comparison
| Aspect | BERT (Encoder) | GPT (Decoder) | T5 (Enc-Dec) |
|---|---|---|---|
| Attention mask | None (full bidirectional) | Causal (lower triangular) | Encoder: none; Decoder: causal + cross |
| Training objective | MLM + NSP | Next token prediction | Span corruption (text-to-text) |
| Context access | Full sequence | Only left context | Encoder: full; Decoder: left + encoder |
| Generation | Requires tricks (iterative) | Natural (autoregressive) | Natural (autoregressive decoder) |
| Understanding | Excellent | Good (but unidirectional) | Excellent |
| Dominant use | Classification, NER, embeddings | Text generation, chat, code | Translation, summarization |
| Modern examples | BERT, RoBERTa, DeBERTa | GPT-4, Claude, LLaMA, Mistral | T5, FLAN-T5, UL2 |
Why Decoder-Only Won
Most modern LLMs (GPT-4, Claude, LLaMA, Gemini) are decoder-only. Why?
- Simplicity: One architecture for both understanding and generation
- Scaling: Decoder-only models scale more predictably (Kaplan et al., scaling laws)
- Emergent abilities: Large decoder-only models develop strong understanding capabilities even without bidirectional attention (in-context learning, chain-of-thought)
- Versatility: Any NLP task can be framed as text generation ("classify the following as positive or negative: ...")
- Training efficiency: Next-token prediction on every position provides a training signal at every token (more efficient than MLM which only trains on ~15% of tokens)
Do NOT say "BERT is better for understanding and GPT is better for generation, so you should use BERT for classification tasks." Modern large decoder-only models (GPT-4, Claude) match or exceed BERT on classification tasks when given appropriate prompting or fine-tuning. The BERT advantage is primarily for SMALLER models where bidirectional context helps overcome limited capacity. At scale, decoder-only models can learn to "understand" from next-token prediction alone.
Part 5 - The Feed-Forward Network
Standard FFN
This is a position-wise MLP - the same network is applied independently to each position. The hidden dimension is typically .
Modern FFN: SwiGLU
Most modern LLMs use SwiGLU (Shazeer, 2020) instead of ReLU:
Where and is element-wise multiplication.
Note: SwiGLU has THREE weight matrices instead of two, so is typically set to (rounded to a multiple of 256) to keep parameter count similar.
| FFN Variant | Activation | Weight Matrices | Used By |
|---|---|---|---|
| Original | ReLU | 2 (, ) | Original Transformer |
| GELU FFN | GELU | 2 | BERT, GPT-2 |
| SwiGLU | Swish + gating | 3 (, , ) | LLaMA, PaLM, Mistral |
| GeGLU | GELU + gating | 3 | Some T5 variants |
Why FFN Matters
The FFN layers account for 2/3 of the parameters in a standard Transformer layer ( for FFN vs for attention, and , so FFN has vs attention's ).
Recent research suggests that FFN layers act as key-value memories (Geva et al., 2021): the first layer's rows store "keys" (patterns to match), and the second layer's columns store "values" (information to output). This is conceptually similar to attention but with the "keys" being fixed learned patterns rather than dynamic input-dependent patterns.
Part 6 - Efficiency Techniques
Flash Attention
Problem: Standard attention materializes the full attention matrix, requiring memory.
Solution: Flash Attention (Dao et al., 2022) computes attention in tiles that fit in GPU SRAM (on-chip memory), never materializing the full attention matrix.
Key ideas:
- Tiling: Split Q, K, V into blocks that fit in SRAM
- Online softmax: Compute softmax incrementally using the log-sum-exp trick
- Recomputation: During backward pass, recompute attention instead of storing it (trading compute for memory)
Complexity comparison:
| Aspect | Standard Attention | Flash Attention |
|---|---|---|
| Memory (forward) | ||
| FLOPs | (same) | |
| Wall-clock time | Baseline | 2-4x faster |
| HBM reads/writes | where is SRAM size |
The speedup comes from reducing memory I/O, not reducing computation. Modern GPUs are memory-bandwidth-limited for attention, so Flash Attention's tiled approach - which keeps data in fast SRAM - achieves significant wall-clock speedups despite the same FLOP count.
KV Cache for Inference
During autoregressive generation, the model generates one token at a time. Without caching, generating token requires recomputing attention over all previous tokens.
KV cache: Store the K and V projections from all previous tokens. When generating token :
- Compute Q, K, V only for the new token (not the full sequence)
- Append the new K and V to the cache
- Compute attention between the new Q and the full K/V cache
Without KV cache: Generating tokens costs per layer (recompute attention for each new token over growing context).
With KV cache: Generating each new token costs per layer (one query against the full cache). Total: same, but spread incrementally.
KV cache memory:
Per layer:
For a 70B model (80 layers, , 8K context, float16):
This is a significant memory cost - often larger than the model weights for long contexts.
Sliding Window Attention
Instead of attending to all previous positions, each position attends only to the most recent positions:
Benefits:
- Memory per layer: instead of
- Inference memory: KV cache limited to entries per layer
- Effective context can exceed through stacking: with layers and window size , the receptive field is
Used by: Mistral (window size 4096), Longformer (local + global tokens).
Modern LLM interviews increasingly focus on inference efficiency. If you can explain KV cache sizing, Flash Attention's tiling strategy, and the tradeoffs of sliding window attention, you demonstrate practical systems knowledge that separates you from candidates who only know the theory.
Grouped-Query Attention (GQA)
Standard multi-head attention: each head has its own Q, K, V projections.
GQA: Multiple query heads share the same K and V projections.
| Variant | Q Heads | KV Heads | KV Cache Size |
|---|---|---|---|
| Multi-Head Attention (MHA) | |||
| Grouped-Query (GQA) | |||
| Multi-Query (MQA) | 1 |
For a model with 32 query heads and 8 KV heads (GQA with group size 4), the KV cache is 4x smaller than full MHA.
Used by: LLaMA 2 (GQA), Mistral (GQA), PaLM (MQA for some sizes).
Part 7 - Modern LLM Architecture Recipe
The Standard 2024-2026 Recipe
Most modern LLMs follow a remarkably similar architecture:
| Component | Choice |
|---|---|
| Architecture | Decoder-only Transformer |
| Normalization | Pre-norm with RMSNorm |
| Positional encoding | RoPE |
| FFN | SwiGLU () |
| Attention | GQA with Flash Attention |
| Activation | SiLU (Swish) in FFN |
| Embedding | Tied input-output embeddings |
| Bias | No bias in linear layers (saves parameters) |
| Context extension | NTK-aware RoPE scaling or YaRN |
This recipe (pioneered by LLaMA) has become the default because each component has been individually validated through ablation studies.
Part 8 - Computing FLOPs
Per-Token Forward Pass FLOPs
For a Transformer with layers, , , sequence length :
Attention (per layer):
- QKV projection:
- Attention scores ():
- Attention output ():
- Output projection:
- Total:
FFN (per layer):
- Two linear layers:
Total per layer:
Total model: (embedding + output)
For large and moderate : , which gives the approximation:
(Factor of 2 because each parameter is used in one multiply and one add.)
"The FLOPs for a forward pass through a Transformer are approximately . For a 70B parameter model processing 2048 tokens, that is about FLOPs. Training costs approximately FLOPs total because each training step involves forward pass, backward pass (2x forward), and the gradient accumulation. For GPT-3 (175B params, 300B tokens), that is roughly FLOPs."
Practice Problems
Problem 1: Architecture Walkthrough
You have a Transformer with , , , layers (decoder-only), .
(a) Compute the total parameter count. (b) Compute the per-head dimension. (c) What is the KV cache size (in GB, float16) for a context length of 8192? (d) What is the approximate forward pass FLOP count for a 2048-token sequence?
Hint 1 -- Direction
Break down parameters by component: embedding, attention per layer, FFN per layer, LayerNorm per layer, output head. For KV cache: .
Hint 2 -- Insight
Per layer: attention = params, FFN = params (ignoring bias), LN = params. Multiply by layers. Add embedding () and final LN. If output weights are tied with embedding, do not double-count.
Hint 3 -- Full Solution + Rubric
(a) Parameter count:
| Component | Calculation | Parameters |
|---|---|---|
| Token embedding | 50257 x 1024 | 51.5M |
| Per-layer attention (QKV + O) | 4 x 1024^2 | 4.19M |
| Per-layer FFN (no bias) | 2 x 1024 x 4096 | 8.39M |
| Per-layer LayerNorm (x2) | 2 x 2 x 1024 | 4.1K |
| Total per layer | 12.59M | |
| All 24 layers | 24 x 12.59M | 302.1M |
| Final LayerNorm | 2 x 1024 | 2K |
| Output head (tied) | 0 | 0 |
| Total | ~354M |
This is approximately GPT-2 Medium scale.
(b) Per-head dimension:
(c) KV cache for T=8192:
bytes (float16)
bytes = ~0.75 GB
(d) Forward FLOPs for T=2048:
Using the approximation: FLOPs (~1.45 TFLOPs)
More precisely: FLOPs
Scoring Rubric:
- Strong Hire: Correct parameter count with component breakdown, correct KV cache calculation, reasonable FLOP estimate, mentions weight tying.
- Lean Hire: Approximately correct total but misses some components or makes small errors in the breakdown.
- No Hire: Cannot set up the calculation or is off by more than 2x.
Problem 2: Pre-Norm vs Post-Norm Training
Your team trained a 1B parameter Transformer with post-norm. Training diverged at step 5000 despite using Adam optimizer with , , learning rate .
(a) Diagnose the most likely cause. (b) Propose three fixes, ranked by likelihood of success. (c) If you switch to pre-norm, what changes would you expect in training dynamics?
Hint 1 -- Direction
Post-norm is known to be sensitive to learning rate and requires warmup. A 1B model with learning rate and no mentioned warmup is a classic recipe for divergence.
Hint 2 -- Insight
The most likely cause is missing or insufficient learning rate warmup. Post-norm has large gradient magnitudes in early training, and a high initial LR causes weight explosion. The three most effective fixes: (1) add LR warmup over 1-2K steps, (2) switch to pre-norm, (3) reduce initial LR. Switching to pre-norm is the most robust long-term fix.
Hint 3 -- Full Solution + Rubric
(a) Most likely cause: Missing learning rate warmup.
With post-norm, early training has large, unstable gradients because LayerNorm is on the residual path and the model weights are random. A learning rate of without warmup causes the gradients to amplify through the LayerNorm, leading to weight explosion around step 5000.
Additional contributing factors:
- At 1B parameters, gradient norms are inherently larger (more parameters contributing to gradient)
- takes ~1000 steps to stabilize the adaptive learning rate estimates, meaning the effective LR is poorly calibrated in early training
(b) Three fixes, ranked:
-
Switch to pre-norm (highest confidence). This structurally eliminates the gradient instability from LayerNorm on the residual path. Pre-norm models are robust to a much wider range of learning rates and rarely diverge. Cost: architecture change, full retrain.
-
Add learning rate warmup (quick fix). Warmup from to over 2000 steps. This lets the Adam statistics stabilize before applying large updates. Cost: hyperparameter change, can resume from earlier checkpoint.
-
Reduce learning rate to with warmup. Even with warmup, may be too high for a 1B post-norm model. Many post-norm training recipes use lower peak LR than pre-norm equivalents.
(c) Expected changes with pre-norm:
- Training becomes stable without warmup (though warmup still helps)
- Can use higher learning rate ( or even )
- Loss curve is smoother with fewer spikes
- Final performance may be slightly different (some studies show post-norm converges to slightly better solutions when it successfully trains, but pre-norm is more reliable)
- Gradient norms are more uniform across layers (post-norm has larger gradients in earlier layers)
Scoring Rubric:
- Strong Hire: Correctly identifies warmup as the primary issue, recommends pre-norm as the structural fix, explains the gradient flow difference between pre-norm and post-norm, mentions Adam warmup interaction.
- Lean Hire: Identifies the learning rate as too high but does not connect to the pre-norm/post-norm distinction.
- No Hire: Suggests reducing model size or adding dropout as the primary fix without addressing the normalization architecture.
Problem 3: Positional Encoding Design
You are designing a Transformer for processing DNA sequences (alphabet: A, C, G, T). Sequences are 10,000-100,000 base pairs long. Which positional encoding would you choose and why?
Hint 1 -- Direction
DNA sequences are extremely long (10K-100K), and biological properties can depend on relative positions (e.g., distance between a promoter and a gene). Consider which encodings handle long sequences and capture relative position naturally.
Hint 2 -- Insight
Learned positional encodings are out - you would need 100K position embeddings and cannot extrapolate. Sinusoidal can define positions for any length but models still struggle at unseen lengths. RoPE with NTK-aware scaling is the best choice: it natively captures relative position (critical for biology), has no learned parameters, and can be extended to longer sequences than seen in training. ALiBi is also a strong contender because it naturally handles extrapolation.
Hint 3 -- Full Solution + Rubric
Recommendation: RoPE with NTK-aware scaling, combined with sliding window attention.
Why RoPE:
- Relative position matters in biology. The distance between two nucleotides determines their interaction strength. RoPE's attention scores naturally depend on relative position .
- No learned parameters. With 100K positions, learned embeddings would add 100K x unnecessary parameters.
- Extrapolation. NTK-aware RoPE scaling allows training on shorter sequences (e.g., 16K) and inference on longer ones (e.g., 100K).
Why also sliding window attention:
- Full attention over 100K tokens requires entries per head - prohibitively expensive
- Biology often has local structure (codon-level, ~3bp) and medium-range structure (gene-level, ~1-10Kbp)
- Sliding window of ~4K captures local structure; add global tokens every 1K positions for long-range
- This reduces memory from to
Alternative: ALiBi
- Also handles long sequences natively (attention decays linearly with distance)
- The linear decay matches biological intuition: nearby nucleotides interact more strongly
- Simpler to implement than RoPE scaling
- Potential downside: the fixed linear decay pattern may not match all biological distance relationships
What NOT to use:
- Learned positional embeddings: cannot extrapolate, wasteful parameters
- Sinusoidal (without modification): while defined for all positions, models trained on 16K rarely generalize to 100K
- No positional encoding: DNA sequence order is critical (ACGT and TGCA are very different)
Scoring Rubric:
- Strong Hire: Chooses RoPE or ALiBi with clear justification for relative position and extrapolation, addresses the computational challenge of long sequences with efficient attention, connects to biological domain knowledge.
- Lean Hire: Chooses an appropriate encoding but does not address the long-sequence computational challenge.
- No Hire: Chooses learned positional embeddings for 100K positions or does not address extrapolation.
Problem 4: KV Cache Optimization
You are serving a 70B parameter LLM (80 layers, , 64 query heads, 8 KV heads via GQA) on 8 x A100 GPUs (80GB each).
(a) What is the KV cache memory per user for a 32K context length? (b) How many concurrent users can you serve? (c) Propose two strategies to increase concurrent users by 4x.
Hint 1 -- Direction
KV cache per user = . Note GQA: 8 KV heads, not 64. Total GPU memory minus model weights gives you the cache budget.
Hint 2 -- Insight
The 70B model in float16 uses ~140GB. Across 8 A100s (640GB total), that leaves ~500GB for KV cache and overhead. With GQA (8 KV heads instead of 64), the KV cache per user for 32K context should be significantly smaller than full MHA. Strategies to increase users: quantize KV cache to int8 (2x), use PagedAttention for memory fragmentation (1.5-2x), reduce context window with sliding window attention.
Hint 3 -- Full Solution + Rubric
(a) KV cache per user:
Per layer: bytes
Breaking it down:
- KV heads = 8 (GQA)
- Per layer: bytes = 128 MB
- 80 layers: MB = 10.24 GB per user
Without GQA (64 KV heads): would be GB per user - illustrating why GQA is critical for serving.
(b) Concurrent users:
Model weights: 70B params x 2 bytes = 140 GB (spread across 8 GPUs) Total GPU memory: 8 x 80 = 640 GB Available for KV cache: ~640 - 140 - 50 (overhead) = 450 GB Users: concurrent users
(c) Strategies to increase to ~172 users (4x):
Strategy 1: Quantize KV cache to INT8 (2x improvement)
- Store K and V values in 8-bit integers instead of float16
- Reduces per-user cache from 10.24 GB to 5.12 GB
- Quality impact: minimal (KV values have limited dynamic range)
- New capacity: ~88 users
Strategy 2: PagedAttention / vLLM (1.5-2x improvement)
- Standard KV cache pre-allocates max context per user, wasting memory when actual context is shorter
- PagedAttention allocates cache in pages (like OS virtual memory), only allocating what is actually used
- If average context is 50% of max, this doubles effective capacity
- Combined with INT8: ~130-175 users
Other strategies:
- Sliding window attention: Limit KV cache to last tokens per layer. With (instead of 32K), reduces cache by 8x.
- KV cache eviction: Evict low-attention keys (Heavy Hitter oracle). Keeps top- most attended positions.
- Model quantization: Quantize model weights to INT4 (from 140GB to 35GB), freeing ~105GB more for cache.
Scoring Rubric:
- Strong Hire: Correct KV cache calculation with GQA, reasonable concurrent user estimate, proposes INT8 cache quantization AND PagedAttention with quantified improvement estimates, mentions GQA benefit.
- Lean Hire: Correct calculation but only suggests one optimization strategy.
- No Hire: Incorrect KV cache calculation (forgets GQA or gets dimensions wrong) or suggests only "use a bigger GPU."
Interview Cheat Sheet
| Concept | Key Formula / Fact | One-Liner | Red Flag |
|---|---|---|---|
| Transformer blocks | Self-attn + FFN + residual + norm | Parallel processing of all positions | "Transformers use RNNs internally" |
| Position encoding | Sinusoidal / Learned / RoPE | Without it, Transformer is permutation-equivariant | "Position encoding is optional" |
| Pre-norm vs post-norm | Pre: | Pre-norm has clean gradient path | "They are equivalent" |
| FFN | ; modern: SwiGLU | 2/3 of layer parameters | "FFN is just a small detail" |
| BERT vs GPT | Bidirectional vs causal masking | Both use attention, different masks | "BERT is always better for understanding" |
| Flash Attention | Tiled attention, memory | Same FLOPs, less memory, faster | "Flash Attention is approximate" |
| KV Cache | Store K,V for previous tokens | Avoids recomputation during generation | "KV cache stores attention weights" |
| GQA | Shared KV heads | Reduces KV cache by | "GQA reduces model quality significantly" |
| Sliding window | Attend to last positions | instead of | "Sliding window sees full context" |
| Parameter count | for decoder-only | FFN dominates at scale | Off by more than 2x |
Spaced Repetition Checkpoints
Day 0 -- Initial Learning
- Read this entire page
- Draw the full Transformer architecture from memory (encoder + decoder)
- Walk through the forward pass for a 4-token sequence, writing dimensions at each step
- Complete the self-assessment
Day 3 -- First Recall
- Without notes, explain pre-norm vs post-norm and why pre-norm is preferred
- Give the "60-Second Answer" for the full Transformer, out loud, timed
- Write the sinusoidal positional encoding formula and explain why sinusoidal
Day 7 -- Connections
- Compare BERT, GPT, and T5 architectures (attention masks, training objectives, use cases)
- Do Practice Problem 1 (parameter count + KV cache) on paper without hints
- Explain Flash Attention to an imaginary interviewer (tiling, online softmax, memory savings)
Day 14 -- Application
- Do Practice Problem 4 (KV cache optimization) under timed conditions (12 minutes)
- Compute the parameter count for LLaMA-7B from its config (32 layers, 4096 dim, 32 heads)
- Explain the modern LLM recipe and justify each design choice
Day 21 -- Mock Interview
- Have someone ask: "Draw the Transformer, walk through a forward pass with dimensions, then explain how you would optimize it for serving at scale"
- Time yourself: architecture drawing in under 3 minutes, forward pass in under 5 minutes, optimization discussion in under 5 minutes
- Do all 4 practice problems in sequence under timed conditions (45 minutes total)
Key Takeaways
-
The Transformer is fundamentally a stack of two operations: attention and feed-forward. Everything else - residual connections, normalization, positional encoding - is scaffolding that makes these two operations trainable and effective. Understanding the architecture means understanding why each piece of scaffolding exists.
-
Pre-norm with RMSNorm is the modern standard for a reason. It provides a clean gradient path through the residual stream, eliminates the need for careful warmup scheduling, and is computationally cheaper than full LayerNorm. If you design a new Transformer, use pre-norm.
-
Decoder-only won because of simplicity and scaling. While encoder-decoder models can be more parameter-efficient for some tasks, the simplicity of a single decoder stack - one architecture for understanding and generation - made it the foundation of GPT-4, Claude, and the modern LLM paradigm.
-
Efficiency techniques are not optional knowledge. Flash Attention, KV caching, GQA, and sliding window attention are the difference between a Transformer that works on paper and one that can actually serve millions of users. Interviewers at production-focused companies increasingly weight this knowledge heavily.
