Transformer Internals for LLMs - The Architecture That Powers Everything
Reading time: ~50 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Eng, LLM Eng, ML Infra Eng
The Real Interview Moment
You are in a Google DeepMind interview. The interviewer opens with: "Let us say we are designing a new 70B parameter decoder-only model with 80 layers, a hidden dimension of 8192, and 64 attention heads. Walk me through the architecture - what is GQA and why would we use it? How much KV cache memory do we need for a single sequence of 128K tokens in fp16? How many FLOPs does a forward pass take?"
You start writing on the whiteboard. She interrupts after your first equation: "Wait - you wrote the attention formula. But modern LLMs do not use vanilla attention. What positional encoding scheme are we using, and why did the field move away from learned absolute embeddings?"
This is a standard Tier 1 interview question. The interviewer is not testing whether you have heard of Transformers - everyone has. She is testing whether you understand the specific architectural choices that make modern LLMs work, and whether you can reason quantitatively about memory, compute, and tradeoffs.
Candidates who describe the original 2017 Transformer and stop there get a "no hire." Candidates who can walk through a modern decoder-only architecture with RoPE, GQA, SwiGLU, and RMSNorm - with exact dimension calculations - get a "strong hire."
What You Will Master
- Draw a modern decoder-only Transformer block with all sub-components labeled
- Explain why LLMs use decoder-only architecture instead of encoder-decoder
- Derive causal attention masking and its role in autoregressive generation
- Compare RoPE, ALiBi, and learned positional encodings with mathematical precision
- Calculate KV cache memory requirements for any model configuration
- Explain GQA, MQA, and MHA with memory and quality tradeoffs
- Describe SwiGLU FFN and why it replaced standard ReLU FFN
- Justify RMSNorm over LayerNorm and pre-norm over post-norm
- Count parameters and estimate FLOPs for any Transformer configuration
- Compare architectures of GPT-4, LLaMA 3, Gemini, Claude, and Mistral
Self-Assessment: Where Are You Now?
| Skill | 1 -- Cannot | 2 -- Vaguely | 3 -- Can Explain | 4 -- Can Derive | 5 -- Can Teach | Your Score |
|---|---|---|---|---|---|---|
| Draw a modern decoder-only block | ___ | |||||
| Explain causal masking mathematically | ___ | |||||
| Derive RoPE positional encoding | ___ | |||||
| Compare GQA/MQA/MHA with numbers | ___ | |||||
| Calculate KV cache memory | ___ | |||||
| Explain SwiGLU activation | ___ | |||||
| RMSNorm vs LayerNorm difference | ___ | |||||
| Count parameters for a given config | ___ | |||||
| Estimate forward pass FLOPs | ___ | |||||
| Compare GPT-4 / LLaMA 3 / Mistral architectures | ___ |
Target: All 4s and 5s before your interview.
Part 1 - The Modern Decoder-Only Architecture
Why Decoder-Only?
The original Transformer (Vaswani et al., 2017) had both encoder and decoder stacks. Modern LLMs overwhelmingly use decoder-only architecture. Here is why:
| Factor | Encoder-Decoder | Decoder-Only |
|---|---|---|
| Task generality | Best for seq2seq (translation) | Handles all tasks via prompting |
| Training simplicity | Two separate stacks to train | Single uniform stack |
| Scaling behavior | Parameters split across stacks | All parameters serve generation |
| In-context learning | Limited by encoder bottleneck | Naturally handles long contexts |
| KV cache | Separate caches for encoder/decoder | Single unified cache |
"Why decoder-only?" is a warm-up question. The real test is whether you understand that decoder-only models can still do "encoding" - the prefix tokens in a prompt serve as a pseudo-encoder. The key architectural difference is causal masking, not the absence of encoding capability.
The Modern LLM Block
A single Transformer block in a modern LLM (circa LLaMA 3, 2024) looks like this:
Key differences from the original Transformer:
| Component | Original (2017) | Modern LLM (2024+) |
|---|---|---|
| Normalization | Post-norm LayerNorm | Pre-norm RMSNorm |
| Positional encoding | Sinusoidal (additive) | RoPE (multiplicative, relative) |
| Attention | Multi-Head Attention | Grouped Query Attention |
| Feed-forward | ReLU with 4x expansion | SwiGLU with 8/3x expansion |
| Architecture | Encoder-Decoder | Decoder-only |
Causal Masking
In a decoder-only model, each token can only attend to itself and all preceding tokens. This is enforced by a causal mask (also called the "look-ahead mask"):
The masked attention computation becomes:
The entries become 0 after softmax, effectively preventing information flow from future tokens.
Candidates often say "the mask is a triangular matrix of 0s and 1s." This is conceptually correct but technically wrong - the mask uses (or a very large negative number like ) which is added to the attention logits before softmax. A mask of 0s and 1s would be multiplied, which is a different (less numerically stable) approach. Interviewers notice this distinction.
Why causal masking enables training parallelism: During training, all positions in a sequence are computed simultaneously. The causal mask ensures position does not "see" positions even though they exist in the batch. This is why Transformer training is much faster than RNN training - the entire sequence is processed in one forward pass.
Part 2 - Positional Encodings: From Sinusoidal to RoPE
The Position Problem
Self-attention is permutation-equivariant - without positional information, it treats "the cat sat on the mat" and "mat the on sat cat the" identically. We need to inject position information.
Sinusoidal (Original, 2017)
The original Transformer used fixed sinusoidal embeddings added to token embeddings:
Limitations for LLMs:
- Additive: position info gets diluted through deep networks
- Absolute: does not generalize well to sequences longer than training length
- Not relative: attention scores do not directly encode relative distance
Learned Absolute Embeddings (GPT-2 era)
Replace fixed sinusoidal with a learnable embedding table of shape (max_seq_len, d).
Limitations:
- Hard ceiling on context length - cannot extrapolate beyond training length
- Still absolute, not relative
RoPE - Rotary Position Embedding (Su et al., 2021)
RoPE is used by LLaMA, Mistral, Qwen, and most modern open-source LLMs. It encodes position by rotating the query and key vectors in 2D subspaces.
Core idea: Group the dimensions of and into pairs and rotate each pair by an angle proportional to the position:
where and is the position index.
Why RoPE is elegant: The dot product after RoPE application depends only on the relative position , not on and individually. This is because:
The rotation of at position and at position produces an inner product that is a function of the relative distance - exactly the property we want for attention.
RoPE advantages for LLMs:
- Relative: attention inherently captures relative position
- Multiplicative: position information is deeply intertwined with content, not easily washed away
- Extensible: with frequency scaling (e.g., YaRN, NTK-aware scaling), RoPE can extrapolate to longer contexts than seen during training
- No extra parameters: the rotation angles are deterministic, not learned
"RoPE encodes position by rotating query and key vectors in 2D subspaces. Each pair of dimensions gets rotated by an angle proportional to the position, with different frequencies for different dimension pairs. The key property is that the dot product between a rotated query at position and a rotated key at position depends only on the relative distance . This gives us relative positional encoding without any additional parameters, and it can be extended to longer contexts through frequency scaling techniques like YaRN."
ALiBi - Attention with Linear Biases
ALiBi (Press et al., 2022) takes a different approach: it adds a linear bias to attention scores based on distance:
where is a head-specific slope. Heads with steeper slopes focus on local context; gentler slopes attend globally.
ALiBi vs RoPE tradeoff:
- ALiBi: simpler, better length extrapolation, but attention scores are biased (penalizes long-range attention)
- RoPE: more expressive, preserves attention magnitude, but needs explicit scaling for length extension
Most 2024-2026 models use RoPE. ALiBi saw adoption in BLOOM and some MPT models but fell out of favor.
Comparison Table
| Method | Type | Relative | Extra Params | Length Extension | Used By |
|---|---|---|---|---|---|
| Sinusoidal | Additive, absolute | No | None | Poor | Original Transformer |
| Learned | Additive, absolute | No | None (hard limit) | GPT-2 | |
| RoPE | Multiplicative, relative | Yes | None | Good (with scaling) | LLaMA, Mistral, Qwen |
| ALiBi | Additive bias, relative | Yes | None | Good (native) | BLOOM, MPT |
Part 3 - Grouped Query Attention (GQA)
The KV Cache Problem
During autoregressive generation, each new token needs to attend to all previous tokens. Recomputing all keys and values at every step is wasteful, so we cache them - the KV cache.
KV cache size per token per layer:
For a model with , 64 heads (so ), 80 layers, at fp16 (2 bytes):
For 128K context length: GB. Just for the KV cache of one sequence!
This is the motivation for reducing the number of KV heads.
MHA, MQA, and GQA
| Variant | Q Heads | KV Heads | KV Cache Ratio | Quality Impact |
|---|---|---|---|---|
| MHA (original) | 1x (baseline) | Best quality | ||
| MQA (Shazeer, 2019) | 1 | Noticeable degradation | ||
| GQA (Ainslie et al., 2023) | Minimal degradation |
GQA groups query heads and shares a single KV head per group. With query heads and groups, you get KV heads. The KV cache is now of MHA.
Revised KV cache for our 70B example with GQA (8 KV heads instead of 64):
For 128K context: GB. Down from 320 GB - an 8x reduction.
"GQA groups multiple query heads to share a single set of key-value heads. If you have 64 query heads and 8 KV head groups, each group of 8 query heads shares one KV head. This reduces KV cache memory by 8x with minimal quality loss. LLaMA 3 70B uses GQA with 8 KV heads for its 64 query heads. The key insight is that nearby query heads learn similar attention patterns, so sharing KV projections is a reasonable approximation."
Do not confuse GQA group count with KV head count. If someone says "GQA with 8 groups and 64 query heads," the number of KV heads is . Some papers describe it as the number of KV heads directly. Clarify which convention is being used.
KV Cache Memory Formula (Memorize This)
The factor of 2 is for K and V separately. For batch size , multiply by .
Quick reference for popular models:
| Model | Layers | KV Heads | KV Cache per Token (fp16) | 128K Context | |
|---|---|---|---|---|---|
| LLaMA 3 8B | 32 | 8 | 128 | 128 KB | 16 GB |
| LLaMA 3 70B | 80 | 8 | 128 | 320 KB | 40 GB |
| Mistral 7B | 32 | 8 | 128 | 128 KB | 16 GB |
| GPT-3 175B (MHA) | 96 | 96 | 128 | 4.7 MB | 600 GB |
Part 4 - SwiGLU Feed-Forward Network
The Standard FFN (Original Transformer)
The original feed-forward network:
with and . The intermediate dimension is .
SwiGLU (Shazeer, 2020)
Modern LLMs replace ReLU FFN with SwiGLU:
where (also called SiLU) and is element-wise multiplication.
Three weight matrices instead of two:
- - produces gating signal
- - produces value signal
- - projects back down
To keep the total parameter count comparable to the standard FFN (which has parameters), the intermediate dimension is set to (often rounded to a multiple of 256):
Same parameter budget as standard FFN, but consistently better performance across benchmarks.
Why SwiGLU works better: The gating mechanism allows the network to selectively pass information. The Swish activation is smooth (unlike ReLU which has a hard zero), which helps with gradient flow. Empirically, SwiGLU consistently outperforms ReLU and GELU FFNs at the same parameter count (Shazeer, 2020).
If asked "Why SwiGLU instead of ReLU?" do NOT just say "it works better." Explain: (1) the gating mechanism provides multiplicative interaction that is more expressive, (2) Swish is smooth so gradients flow better at zero, (3) the three-matrix formulation with intermediate dimension matches the parameter budget of two-matrix ReLU FFN. Then mention that this was validated empirically in PaLM and LLaMA.
Part 5 - RMSNorm and Pre-Norm Architecture
LayerNorm vs RMSNorm
LayerNorm (Ba et al., 2016) normalizes across the feature dimension:
where and are the mean and variance of across features.
RMSNorm (Zhang and Sennrich, 2019) drops the mean centering:
Why RMSNorm for LLMs:
| Factor | LayerNorm | RMSNorm |
|---|---|---|
| Operations | Mean, variance, normalize, scale, shift | RMS, normalize, scale |
| Parameters | (scale) + (shift) | (scale) only |
| Compute | Slightly more (mean subtraction) | ~15% faster |
| Quality | Baseline | Equivalent or better |
The mean subtraction in LayerNorm is unnecessary - the re-centering () can compensate. Dropping it simplifies the operation and is slightly faster, which matters when you have 80+ layers and billions of tokens.
Pre-Norm vs Post-Norm
Post-norm (original Transformer):
x → Attention → Add(x, ·) → LayerNorm → FFN → Add(·, ·) → LayerNorm
Pre-norm (modern LLMs):
x → RMSNorm → Attention → Add(x, ·) → RMSNorm → FFN → Add(·, ·)
Why pre-norm is critical for deep models:
In pre-norm, the residual stream flows unimpeded from input to output. Each layer's contribution is normalized before being added, preventing the residual magnitudes from growing unboundedly. This means:
- Stable gradients: The gradient flows directly through the residual connections without passing through normalization layers
- No warmup tricks needed: Post-norm often requires learning rate warmup; pre-norm trains stably from the start
- Better scaling: Models with 100+ layers train reliably with pre-norm
If you draw the Transformer with post-norm and claim that is what GPT or LLaMA uses, the interviewer will question your practical knowledge. Every major LLM since GPT-3 uses pre-norm. This is a basic factual check.
Part 6 - Parameter Counting
The Complete Parameter Count
For a decoder-only Transformer with:
- layers, hidden dimension , vocabulary size
- query heads, KV heads, head dimension
- SwiGLU FFN with intermediate dimension
Per-layer parameters:
| Component | Parameters | Formula |
|---|---|---|
| Q projection | ||
| K projection | Reduced for GQA | |
| V projection | Same as K | |
| Output projection | ||
| RMSNorm (attention) | Scale parameter only | |
| Gate projection (SwiGLU) | ||
| Up projection (SwiGLU) | ||
| Down projection (SwiGLU) | ||
| RMSNorm (FFN) |
Total attention params per layer:
For MHA ():
For GQA ():
Total FFN params per layer (SwiGLU):
With :
Total model parameters:
where the first is the token embedding, the last is the output projection (often tied to the embedding), and the at the end is the final RMSNorm.
Worked Example: LLaMA 3 8B
| Hyperparameter | Value |
|---|---|
| (layers) | 32 |
| (hidden dim) | 4096 |
| (query heads) | 32 |
| (KV heads) | 8 |
| (head dim) | 128 |
| (FFN intermediate) | 14336 |
| (vocab size) | 128256 |
Attention per layer:
Wait - let me recalculate properly. , , same as , .
FFN per layer:
Per-layer total:
All layers:
Embedding: (shared with output head)
Total: - but the reported number is ~8.0B because of rounding, unshared output head, and additional norm parameters. The discrepancy is normal for back-of-envelope calculations.
Interviewers do not expect exact numbers. They want to see that you (1) know the formula, (2) can set up the calculation correctly, and (3) get within 10-20% of the right answer. Being methodical matters more than being exact.
Part 7 - FLOP Estimation
The 6Nd Rule
For a Transformer with parameters processing a sequence of tokens:
The factor of 2 for forward comes from the fact that each parameter participates in one multiply-add operation (2 FLOPs per parameter per token). The backward pass costs roughly 2x the forward pass (one for activation gradients, one for weight gradients).
More Precise Breakdown
For a single layer processing batch with sequence length :
Attention:
- QKV projections: FLOPs (for MHA; less for GQA)
- Attention scores (): FLOPs
- Attention weighted values (): FLOPs
- Output projection: FLOPs
FFN (SwiGLU):
- Gate, up, down projections: FLOPs
Key insight: The attention score computation scales as while everything else scales as . For long contexts, attention becomes the bottleneck:
| Sequence Length | Attention FLOPs Fraction | FFN FLOPs Fraction |
|---|---|---|
| 2K | ~5% | ~95% |
| 32K | ~40% | ~60% |
| 128K | ~75% | ~25% |
This is why Flash Attention and KV cache optimization matter so much for long-context models.
Training Compute Budget
Training FLOPs for a model with parameters on tokens:
Chinchilla scaling law (Hoffmann et al., 2022) suggests for compute-optimal training. So for a 70B model, you want about tokens.
GPU hours estimate:
For an H100 at ~1000 TFLOPS (bf16) with 40% MFU (model FLOP utilization):
That is about 500 H100s running for 1 year, or 6,000 H100s for one month.
Part 8 - Model Architecture Comparison
The Big Picture (2024-2026 Models)
| Feature | LLaMA 3 70B | Mistral 7B | Gemini 1.5 Pro | GPT-4 (estimated) | Claude 3.5 |
|---|---|---|---|---|---|
| Architecture | Decoder-only | Decoder-only | Decoder-only (MoE for Ultra) | Decoder-only (MoE, rumored) | Decoder-only |
| Positional | RoPE | RoPE | RoPE variant | Unknown | Unknown |
| Attention | GQA (8 KV) | GQA (8 KV) | Multi-Query or GQA | Unknown | Unknown |
| FFN | SwiGLU | SwiGLU | SwiGLU variant | Unknown | Unknown |
| Norm | RMSNorm, pre-norm | RMSNorm, pre-norm | RMSNorm, pre-norm | Pre-norm | Pre-norm |
| Context | 128K | 32K (sliding window) | 1M+ | 128K | 200K |
| Params | 70B | 7.3B | Unknown (large) | ~1.8T MoE (rumored) | Unknown |
| Special | - | Sliding window attn | Ring attention, long context | MoE routing | Constitutional AI |
Anthropic and OpenAI do not publish full architecture details. In interviews at these companies, acknowledge what is public and what is speculation. Saying "GPT-4 is rumored to be a mixture of experts with about 1.8T total parameters and ~200B active" shows awareness. Claiming certainty about unpublished details shows poor judgment.
Mistral's Sliding Window Attention
Mistral introduced sliding window attention (SWA) where each token attends only to the previous tokens (e.g., ):
Benefits: KV cache capped at tokens regardless of sequence length. Memory is instead of .
Effective receptive field: After layers with window size , a token can theoretically attend to the previous tokens (information propagates through residual connections). With 32 layers and : effective receptive field = tokens.
Mixture of Experts (MoE)
GPT-4 and Mixtral use MoE layers where only a subset of "expert" FFN blocks are activated per token:
Key MoE numbers for Mixtral 8x7B:
- 8 experts per layer, top-2 routing
- Each expert is ~7B parameters (standard FFN)
- Total parameters: ~47B (but only ~13B active per token)
- Inference cost: similar to a 13B dense model
- Quality: comparable to LLaMA 2 70B
MoE tradeoffs:
- More total parameters (more memory for weights) but fewer active FLOPs
- Load balancing challenges - some experts get used much more than others
- Harder to fine-tune - which experts should adapt?
Practice Problems
Problem 1: KV Cache Memory
You are deploying a model with these specs: 40 layers, hidden dimension 5120, 40 query heads, 8 KV heads, head dimension 128. You need to serve batch size 32 with context length 8192 in fp16. How much KV cache memory is needed?
Hint 1 - Direction
Use the KV cache formula:
Hint 2 - Insight
Plug in: . Be careful with units - work in bytes and convert to GB at the end.
Hint 3 - Full Solution + Rubric
Step by step:
- Per token per layer: bytes
- Per token all layers: bytes KB
- Per sequence: bytes GB
- Full batch: GB
Answer: ~40 GB of KV cache memory.
Scoring Rubric:
| Criterion | Strong Hire | Lean Hire | No Hire |
|---|---|---|---|
| Correct formula | Wrote formula from memory | Needed to derive step by step | Could not set up the calculation |
| Correct answer | Within 5% of 40 GB | Within 20% | Off by more than 2x |
| Follow-up insight | "This means we need at least an 80 GB GPU or tensor parallelism across 2 GPUs, plus model weights" | "That is a lot of memory" | No interpretation |
| Speed | Under 2 minutes | Under 5 minutes | Could not finish |
Problem 2: Architecture Design
You are tasked with designing a 13B parameter model optimized for 32K context length inference on 8 H100 GPUs. Specify: layers, hidden dimension, query heads, KV heads, FFN dimension, and positional encoding. Justify each choice.
Hint 1 - Direction
Start by working backward from 13B parameters using the parameter counting formulas. Consider how to minimize KV cache memory for 32K context while maintaining quality.
Hint 2 - Insight
A 13B model typically has and . For 32K context, GQA with few KV heads is essential. Use RoPE for position extension capability. Use SwiGLU with .
Hint 3 - Full Solution + Rubric
Proposed architecture:
| Hyperparameter | Value | Justification |
|---|---|---|
| Layers () | 40 | Standard depth for 13B class |
| Hidden dim () | 5120 | |
| Query heads () | 40 | (standard) |
| KV heads () | 8 | GQA with 5x reduction; KV cache for 32K in fp16: ~10 GB |
| FFN dim () | 13696 | , rounded to multiple of 256 |
| Positional encoding | RoPE | Relative, extensible to longer contexts |
| Normalization | Pre-norm RMSNorm | Standard for stability |
| Activation | SwiGLU | Standard for quality |
Memory budget on 8 H100s (80 GB each = 640 GB total):
- Model weights (bf16): ~26 GB (tensor parallel across 8 GPUs = ~3.3 GB/GPU)
- KV cache for batch 16 at 32K: GB/sequence GPUs GB/GPU
- Activations and overhead: ~5 GB/GPU
- Total per GPU: ~18 GB - fits comfortably in 80 GB
Scoring Rubric:
| Criterion | Strong Hire | Lean Hire | No Hire |
|---|---|---|---|
| Parameter count works out | Within 10% of 13B | Within 30% | Did not verify |
| KV cache considered | Calculated memory, chose GQA | Mentioned GQA without numbers | Used MHA or ignored memory |
| Multi-GPU plan | Tensor parallelism layout | Mentioned parallelism vaguely | No consideration of 8 GPUs |
| Modern components | RoPE, GQA, SwiGLU, RMSNorm | Most modern choices | Used original 2017 architecture |
Problem 3: RoPE Extension
Your 13B model was trained with RoPE on 8K context length. Your users need 32K context. What approaches can you use to extend the context length without retraining from scratch?
Hint 1 - Direction
Think about what RoPE does at positions beyond the training range. The rotation angles become much larger - the model has never seen those angles during training.
Hint 2 - Insight
The main approaches are: (1) Position Interpolation - scale down positions so 32K maps to the 0-8K range, (2) NTK-aware scaling - modify the base frequency to spread rotations more evenly, (3) YaRN - combine NTK scaling with attention scaling and fine-tune briefly.
Hint 3 - Full Solution + Rubric
Three main approaches:
1. Position Interpolation (PI) - Chen et al., 2023 Scale all positions by . Position 32K becomes position 8K in RoPE-space. Requires short fine-tuning (~1000 steps) to adapt.
2. NTK-aware Scaling - "Code LLaMA" approach Modify the RoPE base frequency: where is the scaling factor. This spreads the rotations more evenly across the extended range, preserving the relative position resolution at short distances while extending range.
3. YaRN - Yet another RoPE extension (Peng et al., 2023) Combines NTK-aware interpolation with an attention temperature scaling factor. Different frequency bands are scaled differently: low-frequency (long-range) dimensions are interpolated more aggressively, while high-frequency (local) dimensions are preserved. Fine-tune for ~400 steps.
Comparison:
| Method | Fine-tuning Needed | Quality at 32K | Complexity |
|---|---|---|---|
| PI | ~1000 steps | Good | Low |
| NTK-aware | None (or very little) | Decent | Medium |
| YaRN | ~400 steps | Best | Medium |
| Retrain | Full pretraining | Ideal | Very high |
Scoring Rubric:
| Criterion | Strong Hire | Lean Hire | No Hire |
|---|---|---|---|
| Knows multiple methods | Describes 2-3 with tradeoffs | Knows one method | "Just train on longer data" |
| Understands the math | Can explain why interpolation works | Vague "scale the positions" | No understanding of mechanism |
| Practical considerations | Fine-tuning cost, quality degradation | Mentions fine-tuning | Does not mention practical steps |
Problem 4: Why Not Encoder-Decoder?
An interviewer asks: "Google's T5 and BART used encoder-decoder architecture and achieved great results. Why did the field converge on decoder-only? Are there cases where encoder-decoder is still better?"
Hint 1 - Direction
Think about three angles: scaling efficiency, task generality, and in-context learning ability.
Hint 2 - Insight
Decoder-only won because: (1) all parameters contribute to generation (no encoder "dead weight" for generation tasks), (2) in-context learning emerges naturally from causal modeling, (3) training is simpler (single objective). Encoder-decoder is still better for fixed seq2seq tasks like translation and summarization where the input and output are clearly separated.
Hint 3 - Full Solution + Rubric
Why decoder-only dominates:
-
Parameter efficiency for generation: In an encoder-decoder model, ~50% of parameters are in the encoder, which does not directly participate in token generation. A decoder-only model uses all parameters for the generation task.
-
Emergent in-context learning: Causal language models naturally develop the ability to learn from examples in their context (few-shot learning). Encoder-decoder models are less natural at this because the "input" and "output" are architecturally separated.
-
Training simplicity: One objective (next-token prediction), one loss, one architecture. Encoder-decoder requires designing input/output splits and potentially different objectives for each stack.
-
Scaling behavior: Scaling laws show that decoder-only models are more compute-efficient at larger scales for the kinds of tasks we care about (general-purpose generation, reasoning, tool use).
Where encoder-decoder still wins:
- Machine translation with fixed language pairs
- Summarization where the compression ratio is high
- Speech recognition (Whisper uses encoder-decoder)
- Any task with a clear, fixed input-output structure where you do not need in-context flexibility
The nuance: Prefix LM (decoder-only with bidirectional attention on the prefix) can approximate encoder-decoder behavior. This is what models like PaLM and UL2 explored - decoder-only architecture with flexible attention masking.
Scoring Rubric:
| Criterion | Strong Hire | Lean Hire | No Hire |
|---|---|---|---|
| Multiple reasons | 3+ well-explained reasons | 1-2 reasons | "Decoder-only is just better" |
| Nuance | Acknowledges where enc-dec wins | Mentions translation | Absolute claims |
| Research awareness | Mentions prefix LM, UL2 | Mentions T5 | Only knows GPT-style |
Interview Cheat Sheet
| Concept | Key Formula / Fact | Common Follow-Up |
|---|---|---|
| Attention | "What is the complexity? How does Flash Attention help?" | |
| Causal Mask | Upper triangle = , applied before softmax | "How does this enable parallel training?" |
| RoPE | Rotate by ; dot product depends on | "How do you extend to longer contexts?" |
| GQA | query heads share KV heads | "What is the memory savings vs MHA?" |
| KV Cache | "How much for batch 32 at 128K?" | |
| SwiGLU | , | "Why three matrices? Parameter count?" |
| RMSNorm | , no mean subtraction | "Why not LayerNorm?" |
| Pre-norm | Norm before attention/FFN, residual flows unimpeded | "What goes wrong with post-norm at depth?" |
| Param count | Attn: ; FFN: | "Count params for LLaMA 3 70B" |
| FLOPs | Forward: ; Train: | "How many GPU-hours for 70B on 15T tokens?" |
| MoE | Top-k routing, total params much larger than active | "Load balancing? Fine-tuning MoE?" |
Spaced Repetition Checkpoints
Day 0 (After reading this chapter)
- Draw a modern LLM block from memory (RMSNorm, GQA, SwiGLU, residuals)
- Write the causal mask formula and explain why it uses
- Calculate KV cache for LLaMA 3 8B at 8K context, fp16
- Explain RoPE in one paragraph without looking at notes
Day 3
- From memory, list all differences between original Transformer and modern LLM block
- Derive the parameter count for a 7B model (, , GQA with 8 KV heads)
- Explain GQA to someone in 60 seconds - include the memory savings number
- Write the SwiGLU formula and explain the three weight matrices
Day 7
- Compare RoPE, ALiBi, and sinusoidal encodings - pros/cons of each
- Calculate training FLOPs for a 13B model on 2T tokens
- Explain why pre-norm is necessary for deep models - what fails with post-norm?
- Quiz: given a model config, can you estimate total parameters within 20%?
Day 14
- Do Practice Problem 2 (architecture design) from scratch, timed (15 min)
- Explain MoE architecture - routing, expert selection, load balancing
- From memory, fill in the model comparison table for LLaMA 3, Mistral, and GPT-4
- Practice the 60-second answers for all cheat sheet entries
Day 21
- Full mock: "Design a 30B model for 64K context. Specify everything and justify."
- Rapid fire: answer 10 random cheat sheet questions in under 60 seconds each
- Explain context length extension methods (PI, NTK, YaRN) with tradeoffs
- Re-take the self-assessment - all scores should be 4+
