Transformer Architecture - The Engine of Modern AI

Reading time: ~45 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Engineer, LLM Engineer, NLP Engineer

The Real Interview Moment

You are in an Anthropic research engineer interview. The interviewer hands you a blank sheet and says: "Draw the complete Transformer architecture. Label every component with its exact dimensions for a model with $d_{\text{model}} = 512$ , 8 heads, and a sequence length of 128. Then walk me through a single token from input embedding to output logits."

You draw the architecture, and she follows up: "Why does GPT use only the decoder? Why does BERT use only the encoder? What is the fundamental difference in their attention masks, and what tasks does each architecture excel at?"

This is the most asked question in modern AI interviews. The Transformer is the architecture behind GPT-4, Claude, BERT, ViT, Stable Diffusion, AlphaFold, and virtually every state-of-the-art model. You need to know it at the level of being able to implement it from scratch, explain every design decision, and discuss the modern variants that make it scale.

Candidates who can draw the architecture but cannot explain pre-norm vs post-norm, or who confuse the encoder and decoder attention patterns, get a "lean no-hire." Candidates who can walk through the full forward pass with dimensions, explain positional encoding design choices, and discuss efficiency techniques like Flash Attention and KV caching get a "strong hire."

What You Will Master

Draw the complete Transformer architecture (encoder + decoder) with all sub-components
Trace a token through the full forward pass with exact dimensions at every step
Derive sinusoidal positional encodings and compare with learned and RoPE
Explain pre-norm vs post-norm and why pre-norm trains more stably
Compare encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) architectures
Describe Flash Attention, KV cache, and sliding window attention
Compute the parameter count and FLOPs for a given Transformer configuration
Solve interview problems on architecture design, efficiency, and scaling

Self-Assessment: Where Are You Now?

Skill	1 -- Cannot	2 -- Vaguely	3 -- Can Explain	4 -- Can Derive	5 -- Can Teach	Your Score
Draw the full Transformer architecture						___
Walk through forward pass with dimensions						___
Explain sinusoidal positional encoding						___
Compare pre-norm vs post-norm						___
Explain BERT vs GPT architecture differences						___
Describe Flash Attention mechanism						___
Explain KV cache for inference						___
Compute parameter count for a given config						___

Target: All 4s and 5s before your interview.

Part 1 - The Full Architecture

High-Level Structure

The original Transformer (Vaswani et al., 2017, "Attention Is All You Need") has two main components:

Encoder: Processes the input sequence into contextual representations
Decoder: Generates the output sequence token by token, conditioned on the encoder output

Transformer Full Architecture: Encoder + Decoder with Cross-Attention (Vaswani et al., 2017)

Layer-by-Layer Walkthrough with Dimensions

Let us trace a forward pass with: $d_{\text{model}} = 512$ , $h = 8$ heads, $d_{ff} = 2048$ , sequence length $T = 128$ , vocabulary size $V = 32000$ .

Step 1: Input Embedding

X = \text{Embed}(\text{tokens}) \in \mathbb{R}^{128 \times 512}

Token embedding lookup: each token ID maps to a 512-dimensional vector. Parameters: $V \times d_{\text{model}} = 32000 \times 512 = 16.4M$ .

Step 2: Positional Encoding

X' = X + PE \in \mathbb{R}^{128 \times 512}

Add position information (see Part 2 for details).

Step 3: Multi-Head Self-Attention

For each of 8 heads ( $d_k = d_v = 512/8 = 64$ ):

Q_i = X'W_i^Q \in \mathbb{R}^{128 \times 64}

K_i = X'W_i^K \in \mathbb{R}^{128 \times 64}

V_i = X'W_i^V \in \mathbb{R}^{128 \times 64}

\text{head}_i = \text{softmax}\left(\frac{Q_iK_i^T}{\sqrt{64}}\right)V_i \in \mathbb{R}^{128 \times 64}

Concatenate: $\text{Concat}(\text{head}_1, \ldots, \text{head}_8) \in \mathbb{R}^{128 \times 512}$

Output projection: $\text{MultiHead} = \text{Concat} \cdot W^O \in \mathbb{R}^{128 \times 512}$

Attention parameters per layer: $4 \times d_{\text{model}}^2 = 4 \times 512^2 = 1,048,576$ .

Step 4: Add and Norm (Residual Connection + Layer Normalization)

X'' = \text{LayerNorm}(X' + \text{MultiHead}(X'))

LayerNorm parameters: $2 \times d_{\text{model}} = 1024$ (scale and shift).

Step 5: Feed-Forward Network

\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2

Where $W_1 \in \mathbb{R}^{512 \times 2048}$ , $W_2 \in \mathbb{R}^{2048 \times 512}$ .

FFN parameters per layer: $2 \times d_{\text{model}} \times d_{ff} + d_{ff} + d_{\text{model}} = 2 \times 512 \times 2048 + 2048 + 512 = 2,099,712$ .

Step 6: Add and Norm again.

\text{Output} = \text{LayerNorm}(X'' + \text{FFN}(X''))

60-Second Answer

"The Transformer processes sequences entirely through attention, with no recurrence. Each encoder layer has multi-head self-attention followed by a feed-forward network, with residual connections and layer normalization around each. The decoder adds causal masking in self-attention to prevent attending to future tokens, plus a cross-attention layer that queries the encoder output. Positional encoding is added to the input embeddings since attention has no inherent notion of position. The whole architecture is highly parallelizable because all positions are processed simultaneously."

Total Parameter Count

For a 6-layer encoder + 6-layer decoder Transformer with $d_{\text{model}} = 512$ , $d_{ff} = 2048$ , $V = 32000$ :

Component	Parameters
Token embedding	32000 x 512 = 16.4M
Positional encoding (sinusoidal)	0 (fixed)
Encoder self-attention (per layer)	4 x 512^2 = 1.05M
Encoder FFN (per layer)	2 x 512 x 2048 + 2048 + 512 = 2.1M
Encoder LayerNorm (per layer)	2 x 2 x 512 = 2K
Encoder total (6 layers)	6 x 3.15M = 18.9M
Decoder self-attention (per layer)	1.05M
Decoder cross-attention (per layer)	1.05M
Decoder FFN (per layer)	2.1M
Decoder LayerNorm (per layer)	3 x 2 x 512 = 3K
Decoder total (6 layers)	6 x 4.2M = 25.2M
Output linear (tied with embedding)	0 (weight tying)
Total	~60.5M

Company Variation

At Google and Meta, interviewers expect you to compute parameter counts on the spot. At research-focused companies (DeepMind, Anthropic), they want you to know how parameters scale and which component dominates at different model sizes. Key insight: for small models, embeddings dominate; for large models (billions of parameters), FFN layers dominate because they scale as $O(d_{\text{model}} \times d_{ff})$ while embeddings scale as $O(V \times d_{\text{model}})$ .

Part 2 - Positional Encoding

Why We Need It

Self-attention is permutation-equivariant: if you shuffle the input tokens, the output is shuffled in the same way (the attention weights change, but the computation treats all positions identically). Without positional encoding, the Transformer has no notion of word order - "the cat sat on the mat" and "mat the on sat cat the" would produce the same representations.

Sinusoidal Positional Encoding (Original)

PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

Where $pos$ is the position index and $i$ is the dimension index.

Why sinusoidal?

Fixed (no learned parameters): Does not add to model size
Extrapolation: Can theoretically handle sequences longer than those seen in training
Relative position encoding: For any fixed offset $k$ , $PE_{pos+k}$ can be expressed as a linear function of $PE_{pos}$ :

PE_{pos+k} = M_k \cdot PE_{pos}

where $M_k$ is a rotation matrix. This allows the model to learn relative position patterns.

Multi-scale patterns: Low-frequency sinusoids capture long-range position differences; high-frequency sinusoids capture fine-grained position differences

Learned Positional Encoding

PE = W_{pos}[pos] \in \mathbb{R}^{d_{\text{model}}}

A lookup table of $T_{\max}$ learned vectors, one per position. Used in BERT and GPT-2.

Pros: More flexible, can learn task-specific patterns. Cons: Cannot extrapolate beyond $T_{\max}$ , adds $T_{\max} \times d_{\text{model}}$ parameters.

Rotary Position Embedding (RoPE)

RoPE (Su et al., 2021) encodes position by rotating the query and key vectors:

f_q(x_m, m) = R_m x_m, \quad f_k(x_n, n) = R_n x_n

Where $R_m$ is a rotation matrix that depends on position $m$ . The inner product becomes:

f_q(x_m, m)^T f_k(x_n, n) = x_m^T R_m^T R_n x_n = x_m^T R_{n-m} x_n

The attention score depends only on the relative position $n - m$ , not the absolute positions. This is the key advantage of RoPE.

RoPE is used by: LLaMA, Mistral, GPT-NeoX, and most modern LLMs.

Positional Encoding Comparison

Method	Params	Extrapolation	Relative Position	Used By
Sinusoidal	0	Theoretically yes	Via linear transform	Original Transformer
Learned	$T \times d$	No	No (absolute only)	BERT, GPT-2
RoPE	0	With NTK scaling	Natively	LLaMA, Mistral
ALiBi	0	Naturally	Natively (linear bias)	BLOOM, MPT

Common Trap

Do NOT say "sinusoidal encoding can extrapolate to any length." While mathematically the sinusoids are defined for any position, in practice models trained with max length $T$ show degraded performance beyond $T$ . True length extrapolation requires techniques like NTK-aware RoPE scaling or ALiBi. The correct statement: "Sinusoidal encoding is defined for arbitrary positions, but the model's learned attention patterns may not generalize to unseen position ranges."

Part 3 - Pre-Norm vs Post-Norm

Post-Norm (Original Transformer)

X' = \text{LayerNorm}(X + \text{SubLayer}(X))

The sublayer output is added to the input (residual), then normalized.

Pre-Norm (Modern Default)

X' = X + \text{SubLayer}(\text{LayerNorm}(X))

The input is normalized BEFORE the sublayer, and the residual connection adds the raw (unnormalized) input.

Why Pre-Norm Trains Better

Pre-Norm vs Post-Norm - Modern LLMs Use Pre-Norm for Stable Training

1. Gradient flow through residual stream.

In pre-norm, the gradient through the residual connection is an identity (unmodified by LayerNorm):

\frac{\partial X'}{\partial X} = I + \frac{\partial \text{SubLayer}(\text{LN}(X))}{\partial X}

The identity term guarantees gradient flow regardless of what happens in the sublayer. In post-norm, the LayerNorm sits on the residual path, complicating gradient flow.

2. Gradient magnitude stability.

With post-norm, gradients can grow or shrink unpredictably as they pass through LayerNorm on the residual path. Pre-norm keeps the residual stream "clean" - it is just a sum of sublayer outputs.

3. Learning rate sensitivity.

Post-norm requires learning rate warmup - without it, training diverges. Pre-norm is much more robust to learning rate choice and does not strictly require warmup, though warmup still helps.

4. Which is used in practice?

Model	Norm Style
Original Transformer	Post-norm
BERT	Post-norm
GPT-2	Pre-norm
GPT-3	Pre-norm
LLaMA	Pre-norm (with RMSNorm)
T5	Pre-norm
PaLM	Pre-norm

Modern consensus: Pre-norm with RMSNorm is the standard for LLMs.

Interviewer's Perspective

When I ask about pre-norm vs post-norm, the strongest answer includes: (1) the mathematical difference in gradient flow - pre-norm has a clean identity path, (2) the practical consequence - post-norm needs warmup and careful LR tuning, pre-norm is more robust, (3) the fact that all modern LLMs use pre-norm. Candidates who only know "pre-norm is better" without the gradient argument get moderate marks.

Part 4 - BERT vs GPT: Encoder-Only vs Decoder-Only

The Architecture Split

The original Transformer has both encoder and decoder. But the two most influential models after it each use only one half:

BERT (2018) - Encoder Only

Bidirectional self-attention (no causal mask)
Sees all positions simultaneously
Trained with masked language modeling (predict masked tokens)
Excels at understanding tasks: classification, NER, QA

GPT (2018-2024) - Decoder Only

Causal (autoregressive) self-attention
Position $i$ can only attend to positions $\leq i$
Trained with next-token prediction
Excels at generation tasks: text generation, code, reasoning

BERT Encoder-Only (Bidirectional) vs GPT Decoder-Only (Causal) Architecture

Detailed Comparison

Aspect	BERT (Encoder)	GPT (Decoder)	T5 (Enc-Dec)
Attention mask	None (full bidirectional)	Causal (lower triangular)	Encoder: none; Decoder: causal + cross
Training objective	MLM + NSP	Next token prediction	Span corruption (text-to-text)
Context access	Full sequence	Only left context	Encoder: full; Decoder: left + encoder
Generation	Requires tricks (iterative)	Natural (autoregressive)	Natural (autoregressive decoder)
Understanding	Excellent	Good (but unidirectional)	Excellent
Dominant use	Classification, NER, embeddings	Text generation, chat, code	Translation, summarization
Modern examples	BERT, RoBERTa, DeBERTa	GPT-4, Claude, LLaMA, Mistral	T5, FLAN-T5, UL2

Why Decoder-Only Won

Most modern LLMs (GPT-4, Claude, LLaMA, Gemini) are decoder-only. Why?

Simplicity: One architecture for both understanding and generation
Scaling: Decoder-only models scale more predictably (Kaplan et al., scaling laws)
Emergent abilities: Large decoder-only models develop strong understanding capabilities even without bidirectional attention (in-context learning, chain-of-thought)
Versatility: Any NLP task can be framed as text generation ("classify the following as positive or negative: ...")
Training efficiency: Next-token prediction on every position provides a training signal at every token (more efficient than MLM which only trains on ~15% of tokens)

Common Trap

Do NOT say "BERT is better for understanding and GPT is better for generation, so you should use BERT for classification tasks." Modern large decoder-only models (GPT-4, Claude) match or exceed BERT on classification tasks when given appropriate prompting or fine-tuning. The BERT advantage is primarily for SMALLER models where bidirectional context helps overcome limited capacity. At scale, decoder-only models can learn to "understand" from next-token prediction alone.

Part 5 - The Feed-Forward Network

Standard FFN

\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2

This is a position-wise MLP - the same network is applied independently to each position. The hidden dimension $d_{ff}$ is typically $4 \times d_{\text{model}}$ .

Modern FFN: SwiGLU

Most modern LLMs use SwiGLU (Shazeer, 2020) instead of ReLU:

\text{SwiGLU}(x) = (\text{Swish}(xW_1) \odot xW_3) W_2

Where $\text{Swish}(x) = x \cdot \sigma(x)$ and $\odot$ is element-wise multiplication.

Note: SwiGLU has THREE weight matrices instead of two, so $d_{ff}$ is typically set to $\frac{8}{3} d_{\text{model}}$ (rounded to a multiple of 256) to keep parameter count similar.

FFN Variant	Activation	Weight Matrices	Used By
Original	ReLU	2 ( $W_1$ , $W_2$ )	Original Transformer
GELU FFN	GELU	2	BERT, GPT-2
SwiGLU	Swish + gating	3 ( $W_1$ , $W_2$ , $W_3$ )	LLaMA, PaLM, Mistral
GeGLU	GELU + gating	3	Some T5 variants

Why FFN Matters

The FFN layers account for 2/3 of the parameters in a standard Transformer layer ( $2 d_{\text{model}} d_{ff}$ for FFN vs $4 d_{\text{model}}^2$ for attention, and $d_{ff} = 4 d_{\text{model}}$ , so FFN has $8 d_{\text{model}}^2$ vs attention's $4 d_{\text{model}}^2$ ).

Recent research suggests that FFN layers act as key-value memories (Geva et al., 2021): the first layer's rows store "keys" (patterns to match), and the second layer's columns store "values" (information to output). This is conceptually similar to attention but with the "keys" being fixed learned patterns rather than dynamic input-dependent patterns.

Part 6 - Efficiency Techniques

Flash Attention

Problem: Standard attention materializes the full $T \times T$ attention matrix, requiring $O(T^2)$ memory.

Solution: Flash Attention (Dao et al., 2022) computes attention in tiles that fit in GPU SRAM (on-chip memory), never materializing the full attention matrix.

Key ideas:

Tiling: Split Q, K, V into blocks that fit in SRAM
Online softmax: Compute softmax incrementally using the log-sum-exp trick
Recomputation: During backward pass, recompute attention instead of storing it (trading compute for memory)

Complexity comparison:

Aspect	Standard Attention	Flash Attention
Memory (forward)	$O(T^2)$	$O(T)$
FLOPs	$O(T^2 d)$	$O(T^2 d)$ (same)
Wall-clock time	Baseline	2-4x faster
HBM reads/writes	$O(T^2 + Td)$	$O(T^2 d / M)$ where $M$ is SRAM size

The speedup comes from reducing memory I/O, not reducing computation. Modern GPUs are memory-bandwidth-limited for attention, so Flash Attention's tiled approach - which keeps data in fast SRAM - achieves significant wall-clock speedups despite the same FLOP count.

KV Cache for Inference

During autoregressive generation, the model generates one token at a time. Without caching, generating token $t$ requires recomputing attention over all $t$ previous tokens.

KV cache: Store the K and V projections from all previous tokens. When generating token $t$ :

Compute Q, K, V only for the new token $t$ (not the full sequence)
Append the new K and V to the cache
Compute attention between the new Q and the full K/V cache

Without KV cache: Generating $T$ tokens costs $O(T^2 \cdot d)$ per layer (recompute attention for each new token over growing context).

With KV cache: Generating each new token costs $O(T \cdot d)$ per layer (one query against the full cache). Total: $O(T^2 \cdot d)$ same, but spread incrementally.

KV cache memory:

Per layer: $2 \times T \times d_{\text{model}} \times \text{bytes}$

For a 70B model (80 layers, $d_{\text{model}} = 8192$ , 8K context, float16):

80 \times 2 \times 8192 \times 8192 \times 2 = 21.5 \text{ GB}

This is a significant memory cost - often larger than the model weights for long contexts.

KV Cache - Reuse Precomputed Keys and Values for O(T) Inference

Sliding Window Attention

Instead of attending to all previous positions, each position attends only to the most recent $w$ positions:

\text{Attention}_i = \text{softmax}\left(\frac{q_i K_{[i-w:i]}^T}{\sqrt{d_k}}\right) V_{[i-w:i]}

Benefits:

Memory per layer: $O(T \cdot w)$ instead of $O(T^2)$
Inference memory: KV cache limited to $w$ entries per layer
Effective context can exceed $w$ through stacking: with $L$ layers and window size $w$ , the receptive field is $L \times w$

Used by: Mistral (window size 4096), Longformer (local + global tokens).

Interviewer's Perspective

Modern LLM interviews increasingly focus on inference efficiency. If you can explain KV cache sizing, Flash Attention's tiling strategy, and the tradeoffs of sliding window attention, you demonstrate practical systems knowledge that separates you from candidates who only know the theory.

Grouped-Query Attention (GQA)

Standard multi-head attention: each head has its own Q, K, V projections.

GQA: Multiple query heads share the same K and V projections.

Variant	Q Heads	KV Heads	KV Cache Size
Multi-Head Attention (MHA)	$h$	$h$	$2 \times h \times d_k \times T$
Grouped-Query (GQA)	$h$	$h/g$	$2 \times (h/g) \times d_k \times T$
Multi-Query (MQA)	$h$	1	$2 \times d_k \times T$

For a model with 32 query heads and 8 KV heads (GQA with group size 4), the KV cache is 4x smaller than full MHA.

Used by: LLaMA 2 (GQA), Mistral (GQA), PaLM (MQA for some sizes).

Part 7 - Modern LLM Architecture Recipe

The Standard 2024-2026 Recipe

Most modern LLMs follow a remarkably similar architecture:

Component	Choice
Architecture	Decoder-only Transformer
Normalization	Pre-norm with RMSNorm
Positional encoding	RoPE
FFN	SwiGLU ( $d_{ff} \approx \frac{8}{3} d_{\text{model}}$ )
Attention	GQA with Flash Attention
Activation	SiLU (Swish) in FFN
Embedding	Tied input-output embeddings
Bias	No bias in linear layers (saves parameters)
Context extension	NTK-aware RoPE scaling or YaRN

This recipe (pioneered by LLaMA) has become the default because each component has been individually validated through ablation studies.

Part 8 - Computing FLOPs

Per-Token Forward Pass FLOPs

For a Transformer with $L$ layers, $d_{\text{model}} = d$ , $d_{ff} = 4d$ , sequence length $T$ :

Attention (per layer):

QKV projection: $3 \times 2Td^2 = 6Td^2$
Attention scores ( $QK^T$ ): $2T^2d$
Attention output ( $AV$ ): $2T^2d$
Output projection: $2Td^2$
Total: $8Td^2 + 4T^2d$

FFN (per layer):

Two linear layers: $2 \times 2T \times d \times 4d = 16Td^2$

Total per layer: $24Td^2 + 4T^2d$

Total model: $L(24Td^2 + 4T^2d) + 2TVd$ (embedding + output)

For large $d$ and moderate $T$ : $\approx 24LTd^2$ , which gives the approximation:

\text{FLOPs} \approx 2 \times \text{Parameters} \times T

(Factor of 2 because each parameter is used in one multiply and one add.)

60-Second Answer

"The FLOPs for a forward pass through a Transformer are approximately $2 \times \text{params} \times \text{sequence\_length}$ . For a 70B parameter model processing 2048 tokens, that is about $2 \times 70 \times 10^9 \times 2048 \approx 2.9 \times 10^{14}$ FLOPs. Training costs approximately $6 \times \text{params} \times \text{tokens}$ FLOPs total because each training step involves forward pass, backward pass (2x forward), and the gradient accumulation. For GPT-3 (175B params, 300B tokens), that is roughly $3.15 \times 10^{23}$ FLOPs."

Practice Problems

Problem 1: Architecture Walkthrough

You have a Transformer with $d_{\text{model}} = 1024$ , $h = 16$ , $d_{ff} = 4096$ , $L = 24$ layers (decoder-only), $V = 50257$ .

(a) Compute the total parameter count. (b) Compute the per-head dimension. (c) What is the KV cache size (in GB, float16) for a context length of 8192? (d) What is the approximate forward pass FLOP count for a 2048-token sequence?

Hint 1 -- Direction

Break down parameters by component: embedding, attention per layer, FFN per layer, LayerNorm per layer, output head. For KV cache: $2 \times L \times T \times d_{\text{model}} \times \text{bytes}$ .

Hint 2 -- Insight

Per layer: attention = $4 d^2$ params, FFN = $2 \times d \times d_{ff}$ params (ignoring bias), LN = $2d$ params. Multiply by $L$ layers. Add embedding ( $V \times d$ ) and final LN. If output weights are tied with embedding, do not double-count.

Hint 3 -- Full Solution + Rubric

(a) Parameter count:

Component	Calculation	Parameters
Token embedding	50257 x 1024	51.5M
Per-layer attention (QKV + O)	4 x 1024^2	4.19M
Per-layer FFN (no bias)	2 x 1024 x 4096	8.39M
Per-layer LayerNorm (x2)	2 x 2 x 1024	4.1K
Total per layer		12.59M
All 24 layers	24 x 12.59M	302.1M
Final LayerNorm	2 x 1024	2K
Output head (tied)	0	0
Total		~354M

This is approximately GPT-2 Medium scale.

(b) Per-head dimension: $d_k = d_v = 1024 / 16 = 64$

(c) KV cache for T=8192:

$2 \times 24 \times 8192 \times 1024 \times 2$ bytes (float16)

$= 2 \times 24 \times 8192 \times 1024 \times 2 = 805,306,368$ bytes = ~0.75 GB

(d) Forward FLOPs for T=2048:

Using the approximation: $2 \times 354M \times 2048 \approx 1.45 \times 10^{12}$ FLOPs (~1.45 TFLOPs)

More precisely: $24 \times (24 \times 2048 \times 1024^2 + 4 \times 2048^2 \times 1024)$ $= 24 \times (51.5 \times 10^9 + 17.2 \times 10^9) = 24 \times 68.7 \times 10^9 \approx 1.65 \times 10^{12}$ FLOPs

Scoring Rubric:

Strong Hire: Correct parameter count with component breakdown, correct KV cache calculation, reasonable FLOP estimate, mentions weight tying.
Lean Hire: Approximately correct total but misses some components or makes small errors in the breakdown.
No Hire: Cannot set up the calculation or is off by more than 2x.

Problem 2: Pre-Norm vs Post-Norm Training

Your team trained a 1B parameter Transformer with post-norm. Training diverged at step 5000 despite using Adam optimizer with $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , learning rate $3 \times 10^{-4}$ .

(a) Diagnose the most likely cause. (b) Propose three fixes, ranked by likelihood of success. (c) If you switch to pre-norm, what changes would you expect in training dynamics?

Hint 1 -- Direction

Post-norm is known to be sensitive to learning rate and requires warmup. A 1B model with learning rate $3 \times 10^{-4}$ and no mentioned warmup is a classic recipe for divergence.

Hint 2 -- Insight

The most likely cause is missing or insufficient learning rate warmup. Post-norm has large gradient magnitudes in early training, and a high initial LR causes weight explosion. The three most effective fixes: (1) add LR warmup over 1-2K steps, (2) switch to pre-norm, (3) reduce initial LR. Switching to pre-norm is the most robust long-term fix.

Hint 3 -- Full Solution + Rubric

(a) Most likely cause: Missing learning rate warmup.

With post-norm, early training has large, unstable gradients because LayerNorm is on the residual path and the model weights are random. A learning rate of $3 \times 10^{-4}$ without warmup causes the gradients to amplify through the LayerNorm, leading to weight explosion around step 5000.

Additional contributing factors:

At 1B parameters, gradient norms are inherently larger (more parameters contributing to gradient)
$\beta_2 = 0.999$ takes ~1000 steps to stabilize the adaptive learning rate estimates, meaning the effective LR is poorly calibrated in early training

(b) Three fixes, ranked:

Switch to pre-norm (highest confidence). This structurally eliminates the gradient instability from LayerNorm on the residual path. Pre-norm models are robust to a much wider range of learning rates and rarely diverge. Cost: architecture change, full retrain.
Add learning rate warmup (quick fix). Warmup from $10^{-7}$ to $3 \times 10^{-4}$ over 2000 steps. This lets the Adam statistics stabilize before applying large updates. Cost: hyperparameter change, can resume from earlier checkpoint.
Reduce learning rate to $10^{-4}$ with warmup. Even with warmup, $3 \times 10^{-4}$ may be too high for a 1B post-norm model. Many post-norm training recipes use lower peak LR than pre-norm equivalents.

(c) Expected changes with pre-norm:

Training becomes stable without warmup (though warmup still helps)
Can use higher learning rate ( $3 \times 10^{-4}$ or even $5 \times 10^{-4}$ )
Loss curve is smoother with fewer spikes
Final performance may be slightly different (some studies show post-norm converges to slightly better solutions when it successfully trains, but pre-norm is more reliable)
Gradient norms are more uniform across layers (post-norm has larger gradients in earlier layers)

Scoring Rubric:

Strong Hire: Correctly identifies warmup as the primary issue, recommends pre-norm as the structural fix, explains the gradient flow difference between pre-norm and post-norm, mentions Adam warmup interaction.
Lean Hire: Identifies the learning rate as too high but does not connect to the pre-norm/post-norm distinction.
No Hire: Suggests reducing model size or adding dropout as the primary fix without addressing the normalization architecture.

Problem 3: Positional Encoding Design

You are designing a Transformer for processing DNA sequences (alphabet: A, C, G, T). Sequences are 10,000-100,000 base pairs long. Which positional encoding would you choose and why?

Hint 1 -- Direction

DNA sequences are extremely long (10K-100K), and biological properties can depend on relative positions (e.g., distance between a promoter and a gene). Consider which encodings handle long sequences and capture relative position naturally.

Hint 2 -- Insight

Learned positional encodings are out - you would need 100K position embeddings and cannot extrapolate. Sinusoidal can define positions for any length but models still struggle at unseen lengths. RoPE with NTK-aware scaling is the best choice: it natively captures relative position (critical for biology), has no learned parameters, and can be extended to longer sequences than seen in training. ALiBi is also a strong contender because it naturally handles extrapolation.

Hint 3 -- Full Solution + Rubric

Recommendation: RoPE with NTK-aware scaling, combined with sliding window attention.

Why RoPE:

Relative position matters in biology. The distance between two nucleotides determines their interaction strength. RoPE's attention scores naturally depend on relative position $n - m$ .
No learned parameters. With 100K positions, learned embeddings would add 100K x $d_{\text{model}}$ unnecessary parameters.
Extrapolation. NTK-aware RoPE scaling allows training on shorter sequences (e.g., 16K) and inference on longer ones (e.g., 100K).

Why also sliding window attention:

Full attention over 100K tokens requires $100K^2 = 10^{10}$ entries per head - prohibitively expensive
Biology often has local structure (codon-level, ~3bp) and medium-range structure (gene-level, ~1-10Kbp)
Sliding window of ~4K captures local structure; add global tokens every 1K positions for long-range
This reduces memory from $O(T^2)$ to $O(T \times w)$

Alternative: ALiBi

Also handles long sequences natively (attention decays linearly with distance)
The linear decay matches biological intuition: nearby nucleotides interact more strongly
Simpler to implement than RoPE scaling
Potential downside: the fixed linear decay pattern may not match all biological distance relationships

What NOT to use:

Learned positional embeddings: cannot extrapolate, wasteful parameters
Sinusoidal (without modification): while defined for all positions, models trained on 16K rarely generalize to 100K
No positional encoding: DNA sequence order is critical (ACGT and TGCA are very different)

Scoring Rubric:

Strong Hire: Chooses RoPE or ALiBi with clear justification for relative position and extrapolation, addresses the computational challenge of long sequences with efficient attention, connects to biological domain knowledge.
Lean Hire: Chooses an appropriate encoding but does not address the long-sequence computational challenge.
No Hire: Chooses learned positional embeddings for 100K positions or does not address extrapolation.

Problem 4: KV Cache Optimization

You are serving a 70B parameter LLM (80 layers, $d_{\text{model}} = 8192$ , 64 query heads, 8 KV heads via GQA) on 8 x A100 GPUs (80GB each).

(a) What is the KV cache memory per user for a 32K context length? (b) How many concurrent users can you serve? (c) Propose two strategies to increase concurrent users by 4x.

Hint 1 -- Direction

KV cache per user = $2 \times \text{layers} \times T \times d_{\text{head}} \times \text{KV\_heads} \times \text{bytes}$ . Note GQA: 8 KV heads, not 64. Total GPU memory minus model weights gives you the cache budget.

Hint 2 -- Insight

The 70B model in float16 uses ~140GB. Across 8 A100s (640GB total), that leaves ~500GB for KV cache and overhead. With GQA (8 KV heads instead of 64), the KV cache per user for 32K context should be significantly smaller than full MHA. Strategies to increase users: quantize KV cache to int8 (2x), use PagedAttention for memory fragmentation (1.5-2x), reduce context window with sliding window attention.

Hint 3 -- Full Solution + Rubric

(a) KV cache per user:

Per layer: $2 \times 32768 \times 8 \times (8192/64) \times 2$ bytes

Breaking it down:

$d_{\text{head}} = 8192 / 64 = 128$
KV heads = 8 (GQA)
Per layer: $2 \times 32768 \times 8 \times 128 \times 2 = 134,217,728$ bytes = 128 MB
80 layers: $80 \times 128$ MB = 10.24 GB per user

Without GQA (64 KV heads): would be $8 \times 10.24 = 81.9$ GB per user - illustrating why GQA is critical for serving.

(b) Concurrent users:

Model weights: 70B params x 2 bytes = 140 GB (spread across 8 GPUs) Total GPU memory: 8 x 80 = 640 GB Available for KV cache: ~640 - 140 - 50 (overhead) = 450 GB Users: $450 / 10.24 \approx 43$ concurrent users

(c) Strategies to increase to ~172 users (4x):

Strategy 1: Quantize KV cache to INT8 (2x improvement)

Store K and V values in 8-bit integers instead of float16
Reduces per-user cache from 10.24 GB to 5.12 GB
Quality impact: minimal (KV values have limited dynamic range)
New capacity: ~88 users

Strategy 2: PagedAttention / vLLM (1.5-2x improvement)

Standard KV cache pre-allocates max context per user, wasting memory when actual context is shorter
PagedAttention allocates cache in pages (like OS virtual memory), only allocating what is actually used
If average context is 50% of max, this doubles effective capacity
Combined with INT8: ~130-175 users

Other strategies:

Sliding window attention: Limit KV cache to last $w$ tokens per layer. With $w = 4096$ (instead of 32K), reduces cache by 8x.
KV cache eviction: Evict low-attention keys (Heavy Hitter oracle). Keeps top- $k$ most attended positions.
Model quantization: Quantize model weights to INT4 (from 140GB to 35GB), freeing ~105GB more for cache.

Scoring Rubric:

Strong Hire: Correct KV cache calculation with GQA, reasonable concurrent user estimate, proposes INT8 cache quantization AND PagedAttention with quantified improvement estimates, mentions GQA benefit.
Lean Hire: Correct calculation but only suggests one optimization strategy.
No Hire: Incorrect KV cache calculation (forgets GQA or gets dimensions wrong) or suggests only "use a bigger GPU."

Interview Cheat Sheet

Concept	Key Formula / Fact	One-Liner	Red Flag
Transformer blocks	Self-attn + FFN + residual + norm	Parallel processing of all positions	"Transformers use RNNs internally"
Position encoding	Sinusoidal / Learned / RoPE	Without it, Transformer is permutation-equivariant	"Position encoding is optional"
Pre-norm vs post-norm	Pre: $X + \text{Sub}(\text{LN}(X))$	Pre-norm has clean gradient path	"They are equivalent"
FFN	$\text{ReLU}(xW_1)W_2$ ; modern: SwiGLU	2/3 of layer parameters	"FFN is just a small detail"
BERT vs GPT	Bidirectional vs causal masking	Both use attention, different masks	"BERT is always better for understanding"
Flash Attention	Tiled attention, $O(T)$ memory	Same FLOPs, less memory, faster	"Flash Attention is approximate"
KV Cache	Store K,V for previous tokens	Avoids recomputation during generation	"KV cache stores attention weights"
GQA	Shared KV heads	Reduces KV cache by $h/g$	"GQA reduces model quality significantly"
Sliding window	Attend to last $w$ positions	$O(Tw)$ instead of $O(T^2)$	"Sliding window sees full context"
Parameter count	$\approx 12 L d^2$ for decoder-only	FFN dominates at scale	Off by more than 2x

Spaced Repetition Checkpoints

Day 0 -- Initial Learning

Read this entire page
Draw the full Transformer architecture from memory (encoder + decoder)
Walk through the forward pass for a 4-token sequence, writing dimensions at each step
Complete the self-assessment

Day 3 -- First Recall

Without notes, explain pre-norm vs post-norm and why pre-norm is preferred
Give the "60-Second Answer" for the full Transformer, out loud, timed
Write the sinusoidal positional encoding formula and explain why sinusoidal

Day 7 -- Connections

Compare BERT, GPT, and T5 architectures (attention masks, training objectives, use cases)
Do Practice Problem 1 (parameter count + KV cache) on paper without hints
Explain Flash Attention to an imaginary interviewer (tiling, online softmax, memory savings)

Day 14 -- Application

Do Practice Problem 4 (KV cache optimization) under timed conditions (12 minutes)
Compute the parameter count for LLaMA-7B from its config (32 layers, 4096 dim, 32 heads)
Explain the modern LLM recipe and justify each design choice

Day 21 -- Mock Interview

Have someone ask: "Draw the Transformer, walk through a forward pass with dimensions, then explain how you would optimize it for serving at scale"
Time yourself: architecture drawing in under 3 minutes, forward pass in under 5 minutes, optimization discussion in under 5 minutes
Do all 4 practice problems in sequence under timed conditions (45 minutes total)

Key Takeaways

The Transformer is fundamentally a stack of two operations: attention and feed-forward. Everything else - residual connections, normalization, positional encoding - is scaffolding that makes these two operations trainable and effective. Understanding the architecture means understanding why each piece of scaffolding exists.
Pre-norm with RMSNorm is the modern standard for a reason. It provides a clean gradient path through the residual stream, eliminates the need for careful warmup scheduling, and is computationally cheaper than full LayerNorm. If you design a new Transformer, use pre-norm.
Decoder-only won because of simplicity and scaling. While encoder-decoder models can be more parameter-efficient for some tasks, the simplicity of a single decoder stack - one architecture for understanding and generation - made it the foundation of GPT-4, Claude, and the modern LLM paradigm.
Efficiency techniques are not optional knowledge. Flash Attention, KV caching, GQA, and sliding window attention are the difference between a Transformer that works on paper and one that can actually serve millions of users. Interviewers at production-focused companies increasingly weight this knowledge heavily.

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - The Full Architecture​

High-Level Structure​

Layer-by-Layer Walkthrough with Dimensions​

Total Parameter Count​

Part 2 - Positional Encoding​

Why We Need It​

Sinusoidal Positional Encoding (Original)​

Learned Positional Encoding​

Rotary Position Embedding (RoPE)​

Positional Encoding Comparison​

Part 3 - Pre-Norm vs Post-Norm​

Post-Norm (Original Transformer)​

Pre-Norm (Modern Default)​

Why Pre-Norm Trains Better​

Part 4 - BERT vs GPT: Encoder-Only vs Decoder-Only​

The Architecture Split​

Detailed Comparison​

Why Decoder-Only Won​

Part 5 - The Feed-Forward Network​

Standard FFN​

Modern FFN: SwiGLU​

Why FFN Matters​

Part 6 - Efficiency Techniques​

Flash Attention​

KV Cache for Inference​

Sliding Window Attention​

Grouped-Query Attention (GQA)​

Part 7 - Modern LLM Architecture Recipe​

The Standard 2024-2026 Recipe​

Part 8 - Computing FLOPs​

Per-Token Forward Pass FLOPs​

Practice Problems​

Problem 1: Architecture Walkthrough​

Problem 2: Pre-Norm vs Post-Norm Training​

Problem 3: Positional Encoding Design​

Problem 4: KV Cache Optimization​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 -- Initial Learning​

Day 3 -- First Recall​

Day 7 -- Connections​

Day 14 -- Application​

Day 21 -- Mock Interview​

Key Takeaways​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 - The Full Architecture

High-Level Structure

Layer-by-Layer Walkthrough with Dimensions

Total Parameter Count

Part 2 - Positional Encoding

Why We Need It

Sinusoidal Positional Encoding (Original)

Learned Positional Encoding

Rotary Position Embedding (RoPE)

Positional Encoding Comparison

Part 3 - Pre-Norm vs Post-Norm

Post-Norm (Original Transformer)

Pre-Norm (Modern Default)

Why Pre-Norm Trains Better

Part 4 - BERT vs GPT: Encoder-Only vs Decoder-Only

The Architecture Split

Detailed Comparison

Why Decoder-Only Won

Part 5 - The Feed-Forward Network

Standard FFN

Modern FFN: SwiGLU

Why FFN Matters

Part 6 - Efficiency Techniques

Flash Attention

KV Cache for Inference

Sliding Window Attention

Grouped-Query Attention (GQA)

Part 7 - Modern LLM Architecture Recipe

The Standard 2024-2026 Recipe

Part 8 - Computing FLOPs

Per-Token Forward Pass FLOPs

Practice Problems

Problem 1: Architecture Walkthrough

Problem 2: Pre-Norm vs Post-Norm Training

Problem 3: Positional Encoding Design

Problem 4: KV Cache Optimization

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 -- Initial Learning

Day 3 -- First Recall

Day 7 -- Connections

Day 14 -- Application

Day 21 -- Mock Interview

Key Takeaways