Skip to main content

Transformer Architecture - The Engine of Modern AI

Reading time: ~45 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Engineer, LLM Engineer, NLP Engineer

The Real Interview Moment

You are in an Anthropic research engineer interview. The interviewer hands you a blank sheet and says: "Draw the complete Transformer architecture. Label every component with its exact dimensions for a model with dmodel=512d_{\text{model}} = 512, 8 heads, and a sequence length of 128. Then walk me through a single token from input embedding to output logits."

You draw the architecture, and she follows up: "Why does GPT use only the decoder? Why does BERT use only the encoder? What is the fundamental difference in their attention masks, and what tasks does each architecture excel at?"

This is the most asked question in modern AI interviews. The Transformer is the architecture behind GPT-4, Claude, BERT, ViT, Stable Diffusion, AlphaFold, and virtually every state-of-the-art model. You need to know it at the level of being able to implement it from scratch, explain every design decision, and discuss the modern variants that make it scale.

Candidates who can draw the architecture but cannot explain pre-norm vs post-norm, or who confuse the encoder and decoder attention patterns, get a "lean no-hire." Candidates who can walk through the full forward pass with dimensions, explain positional encoding design choices, and discuss efficiency techniques like Flash Attention and KV caching get a "strong hire."

What You Will Master

  • Draw the complete Transformer architecture (encoder + decoder) with all sub-components
  • Trace a token through the full forward pass with exact dimensions at every step
  • Derive sinusoidal positional encodings and compare with learned and RoPE
  • Explain pre-norm vs post-norm and why pre-norm trains more stably
  • Compare encoder-only (BERT), decoder-only (GPT), and encoder-decoder (T5) architectures
  • Describe Flash Attention, KV cache, and sliding window attention
  • Compute the parameter count and FLOPs for a given Transformer configuration
  • Solve interview problems on architecture design, efficiency, and scaling

Self-Assessment: Where Are You Now?

Skill1 -- Cannot2 -- Vaguely3 -- Can Explain4 -- Can Derive5 -- Can TeachYour Score
Draw the full Transformer architecture___
Walk through forward pass with dimensions___
Explain sinusoidal positional encoding___
Compare pre-norm vs post-norm___
Explain BERT vs GPT architecture differences___
Describe Flash Attention mechanism___
Explain KV cache for inference___
Compute parameter count for a given config___

Target: All 4s and 5s before your interview.

Part 1 - The Full Architecture

High-Level Structure

The original Transformer (Vaswani et al., 2017, "Attention Is All You Need") has two main components:

  1. Encoder: Processes the input sequence into contextual representations
  2. Decoder: Generates the output sequence token by token, conditioned on the encoder output

Transformer Full Architecture: Encoder + Decoder with Cross-Attention (Vaswani et al., 2017)

Layer-by-Layer Walkthrough with Dimensions

Let us trace a forward pass with: dmodel=512d_{\text{model}} = 512, h=8h = 8 heads, dff=2048d_{ff} = 2048, sequence length T=128T = 128, vocabulary size V=32000V = 32000.

Step 1: Input Embedding

X=Embed(tokens)R128×512X = \text{Embed}(\text{tokens}) \in \mathbb{R}^{128 \times 512}

Token embedding lookup: each token ID maps to a 512-dimensional vector. Parameters: V×dmodel=32000×512=16.4MV \times d_{\text{model}} = 32000 \times 512 = 16.4M.

Step 2: Positional Encoding

X=X+PER128×512X' = X + PE \in \mathbb{R}^{128 \times 512}

Add position information (see Part 2 for details).

Step 3: Multi-Head Self-Attention

For each of 8 heads (dk=dv=512/8=64d_k = d_v = 512/8 = 64):

Qi=XWiQR128×64Q_i = X'W_i^Q \in \mathbb{R}^{128 \times 64} Ki=XWiKR128×64K_i = X'W_i^K \in \mathbb{R}^{128 \times 64} Vi=XWiVR128×64V_i = X'W_i^V \in \mathbb{R}^{128 \times 64} headi=softmax(QiKiT64)ViR128×64\text{head}_i = \text{softmax}\left(\frac{Q_iK_i^T}{\sqrt{64}}\right)V_i \in \mathbb{R}^{128 \times 64}

Concatenate: Concat(head1,,head8)R128×512\text{Concat}(\text{head}_1, \ldots, \text{head}_8) \in \mathbb{R}^{128 \times 512}

Output projection: MultiHead=ConcatWOR128×512\text{MultiHead} = \text{Concat} \cdot W^O \in \mathbb{R}^{128 \times 512}

Attention parameters per layer: 4×dmodel2=4×5122=1,048,5764 \times d_{\text{model}}^2 = 4 \times 512^2 = 1,048,576.

Step 4: Add and Norm (Residual Connection + Layer Normalization)

X=LayerNorm(X+MultiHead(X))X'' = \text{LayerNorm}(X' + \text{MultiHead}(X'))

LayerNorm parameters: 2×dmodel=10242 \times d_{\text{model}} = 1024 (scale and shift).

Step 5: Feed-Forward Network

FFN(x)=ReLU(xW1+b1)W2+b2\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2

Where W1R512×2048W_1 \in \mathbb{R}^{512 \times 2048}, W2R2048×512W_2 \in \mathbb{R}^{2048 \times 512}.

FFN parameters per layer: 2×dmodel×dff+dff+dmodel=2×512×2048+2048+512=2,099,7122 \times d_{\text{model}} \times d_{ff} + d_{ff} + d_{\text{model}} = 2 \times 512 \times 2048 + 2048 + 512 = 2,099,712.

Step 6: Add and Norm again.

Output=LayerNorm(X+FFN(X))\text{Output} = \text{LayerNorm}(X'' + \text{FFN}(X''))
60-Second Answer

"The Transformer processes sequences entirely through attention, with no recurrence. Each encoder layer has multi-head self-attention followed by a feed-forward network, with residual connections and layer normalization around each. The decoder adds causal masking in self-attention to prevent attending to future tokens, plus a cross-attention layer that queries the encoder output. Positional encoding is added to the input embeddings since attention has no inherent notion of position. The whole architecture is highly parallelizable because all positions are processed simultaneously."

Total Parameter Count

For a 6-layer encoder + 6-layer decoder Transformer with dmodel=512d_{\text{model}} = 512, dff=2048d_{ff} = 2048, V=32000V = 32000:

ComponentParameters
Token embedding32000 x 512 = 16.4M
Positional encoding (sinusoidal)0 (fixed)
Encoder self-attention (per layer)4 x 512^2 = 1.05M
Encoder FFN (per layer)2 x 512 x 2048 + 2048 + 512 = 2.1M
Encoder LayerNorm (per layer)2 x 2 x 512 = 2K
Encoder total (6 layers)6 x 3.15M = 18.9M
Decoder self-attention (per layer)1.05M
Decoder cross-attention (per layer)1.05M
Decoder FFN (per layer)2.1M
Decoder LayerNorm (per layer)3 x 2 x 512 = 3K
Decoder total (6 layers)6 x 4.2M = 25.2M
Output linear (tied with embedding)0 (weight tying)
Total~60.5M
Company Variation

At Google and Meta, interviewers expect you to compute parameter counts on the spot. At research-focused companies (DeepMind, Anthropic), they want you to know how parameters scale and which component dominates at different model sizes. Key insight: for small models, embeddings dominate; for large models (billions of parameters), FFN layers dominate because they scale as O(dmodel×dff)O(d_{\text{model}} \times d_{ff}) while embeddings scale as O(V×dmodel)O(V \times d_{\text{model}}).

Part 2 - Positional Encoding

Why We Need It

Self-attention is permutation-equivariant: if you shuffle the input tokens, the output is shuffled in the same way (the attention weights change, but the computation treats all positions identically). Without positional encoding, the Transformer has no notion of word order - "the cat sat on the mat" and "mat the on sat cat the" would produce the same representations.

Sinusoidal Positional Encoding (Original)

PE(pos,2i)=sin(pos100002i/dmodel)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) PE(pos,2i+1)=cos(pos100002i/dmodel)PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

Where pospos is the position index and ii is the dimension index.

Why sinusoidal?

  1. Fixed (no learned parameters): Does not add to model size
  2. Extrapolation: Can theoretically handle sequences longer than those seen in training
  3. Relative position encoding: For any fixed offset kk, PEpos+kPE_{pos+k} can be expressed as a linear function of PEposPE_{pos}:
PEpos+k=MkPEposPE_{pos+k} = M_k \cdot PE_{pos}

where MkM_k is a rotation matrix. This allows the model to learn relative position patterns.

  1. Multi-scale patterns: Low-frequency sinusoids capture long-range position differences; high-frequency sinusoids capture fine-grained position differences

Learned Positional Encoding

PE=Wpos[pos]RdmodelPE = W_{pos}[pos] \in \mathbb{R}^{d_{\text{model}}}

A lookup table of TmaxT_{\max} learned vectors, one per position. Used in BERT and GPT-2.

Pros: More flexible, can learn task-specific patterns. Cons: Cannot extrapolate beyond TmaxT_{\max}, adds Tmax×dmodelT_{\max} \times d_{\text{model}} parameters.

Rotary Position Embedding (RoPE)

RoPE (Su et al., 2021) encodes position by rotating the query and key vectors:

fq(xm,m)=Rmxm,fk(xn,n)=Rnxnf_q(x_m, m) = R_m x_m, \quad f_k(x_n, n) = R_n x_n

Where RmR_m is a rotation matrix that depends on position mm. The inner product becomes:

fq(xm,m)Tfk(xn,n)=xmTRmTRnxn=xmTRnmxnf_q(x_m, m)^T f_k(x_n, n) = x_m^T R_m^T R_n x_n = x_m^T R_{n-m} x_n

The attention score depends only on the relative position nmn - m, not the absolute positions. This is the key advantage of RoPE.

RoPE is used by: LLaMA, Mistral, GPT-NeoX, and most modern LLMs.

Positional Encoding Comparison

MethodParamsExtrapolationRelative PositionUsed By
Sinusoidal0Theoretically yesVia linear transformOriginal Transformer
LearnedT×dT \times dNoNo (absolute only)BERT, GPT-2
RoPE0With NTK scalingNativelyLLaMA, Mistral
ALiBi0NaturallyNatively (linear bias)BLOOM, MPT
Common Trap

Do NOT say "sinusoidal encoding can extrapolate to any length." While mathematically the sinusoids are defined for any position, in practice models trained with max length TT show degraded performance beyond TT. True length extrapolation requires techniques like NTK-aware RoPE scaling or ALiBi. The correct statement: "Sinusoidal encoding is defined for arbitrary positions, but the model's learned attention patterns may not generalize to unseen position ranges."

Part 3 - Pre-Norm vs Post-Norm

Post-Norm (Original Transformer)

X=LayerNorm(X+SubLayer(X))X' = \text{LayerNorm}(X + \text{SubLayer}(X))

The sublayer output is added to the input (residual), then normalized.

Pre-Norm (Modern Default)

X=X+SubLayer(LayerNorm(X))X' = X + \text{SubLayer}(\text{LayerNorm}(X))

The input is normalized BEFORE the sublayer, and the residual connection adds the raw (unnormalized) input.

Why Pre-Norm Trains Better

Pre-Norm vs Post-Norm - Modern LLMs Use Pre-Norm for Stable Training

1. Gradient flow through residual stream.

In pre-norm, the gradient through the residual connection is an identity (unmodified by LayerNorm):

XX=I+SubLayer(LN(X))X\frac{\partial X'}{\partial X} = I + \frac{\partial \text{SubLayer}(\text{LN}(X))}{\partial X}

The identity term guarantees gradient flow regardless of what happens in the sublayer. In post-norm, the LayerNorm sits on the residual path, complicating gradient flow.

2. Gradient magnitude stability.

With post-norm, gradients can grow or shrink unpredictably as they pass through LayerNorm on the residual path. Pre-norm keeps the residual stream "clean" - it is just a sum of sublayer outputs.

3. Learning rate sensitivity.

Post-norm requires learning rate warmup - without it, training diverges. Pre-norm is much more robust to learning rate choice and does not strictly require warmup, though warmup still helps.

4. Which is used in practice?

ModelNorm Style
Original TransformerPost-norm
BERTPost-norm
GPT-2Pre-norm
GPT-3Pre-norm
LLaMAPre-norm (with RMSNorm)
T5Pre-norm
PaLMPre-norm

Modern consensus: Pre-norm with RMSNorm is the standard for LLMs.

Interviewer's Perspective

When I ask about pre-norm vs post-norm, the strongest answer includes: (1) the mathematical difference in gradient flow - pre-norm has a clean identity path, (2) the practical consequence - post-norm needs warmup and careful LR tuning, pre-norm is more robust, (3) the fact that all modern LLMs use pre-norm. Candidates who only know "pre-norm is better" without the gradient argument get moderate marks.

Part 4 - BERT vs GPT: Encoder-Only vs Decoder-Only

The Architecture Split

The original Transformer has both encoder and decoder. But the two most influential models after it each use only one half:

BERT (2018) - Encoder Only

  • Bidirectional self-attention (no causal mask)
  • Sees all positions simultaneously
  • Trained with masked language modeling (predict masked tokens)
  • Excels at understanding tasks: classification, NER, QA

GPT (2018-2024) - Decoder Only

  • Causal (autoregressive) self-attention
  • Position ii can only attend to positions i\leq i
  • Trained with next-token prediction
  • Excels at generation tasks: text generation, code, reasoning

BERT Encoder-Only (Bidirectional) vs GPT Decoder-Only (Causal) Architecture

Detailed Comparison

AspectBERT (Encoder)GPT (Decoder)T5 (Enc-Dec)
Attention maskNone (full bidirectional)Causal (lower triangular)Encoder: none; Decoder: causal + cross
Training objectiveMLM + NSPNext token predictionSpan corruption (text-to-text)
Context accessFull sequenceOnly left contextEncoder: full; Decoder: left + encoder
GenerationRequires tricks (iterative)Natural (autoregressive)Natural (autoregressive decoder)
UnderstandingExcellentGood (but unidirectional)Excellent
Dominant useClassification, NER, embeddingsText generation, chat, codeTranslation, summarization
Modern examplesBERT, RoBERTa, DeBERTaGPT-4, Claude, LLaMA, MistralT5, FLAN-T5, UL2

Why Decoder-Only Won

Most modern LLMs (GPT-4, Claude, LLaMA, Gemini) are decoder-only. Why?

  1. Simplicity: One architecture for both understanding and generation
  2. Scaling: Decoder-only models scale more predictably (Kaplan et al., scaling laws)
  3. Emergent abilities: Large decoder-only models develop strong understanding capabilities even without bidirectional attention (in-context learning, chain-of-thought)
  4. Versatility: Any NLP task can be framed as text generation ("classify the following as positive or negative: ...")
  5. Training efficiency: Next-token prediction on every position provides a training signal at every token (more efficient than MLM which only trains on ~15% of tokens)
Common Trap

Do NOT say "BERT is better for understanding and GPT is better for generation, so you should use BERT for classification tasks." Modern large decoder-only models (GPT-4, Claude) match or exceed BERT on classification tasks when given appropriate prompting or fine-tuning. The BERT advantage is primarily for SMALLER models where bidirectional context helps overcome limited capacity. At scale, decoder-only models can learn to "understand" from next-token prediction alone.

Part 5 - The Feed-Forward Network

Standard FFN

FFN(x)=ReLU(xW1+b1)W2+b2\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2

This is a position-wise MLP - the same network is applied independently to each position. The hidden dimension dffd_{ff} is typically 4×dmodel4 \times d_{\text{model}}.

Modern FFN: SwiGLU

Most modern LLMs use SwiGLU (Shazeer, 2020) instead of ReLU:

SwiGLU(x)=(Swish(xW1)xW3)W2\text{SwiGLU}(x) = (\text{Swish}(xW_1) \odot xW_3) W_2

Where Swish(x)=xσ(x)\text{Swish}(x) = x \cdot \sigma(x) and \odot is element-wise multiplication.

Note: SwiGLU has THREE weight matrices instead of two, so dffd_{ff} is typically set to 83dmodel\frac{8}{3} d_{\text{model}} (rounded to a multiple of 256) to keep parameter count similar.

FFN VariantActivationWeight MatricesUsed By
OriginalReLU2 (W1W_1, W2W_2)Original Transformer
GELU FFNGELU2BERT, GPT-2
SwiGLUSwish + gating3 (W1W_1, W2W_2, W3W_3)LLaMA, PaLM, Mistral
GeGLUGELU + gating3Some T5 variants

Why FFN Matters

The FFN layers account for 2/3 of the parameters in a standard Transformer layer (2dmodeldff2 d_{\text{model}} d_{ff} for FFN vs 4dmodel24 d_{\text{model}}^2 for attention, and dff=4dmodeld_{ff} = 4 d_{\text{model}}, so FFN has 8dmodel28 d_{\text{model}}^2 vs attention's 4dmodel24 d_{\text{model}}^2).

Recent research suggests that FFN layers act as key-value memories (Geva et al., 2021): the first layer's rows store "keys" (patterns to match), and the second layer's columns store "values" (information to output). This is conceptually similar to attention but with the "keys" being fixed learned patterns rather than dynamic input-dependent patterns.

Part 6 - Efficiency Techniques

Flash Attention

Problem: Standard attention materializes the full T×TT \times T attention matrix, requiring O(T2)O(T^2) memory.

Solution: Flash Attention (Dao et al., 2022) computes attention in tiles that fit in GPU SRAM (on-chip memory), never materializing the full attention matrix.

Key ideas:

  1. Tiling: Split Q, K, V into blocks that fit in SRAM
  2. Online softmax: Compute softmax incrementally using the log-sum-exp trick
  3. Recomputation: During backward pass, recompute attention instead of storing it (trading compute for memory)

Complexity comparison:

AspectStandard AttentionFlash Attention
Memory (forward)O(T2)O(T^2)O(T)O(T)
FLOPsO(T2d)O(T^2 d)O(T2d)O(T^2 d) (same)
Wall-clock timeBaseline2-4x faster
HBM reads/writesO(T2+Td)O(T^2 + Td)O(T2d/M)O(T^2 d / M) where MM is SRAM size

The speedup comes from reducing memory I/O, not reducing computation. Modern GPUs are memory-bandwidth-limited for attention, so Flash Attention's tiled approach - which keeps data in fast SRAM - achieves significant wall-clock speedups despite the same FLOP count.

KV Cache for Inference

During autoregressive generation, the model generates one token at a time. Without caching, generating token tt requires recomputing attention over all tt previous tokens.

KV cache: Store the K and V projections from all previous tokens. When generating token tt:

  1. Compute Q, K, V only for the new token tt (not the full sequence)
  2. Append the new K and V to the cache
  3. Compute attention between the new Q and the full K/V cache

Without KV cache: Generating TT tokens costs O(T2d)O(T^2 \cdot d) per layer (recompute attention for each new token over growing context).

With KV cache: Generating each new token costs O(Td)O(T \cdot d) per layer (one query against the full cache). Total: O(T2d)O(T^2 \cdot d) same, but spread incrementally.

KV cache memory:

Per layer: 2×T×dmodel×bytes2 \times T \times d_{\text{model}} \times \text{bytes}

For a 70B model (80 layers, dmodel=8192d_{\text{model}} = 8192, 8K context, float16):

80×2×8192×8192×2=21.5 GB80 \times 2 \times 8192 \times 8192 \times 2 = 21.5 \text{ GB}

This is a significant memory cost - often larger than the model weights for long contexts.

KV Cache - Reuse Precomputed Keys and Values for O(T) Inference

Sliding Window Attention

Instead of attending to all previous positions, each position attends only to the most recent ww positions:

Attentioni=softmax(qiK[iw:i]Tdk)V[iw:i]\text{Attention}_i = \text{softmax}\left(\frac{q_i K_{[i-w:i]}^T}{\sqrt{d_k}}\right) V_{[i-w:i]}

Benefits:

  • Memory per layer: O(Tw)O(T \cdot w) instead of O(T2)O(T^2)
  • Inference memory: KV cache limited to ww entries per layer
  • Effective context can exceed ww through stacking: with LL layers and window size ww, the receptive field is L×wL \times w

Used by: Mistral (window size 4096), Longformer (local + global tokens).

Interviewer's Perspective

Modern LLM interviews increasingly focus on inference efficiency. If you can explain KV cache sizing, Flash Attention's tiling strategy, and the tradeoffs of sliding window attention, you demonstrate practical systems knowledge that separates you from candidates who only know the theory.

Grouped-Query Attention (GQA)

Standard multi-head attention: each head has its own Q, K, V projections.

GQA: Multiple query heads share the same K and V projections.

VariantQ HeadsKV HeadsKV Cache Size
Multi-Head Attention (MHA)hhhh2×h×dk×T2 \times h \times d_k \times T
Grouped-Query (GQA)hhh/gh/g2×(h/g)×dk×T2 \times (h/g) \times d_k \times T
Multi-Query (MQA)hh12×dk×T2 \times d_k \times T

For a model with 32 query heads and 8 KV heads (GQA with group size 4), the KV cache is 4x smaller than full MHA.

Used by: LLaMA 2 (GQA), Mistral (GQA), PaLM (MQA for some sizes).

Part 7 - Modern LLM Architecture Recipe

The Standard 2024-2026 Recipe

Most modern LLMs follow a remarkably similar architecture:

ComponentChoice
ArchitectureDecoder-only Transformer
NormalizationPre-norm with RMSNorm
Positional encodingRoPE
FFNSwiGLU (dff83dmodeld_{ff} \approx \frac{8}{3} d_{\text{model}})
AttentionGQA with Flash Attention
ActivationSiLU (Swish) in FFN
EmbeddingTied input-output embeddings
BiasNo bias in linear layers (saves parameters)
Context extensionNTK-aware RoPE scaling or YaRN

This recipe (pioneered by LLaMA) has become the default because each component has been individually validated through ablation studies.

Part 8 - Computing FLOPs

Per-Token Forward Pass FLOPs

For a Transformer with LL layers, dmodel=dd_{\text{model}} = d, dff=4dd_{ff} = 4d, sequence length TT:

Attention (per layer):

  • QKV projection: 3×2Td2=6Td23 \times 2Td^2 = 6Td^2
  • Attention scores (QKTQK^T): 2T2d2T^2d
  • Attention output (AVAV): 2T2d2T^2d
  • Output projection: 2Td22Td^2
  • Total: 8Td2+4T2d8Td^2 + 4T^2d

FFN (per layer):

  • Two linear layers: 2×2T×d×4d=16Td22 \times 2T \times d \times 4d = 16Td^2

Total per layer: 24Td2+4T2d24Td^2 + 4T^2d

Total model: L(24Td2+4T2d)+2TVdL(24Td^2 + 4T^2d) + 2TVd (embedding + output)

For large dd and moderate TT: 24LTd2\approx 24LTd^2, which gives the approximation:

FLOPs2×Parameters×T\text{FLOPs} \approx 2 \times \text{Parameters} \times T

(Factor of 2 because each parameter is used in one multiply and one add.)

60-Second Answer

"The FLOPs for a forward pass through a Transformer are approximately 2×params×sequence_length2 \times \text{params} \times \text{sequence\_length}. For a 70B parameter model processing 2048 tokens, that is about 2×70×109×20482.9×10142 \times 70 \times 10^9 \times 2048 \approx 2.9 \times 10^{14} FLOPs. Training costs approximately 6×params×tokens6 \times \text{params} \times \text{tokens} FLOPs total because each training step involves forward pass, backward pass (2x forward), and the gradient accumulation. For GPT-3 (175B params, 300B tokens), that is roughly 3.15×10233.15 \times 10^{23} FLOPs."

Practice Problems

Problem 1: Architecture Walkthrough

You have a Transformer with dmodel=1024d_{\text{model}} = 1024, h=16h = 16, dff=4096d_{ff} = 4096, L=24L = 24 layers (decoder-only), V=50257V = 50257.

(a) Compute the total parameter count. (b) Compute the per-head dimension. (c) What is the KV cache size (in GB, float16) for a context length of 8192? (d) What is the approximate forward pass FLOP count for a 2048-token sequence?

Hint 1 -- Direction

Break down parameters by component: embedding, attention per layer, FFN per layer, LayerNorm per layer, output head. For KV cache: 2×L×T×dmodel×bytes2 \times L \times T \times d_{\text{model}} \times \text{bytes}.

Hint 2 -- Insight

Per layer: attention = 4d24 d^2 params, FFN = 2×d×dff2 \times d \times d_{ff} params (ignoring bias), LN = 2d2d params. Multiply by LL layers. Add embedding (V×dV \times d) and final LN. If output weights are tied with embedding, do not double-count.

Hint 3 -- Full Solution + Rubric

(a) Parameter count:

ComponentCalculationParameters
Token embedding50257 x 102451.5M
Per-layer attention (QKV + O)4 x 1024^24.19M
Per-layer FFN (no bias)2 x 1024 x 40968.39M
Per-layer LayerNorm (x2)2 x 2 x 10244.1K
Total per layer12.59M
All 24 layers24 x 12.59M302.1M
Final LayerNorm2 x 10242K
Output head (tied)00
Total~354M

This is approximately GPT-2 Medium scale.

(b) Per-head dimension: dk=dv=1024/16=64d_k = d_v = 1024 / 16 = 64

(c) KV cache for T=8192:

2×24×8192×1024×22 \times 24 \times 8192 \times 1024 \times 2 bytes (float16)

=2×24×8192×1024×2=805,306,368= 2 \times 24 \times 8192 \times 1024 \times 2 = 805,306,368 bytes = ~0.75 GB

(d) Forward FLOPs for T=2048:

Using the approximation: 2×354M×20481.45×10122 \times 354M \times 2048 \approx 1.45 \times 10^{12} FLOPs (~1.45 TFLOPs)

More precisely: 24×(24×2048×10242+4×20482×1024)24 \times (24 \times 2048 \times 1024^2 + 4 \times 2048^2 \times 1024) =24×(51.5×109+17.2×109)=24×68.7×1091.65×1012= 24 \times (51.5 \times 10^9 + 17.2 \times 10^9) = 24 \times 68.7 \times 10^9 \approx 1.65 \times 10^{12} FLOPs

Scoring Rubric:

  • Strong Hire: Correct parameter count with component breakdown, correct KV cache calculation, reasonable FLOP estimate, mentions weight tying.
  • Lean Hire: Approximately correct total but misses some components or makes small errors in the breakdown.
  • No Hire: Cannot set up the calculation or is off by more than 2x.

Problem 2: Pre-Norm vs Post-Norm Training

Your team trained a 1B parameter Transformer with post-norm. Training diverged at step 5000 despite using Adam optimizer with β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, learning rate 3×1043 \times 10^{-4}.

(a) Diagnose the most likely cause. (b) Propose three fixes, ranked by likelihood of success. (c) If you switch to pre-norm, what changes would you expect in training dynamics?

Hint 1 -- Direction

Post-norm is known to be sensitive to learning rate and requires warmup. A 1B model with learning rate 3×1043 \times 10^{-4} and no mentioned warmup is a classic recipe for divergence.

Hint 2 -- Insight

The most likely cause is missing or insufficient learning rate warmup. Post-norm has large gradient magnitudes in early training, and a high initial LR causes weight explosion. The three most effective fixes: (1) add LR warmup over 1-2K steps, (2) switch to pre-norm, (3) reduce initial LR. Switching to pre-norm is the most robust long-term fix.

Hint 3 -- Full Solution + Rubric

(a) Most likely cause: Missing learning rate warmup.

With post-norm, early training has large, unstable gradients because LayerNorm is on the residual path and the model weights are random. A learning rate of 3×1043 \times 10^{-4} without warmup causes the gradients to amplify through the LayerNorm, leading to weight explosion around step 5000.

Additional contributing factors:

  • At 1B parameters, gradient norms are inherently larger (more parameters contributing to gradient)
  • β2=0.999\beta_2 = 0.999 takes ~1000 steps to stabilize the adaptive learning rate estimates, meaning the effective LR is poorly calibrated in early training

(b) Three fixes, ranked:

  1. Switch to pre-norm (highest confidence). This structurally eliminates the gradient instability from LayerNorm on the residual path. Pre-norm models are robust to a much wider range of learning rates and rarely diverge. Cost: architecture change, full retrain.

  2. Add learning rate warmup (quick fix). Warmup from 10710^{-7} to 3×1043 \times 10^{-4} over 2000 steps. This lets the Adam statistics stabilize before applying large updates. Cost: hyperparameter change, can resume from earlier checkpoint.

  3. Reduce learning rate to 10410^{-4} with warmup. Even with warmup, 3×1043 \times 10^{-4} may be too high for a 1B post-norm model. Many post-norm training recipes use lower peak LR than pre-norm equivalents.

(c) Expected changes with pre-norm:

  • Training becomes stable without warmup (though warmup still helps)
  • Can use higher learning rate (3×1043 \times 10^{-4} or even 5×1045 \times 10^{-4})
  • Loss curve is smoother with fewer spikes
  • Final performance may be slightly different (some studies show post-norm converges to slightly better solutions when it successfully trains, but pre-norm is more reliable)
  • Gradient norms are more uniform across layers (post-norm has larger gradients in earlier layers)

Scoring Rubric:

  • Strong Hire: Correctly identifies warmup as the primary issue, recommends pre-norm as the structural fix, explains the gradient flow difference between pre-norm and post-norm, mentions Adam warmup interaction.
  • Lean Hire: Identifies the learning rate as too high but does not connect to the pre-norm/post-norm distinction.
  • No Hire: Suggests reducing model size or adding dropout as the primary fix without addressing the normalization architecture.

Problem 3: Positional Encoding Design

You are designing a Transformer for processing DNA sequences (alphabet: A, C, G, T). Sequences are 10,000-100,000 base pairs long. Which positional encoding would you choose and why?

Hint 1 -- Direction

DNA sequences are extremely long (10K-100K), and biological properties can depend on relative positions (e.g., distance between a promoter and a gene). Consider which encodings handle long sequences and capture relative position naturally.

Hint 2 -- Insight

Learned positional encodings are out - you would need 100K position embeddings and cannot extrapolate. Sinusoidal can define positions for any length but models still struggle at unseen lengths. RoPE with NTK-aware scaling is the best choice: it natively captures relative position (critical for biology), has no learned parameters, and can be extended to longer sequences than seen in training. ALiBi is also a strong contender because it naturally handles extrapolation.

Hint 3 -- Full Solution + Rubric

Recommendation: RoPE with NTK-aware scaling, combined with sliding window attention.

Why RoPE:

  1. Relative position matters in biology. The distance between two nucleotides determines their interaction strength. RoPE's attention scores naturally depend on relative position nmn - m.
  2. No learned parameters. With 100K positions, learned embeddings would add 100K x dmodeld_{\text{model}} unnecessary parameters.
  3. Extrapolation. NTK-aware RoPE scaling allows training on shorter sequences (e.g., 16K) and inference on longer ones (e.g., 100K).

Why also sliding window attention:

  • Full attention over 100K tokens requires 100K2=1010100K^2 = 10^{10} entries per head - prohibitively expensive
  • Biology often has local structure (codon-level, ~3bp) and medium-range structure (gene-level, ~1-10Kbp)
  • Sliding window of ~4K captures local structure; add global tokens every 1K positions for long-range
  • This reduces memory from O(T2)O(T^2) to O(T×w)O(T \times w)

Alternative: ALiBi

  • Also handles long sequences natively (attention decays linearly with distance)
  • The linear decay matches biological intuition: nearby nucleotides interact more strongly
  • Simpler to implement than RoPE scaling
  • Potential downside: the fixed linear decay pattern may not match all biological distance relationships

What NOT to use:

  • Learned positional embeddings: cannot extrapolate, wasteful parameters
  • Sinusoidal (without modification): while defined for all positions, models trained on 16K rarely generalize to 100K
  • No positional encoding: DNA sequence order is critical (ACGT and TGCA are very different)

Scoring Rubric:

  • Strong Hire: Chooses RoPE or ALiBi with clear justification for relative position and extrapolation, addresses the computational challenge of long sequences with efficient attention, connects to biological domain knowledge.
  • Lean Hire: Chooses an appropriate encoding but does not address the long-sequence computational challenge.
  • No Hire: Chooses learned positional embeddings for 100K positions or does not address extrapolation.

Problem 4: KV Cache Optimization

You are serving a 70B parameter LLM (80 layers, dmodel=8192d_{\text{model}} = 8192, 64 query heads, 8 KV heads via GQA) on 8 x A100 GPUs (80GB each).

(a) What is the KV cache memory per user for a 32K context length? (b) How many concurrent users can you serve? (c) Propose two strategies to increase concurrent users by 4x.

Hint 1 -- Direction

KV cache per user = 2×layers×T×dhead×KV_heads×bytes2 \times \text{layers} \times T \times d_{\text{head}} \times \text{KV\_heads} \times \text{bytes}. Note GQA: 8 KV heads, not 64. Total GPU memory minus model weights gives you the cache budget.

Hint 2 -- Insight

The 70B model in float16 uses ~140GB. Across 8 A100s (640GB total), that leaves ~500GB for KV cache and overhead. With GQA (8 KV heads instead of 64), the KV cache per user for 32K context should be significantly smaller than full MHA. Strategies to increase users: quantize KV cache to int8 (2x), use PagedAttention for memory fragmentation (1.5-2x), reduce context window with sliding window attention.

Hint 3 -- Full Solution + Rubric

(a) KV cache per user:

Per layer: 2×32768×8×(8192/64)×22 \times 32768 \times 8 \times (8192/64) \times 2 bytes

Breaking it down:

  • dhead=8192/64=128d_{\text{head}} = 8192 / 64 = 128
  • KV heads = 8 (GQA)
  • Per layer: 2×32768×8×128×2=134,217,7282 \times 32768 \times 8 \times 128 \times 2 = 134,217,728 bytes = 128 MB
  • 80 layers: 80×12880 \times 128 MB = 10.24 GB per user

Without GQA (64 KV heads): would be 8×10.24=81.98 \times 10.24 = 81.9 GB per user - illustrating why GQA is critical for serving.

(b) Concurrent users:

Model weights: 70B params x 2 bytes = 140 GB (spread across 8 GPUs) Total GPU memory: 8 x 80 = 640 GB Available for KV cache: ~640 - 140 - 50 (overhead) = 450 GB Users: 450/10.2443450 / 10.24 \approx 43 concurrent users

(c) Strategies to increase to ~172 users (4x):

Strategy 1: Quantize KV cache to INT8 (2x improvement)

  • Store K and V values in 8-bit integers instead of float16
  • Reduces per-user cache from 10.24 GB to 5.12 GB
  • Quality impact: minimal (KV values have limited dynamic range)
  • New capacity: ~88 users

Strategy 2: PagedAttention / vLLM (1.5-2x improvement)

  • Standard KV cache pre-allocates max context per user, wasting memory when actual context is shorter
  • PagedAttention allocates cache in pages (like OS virtual memory), only allocating what is actually used
  • If average context is 50% of max, this doubles effective capacity
  • Combined with INT8: ~130-175 users

Other strategies:

  • Sliding window attention: Limit KV cache to last ww tokens per layer. With w=4096w = 4096 (instead of 32K), reduces cache by 8x.
  • KV cache eviction: Evict low-attention keys (Heavy Hitter oracle). Keeps top-kk most attended positions.
  • Model quantization: Quantize model weights to INT4 (from 140GB to 35GB), freeing ~105GB more for cache.

Scoring Rubric:

  • Strong Hire: Correct KV cache calculation with GQA, reasonable concurrent user estimate, proposes INT8 cache quantization AND PagedAttention with quantified improvement estimates, mentions GQA benefit.
  • Lean Hire: Correct calculation but only suggests one optimization strategy.
  • No Hire: Incorrect KV cache calculation (forgets GQA or gets dimensions wrong) or suggests only "use a bigger GPU."

Interview Cheat Sheet

ConceptKey Formula / FactOne-LinerRed Flag
Transformer blocksSelf-attn + FFN + residual + normParallel processing of all positions"Transformers use RNNs internally"
Position encodingSinusoidal / Learned / RoPEWithout it, Transformer is permutation-equivariant"Position encoding is optional"
Pre-norm vs post-normPre: X+Sub(LN(X))X + \text{Sub}(\text{LN}(X))Pre-norm has clean gradient path"They are equivalent"
FFNReLU(xW1)W2\text{ReLU}(xW_1)W_2; modern: SwiGLU2/3 of layer parameters"FFN is just a small detail"
BERT vs GPTBidirectional vs causal maskingBoth use attention, different masks"BERT is always better for understanding"
Flash AttentionTiled attention, O(T)O(T) memorySame FLOPs, less memory, faster"Flash Attention is approximate"
KV CacheStore K,V for previous tokensAvoids recomputation during generation"KV cache stores attention weights"
GQAShared KV headsReduces KV cache by h/gh/g"GQA reduces model quality significantly"
Sliding windowAttend to last ww positionsO(Tw)O(Tw) instead of O(T2)O(T^2)"Sliding window sees full context"
Parameter count12Ld2\approx 12 L d^2 for decoder-onlyFFN dominates at scaleOff by more than 2x

Spaced Repetition Checkpoints

Day 0 -- Initial Learning

  • Read this entire page
  • Draw the full Transformer architecture from memory (encoder + decoder)
  • Walk through the forward pass for a 4-token sequence, writing dimensions at each step
  • Complete the self-assessment

Day 3 -- First Recall

  • Without notes, explain pre-norm vs post-norm and why pre-norm is preferred
  • Give the "60-Second Answer" for the full Transformer, out loud, timed
  • Write the sinusoidal positional encoding formula and explain why sinusoidal

Day 7 -- Connections

  • Compare BERT, GPT, and T5 architectures (attention masks, training objectives, use cases)
  • Do Practice Problem 1 (parameter count + KV cache) on paper without hints
  • Explain Flash Attention to an imaginary interviewer (tiling, online softmax, memory savings)

Day 14 -- Application

  • Do Practice Problem 4 (KV cache optimization) under timed conditions (12 minutes)
  • Compute the parameter count for LLaMA-7B from its config (32 layers, 4096 dim, 32 heads)
  • Explain the modern LLM recipe and justify each design choice

Day 21 -- Mock Interview

  • Have someone ask: "Draw the Transformer, walk through a forward pass with dimensions, then explain how you would optimize it for serving at scale"
  • Time yourself: architecture drawing in under 3 minutes, forward pass in under 5 minutes, optimization discussion in under 5 minutes
  • Do all 4 practice problems in sequence under timed conditions (45 minutes total)

Key Takeaways

  1. The Transformer is fundamentally a stack of two operations: attention and feed-forward. Everything else - residual connections, normalization, positional encoding - is scaffolding that makes these two operations trainable and effective. Understanding the architecture means understanding why each piece of scaffolding exists.

  2. Pre-norm with RMSNorm is the modern standard for a reason. It provides a clean gradient path through the residual stream, eliminates the need for careful warmup scheduling, and is computationally cheaper than full LayerNorm. If you design a new Transformer, use pre-norm.

  3. Decoder-only won because of simplicity and scaling. While encoder-decoder models can be more parameter-efficient for some tasks, the simplicity of a single decoder stack - one architecture for understanding and generation - made it the foundation of GPT-4, Claude, and the modern LLM paradigm.

  4. Efficiency techniques are not optional knowledge. Flash Attention, KV caching, GQA, and sliding window attention are the difference between a Transformer that works on paper and one that can actually serve millions of users. Interviewers at production-focused companies increasingly weight this knowledge heavily.

© 2026 EngineersOfAI. All rights reserved.