Skip to main content

Transformer Internals for LLMs - The Architecture That Powers Everything

Reading time: ~50 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Eng, LLM Eng, ML Infra Eng

The Real Interview Moment

You are in a Google DeepMind interview. The interviewer opens with: "Let us say we are designing a new 70B parameter decoder-only model with 80 layers, a hidden dimension of 8192, and 64 attention heads. Walk me through the architecture - what is GQA and why would we use it? How much KV cache memory do we need for a single sequence of 128K tokens in fp16? How many FLOPs does a forward pass take?"

You start writing on the whiteboard. She interrupts after your first equation: "Wait - you wrote the attention formula. But modern LLMs do not use vanilla attention. What positional encoding scheme are we using, and why did the field move away from learned absolute embeddings?"

This is a standard Tier 1 interview question. The interviewer is not testing whether you have heard of Transformers - everyone has. She is testing whether you understand the specific architectural choices that make modern LLMs work, and whether you can reason quantitatively about memory, compute, and tradeoffs.

Candidates who describe the original 2017 Transformer and stop there get a "no hire." Candidates who can walk through a modern decoder-only architecture with RoPE, GQA, SwiGLU, and RMSNorm - with exact dimension calculations - get a "strong hire."

What You Will Master

  • Draw a modern decoder-only Transformer block with all sub-components labeled
  • Explain why LLMs use decoder-only architecture instead of encoder-decoder
  • Derive causal attention masking and its role in autoregressive generation
  • Compare RoPE, ALiBi, and learned positional encodings with mathematical precision
  • Calculate KV cache memory requirements for any model configuration
  • Explain GQA, MQA, and MHA with memory and quality tradeoffs
  • Describe SwiGLU FFN and why it replaced standard ReLU FFN
  • Justify RMSNorm over LayerNorm and pre-norm over post-norm
  • Count parameters and estimate FLOPs for any Transformer configuration
  • Compare architectures of GPT-4, LLaMA 3, Gemini, Claude, and Mistral

Self-Assessment: Where Are You Now?

Skill1 -- Cannot2 -- Vaguely3 -- Can Explain4 -- Can Derive5 -- Can TeachYour Score
Draw a modern decoder-only block___
Explain causal masking mathematically___
Derive RoPE positional encoding___
Compare GQA/MQA/MHA with numbers___
Calculate KV cache memory___
Explain SwiGLU activation___
RMSNorm vs LayerNorm difference___
Count parameters for a given config___
Estimate forward pass FLOPs___
Compare GPT-4 / LLaMA 3 / Mistral architectures___

Target: All 4s and 5s before your interview.

Part 1 - The Modern Decoder-Only Architecture

Why Decoder-Only?

The original Transformer (Vaswani et al., 2017) had both encoder and decoder stacks. Modern LLMs overwhelmingly use decoder-only architecture. Here is why:

FactorEncoder-DecoderDecoder-Only
Task generalityBest for seq2seq (translation)Handles all tasks via prompting
Training simplicityTwo separate stacks to trainSingle uniform stack
Scaling behaviorParameters split across stacksAll parameters serve generation
In-context learningLimited by encoder bottleneckNaturally handles long contexts
KV cacheSeparate caches for encoder/decoderSingle unified cache
Interviewer's Perspective

"Why decoder-only?" is a warm-up question. The real test is whether you understand that decoder-only models can still do "encoding" - the prefix tokens in a prompt serve as a pseudo-encoder. The key architectural difference is causal masking, not the absence of encoding capability.

The Modern LLM Block

A single Transformer block in a modern LLM (circa LLaMA 3, 2024) looks like this:

Transformer Block Architecture

Key differences from the original Transformer:

ComponentOriginal (2017)Modern LLM (2024+)
NormalizationPost-norm LayerNormPre-norm RMSNorm
Positional encodingSinusoidal (additive)RoPE (multiplicative, relative)
AttentionMulti-Head AttentionGrouped Query Attention
Feed-forwardReLU with 4x expansionSwiGLU with 8/3x expansion
ArchitectureEncoder-DecoderDecoder-only

Causal Masking

In a decoder-only model, each token can only attend to itself and all preceding tokens. This is enforced by a causal mask (also called the "look-ahead mask"):

Maskij={0if ijif i<j\text{Mask}_{ij} = \begin{cases} 0 & \text{if } i \geq j \\ -\infty & \text{if } i < j \end{cases}

The masked attention computation becomes:

Attention(Q,K,V)=softmax(QKTdk+Mask)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + \text{Mask}\right) V

The -\infty entries become 0 after softmax, effectively preventing information flow from future tokens.

Common Trap

Candidates often say "the mask is a triangular matrix of 0s and 1s." This is conceptually correct but technically wrong - the mask uses -\infty (or a very large negative number like 109-10^9) which is added to the attention logits before softmax. A mask of 0s and 1s would be multiplied, which is a different (less numerically stable) approach. Interviewers notice this distinction.

Why causal masking enables training parallelism: During training, all positions in a sequence are computed simultaneously. The causal mask ensures position ii does not "see" positions i+1,i+2,i+1, i+2, \ldots even though they exist in the batch. This is why Transformer training is much faster than RNN training - the entire sequence is processed in one forward pass.

Part 2 - Positional Encodings: From Sinusoidal to RoPE

The Position Problem

Self-attention is permutation-equivariant - without positional information, it treats "the cat sat on the mat" and "mat the on sat cat the" identically. We need to inject position information.

Sinusoidal (Original, 2017)

The original Transformer used fixed sinusoidal embeddings added to token embeddings:

PE(pos,2i)=sin(pos100002i/d),PE(pos,2i+1)=cos(pos100002i/d)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)

Limitations for LLMs:

  • Additive: position info gets diluted through deep networks
  • Absolute: does not generalize well to sequences longer than training length
  • Not relative: attention scores do not directly encode relative distance

Learned Absolute Embeddings (GPT-2 era)

Replace fixed sinusoidal with a learnable embedding table of shape (max_seq_len, d).

Limitations:

  • Hard ceiling on context length - cannot extrapolate beyond training length
  • Still absolute, not relative

RoPE - Rotary Position Embedding (Su et al., 2021)

RoPE is used by LLaMA, Mistral, Qwen, and most modern open-source LLMs. It encodes position by rotating the query and key vectors in 2D subspaces.

Core idea: Group the dimensions of qq and kk into pairs (q2i,q2i+1)(q_{2i}, q_{2i+1}) and rotate each pair by an angle proportional to the position:

RoPE(xm,m)=(xm,0xm,1xm,2xm,3)(cos(mθ0)cos(mθ0)cos(mθ1)cos(mθ1))+(xm,1xm,0xm,3xm,2)(sin(mθ0)sin(mθ0)sin(mθ1)sin(mθ1))\text{RoPE}(x_m, m) = \begin{pmatrix} x_{m,0} \\ x_{m,1} \\ x_{m,2} \\ x_{m,3} \\ \vdots \end{pmatrix} \otimes \begin{pmatrix} \cos(m\theta_0) \\ \cos(m\theta_0) \\ \cos(m\theta_1) \\ \cos(m\theta_1) \\ \vdots \end{pmatrix} + \begin{pmatrix} -x_{m,1} \\ x_{m,0} \\ -x_{m,3} \\ x_{m,2} \\ \vdots \end{pmatrix} \otimes \begin{pmatrix} \sin(m\theta_0) \\ \sin(m\theta_0) \\ \sin(m\theta_1) \\ \sin(m\theta_1) \\ \vdots \end{pmatrix}

where θi=100002i/d\theta_i = 10000^{-2i/d} and mm is the position index.

Why RoPE is elegant: The dot product qmknq_m \cdot k_n after RoPE application depends only on the relative position mnm - n, not on mm and nn individually. This is because:

RoPE(qm)RoPE(kn)=g(q,k,mn)\text{RoPE}(q_m) \cdot \text{RoPE}(k_n) = g(q, k, m - n)

The rotation of qq at position mm and kk at position nn produces an inner product that is a function of the relative distance - exactly the property we want for attention.

RoPE advantages for LLMs:

  • Relative: attention inherently captures relative position
  • Multiplicative: position information is deeply intertwined with content, not easily washed away
  • Extensible: with frequency scaling (e.g., YaRN, NTK-aware scaling), RoPE can extrapolate to longer contexts than seen during training
  • No extra parameters: the rotation angles are deterministic, not learned
60-Second Answer

"RoPE encodes position by rotating query and key vectors in 2D subspaces. Each pair of dimensions gets rotated by an angle proportional to the position, with different frequencies for different dimension pairs. The key property is that the dot product between a rotated query at position mm and a rotated key at position nn depends only on the relative distance mnm - n. This gives us relative positional encoding without any additional parameters, and it can be extended to longer contexts through frequency scaling techniques like YaRN."

ALiBi - Attention with Linear Biases

ALiBi (Press et al., 2022) takes a different approach: it adds a linear bias to attention scores based on distance:

Attentionij=qikjdkmij\text{Attention}_{ij} = \frac{q_i \cdot k_j}{\sqrt{d_k}} - m \cdot |i - j|

where mm is a head-specific slope. Heads with steeper slopes focus on local context; gentler slopes attend globally.

ALiBi vs RoPE tradeoff:

  • ALiBi: simpler, better length extrapolation, but attention scores are biased (penalizes long-range attention)
  • RoPE: more expressive, preserves attention magnitude, but needs explicit scaling for length extension

Most 2024-2026 models use RoPE. ALiBi saw adoption in BLOOM and some MPT models but fell out of favor.

Comparison Table

MethodTypeRelativeExtra ParamsLength ExtensionUsed By
SinusoidalAdditive, absoluteNoNonePoorOriginal Transformer
LearnedAdditive, absoluteNoT×dT \times dNone (hard limit)GPT-2
RoPEMultiplicative, relativeYesNoneGood (with scaling)LLaMA, Mistral, Qwen
ALiBiAdditive bias, relativeYesNoneGood (native)BLOOM, MPT

Part 3 - Grouped Query Attention (GQA)

The KV Cache Problem

During autoregressive generation, each new token needs to attend to all previous tokens. Recomputing all keys and values at every step is wasteful, so we cache them - the KV cache.

KV cache size per token per layer:

KV per token per layer=2×dhead×nkv_heads×bytes_per_param\text{KV per token per layer} = 2 \times d_{\text{head}} \times n_{\text{kv\_heads}} \times \text{bytes\_per\_param}

For a model with dmodel=8192d_{\text{model}} = 8192, 64 heads (so dhead=128d_{\text{head}} = 128), 80 layers, at fp16 (2 bytes):

KV per token (all layers)=2×128×64×80×2=2,621,440 bytes2.5 MB/token\text{KV per token (all layers)} = 2 \times 128 \times 64 \times 80 \times 2 = 2{,}621{,}440 \text{ bytes} \approx 2.5 \text{ MB/token}

For 128K context length: 2.5×128,0003202.5 \times 128{,}000 \approx 320 GB. Just for the KV cache of one sequence!

This is the motivation for reducing the number of KV heads.

MHA, MQA, and GQA

MHA vs MQA vs GQA Comparison

VariantQ HeadsKV HeadsKV Cache RatioQuality Impact
MHA (original)HHHH1x (baseline)Best quality
MQA (Shazeer, 2019)HH11/H1/HNoticeable degradation
GQA (Ainslie et al., 2023)HHH/GH/G1/G1/GMinimal degradation

GQA groups query heads and shares a single KV head per group. With H=64H = 64 query heads and G=8G = 8 groups, you get 64/8=864/8 = 8 KV heads. The KV cache is now 8/64=1/88/64 = 1/8 of MHA.

Revised KV cache for our 70B example with GQA (8 KV heads instead of 64):

KV per token (all layers)=2×128×8×80×2=327,680 bytes0.31 MB/token\text{KV per token (all layers)} = 2 \times 128 \times 8 \times 80 \times 2 = 327{,}680 \text{ bytes} \approx 0.31 \text{ MB/token}

For 128K context: 0.31×128,000400.31 \times 128{,}000 \approx 40 GB. Down from 320 GB - an 8x reduction.

60-Second Answer

"GQA groups multiple query heads to share a single set of key-value heads. If you have 64 query heads and 8 KV head groups, each group of 8 query heads shares one KV head. This reduces KV cache memory by 8x with minimal quality loss. LLaMA 3 70B uses GQA with 8 KV heads for its 64 query heads. The key insight is that nearby query heads learn similar attention patterns, so sharing KV projections is a reasonable approximation."

Common Trap

Do not confuse GQA group count with KV head count. If someone says "GQA with 8 groups and 64 query heads," the number of KV heads is 64/8=864/8 = 8. Some papers describe it as the number of KV heads directly. Clarify which convention is being used.

KV Cache Memory Formula (Memorize This)

KV Cache (bytes)=2×nlayers×nkv_heads×dhead×seq_len×bytes_per_param\text{KV Cache (bytes)} = 2 \times n_{\text{layers}} \times n_{\text{kv\_heads}} \times d_{\text{head}} \times \text{seq\_len} \times \text{bytes\_per\_param}

The factor of 2 is for K and V separately. For batch size BB, multiply by BB.

Quick reference for popular models:

ModelLayersKV Headsdheadd_{\text{head}}KV Cache per Token (fp16)128K Context
LLaMA 3 8B328128128 KB16 GB
LLaMA 3 70B808128320 KB40 GB
Mistral 7B328128128 KB16 GB
GPT-3 175B (MHA)96961284.7 MB600 GB

Part 4 - SwiGLU Feed-Forward Network

The Standard FFN (Original Transformer)

The original feed-forward network:

FFN(x)=ReLU(xW1+b1)W2+b2\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2

with W1Rd×4dW_1 \in \mathbb{R}^{d \times 4d} and W2R4d×dW_2 \in \mathbb{R}^{4d \times d}. The intermediate dimension is 4×dmodel4 \times d_{\text{model}}.

SwiGLU (Shazeer, 2020)

Modern LLMs replace ReLU FFN with SwiGLU:

SwiGLU(x)=(Swish(xWgate)xWup)Wdown\text{SwiGLU}(x) = (\text{Swish}(xW_{\text{gate}}) \odot xW_{\text{up}}) W_{\text{down}}

where Swish(x)=xσ(x)\text{Swish}(x) = x \cdot \sigma(x) (also called SiLU) and \odot is element-wise multiplication.

Three weight matrices instead of two:

  • WgateRd×dffW_{\text{gate}} \in \mathbb{R}^{d \times d_{\text{ff}}} - produces gating signal
  • WupRd×dffW_{\text{up}} \in \mathbb{R}^{d \times d_{\text{ff}}} - produces value signal
  • WdownRdff×dW_{\text{down}} \in \mathbb{R}^{d_{\text{ff}} \times d} - projects back down

To keep the total parameter count comparable to the standard 4d4d FFN (which has 2×d×4d=8d22 \times d \times 4d = 8d^2 parameters), the intermediate dimension is set to dff=8d3d_{\text{ff}} = \frac{8d}{3} (often rounded to a multiple of 256):

SwiGLU params=3×d×8d3=8d2\text{SwiGLU params} = 3 \times d \times \frac{8d}{3} = 8d^2

Same parameter budget as standard FFN, but consistently better performance across benchmarks.

SwiGLU FFN Architecture

Why SwiGLU works better: The gating mechanism allows the network to selectively pass information. The Swish activation is smooth (unlike ReLU which has a hard zero), which helps with gradient flow. Empirically, SwiGLU consistently outperforms ReLU and GELU FFNs at the same parameter count (Shazeer, 2020).

Interviewer's Perspective

If asked "Why SwiGLU instead of ReLU?" do NOT just say "it works better." Explain: (1) the gating mechanism provides multiplicative interaction that is more expressive, (2) Swish is smooth so gradients flow better at zero, (3) the three-matrix formulation with 8d/38d/3 intermediate dimension matches the parameter budget of two-matrix 4d4d ReLU FFN. Then mention that this was validated empirically in PaLM and LLaMA.

Part 5 - RMSNorm and Pre-Norm Architecture

LayerNorm vs RMSNorm

LayerNorm (Ba et al., 2016) normalizes across the feature dimension:

LayerNorm(x)=xμσ2+ϵγ+β\text{LayerNorm}(x) = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \gamma + \beta

where μ\mu and σ2\sigma^2 are the mean and variance of xx across features.

RMSNorm (Zhang and Sennrich, 2019) drops the mean centering:

RMSNorm(x)=x1di=1dxi2+ϵγ\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2 + \epsilon}} \cdot \gamma

Why RMSNorm for LLMs:

FactorLayerNormRMSNorm
OperationsMean, variance, normalize, scale, shiftRMS, normalize, scale
Parametersγ\gamma (scale) + β\beta (shift)γ\gamma (scale) only
ComputeSlightly more (mean subtraction)~15% faster
QualityBaselineEquivalent or better

The mean subtraction in LayerNorm is unnecessary - the re-centering (β\beta) can compensate. Dropping it simplifies the operation and is slightly faster, which matters when you have 80+ layers and billions of tokens.

Pre-Norm vs Post-Norm

Post-norm (original Transformer):

x → Attention → Add(x, ·) → LayerNorm → FFN → Add(·, ·) → LayerNorm

Pre-norm (modern LLMs):

x → RMSNorm → Attention → Add(x, ·) → RMSNorm → FFN → Add(·, ·)

Pre-Norm vs Post-Norm

Why pre-norm is critical for deep models:

In pre-norm, the residual stream flows unimpeded from input to output. Each layer's contribution is normalized before being added, preventing the residual magnitudes from growing unboundedly. This means:

  1. Stable gradients: The gradient flows directly through the residual connections without passing through normalization layers
  2. No warmup tricks needed: Post-norm often requires learning rate warmup; pre-norm trains stably from the start
  3. Better scaling: Models with 100+ layers train reliably with pre-norm
Instant Rejection

If you draw the Transformer with post-norm and claim that is what GPT or LLaMA uses, the interviewer will question your practical knowledge. Every major LLM since GPT-3 uses pre-norm. This is a basic factual check.

Part 6 - Parameter Counting

The Complete Parameter Count

For a decoder-only Transformer with:

  • LL layers, hidden dimension dd, vocabulary size VV
  • HH query heads, HkvH_{\text{kv}} KV heads, head dimension dh=d/Hd_h = d/H
  • SwiGLU FFN with intermediate dimension dffd_{\text{ff}}

Per-layer parameters:

ComponentParametersFormula
Q projectiond×dd \times dd×(H×dh)d \times (H \times d_h)
K projectiond×Hkv×dhd \times H_{\text{kv}} \times d_hReduced for GQA
V projectiond×Hkv×dhd \times H_{\text{kv}} \times d_hSame as K
Output projectiond×dd \times d
RMSNorm (attention)ddScale parameter only
Gate projection (SwiGLU)d×dffd \times d_{\text{ff}}
Up projection (SwiGLU)d×dffd \times d_{\text{ff}}
Down projection (SwiGLU)dff×dd_{\text{ff}} \times d
RMSNorm (FFN)dd

Total attention params per layer:

Pattn=d2+2×d×Hkv×dh+d2=2d2+2dHkvdhP_{\text{attn}} = d^2 + 2 \times d \times H_{\text{kv}} \times d_h + d^2 = 2d^2 + 2 d \cdot H_{\text{kv}} \cdot d_h

For MHA (Hkv=HH_{\text{kv}} = H): Pattn=4d2P_{\text{attn}} = 4d^2

For GQA (Hkv=H/GH_{\text{kv}} = H/G): Pattn=2d2+2d2/GP_{\text{attn}} = 2d^2 + 2d^2/G

Total FFN params per layer (SwiGLU):

Pffn=3×d×dffP_{\text{ffn}} = 3 \times d \times d_{\text{ff}}

With dff=8d3d_{\text{ff}} = \frac{8d}{3}: Pffn=8d2P_{\text{ffn}} = 8d^2

Total model parameters:

Ptotal=V×d+L×(Pattn+Pffn+2d)+d+V×dP_{\text{total}} = V \times d + L \times (P_{\text{attn}} + P_{\text{ffn}} + 2d) + d + V \times d

where the first V×dV \times d is the token embedding, the last V×dV \times d is the output projection (often tied to the embedding), and the +d+d at the end is the final RMSNorm.

Worked Example: LLaMA 3 8B

HyperparameterValue
LL (layers)32
dd (hidden dim)4096
HH (query heads)32
HkvH_{\text{kv}} (KV heads)8
dhd_h (head dim)128
dffd_{\text{ff}} (FFN intermediate)14336
VV (vocab size)128256

Attention per layer:

Pattn=40962+2×4096×8×128+40962=16,777,216+8,388,608+16,777,216=41,943,040P_{\text{attn}} = 4096^2 + 2 \times 4096 \times 8 \times 128 + 4096^2 = 16{,}777{,}216 + 8{,}388{,}608 + 16{,}777{,}216 = 41{,}943{,}040

Wait - let me recalculate properly. Q=d×d=4096×4096=16,777,216Q = d \times d = 4096 \times 4096 = 16{,}777{,}216, K=d×Hkv×dh=4096×8×128=4,194,304K = d \times H_{\text{kv}} \times d_h = 4096 \times 8 \times 128 = 4{,}194{,}304, V=V = same as K=4,194,304K = 4{,}194{,}304, O=d×d=16,777,216O = d \times d = 16{,}777{,}216.

Pattn=16,777,216+4,194,304+4,194,304+16,777,216=41,943,04042MP_{\text{attn}} = 16{,}777{,}216 + 4{,}194{,}304 + 4{,}194{,}304 + 16{,}777{,}216 = 41{,}943{,}040 \approx 42\text{M}

FFN per layer:

Pffn=3×4096×14336=176,160,768176MP_{\text{ffn}} = 3 \times 4096 \times 14336 = 176{,}160{,}768 \approx 176\text{M}

Per-layer total: 42M+176M+2×4096218M42\text{M} + 176\text{M} + 2 \times 4096 \approx 218\text{M}

All layers: 32×218M6,976M7.0B32 \times 218\text{M} \approx 6{,}976\text{M} \approx 7.0\text{B}

Embedding: 128,256×4,096525M128{,}256 \times 4{,}096 \approx 525\text{M} (shared with output head)

Total: 7.0B+0.53B7.5B7.0\text{B} + 0.53\text{B} \approx 7.5\text{B} - but the reported number is ~8.0B because of rounding, unshared output head, and additional norm parameters. The discrepancy is normal for back-of-envelope calculations.

Interviewer's Perspective

Interviewers do not expect exact numbers. They want to see that you (1) know the formula, (2) can set up the calculation correctly, and (3) get within 10-20% of the right answer. Being methodical matters more than being exact.

Part 7 - FLOP Estimation

The 6Nd Rule

For a Transformer with NN parameters processing a sequence of TT tokens:

FLOPs (forward pass)2NT\text{FLOPs (forward pass)} \approx 2NT FLOPs (forward + backward)6NT\text{FLOPs (forward + backward)} \approx 6NT

The factor of 2 for forward comes from the fact that each parameter participates in one multiply-add operation (2 FLOPs per parameter per token). The backward pass costs roughly 2x the forward pass (one for activation gradients, one for weight gradients).

More Precise Breakdown

For a single layer processing batch BB with sequence length TT:

Attention:

  • QKV projections: 6BdTd6Bd \cdot Td FLOPs (for MHA; less for GQA)
  • Attention scores (QKTQK^T): 2BHT2dh2BH \cdot T^2 \cdot d_h FLOPs
  • Attention weighted values (scoresV\text{scores} \cdot V): 2BHT2dh2BH \cdot T^2 \cdot d_h FLOPs
  • Output projection: 2BTd22BTd^2 FLOPs

FFN (SwiGLU):

  • Gate, up, down projections: 3×2BTddff3 \times 2BTd \cdot d_{\text{ff}} FLOPs

Key insight: The attention score computation scales as O(T2)O(T^2) while everything else scales as O(T)O(T). For long contexts, attention becomes the bottleneck:

Sequence LengthAttention FLOPs FractionFFN FLOPs Fraction
2K~5%~95%
32K~40%~60%
128K~75%~25%

This is why Flash Attention and KV cache optimization matter so much for long-context models.

Training Compute Budget

Training FLOPs for a model with NN parameters on DD tokens:

C6NDC \approx 6ND

Chinchilla scaling law (Hoffmann et al., 2022) suggests ND/20N \approx D/20 for compute-optimal training. So for a 70B model, you want about 70B×20=1.4T70\text{B} \times 20 = 1.4\text{T} tokens.

GPU hours estimate:

GPU hours=6NDGPU FLOPS×MFU×3600\text{GPU hours} = \frac{6ND}{\text{GPU FLOPS} \times \text{MFU} \times 3600}

For an H100 at ~1000 TFLOPS (bf16) with 40% MFU (model FLOP utilization):

GPU hours for LLaMA 3 70B=6×70×109×15×10121000×1012×0.4×36004,375,000 GPU hours\text{GPU hours for LLaMA 3 70B} = \frac{6 \times 70 \times 10^9 \times 15 \times 10^{12}}{1000 \times 10^{12} \times 0.4 \times 3600} \approx 4{,}375{,}000 \text{ GPU hours}

That is about 500 H100s running for 1 year, or 6,000 H100s for one month.

Part 8 - Model Architecture Comparison

The Big Picture (2024-2026 Models)

FeatureLLaMA 3 70BMistral 7BGemini 1.5 ProGPT-4 (estimated)Claude 3.5
ArchitectureDecoder-onlyDecoder-onlyDecoder-only (MoE for Ultra)Decoder-only (MoE, rumored)Decoder-only
PositionalRoPERoPERoPE variantUnknownUnknown
AttentionGQA (8 KV)GQA (8 KV)Multi-Query or GQAUnknownUnknown
FFNSwiGLUSwiGLUSwiGLU variantUnknownUnknown
NormRMSNorm, pre-normRMSNorm, pre-normRMSNorm, pre-normPre-normPre-norm
Context128K32K (sliding window)1M+128K200K
Params70B7.3BUnknown (large)~1.8T MoE (rumored)Unknown
Special-Sliding window attnRing attention, long contextMoE routingConstitutional AI
Company Variation

Anthropic and OpenAI do not publish full architecture details. In interviews at these companies, acknowledge what is public and what is speculation. Saying "GPT-4 is rumored to be a mixture of experts with about 1.8T total parameters and ~200B active" shows awareness. Claiming certainty about unpublished details shows poor judgment.

Mistral's Sliding Window Attention

Mistral introduced sliding window attention (SWA) where each token attends only to the previous WW tokens (e.g., W=4096W = 4096):

Attentionij={qikjdkif 0ijWotherwise\text{Attention}_{ij} = \begin{cases} \frac{q_i \cdot k_j}{\sqrt{d_k}} & \text{if } 0 \leq i - j \leq W \\ -\infty & \text{otherwise} \end{cases}

Benefits: KV cache capped at WW tokens regardless of sequence length. Memory is O(W)O(W) instead of O(T)O(T).

Effective receptive field: After LL layers with window size WW, a token can theoretically attend to the previous L×WL \times W tokens (information propagates through residual connections). With 32 layers and W=4096W = 4096: effective receptive field = 131,072131{,}072 tokens.

Mixture of Experts (MoE)

GPT-4 and Mixtral use MoE layers where only a subset of "expert" FFN blocks are activated per token:

MoE Routing Architecture

Key MoE numbers for Mixtral 8x7B:

  • 8 experts per layer, top-2 routing
  • Each expert is ~7B parameters (standard FFN)
  • Total parameters: ~47B (but only ~13B active per token)
  • Inference cost: similar to a 13B dense model
  • Quality: comparable to LLaMA 2 70B

MoE tradeoffs:

  • More total parameters (more memory for weights) but fewer active FLOPs
  • Load balancing challenges - some experts get used much more than others
  • Harder to fine-tune - which experts should adapt?

Practice Problems

Problem 1: KV Cache Memory

You are deploying a model with these specs: 40 layers, hidden dimension 5120, 40 query heads, 8 KV heads, head dimension 128. You need to serve batch size 32 with context length 8192 in fp16. How much KV cache memory is needed?

Hint 1 - Direction

Use the KV cache formula: 2×nlayers×nkv_heads×dhead×seq_len×bytes×batch2 \times n_{\text{layers}} \times n_{\text{kv\_heads}} \times d_{\text{head}} \times \text{seq\_len} \times \text{bytes} \times \text{batch}

Hint 2 - Insight

Plug in: 2×40×8×128×8192×2×322 \times 40 \times 8 \times 128 \times 8192 \times 2 \times 32. Be careful with units - work in bytes and convert to GB at the end.

Hint 3 - Full Solution + Rubric
KV Cache=2×40×8×128×8192×2×32\text{KV Cache} = 2 \times 40 \times 8 \times 128 \times 8192 \times 2 \times 32

Step by step:

  • Per token per layer: 2×8×128×2=4,0962 \times 8 \times 128 \times 2 = 4{,}096 bytes
  • Per token all layers: 4,096×40=163,8404{,}096 \times 40 = 163{,}840 bytes 160\approx 160 KB
  • Per sequence: 163,840×8,192=1,342,177,280163{,}840 \times 8{,}192 = 1{,}342{,}177{,}280 bytes 1.25\approx 1.25 GB
  • Full batch: 1.25×32=401.25 \times 32 = 40 GB

Answer: ~40 GB of KV cache memory.

Scoring Rubric:

CriterionStrong HireLean HireNo Hire
Correct formulaWrote formula from memoryNeeded to derive step by stepCould not set up the calculation
Correct answerWithin 5% of 40 GBWithin 20%Off by more than 2x
Follow-up insight"This means we need at least an 80 GB GPU or tensor parallelism across 2 GPUs, plus model weights""That is a lot of memory"No interpretation
SpeedUnder 2 minutesUnder 5 minutesCould not finish

Problem 2: Architecture Design

You are tasked with designing a 13B parameter model optimized for 32K context length inference on 8 H100 GPUs. Specify: layers, hidden dimension, query heads, KV heads, FFN dimension, and positional encoding. Justify each choice.

Hint 1 - Direction

Start by working backward from 13B parameters using the parameter counting formulas. Consider how to minimize KV cache memory for 32K context while maintaining quality.

Hint 2 - Insight

A 13B model typically has d5120d \approx 5120 and L40L \approx 40. For 32K context, GQA with few KV heads is essential. Use RoPE for position extension capability. Use SwiGLU with dff8d313,696d_{\text{ff}} \approx \frac{8d}{3} \approx 13{,}696.

Hint 3 - Full Solution + Rubric

Proposed architecture:

HyperparameterValueJustification
Layers (LL)40Standard depth for 13B class
Hidden dim (dd)512040×(10d2+3ddff)12.8B40 \times (10d^2 + 3d \cdot d_{\text{ff}}) \approx 12.8\text{B}
Query heads (HH)40dh=5120/40=128d_h = 5120/40 = 128 (standard)
KV heads (HkvH_{\text{kv}})8GQA with 5x reduction; KV cache for 32K in fp16: ~10 GB
FFN dim (dffd_{\text{ff}})136968×5120313653\frac{8 \times 5120}{3} \approx 13653, rounded to multiple of 256
Positional encodingRoPERelative, extensible to longer contexts
NormalizationPre-norm RMSNormStandard for stability
ActivationSwiGLUStandard for quality

Memory budget on 8 H100s (80 GB each = 640 GB total):

  • Model weights (bf16): ~26 GB (tensor parallel across 8 GPUs = ~3.3 GB/GPU)
  • KV cache for batch 16 at 32K: 5\approx 5 GB/sequence ×16/8\times 16 / 8 GPUs 10\approx 10 GB/GPU
  • Activations and overhead: ~5 GB/GPU
  • Total per GPU: ~18 GB - fits comfortably in 80 GB

Scoring Rubric:

CriterionStrong HireLean HireNo Hire
Parameter count works outWithin 10% of 13BWithin 30%Did not verify
KV cache consideredCalculated memory, chose GQAMentioned GQA without numbersUsed MHA or ignored memory
Multi-GPU planTensor parallelism layoutMentioned parallelism vaguelyNo consideration of 8 GPUs
Modern componentsRoPE, GQA, SwiGLU, RMSNormMost modern choicesUsed original 2017 architecture

Problem 3: RoPE Extension

Your 13B model was trained with RoPE on 8K context length. Your users need 32K context. What approaches can you use to extend the context length without retraining from scratch?

Hint 1 - Direction

Think about what RoPE does at positions beyond the training range. The rotation angles become much larger - the model has never seen those angles during training.

Hint 2 - Insight

The main approaches are: (1) Position Interpolation - scale down positions so 32K maps to the 0-8K range, (2) NTK-aware scaling - modify the base frequency to spread rotations more evenly, (3) YaRN - combine NTK scaling with attention scaling and fine-tune briefly.

Hint 3 - Full Solution + Rubric

Three main approaches:

1. Position Interpolation (PI) - Chen et al., 2023 Scale all positions by LtrainLtarget=819232768=0.25\frac{L_{\text{train}}}{L_{\text{target}}} = \frac{8192}{32768} = 0.25. Position 32K becomes position 8K in RoPE-space. Requires short fine-tuning (~1000 steps) to adapt.

2. NTK-aware Scaling - "Code LLaMA" approach Modify the RoPE base frequency: θnew=θbase×αd/(d2)\theta_{\text{new}} = \theta_{\text{base}} \times \alpha^{d/(d-2)} where α\alpha is the scaling factor. This spreads the rotations more evenly across the extended range, preserving the relative position resolution at short distances while extending range.

3. YaRN - Yet another RoPE extension (Peng et al., 2023) Combines NTK-aware interpolation with an attention temperature scaling factor. Different frequency bands are scaled differently: low-frequency (long-range) dimensions are interpolated more aggressively, while high-frequency (local) dimensions are preserved. Fine-tune for ~400 steps.

Comparison:

MethodFine-tuning NeededQuality at 32KComplexity
PI~1000 stepsGoodLow
NTK-awareNone (or very little)DecentMedium
YaRN~400 stepsBestMedium
RetrainFull pretrainingIdealVery high

Scoring Rubric:

CriterionStrong HireLean HireNo Hire
Knows multiple methodsDescribes 2-3 with tradeoffsKnows one method"Just train on longer data"
Understands the mathCan explain why interpolation worksVague "scale the positions"No understanding of mechanism
Practical considerationsFine-tuning cost, quality degradationMentions fine-tuningDoes not mention practical steps

Problem 4: Why Not Encoder-Decoder?

An interviewer asks: "Google's T5 and BART used encoder-decoder architecture and achieved great results. Why did the field converge on decoder-only? Are there cases where encoder-decoder is still better?"

Hint 1 - Direction

Think about three angles: scaling efficiency, task generality, and in-context learning ability.

Hint 2 - Insight

Decoder-only won because: (1) all parameters contribute to generation (no encoder "dead weight" for generation tasks), (2) in-context learning emerges naturally from causal modeling, (3) training is simpler (single objective). Encoder-decoder is still better for fixed seq2seq tasks like translation and summarization where the input and output are clearly separated.

Hint 3 - Full Solution + Rubric

Why decoder-only dominates:

  1. Parameter efficiency for generation: In an encoder-decoder model, ~50% of parameters are in the encoder, which does not directly participate in token generation. A decoder-only model uses all parameters for the generation task.

  2. Emergent in-context learning: Causal language models naturally develop the ability to learn from examples in their context (few-shot learning). Encoder-decoder models are less natural at this because the "input" and "output" are architecturally separated.

  3. Training simplicity: One objective (next-token prediction), one loss, one architecture. Encoder-decoder requires designing input/output splits and potentially different objectives for each stack.

  4. Scaling behavior: Scaling laws show that decoder-only models are more compute-efficient at larger scales for the kinds of tasks we care about (general-purpose generation, reasoning, tool use).

Where encoder-decoder still wins:

  • Machine translation with fixed language pairs
  • Summarization where the compression ratio is high
  • Speech recognition (Whisper uses encoder-decoder)
  • Any task with a clear, fixed input-output structure where you do not need in-context flexibility

The nuance: Prefix LM (decoder-only with bidirectional attention on the prefix) can approximate encoder-decoder behavior. This is what models like PaLM and UL2 explored - decoder-only architecture with flexible attention masking.

Scoring Rubric:

CriterionStrong HireLean HireNo Hire
Multiple reasons3+ well-explained reasons1-2 reasons"Decoder-only is just better"
NuanceAcknowledges where enc-dec winsMentions translationAbsolute claims
Research awarenessMentions prefix LM, UL2Mentions T5Only knows GPT-style

Interview Cheat Sheet

ConceptKey Formula / FactCommon Follow-Up
Attentionsoftmax(QKT/dk+mask)V\text{softmax}(QK^T/\sqrt{d_k} + \text{mask})V"What is the complexity? How does Flash Attention help?"
Causal MaskUpper triangle = -\infty, applied before softmax"How does this enable parallel training?"
RoPERotate (q2i,q2i+1)(q_{2i}, q_{2i+1}) by mθim\theta_i; dot product depends on mnm-n"How do you extend to longer contexts?"
GQAHH query heads share HkvH_{\text{kv}} KV heads"What is the memory savings vs MHA?"
KV Cache2×L×Hkv×dh×T×bytes2 \times L \times H_{\text{kv}} \times d_h \times T \times \text{bytes}"How much for batch 32 at 128K?"
SwiGLU(Swish(xWg)xWu)Wd(\text{Swish}(xW_g) \odot xW_u)W_d, dff=8d/3d_{\text{ff}} = 8d/3"Why three matrices? Parameter count?"
RMSNormx/RMS(x)×γx / \text{RMS}(x) \times \gamma, no mean subtraction"Why not LayerNorm?"
Pre-normNorm before attention/FFN, residual flows unimpeded"What goes wrong with post-norm at depth?"
Param countAttn: 2d2+2dHkvdh2d^2 + 2dH_{\text{kv}}d_h; FFN: 3ddff3d \cdot d_{\text{ff}}"Count params for LLaMA 3 70B"
FLOPsForward: 2NT\approx 2NT; Train: 6ND\approx 6ND"How many GPU-hours for 70B on 15T tokens?"
MoETop-k routing, total params much larger than active"Load balancing? Fine-tuning MoE?"

Spaced Repetition Checkpoints

Day 0 (After reading this chapter)

  • Draw a modern LLM block from memory (RMSNorm, GQA, SwiGLU, residuals)
  • Write the causal mask formula and explain why it uses -\infty
  • Calculate KV cache for LLaMA 3 8B at 8K context, fp16
  • Explain RoPE in one paragraph without looking at notes

Day 3

  • From memory, list all differences between original Transformer and modern LLM block
  • Derive the parameter count for a 7B model (d=4096d=4096, L=32L=32, GQA with 8 KV heads)
  • Explain GQA to someone in 60 seconds - include the memory savings number
  • Write the SwiGLU formula and explain the three weight matrices

Day 7

  • Compare RoPE, ALiBi, and sinusoidal encodings - pros/cons of each
  • Calculate training FLOPs for a 13B model on 2T tokens
  • Explain why pre-norm is necessary for deep models - what fails with post-norm?
  • Quiz: given a model config, can you estimate total parameters within 20%?

Day 14

  • Do Practice Problem 2 (architecture design) from scratch, timed (15 min)
  • Explain MoE architecture - routing, expert selection, load balancing
  • From memory, fill in the model comparison table for LLaMA 3, Mistral, and GPT-4
  • Practice the 60-second answers for all cheat sheet entries

Day 21

  • Full mock: "Design a 30B model for 64K context. Specify everything and justify."
  • Rapid fire: answer 10 random cheat sheet questions in under 60 seconds each
  • Explain context length extension methods (PI, NTK, YaRN) with tradeoffs
  • Re-take the self-assessment - all scores should be 4+
© 2026 EngineersOfAI. All rights reserved.