Transformer Internals for LLMs - The Architecture That Powers Everything

Reading time: ~50 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Eng, LLM Eng, ML Infra Eng

The Real Interview Moment

You are in a Google DeepMind interview. The interviewer opens with: "Let us say we are designing a new 70B parameter decoder-only model with 80 layers, a hidden dimension of 8192, and 64 attention heads. Walk me through the architecture - what is GQA and why would we use it? How much KV cache memory do we need for a single sequence of 128K tokens in fp16? How many FLOPs does a forward pass take?"

You start writing on the whiteboard. She interrupts after your first equation: "Wait - you wrote the attention formula. But modern LLMs do not use vanilla attention. What positional encoding scheme are we using, and why did the field move away from learned absolute embeddings?"

This is a standard Tier 1 interview question. The interviewer is not testing whether you have heard of Transformers - everyone has. She is testing whether you understand the specific architectural choices that make modern LLMs work, and whether you can reason quantitatively about memory, compute, and tradeoffs.

Candidates who describe the original 2017 Transformer and stop there get a "no hire." Candidates who can walk through a modern decoder-only architecture with RoPE, GQA, SwiGLU, and RMSNorm - with exact dimension calculations - get a "strong hire."

What You Will Master

Draw a modern decoder-only Transformer block with all sub-components labeled
Explain why LLMs use decoder-only architecture instead of encoder-decoder
Derive causal attention masking and its role in autoregressive generation
Compare RoPE, ALiBi, and learned positional encodings with mathematical precision
Calculate KV cache memory requirements for any model configuration
Explain GQA, MQA, and MHA with memory and quality tradeoffs
Describe SwiGLU FFN and why it replaced standard ReLU FFN
Justify RMSNorm over LayerNorm and pre-norm over post-norm
Count parameters and estimate FLOPs for any Transformer configuration
Compare architectures of GPT-4, LLaMA 3, Gemini, Claude, and Mistral

Self-Assessment: Where Are You Now?

Skill	1 -- Cannot	2 -- Vaguely	3 -- Can Explain	4 -- Can Derive	5 -- Can Teach	Your Score
Draw a modern decoder-only block						___
Explain causal masking mathematically						___
Derive RoPE positional encoding						___
Compare GQA/MQA/MHA with numbers						___
Calculate KV cache memory						___
Explain SwiGLU activation						___
RMSNorm vs LayerNorm difference						___
Count parameters for a given config						___
Estimate forward pass FLOPs						___
Compare GPT-4 / LLaMA 3 / Mistral architectures						___

Target: All 4s and 5s before your interview.

Part 1 - The Modern Decoder-Only Architecture

Why Decoder-Only?

The original Transformer (Vaswani et al., 2017) had both encoder and decoder stacks. Modern LLMs overwhelmingly use decoder-only architecture. Here is why:

Factor	Encoder-Decoder	Decoder-Only
Task generality	Best for seq2seq (translation)	Handles all tasks via prompting
Training simplicity	Two separate stacks to train	Single uniform stack
Scaling behavior	Parameters split across stacks	All parameters serve generation
In-context learning	Limited by encoder bottleneck	Naturally handles long contexts
KV cache	Separate caches for encoder/decoder	Single unified cache

Interviewer's Perspective

"Why decoder-only?" is a warm-up question. The real test is whether you understand that decoder-only models can still do "encoding" - the prefix tokens in a prompt serve as a pseudo-encoder. The key architectural difference is causal masking, not the absence of encoding capability.

The Modern LLM Block

A single Transformer block in a modern LLM (circa LLaMA 3, 2024) looks like this:

Transformer Block Architecture

Key differences from the original Transformer:

Component	Original (2017)	Modern LLM (2024+)
Normalization	Post-norm LayerNorm	Pre-norm RMSNorm
Positional encoding	Sinusoidal (additive)	RoPE (multiplicative, relative)
Attention	Multi-Head Attention	Grouped Query Attention
Feed-forward	ReLU with 4x expansion	SwiGLU with 8/3x expansion
Architecture	Encoder-Decoder	Decoder-only

Causal Masking

In a decoder-only model, each token can only attend to itself and all preceding tokens. This is enforced by a causal mask (also called the "look-ahead mask"):

\text{Mask}_{ij} = \begin{cases} 0 & \text{if } i \geq j \\ -\infty & \text{if } i < j \end{cases}

The masked attention computation becomes:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + \text{Mask}\right) V

The $-\infty$ entries become 0 after softmax, effectively preventing information flow from future tokens.

Common Trap

Candidates often say "the mask is a triangular matrix of 0s and 1s." This is conceptually correct but technically wrong - the mask uses $-\infty$ (or a very large negative number like $-10^9$ ) which is added to the attention logits before softmax. A mask of 0s and 1s would be multiplied, which is a different (less numerically stable) approach. Interviewers notice this distinction.

Why causal masking enables training parallelism: During training, all positions in a sequence are computed simultaneously. The causal mask ensures position $i$ does not "see" positions $i+1, i+2, \ldots$ even though they exist in the batch. This is why Transformer training is much faster than RNN training - the entire sequence is processed in one forward pass.

Part 2 - Positional Encodings: From Sinusoidal to RoPE

The Position Problem

Self-attention is permutation-equivariant - without positional information, it treats "the cat sat on the mat" and "mat the on sat cat the" identically. We need to inject position information.

Sinusoidal (Original, 2017)

The original Transformer used fixed sinusoidal embeddings added to token embeddings:

PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)

Limitations for LLMs:

Additive: position info gets diluted through deep networks
Absolute: does not generalize well to sequences longer than training length
Not relative: attention scores do not directly encode relative distance

Learned Absolute Embeddings (GPT-2 era)

Replace fixed sinusoidal with a learnable embedding table of shape (max_seq_len, d).

Limitations:

Hard ceiling on context length - cannot extrapolate beyond training length
Still absolute, not relative

RoPE - Rotary Position Embedding (Su et al., 2021)

RoPE is used by LLaMA, Mistral, Qwen, and most modern open-source LLMs. It encodes position by rotating the query and key vectors in 2D subspaces.

Core idea: Group the dimensions of $q$ and $k$ into pairs $(q_{2i}, q_{2i+1})$ and rotate each pair by an angle proportional to the position:

\text{RoPE}(x_m, m) = \begin{pmatrix} x_{m,0} \\ x_{m,1} \\ x_{m,2} \\ x_{m,3} \\ \vdots \end{pmatrix} \otimes \begin{pmatrix} \cos(m\theta_0) \\ \cos(m\theta_0) \\ \cos(m\theta_1) \\ \cos(m\theta_1) \\ \vdots \end{pmatrix} + \begin{pmatrix} -x_{m,1} \\ x_{m,0} \\ -x_{m,3} \\ x_{m,2} \\ \vdots \end{pmatrix} \otimes \begin{pmatrix} \sin(m\theta_0) \\ \sin(m\theta_0) \\ \sin(m\theta_1) \\ \sin(m\theta_1) \\ \vdots \end{pmatrix}

where $\theta_i = 10000^{-2i/d}$ and $m$ is the position index.

Why RoPE is elegant: The dot product $q_m \cdot k_n$ after RoPE application depends only on the relative position $m - n$ , not on $m$ and $n$ individually. This is because:

\text{RoPE}(q_m) \cdot \text{RoPE}(k_n) = g(q, k, m - n)

The rotation of $q$ at position $m$ and $k$ at position $n$ produces an inner product that is a function of the relative distance - exactly the property we want for attention.

RoPE advantages for LLMs:

Relative: attention inherently captures relative position
Multiplicative: position information is deeply intertwined with content, not easily washed away
Extensible: with frequency scaling (e.g., YaRN, NTK-aware scaling), RoPE can extrapolate to longer contexts than seen during training
No extra parameters: the rotation angles are deterministic, not learned

60-Second Answer

"RoPE encodes position by rotating query and key vectors in 2D subspaces. Each pair of dimensions gets rotated by an angle proportional to the position, with different frequencies for different dimension pairs. The key property is that the dot product between a rotated query at position $m$ and a rotated key at position $n$ depends only on the relative distance $m - n$ . This gives us relative positional encoding without any additional parameters, and it can be extended to longer contexts through frequency scaling techniques like YaRN."

ALiBi - Attention with Linear Biases

ALiBi (Press et al., 2022) takes a different approach: it adds a linear bias to attention scores based on distance:

\text{Attention}_{ij} = \frac{q_i \cdot k_j}{\sqrt{d_k}} - m \cdot |i - j|

where $m$ is a head-specific slope. Heads with steeper slopes focus on local context; gentler slopes attend globally.

ALiBi vs RoPE tradeoff:

ALiBi: simpler, better length extrapolation, but attention scores are biased (penalizes long-range attention)
RoPE: more expressive, preserves attention magnitude, but needs explicit scaling for length extension

Most 2024-2026 models use RoPE. ALiBi saw adoption in BLOOM and some MPT models but fell out of favor.

Comparison Table

Method	Type	Relative	Extra Params	Length Extension	Used By
Sinusoidal	Additive, absolute	No	None	Poor	Original Transformer
Learned	Additive, absolute	No	$T \times d$	None (hard limit)	GPT-2
RoPE	Multiplicative, relative	Yes	None	Good (with scaling)	LLaMA, Mistral, Qwen
ALiBi	Additive bias, relative	Yes	None	Good (native)	BLOOM, MPT

Part 3 - Grouped Query Attention (GQA)

The KV Cache Problem

During autoregressive generation, each new token needs to attend to all previous tokens. Recomputing all keys and values at every step is wasteful, so we cache them - the KV cache.

KV cache size per token per layer:

\text{KV per token per layer} = 2 \times d_{\text{head}} \times n_{\text{kv\_heads}} \times \text{bytes\_per\_param}

For a model with $d_{\text{model}} = 8192$ , 64 heads (so $d_{\text{head}} = 128$ ), 80 layers, at fp16 (2 bytes):

\text{KV per token (all layers)} = 2 \times 128 \times 64 \times 80 \times 2 = 2{,}621{,}440 \text{ bytes} \approx 2.5 \text{ MB/token}

For 128K context length: $2.5 \times 128{,}000 \approx 320$ GB. Just for the KV cache of one sequence!

This is the motivation for reducing the number of KV heads.

MHA, MQA, and GQA

MHA vs MQA vs GQA Comparison

Variant	Q Heads	KV Heads	KV Cache Ratio	Quality Impact
MHA (original)	$H$	$H$	1x (baseline)	Best quality
MQA (Shazeer, 2019)	$H$	1	$1/H$	Noticeable degradation
GQA (Ainslie et al., 2023)	$H$	$H/G$	$1/G$	Minimal degradation

GQA groups query heads and shares a single KV head per group. With $H = 64$ query heads and $G = 8$ groups, you get $64/8 = 8$ KV heads. The KV cache is now $8/64 = 1/8$ of MHA.

Revised KV cache for our 70B example with GQA (8 KV heads instead of 64):

\text{KV per token (all layers)} = 2 \times 128 \times 8 \times 80 \times 2 = 327{,}680 \text{ bytes} \approx 0.31 \text{ MB/token}

For 128K context: $0.31 \times 128{,}000 \approx 40$ GB. Down from 320 GB - an 8x reduction.

60-Second Answer

"GQA groups multiple query heads to share a single set of key-value heads. If you have 64 query heads and 8 KV head groups, each group of 8 query heads shares one KV head. This reduces KV cache memory by 8x with minimal quality loss. LLaMA 3 70B uses GQA with 8 KV heads for its 64 query heads. The key insight is that nearby query heads learn similar attention patterns, so sharing KV projections is a reasonable approximation."

Common Trap

Do not confuse GQA group count with KV head count. If someone says "GQA with 8 groups and 64 query heads," the number of KV heads is $64/8 = 8$ . Some papers describe it as the number of KV heads directly. Clarify which convention is being used.

KV Cache Memory Formula (Memorize This)

\text{KV Cache (bytes)} = 2 \times n_{\text{layers}} \times n_{\text{kv\_heads}} \times d_{\text{head}} \times \text{seq\_len} \times \text{bytes\_per\_param}

The factor of 2 is for K and V separately. For batch size $B$ , multiply by $B$ .

Quick reference for popular models:

Model	Layers	KV Heads	$d_{\text{head}}$	KV Cache per Token (fp16)	128K Context
LLaMA 3 8B	32	8	128	128 KB	16 GB
LLaMA 3 70B	80	8	128	320 KB	40 GB
Mistral 7B	32	8	128	128 KB	16 GB
GPT-3 175B (MHA)	96	96	128	4.7 MB	600 GB

Part 4 - SwiGLU Feed-Forward Network

The Standard FFN (Original Transformer)

The original feed-forward network:

\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2

with $W_1 \in \mathbb{R}^{d \times 4d}$ and $W_2 \in \mathbb{R}^{4d \times d}$ . The intermediate dimension is $4 \times d_{\text{model}}$ .

SwiGLU (Shazeer, 2020)

Modern LLMs replace ReLU FFN with SwiGLU:

\text{SwiGLU}(x) = (\text{Swish}(xW_{\text{gate}}) \odot xW_{\text{up}}) W_{\text{down}}

where $\text{Swish}(x) = x \cdot \sigma(x)$ (also called SiLU) and $\odot$ is element-wise multiplication.

Three weight matrices instead of two:

$W_{\text{gate}} \in \mathbb{R}^{d \times d_{\text{ff}}}$ - produces gating signal
$W_{\text{up}} \in \mathbb{R}^{d \times d_{\text{ff}}}$ - produces value signal
$W_{\text{down}} \in \mathbb{R}^{d_{\text{ff}} \times d}$ - projects back down

To keep the total parameter count comparable to the standard $4d$ FFN (which has $2 \times d \times 4d = 8d^2$ parameters), the intermediate dimension is set to $d_{\text{ff}} = \frac{8d}{3}$ (often rounded to a multiple of 256):

\text{SwiGLU params} = 3 \times d \times \frac{8d}{3} = 8d^2

Same parameter budget as standard FFN, but consistently better performance across benchmarks.

SwiGLU FFN Architecture

Why SwiGLU works better: The gating mechanism allows the network to selectively pass information. The Swish activation is smooth (unlike ReLU which has a hard zero), which helps with gradient flow. Empirically, SwiGLU consistently outperforms ReLU and GELU FFNs at the same parameter count (Shazeer, 2020).

Interviewer's Perspective

If asked "Why SwiGLU instead of ReLU?" do NOT just say "it works better." Explain: (1) the gating mechanism provides multiplicative interaction that is more expressive, (2) Swish is smooth so gradients flow better at zero, (3) the three-matrix formulation with $8d/3$ intermediate dimension matches the parameter budget of two-matrix $4d$ ReLU FFN. Then mention that this was validated empirically in PaLM and LLaMA.

Part 5 - RMSNorm and Pre-Norm Architecture

LayerNorm vs RMSNorm

LayerNorm (Ba et al., 2016) normalizes across the feature dimension:

\text{LayerNorm}(x) = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \cdot \gamma + \beta

where $\mu$ and $\sigma^2$ are the mean and variance of $x$ across features.

RMSNorm (Zhang and Sennrich, 2019) drops the mean centering:

\text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2 + \epsilon}} \cdot \gamma

Why RMSNorm for LLMs:

Factor	LayerNorm	RMSNorm
Operations	Mean, variance, normalize, scale, shift	RMS, normalize, scale
Parameters	$\gamma$ (scale) + $\beta$ (shift)	$\gamma$ (scale) only
Compute	Slightly more (mean subtraction)	~15% faster
Quality	Baseline	Equivalent or better

The mean subtraction in LayerNorm is unnecessary - the re-centering ( $\beta$ ) can compensate. Dropping it simplifies the operation and is slightly faster, which matters when you have 80+ layers and billions of tokens.

Pre-Norm vs Post-Norm

Post-norm (original Transformer):

x → Attention → Add(x, ·) → LayerNorm → FFN → Add(·, ·) → LayerNorm

Pre-norm (modern LLMs):

x → RMSNorm → Attention → Add(x, ·) → RMSNorm → FFN → Add(·, ·)

Pre-Norm vs Post-Norm

Why pre-norm is critical for deep models:

In pre-norm, the residual stream flows unimpeded from input to output. Each layer's contribution is normalized before being added, preventing the residual magnitudes from growing unboundedly. This means:

Stable gradients: The gradient flows directly through the residual connections without passing through normalization layers
No warmup tricks needed: Post-norm often requires learning rate warmup; pre-norm trains stably from the start
Better scaling: Models with 100+ layers train reliably with pre-norm

Instant Rejection

If you draw the Transformer with post-norm and claim that is what GPT or LLaMA uses, the interviewer will question your practical knowledge. Every major LLM since GPT-3 uses pre-norm. This is a basic factual check.

Part 6 - Parameter Counting

The Complete Parameter Count

For a decoder-only Transformer with:

$L$ layers, hidden dimension $d$ , vocabulary size $V$
$H$ query heads, $H_{\text{kv}}$ KV heads, head dimension $d_h = d/H$
SwiGLU FFN with intermediate dimension $d_{\text{ff}}$

Per-layer parameters:

Component	Parameters	Formula
Q projection	$d \times d$	$d \times (H \times d_h)$
K projection	$d \times H_{\text{kv}} \times d_h$	Reduced for GQA
V projection	$d \times H_{\text{kv}} \times d_h$	Same as K
Output projection	$d \times d$
RMSNorm (attention)	$d$	Scale parameter only
Gate projection (SwiGLU)	$d \times d_{\text{ff}}$
Up projection (SwiGLU)	$d \times d_{\text{ff}}$
Down projection (SwiGLU)	$d_{\text{ff}} \times d$
RMSNorm (FFN)	$d$

Total attention params per layer:

P_{\text{attn}} = d^2 + 2 \times d \times H_{\text{kv}} \times d_h + d^2 = 2d^2 + 2 d \cdot H_{\text{kv}} \cdot d_h

For MHA ( $H_{\text{kv}} = H$ ): $P_{\text{attn}} = 4d^2$

For GQA ( $H_{\text{kv}} = H/G$ ): $P_{\text{attn}} = 2d^2 + 2d^2/G$

Total FFN params per layer (SwiGLU):

P_{\text{ffn}} = 3 \times d \times d_{\text{ff}}

With $d_{\text{ff}} = \frac{8d}{3}$ : $P_{\text{ffn}} = 8d^2$

Total model parameters:

P_{\text{total}} = V \times d + L \times (P_{\text{attn}} + P_{\text{ffn}} + 2d) + d + V \times d

where the first $V \times d$ is the token embedding, the last $V \times d$ is the output projection (often tied to the embedding), and the $+d$ at the end is the final RMSNorm.

Worked Example: LLaMA 3 8B

Hyperparameter	Value
$L$ (layers)	32
$d$ (hidden dim)	4096
$H$ (query heads)	32
$H_{\text{kv}}$ (KV heads)	8
$d_h$ (head dim)	128
$d_{\text{ff}}$ (FFN intermediate)	14336
$V$ (vocab size)	128256

Attention per layer:

P_{\text{attn}} = 4096^2 + 2 \times 4096 \times 8 \times 128 + 4096^2 = 16{,}777{,}216 + 8{,}388{,}608 + 16{,}777{,}216 = 41{,}943{,}040

Wait - let me recalculate properly. $Q = d \times d = 4096 \times 4096 = 16{,}777{,}216$ , $K = d \times H_{\text{kv}} \times d_h = 4096 \times 8 \times 128 = 4{,}194{,}304$ , $V =$ same as $K = 4{,}194{,}304$ , $O = d \times d = 16{,}777{,}216$ .

P_{\text{attn}} = 16{,}777{,}216 + 4{,}194{,}304 + 4{,}194{,}304 + 16{,}777{,}216 = 41{,}943{,}040 \approx 42\text{M}

FFN per layer:

P_{\text{ffn}} = 3 \times 4096 \times 14336 = 176{,}160{,}768 \approx 176\text{M}

Per-layer total: $42\text{M} + 176\text{M} + 2 \times 4096 \approx 218\text{M}$

All layers: $32 \times 218\text{M} \approx 6{,}976\text{M} \approx 7.0\text{B}$

Embedding: $128{,}256 \times 4{,}096 \approx 525\text{M}$ (shared with output head)

Total: $7.0\text{B} + 0.53\text{B} \approx 7.5\text{B}$ - but the reported number is ~8.0B because of rounding, unshared output head, and additional norm parameters. The discrepancy is normal for back-of-envelope calculations.

Interviewer's Perspective

Interviewers do not expect exact numbers. They want to see that you (1) know the formula, (2) can set up the calculation correctly, and (3) get within 10-20% of the right answer. Being methodical matters more than being exact.

Part 7 - FLOP Estimation

The 6Nd Rule

For a Transformer with $N$ parameters processing a sequence of $T$ tokens:

\text{FLOPs (forward pass)} \approx 2NT

\text{FLOPs (forward + backward)} \approx 6NT

The factor of 2 for forward comes from the fact that each parameter participates in one multiply-add operation (2 FLOPs per parameter per token). The backward pass costs roughly 2x the forward pass (one for activation gradients, one for weight gradients).

More Precise Breakdown

For a single layer processing batch $B$ with sequence length $T$ :

Attention:

QKV projections: $6Bd \cdot Td$ FLOPs (for MHA; less for GQA)
Attention scores ( $QK^T$ ): $2BH \cdot T^2 \cdot d_h$ FLOPs
Attention weighted values ( $\text{scores} \cdot V$ ): $2BH \cdot T^2 \cdot d_h$ FLOPs
Output projection: $2BTd^2$ FLOPs

FFN (SwiGLU):

Gate, up, down projections: $3 \times 2BTd \cdot d_{\text{ff}}$ FLOPs

Key insight: The attention score computation scales as $O(T^2)$ while everything else scales as $O(T)$ . For long contexts, attention becomes the bottleneck:

Sequence Length	Attention FLOPs Fraction	FFN FLOPs Fraction
2K	~5%	~95%
32K	~40%	~60%
128K	~75%	~25%

This is why Flash Attention and KV cache optimization matter so much for long-context models.

Training Compute Budget

Training FLOPs for a model with $N$ parameters on $D$ tokens:

C \approx 6ND

Chinchilla scaling law (Hoffmann et al., 2022) suggests $N \approx D/20$ for compute-optimal training. So for a 70B model, you want about $70\text{B} \times 20 = 1.4\text{T}$ tokens.

GPU hours estimate:

\text{GPU hours} = \frac{6ND}{\text{GPU FLOPS} \times \text{MFU} \times 3600}

For an H100 at ~1000 TFLOPS (bf16) with 40% MFU (model FLOP utilization):

\text{GPU hours for LLaMA 3 70B} = \frac{6 \times 70 \times 10^9 \times 15 \times 10^{12}}{1000 \times 10^{12} \times 0.4 \times 3600} \approx 4{,}375{,}000 \text{ GPU hours}

That is about 500 H100s running for 1 year, or 6,000 H100s for one month.

Part 8 - Model Architecture Comparison

The Big Picture (2024-2026 Models)

Feature	LLaMA 3 70B	Mistral 7B	Gemini 1.5 Pro	GPT-4 (estimated)	Claude 3.5
Architecture	Decoder-only	Decoder-only	Decoder-only (MoE for Ultra)	Decoder-only (MoE, rumored)	Decoder-only
Positional	RoPE	RoPE	RoPE variant	Unknown	Unknown
Attention	GQA (8 KV)	GQA (8 KV)	Multi-Query or GQA	Unknown	Unknown
FFN	SwiGLU	SwiGLU	SwiGLU variant	Unknown	Unknown
Norm	RMSNorm, pre-norm	RMSNorm, pre-norm	RMSNorm, pre-norm	Pre-norm	Pre-norm
Context	128K	32K (sliding window)	1M+	128K	200K
Params	70B	7.3B	Unknown (large)	~1.8T MoE (rumored)	Unknown
Special	-	Sliding window attn	Ring attention, long context	MoE routing	Constitutional AI

Company Variation

Anthropic and OpenAI do not publish full architecture details. In interviews at these companies, acknowledge what is public and what is speculation. Saying "GPT-4 is rumored to be a mixture of experts with about 1.8T total parameters and ~200B active" shows awareness. Claiming certainty about unpublished details shows poor judgment.

Mistral's Sliding Window Attention

Mistral introduced sliding window attention (SWA) where each token attends only to the previous $W$ tokens (e.g., $W = 4096$ ):

\text{Attention}_{ij} = \begin{cases} \frac{q_i \cdot k_j}{\sqrt{d_k}} & \text{if } 0 \leq i - j \leq W \\ -\infty & \text{otherwise} \end{cases}

Benefits: KV cache capped at $W$ tokens regardless of sequence length. Memory is $O(W)$ instead of $O(T)$ .

Effective receptive field: After $L$ layers with window size $W$ , a token can theoretically attend to the previous $L \times W$ tokens (information propagates through residual connections). With 32 layers and $W = 4096$ : effective receptive field = $131{,}072$ tokens.

Mixture of Experts (MoE)

GPT-4 and Mixtral use MoE layers where only a subset of "expert" FFN blocks are activated per token:

MoE Routing Architecture

Key MoE numbers for Mixtral 8x7B:

8 experts per layer, top-2 routing
Each expert is ~7B parameters (standard FFN)
Total parameters: ~47B (but only ~13B active per token)
Inference cost: similar to a 13B dense model
Quality: comparable to LLaMA 2 70B

MoE tradeoffs:

More total parameters (more memory for weights) but fewer active FLOPs
Load balancing challenges - some experts get used much more than others
Harder to fine-tune - which experts should adapt?

Practice Problems

Problem 1: KV Cache Memory

You are deploying a model with these specs: 40 layers, hidden dimension 5120, 40 query heads, 8 KV heads, head dimension 128. You need to serve batch size 32 with context length 8192 in fp16. How much KV cache memory is needed?

Hint 1 - Direction

Use the KV cache formula: $2 \times n_{\text{layers}} \times n_{\text{kv\_heads}} \times d_{\text{head}} \times \text{seq\_len} \times \text{bytes} \times \text{batch}$

Hint 2 - Insight

Plug in: $2 \times 40 \times 8 \times 128 \times 8192 \times 2 \times 32$ . Be careful with units - work in bytes and convert to GB at the end.

Hint 3 - Full Solution + Rubric

\text{KV Cache} = 2 \times 40 \times 8 \times 128 \times 8192 \times 2 \times 32

Step by step:

Per token per layer: $2 \times 8 \times 128 \times 2 = 4{,}096$ bytes
Per token all layers: $4{,}096 \times 40 = 163{,}840$ bytes $\approx 160$ KB
Per sequence: $163{,}840 \times 8{,}192 = 1{,}342{,}177{,}280$ bytes $\approx 1.25$ GB
Full batch: $1.25 \times 32 = 40$ GB

Answer: ~40 GB of KV cache memory.

Scoring Rubric:

Criterion	Strong Hire	Lean Hire	No Hire
Correct formula	Wrote formula from memory	Needed to derive step by step	Could not set up the calculation
Correct answer	Within 5% of 40 GB	Within 20%	Off by more than 2x
Follow-up insight	"This means we need at least an 80 GB GPU or tensor parallelism across 2 GPUs, plus model weights"	"That is a lot of memory"	No interpretation
Speed	Under 2 minutes	Under 5 minutes	Could not finish

Problem 2: Architecture Design

You are tasked with designing a 13B parameter model optimized for 32K context length inference on 8 H100 GPUs. Specify: layers, hidden dimension, query heads, KV heads, FFN dimension, and positional encoding. Justify each choice.

Hint 1 - Direction

Start by working backward from 13B parameters using the parameter counting formulas. Consider how to minimize KV cache memory for 32K context while maintaining quality.

Hint 2 - Insight

A 13B model typically has $d \approx 5120$ and $L \approx 40$ . For 32K context, GQA with few KV heads is essential. Use RoPE for position extension capability. Use SwiGLU with $d_{\text{ff}} \approx \frac{8d}{3} \approx 13{,}696$ .

Hint 3 - Full Solution + Rubric

Proposed architecture:

Hyperparameter	Value	Justification
Layers ( $L$ )	40	Standard depth for 13B class
Hidden dim ( $d$ )	5120	$40 \times (10d^2 + 3d \cdot d_{\text{ff}}) \approx 12.8\text{B}$
Query heads ( $H$ )	40	$d_h = 5120/40 = 128$ (standard)
KV heads ( $H_{\text{kv}}$ )	8	GQA with 5x reduction; KV cache for 32K in fp16: ~10 GB
FFN dim ( $d_{\text{ff}}$ )	13696	$\frac{8 \times 5120}{3} \approx 13653$ , rounded to multiple of 256
Positional encoding	RoPE	Relative, extensible to longer contexts
Normalization	Pre-norm RMSNorm	Standard for stability
Activation	SwiGLU	Standard for quality

Memory budget on 8 H100s (80 GB each = 640 GB total):

Model weights (bf16): ~26 GB (tensor parallel across 8 GPUs = ~3.3 GB/GPU)
KV cache for batch 16 at 32K: $\approx 5$ GB/sequence $\times 16 / 8$ GPUs $\approx 10$ GB/GPU
Activations and overhead: ~5 GB/GPU
Total per GPU: ~18 GB - fits comfortably in 80 GB

Scoring Rubric:

Criterion	Strong Hire	Lean Hire	No Hire
Parameter count works out	Within 10% of 13B	Within 30%	Did not verify
KV cache considered	Calculated memory, chose GQA	Mentioned GQA without numbers	Used MHA or ignored memory
Multi-GPU plan	Tensor parallelism layout	Mentioned parallelism vaguely	No consideration of 8 GPUs
Modern components	RoPE, GQA, SwiGLU, RMSNorm	Most modern choices	Used original 2017 architecture

Problem 3: RoPE Extension

Your 13B model was trained with RoPE on 8K context length. Your users need 32K context. What approaches can you use to extend the context length without retraining from scratch?

Hint 1 - Direction

Think about what RoPE does at positions beyond the training range. The rotation angles become much larger - the model has never seen those angles during training.

Hint 2 - Insight

The main approaches are: (1) Position Interpolation - scale down positions so 32K maps to the 0-8K range, (2) NTK-aware scaling - modify the base frequency to spread rotations more evenly, (3) YaRN - combine NTK scaling with attention scaling and fine-tune briefly.

Hint 3 - Full Solution + Rubric

Three main approaches:

1. Position Interpolation (PI) - Chen et al., 2023 Scale all positions by $\frac{L_{\text{train}}}{L_{\text{target}}} = \frac{8192}{32768} = 0.25$ . Position 32K becomes position 8K in RoPE-space. Requires short fine-tuning (~1000 steps) to adapt.

2. NTK-aware Scaling - "Code LLaMA" approach Modify the RoPE base frequency: $\theta_{\text{new}} = \theta_{\text{base}} \times \alpha^{d/(d-2)}$ where $\alpha$ is the scaling factor. This spreads the rotations more evenly across the extended range, preserving the relative position resolution at short distances while extending range.

3. YaRN - Yet another RoPE extension (Peng et al., 2023) Combines NTK-aware interpolation with an attention temperature scaling factor. Different frequency bands are scaled differently: low-frequency (long-range) dimensions are interpolated more aggressively, while high-frequency (local) dimensions are preserved. Fine-tune for ~400 steps.

Comparison:

Method	Fine-tuning Needed	Quality at 32K	Complexity
PI	~1000 steps	Good	Low
NTK-aware	None (or very little)	Decent	Medium
YaRN	~400 steps	Best	Medium
Retrain	Full pretraining	Ideal	Very high

Scoring Rubric:

Criterion	Strong Hire	Lean Hire	No Hire
Knows multiple methods	Describes 2-3 with tradeoffs	Knows one method	"Just train on longer data"
Understands the math	Can explain why interpolation works	Vague "scale the positions"	No understanding of mechanism
Practical considerations	Fine-tuning cost, quality degradation	Mentions fine-tuning	Does not mention practical steps

Problem 4: Why Not Encoder-Decoder?

An interviewer asks: "Google's T5 and BART used encoder-decoder architecture and achieved great results. Why did the field converge on decoder-only? Are there cases where encoder-decoder is still better?"

Hint 1 - Direction

Think about three angles: scaling efficiency, task generality, and in-context learning ability.

Hint 2 - Insight

Decoder-only won because: (1) all parameters contribute to generation (no encoder "dead weight" for generation tasks), (2) in-context learning emerges naturally from causal modeling, (3) training is simpler (single objective). Encoder-decoder is still better for fixed seq2seq tasks like translation and summarization where the input and output are clearly separated.

Hint 3 - Full Solution + Rubric

Why decoder-only dominates:

Parameter efficiency for generation: In an encoder-decoder model, ~50% of parameters are in the encoder, which does not directly participate in token generation. A decoder-only model uses all parameters for the generation task.
Emergent in-context learning: Causal language models naturally develop the ability to learn from examples in their context (few-shot learning). Encoder-decoder models are less natural at this because the "input" and "output" are architecturally separated.
Training simplicity: One objective (next-token prediction), one loss, one architecture. Encoder-decoder requires designing input/output splits and potentially different objectives for each stack.
Scaling behavior: Scaling laws show that decoder-only models are more compute-efficient at larger scales for the kinds of tasks we care about (general-purpose generation, reasoning, tool use).

Where encoder-decoder still wins:

Machine translation with fixed language pairs
Summarization where the compression ratio is high
Speech recognition (Whisper uses encoder-decoder)
Any task with a clear, fixed input-output structure where you do not need in-context flexibility

The nuance: Prefix LM (decoder-only with bidirectional attention on the prefix) can approximate encoder-decoder behavior. This is what models like PaLM and UL2 explored - decoder-only architecture with flexible attention masking.

Scoring Rubric:

Criterion	Strong Hire	Lean Hire	No Hire
Multiple reasons	3+ well-explained reasons	1-2 reasons	"Decoder-only is just better"
Nuance	Acknowledges where enc-dec wins	Mentions translation	Absolute claims
Research awareness	Mentions prefix LM, UL2	Mentions T5	Only knows GPT-style

Interview Cheat Sheet

Concept	Key Formula / Fact	Common Follow-Up
Attention	$\text{softmax}(QK^T/\sqrt{d_k} + \text{mask})V$	"What is the complexity? How does Flash Attention help?"
Causal Mask	Upper triangle = $-\infty$ , applied before softmax	"How does this enable parallel training?"
RoPE	Rotate $(q_{2i}, q_{2i+1})$ by $m\theta_i$ ; dot product depends on $m-n$	"How do you extend to longer contexts?"
GQA	$H$ query heads share $H_{\text{kv}}$ KV heads	"What is the memory savings vs MHA?"
KV Cache	$2 \times L \times H_{\text{kv}} \times d_h \times T \times \text{bytes}$	"How much for batch 32 at 128K?"
SwiGLU	$(\text{Swish}(xW_g) \odot xW_u)W_d$ , $d_{\text{ff}} = 8d/3$	"Why three matrices? Parameter count?"
RMSNorm	$x / \text{RMS}(x) \times \gamma$ , no mean subtraction	"Why not LayerNorm?"
Pre-norm	Norm before attention/FFN, residual flows unimpeded	"What goes wrong with post-norm at depth?"
Param count	Attn: $2d^2 + 2dH_{\text{kv}}d_h$ ; FFN: $3d \cdot d_{\text{ff}}$	"Count params for LLaMA 3 70B"
FLOPs	Forward: $\approx 2NT$ ; Train: $\approx 6ND$	"How many GPU-hours for 70B on 15T tokens?"
MoE	Top-k routing, total params much larger than active	"Load balancing? Fine-tuning MoE?"

Spaced Repetition Checkpoints

Day 0 (After reading this chapter)

Draw a modern LLM block from memory (RMSNorm, GQA, SwiGLU, residuals)
Write the causal mask formula and explain why it uses $-\infty$
Calculate KV cache for LLaMA 3 8B at 8K context, fp16
Explain RoPE in one paragraph without looking at notes

Day 3

From memory, list all differences between original Transformer and modern LLM block
Derive the parameter count for a 7B model ( $d=4096$ , $L=32$ , GQA with 8 KV heads)
Explain GQA to someone in 60 seconds - include the memory savings number
Write the SwiGLU formula and explain the three weight matrices

Day 7

Compare RoPE, ALiBi, and sinusoidal encodings - pros/cons of each
Calculate training FLOPs for a 13B model on 2T tokens
Explain why pre-norm is necessary for deep models - what fails with post-norm?
Quiz: given a model config, can you estimate total parameters within 20%?

Day 14

Do Practice Problem 2 (architecture design) from scratch, timed (15 min)
Explain MoE architecture - routing, expert selection, load balancing
From memory, fill in the model comparison table for LLaMA 3, Mistral, and GPT-4
Practice the 60-second answers for all cheat sheet entries

Day 21

Full mock: "Design a 30B model for 64K context. Specify everything and justify."
Rapid fire: answer 10 random cheat sheet questions in under 60 seconds each
Explain context length extension methods (PI, NTK, YaRN) with tradeoffs
Re-take the self-assessment - all scores should be 4+

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - The Modern Decoder-Only Architecture​

Why Decoder-Only?​

The Modern LLM Block​

Causal Masking​

Part 2 - Positional Encodings: From Sinusoidal to RoPE​

The Position Problem​

Sinusoidal (Original, 2017)​

Learned Absolute Embeddings (GPT-2 era)​

RoPE - Rotary Position Embedding (Su et al., 2021)​

ALiBi - Attention with Linear Biases​

Comparison Table​

Part 3 - Grouped Query Attention (GQA)​

The KV Cache Problem​

MHA, MQA, and GQA​

KV Cache Memory Formula (Memorize This)​

Part 4 - SwiGLU Feed-Forward Network​

The Standard FFN (Original Transformer)​

SwiGLU (Shazeer, 2020)​

Part 5 - RMSNorm and Pre-Norm Architecture​

LayerNorm vs RMSNorm​

Pre-Norm vs Post-Norm​

Part 6 - Parameter Counting​

The Complete Parameter Count​

Worked Example: LLaMA 3 8B​

Part 7 - FLOP Estimation​

The 6Nd Rule​

More Precise Breakdown​

Training Compute Budget​

Part 8 - Model Architecture Comparison​

The Big Picture (2024-2026 Models)​

Mistral's Sliding Window Attention​

Mixture of Experts (MoE)​

Practice Problems​

Problem 1: KV Cache Memory​

Problem 2: Architecture Design​

Problem 3: RoPE Extension​

Problem 4: Why Not Encoder-Decoder?​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 (After reading this chapter)​

Day 3​

Day 7​

Day 14​

Day 21​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 - The Modern Decoder-Only Architecture

Why Decoder-Only?

The Modern LLM Block

Causal Masking

Part 2 - Positional Encodings: From Sinusoidal to RoPE

The Position Problem

Sinusoidal (Original, 2017)

Learned Absolute Embeddings (GPT-2 era)

RoPE - Rotary Position Embedding (Su et al., 2021)

ALiBi - Attention with Linear Biases

Comparison Table

Part 3 - Grouped Query Attention (GQA)

The KV Cache Problem

MHA, MQA, and GQA

KV Cache Memory Formula (Memorize This)

Part 4 - SwiGLU Feed-Forward Network

The Standard FFN (Original Transformer)

SwiGLU (Shazeer, 2020)

Part 5 - RMSNorm and Pre-Norm Architecture

LayerNorm vs RMSNorm

Pre-Norm vs Post-Norm

Part 6 - Parameter Counting

The Complete Parameter Count

Worked Example: LLaMA 3 8B

Part 7 - FLOP Estimation

The 6Nd Rule

More Precise Breakdown

Training Compute Budget

Part 8 - Model Architecture Comparison

The Big Picture (2024-2026 Models)

Mistral's Sliding Window Attention

Mixture of Experts (MoE)

Practice Problems

Problem 1: KV Cache Memory

Problem 2: Architecture Design

Problem 3: RoPE Extension

Problem 4: Why Not Encoder-Decoder?

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 (After reading this chapter)

Day 3

Day 7

Day 14

Day 21