GPT Series - The Arc from 117M to a Trillion Parameters
Reading time: ~50 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Engineer, LLM Engineer, NLP Engineer
The Real Interview Moment
You are in an OpenAI research engineer interview. The interviewer leans back and says: "Walk me through the GPT lineage - GPT-1 through GPT-4. For each model, tell me the single most important idea it introduced, and why that idea changed the field. I do not want a parameter count recitation. I want to know what each paper proved."
You start with GPT-1, and she immediately probes: "GPT-1 used the same architecture as the Transformer decoder. Why was unsupervised pre-training with a language model objective the key insight, and why had nobody done it effectively before?" You explain the pre-train then fine-tune paradigm, and she follows up: "GPT-2 dropped fine-tuning entirely. Why? And how does in-context learning in GPT-3 actually work - is it gradient-free learning?"
This is the most important lineage in modern AI. Every model in the GPT series introduced an idea that reshaped how the field thinks about language, scale, and intelligence. Candidates who can only recite parameter counts get a "no-hire." Candidates who can articulate the conceptual leap at each generation - and explain why scale enabled those leaps - get a "strong hire."
What You Will Master
- Explain GPT-1's contribution: unsupervised pre-training + supervised fine-tuning
- Describe GPT-2's paradigm shift: task-agnostic multitask learning via zero-shot
- Derive GPT-3's in-context learning mechanism and explain few-shot prompting
- Explain InstructGPT's RLHF pipeline: SFT, reward modeling, PPO
- Discuss GPT-4's multimodal capabilities and rumored MoE architecture
- Trace the conceptual evolution from fine-tuning to prompting to alignment
- Compare the GPT series to BERT and modern alternatives
Self-Assessment: Where Are You Now?
| Skill | 1 - Cannot | 2 - Vaguely | 3 - Can Explain | 4 - Can Derive | 5 - Can Teach | Your Score |
|---|---|---|---|---|---|---|
| Explain GPT-1's pre-training objective | ___ | |||||
| Explain why GPT-2 dropped fine-tuning | ___ | |||||
| Describe in-context learning (GPT-3) | ___ | |||||
| Explain few-shot vs zero-shot vs one-shot | ___ | |||||
| Describe the RLHF pipeline (InstructGPT) | ___ | |||||
| Explain reward modeling and PPO | ___ | |||||
| Discuss GPT-4's multimodal capabilities | ___ | |||||
| Explain the MoE architecture hypothesis | ___ | |||||
| Trace the paradigm shifts across generations | ___ | |||||
| Compare GPT vs BERT design philosophies | ___ |
Target: All 4s and 5s before your interview.
Part 1 - GPT-1: Unsupervised Pre-Training (2018)
The Paper
"Improving Language Understanding by Generative Pre-Training" - Radford et al., 2018
The Core Idea
Before GPT-1, NLP models were trained from scratch on each task. Word2Vec and GloVe provided pre-trained word embeddings, but the model architecture still had to be trained from labeled data. GPT-1 proved that a two-stage approach works dramatically better:
- Pre-train a Transformer decoder on a large unlabeled text corpus using a language modeling objective
- Fine-tune the same model on each downstream task with minimal architectural changes
The language modeling objective is simple next-token prediction:
The model learns to predict the next word given all previous words. This is unsupervised - no labeled data is needed. The key insight is that next-token prediction on diverse text forces the model to learn syntax, semantics, world knowledge, and reasoning as a byproduct.
Architecture Details
| Parameter | Value |
|---|---|
| Architecture | Transformer decoder (12 layers) |
| Parameters | 117M |
| Hidden size | 768 |
| Attention heads | 12 |
| Context window | 512 tokens |
| Training data | BooksCorpus (7,000 unpublished books, ~800M words) |
| Tokenization | BPE (40,000 merges) |
| Optimizer | Adam with warmup and cosine decay |
How Fine-Tuning Worked
GPT-1's fine-tuning was clever. Instead of adding complex task-specific architectures, every task was reformulated as a sequence:
- Classification:
[START] text [EXTRACT]→ linear layer on the[EXTRACT]token - Entailment:
[START] premise [DELIM] hypothesis [EXTRACT] - Similarity: Both orderings concatenated, representations added
- Multiple choice: Each option paired with context, scored independently
The fine-tuning loss combined the task loss with the language modeling loss:
The auxiliary LM loss () acted as a regularizer, preventing the model from forgetting its pre-trained representations.
Results and Impact
GPT-1 achieved state-of-the-art on 9 of 12 benchmarks, including:
- Commonsense reasoning (Stories Cloze): 86.5% (+5.7%)
- Question answering (RACE): 59.0% (+5.7%)
- Textual entailment (RTE): 56.0%
"GPT-1 proved that unsupervised pre-training with a language model objective, followed by supervised fine-tuning, dramatically outperforms training from scratch. It used a 12-layer Transformer decoder trained on BooksCorpus to predict the next token. The pre-trained representations captured enough linguistic knowledge that fine-tuning with minimal task-specific modifications achieved state-of-the-art on most benchmarks. The key insight was that next-token prediction is a sufficiently rich objective to learn general-purpose language representations."
Part 2 - GPT-2: Zero-Shot Transfer (2019)
The Paper
"Language Models are Unsupervised Multitask Learners" - Radford et al., 2019
The Paradigm Shift
GPT-2's core claim was radical: a language model trained on enough diverse text can perform tasks without any fine-tuning at all. The argument runs as follows:
- Every NLP task can be framed as predicting text given some context
- A sufficiently large language model trained on diverse data has implicitly seen examples of every task
- Therefore, the model can perform tasks zero-shot by conditioning on appropriate prompts
For example, to translate English to French, you do not fine-tune a translation model. You prompt:
Translate English to French:
sea otter => loutre de mer
cheese => fromage
the cat sat on the mat =>
The model completes the sequence by producing the translation.
Scale Changes Things
GPT-2 demonstrated a crucial principle that would define the next era of AI: scale changes the qualitative behavior of models.
| Variant | Parameters | Layers | Hidden Size | Zero-Shot Performance |
|---|---|---|---|---|
| GPT-2 Small | 117M | 12 | 768 | Baseline |
| GPT-2 Medium | 345M | 24 | 1024 | Better |
| GPT-2 Large | 762M | 36 | 1280 | Much better |
| GPT-2 XL | 1.5B | 48 | 1600 | Best |
The training data was also scaled dramatically:
| Aspect | GPT-1 | GPT-2 |
|---|---|---|
| Training data | BooksCorpus (800M words) | WebText (40GB, 8M web pages) |
| Data curation | Existing dataset | Custom: Reddit links with 3+ karma |
| Vocabulary | 40K BPE | 50,257 BPE |
| Context window | 512 | 1024 |
Key Results
GPT-2 achieved state-of-the-art on several benchmarks without any training on those benchmarks:
| Benchmark | Previous SOTA | GPT-2 (zero-shot) |
|---|---|---|
| LAMBADA (last word prediction) | 99.8 (PPL) | 8.6 (PPL) |
| Children's Book Test (NE) | 85.3% | 89.1% |
| Winograd Schema | - | 70.7% |
The model also generated remarkably coherent long text, which led to the (controversial) decision to initially withhold the full model weights.
Why WebText Mattered
The key data innovation was quality-filtered web text. Instead of crawling the entire web (Common Crawl), Radford et al. scraped all outbound links from Reddit posts with at least 3 upvotes. This produced a dataset that was:
- Diverse: Covered every topic discussed on Reddit
- Quality-filtered: Human curation via upvotes
- Large enough: 40GB of text (~10x BooksCorpus)
Do not describe GPT-2 as "just a bigger GPT-1." The conceptual leap from "pre-train then fine-tune" to "pre-train and directly prompt" was fundamental. GPT-1 required task-specific fine-tuning with labeled data. GPT-2 showed that task performance emerges from scale alone - no fine-tuning needed. This was the seed of the "foundation model" concept.
# Conceptual difference: GPT-1 vs GPT-2 task adaptation
# GPT-1 approach: Fine-tune on each task
def gpt1_sentiment(text, model, labeled_data):
"""Requires labeled training data and gradient updates."""
fine_tuned_model = fine_tune(model, labeled_data, epochs=3, lr=2e-5)
return fine_tuned_model.classify(text)
# GPT-2 approach: Zero-shot prompting
def gpt2_sentiment(text, model):
"""No training data needed. Just prompt."""
prompt = f"Review: {text}\nSentiment:"
return model.generate(prompt) # Model outputs "positive" or "negative"
Part 3 - GPT-3: In-Context Learning (2020)
The Paper
"Language Models are Few-Shot Learners" - Brown et al., 2020
The Scale Leap
GPT-3 scaled by two orders of magnitude:
| Parameter | GPT-2 | GPT-3 |
|---|---|---|
| Parameters | 1.5B | 175B |
| Layers | 48 | 96 |
| Hidden size | 1600 | 12,288 |
| Attention heads | 25 | 96 |
| Context window | 1024 | 2048 |
| Training data | 40GB | 570GB (filtered Common Crawl + books + Wikipedia) |
| Training cost | ~$50K (estimated) | ~$4.6M (estimated) |
In-Context Learning: The Key Innovation
GPT-3 introduced in-context learning - the ability to perform tasks by conditioning on a few examples in the prompt, with no gradient updates whatsoever:
Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrée
plush giraffe => girafe en peluche
cheese =>
The model outputs "fromage" without ever being trained on translation.
Three modes of in-context learning:
How Does In-Context Learning Work?
This is one of the most debated questions in modern ML. The key hypotheses:
Hypothesis 1: Implicit Bayesian inference. The model has learned a prior over tasks during pre-training. The in-context examples narrow the posterior to the correct task, and the model applies the inferred task to the test input.
Hypothesis 2: Mesa-optimization. The Transformer internally implements a learning algorithm (akin to gradient descent) within its forward pass. Research by Akyürek et al. (2022) showed that Transformers can implement linear regression in their forward pass.
Hypothesis 3: Task location. The model has already learned to perform many tasks during pre-training. The in-context examples serve as a "task locator" - they tell the model which of its existing capabilities to apply. This is supported by the finding that in-context learning performance does not degrade much when labels are randomized (Min et al., 2022).
The notation above is suggestive: in-context learning behaves as if the model is doing gradient descent internally, but without any actual parameter updates.
If asked "Is in-context learning actually learning?" and you say "Yes, the model updates its weights," that is an instant rejection. In-context learning involves ZERO parameter updates. All computation happens in a single forward pass. The model's weights are frozen. This is fundamentally different from fine-tuning or training. What changes is the input context, not the parameters.
Scaling Laws in GPT-3
The paper showed smooth power-law improvements across three orders of magnitude of scale:
# GPT-3 scaling: performance improves as a power law of model size
import numpy as np
# Approximate cross-entropy loss vs parameter count (from the paper)
# L(N) ≈ (N_c / N)^α where α ≈ 0.076 and N_c ≈ 8.8 × 10^13
def gpt3_loss_vs_params(N):
"""Approximate test loss as a function of parameter count."""
N_c = 8.8e13
alpha = 0.076
return (N_c / N) ** alpha
param_counts = [125e6, 350e6, 760e6, 1.3e9, 2.7e9, 6.7e9, 13e9, 175e9]
model_names = ["125M", "350M", "760M", "1.3B", "2.7B", "6.7B", "13B", "175B"]
for name, N in zip(model_names, param_counts):
loss = gpt3_loss_vs_params(N)
print(f"GPT-3 {name:>5s}: Approx loss = {loss:.3f}")
Data Contamination
GPT-3 was one of the first papers to seriously analyze data contamination - the risk that test sets appear in the massive training data. They found contamination in several benchmarks and attempted to measure its effect. This became a template for all subsequent large model evaluations.
"GPT-3's key contribution is in-context learning: the ability to perform new tasks by conditioning on a few examples in the prompt, with zero parameter updates. At 175B parameters trained on 570GB of text, GPT-3 showed that scale enables emergent capabilities - few-shot performance that smaller models cannot achieve. This proved that sufficiently large language models are general-purpose few-shot learners, eliminating the need for task-specific fine-tuning in many cases. The mechanism is still debated, but the leading hypothesis is that the model has learned a distribution over tasks during pre-training, and in-context examples serve to locate the relevant task."
Part 4 - InstructGPT: Alignment via RLHF (2022)
The Paper
"Training language models to follow instructions with human feedback" - Ouyang et al., 2022
The Problem
GPT-3 was powerful but poorly behaved. It would:
- Generate toxic or harmful content when prompted
- Make up facts confidently (hallucinate)
- Follow the literal prompt instead of the user's actual intent
- Produce verbose, unhelpful responses
The core issue: the language modeling objective optimizes for predicting likely text, not for being helpful, harmless, and honest. A model trained to predict web text will produce text that looks like web text - including all its toxicity, misinformation, and irrelevance.
The Three-Stage RLHF Pipeline
InstructGPT introduced the three-stage pipeline that would become the standard for aligning language models:
Stage 1: Supervised Fine-Tuning (SFT)
Collect a dataset of prompts and human-written ideal responses. Fine-tune GPT-3 on this data using standard supervised learning:
Where is the prompt and is the human-written response.
This produces a model that follows instructions but is not yet optimized for quality. The SFT model is a starting point, not the final product.
Stage 2: Reward Model Training
Collect comparison data: for a given prompt, generate multiple responses and have humans rank them from best to worst. Train a reward model to predict which response humans will prefer.
The loss function uses the Bradley-Terry model of pairwise comparisons:
Where is the preferred response and is the less-preferred response.
import numpy as np
def reward_model_loss(r_preferred, r_rejected):
"""
Bradley-Terry pairwise loss for reward modeling.
r_preferred: reward score for the human-preferred response
r_rejected: reward score for the rejected response
"""
# We want r_preferred > r_rejected
# The loss pushes the gap to be positive and large
return -np.log(1 / (1 + np.exp(-(r_preferred - r_rejected))))
# Example: reward model correctly ranks
r_good = 2.5 # Human-preferred response
r_bad = -1.0 # Rejected response
print(f"Loss (correct ranking): {reward_model_loss(r_good, r_bad):.4f}")
# Low loss - model agrees with humans
# Example: reward model incorrectly ranks
r_good = -0.5 # Human-preferred response scores lower
r_bad = 1.5 # Rejected response scores higher
print(f"Loss (incorrect ranking): {reward_model_loss(r_good, r_bad):.4f}")
# High loss - model disagrees with humans
Stage 3: PPO Optimization
Use the reward model to optimize the language model via Proximal Policy Optimization (PPO). The objective is:
The KL penalty is critical: it prevents the model from drifting too far from the SFT model, which would lead to reward hacking - generating degenerate text that scores highly on the reward model but is nonsensical.
Results
The results were striking:
| Model | Parameters | Human Preference Rate |
|---|---|---|
| GPT-3 (175B, no alignment) | 175B | Baseline |
| InstructGPT (SFT only) | 1.3B | Preferred over GPT-3 175B |
| InstructGPT (SFT + RLHF) | 1.3B | Strongly preferred over GPT-3 175B |
The most important finding: a 1.3B parameter model with RLHF was preferred by humans over a 175B parameter model without alignment. Alignment is not just a safety measure - it is a capability amplifier.
Do not describe RLHF as "just making the model nicer." InstructGPT showed that RLHF improves helpfulness, reduces hallucination, and makes the model better at following complex instructions. It is a fundamental training methodology that improves capability, not just safety. The aligned 1.3B model was preferred over the unaligned 175B model - alignment actually unlocks capability that raw pre-training does not surface.
The Alignment Tax
InstructGPT also introduced the concept of the "alignment tax" - the tradeoff between alignment and raw task performance. RLHF can slightly reduce performance on traditional NLP benchmarks while dramatically improving real-world usefulness. The paper found this tax was small and often negative (alignment actually improved benchmark performance on some tasks).
"InstructGPT solved the alignment problem for language models using a three-stage pipeline: (1) supervised fine-tuning on human demonstrations, (2) training a reward model on human preference comparisons, and (3) optimizing the language model against the reward model using PPO with a KL penalty. The key result was that a 1.3B RLHF model was preferred by humans over the 175B GPT-3, proving that alignment is a capability multiplier, not just a safety constraint. The KL penalty prevents reward hacking by keeping the model close to the SFT distribution."
Part 5 - GPT-4: Multimodal and Mixture of Experts (2023)
The Paper
"GPT-4 Technical Report" - OpenAI, 2023
What We Know
The GPT-4 technical report is notably sparse on details. OpenAI stated: "Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar."
Despite this, enough information has been leaked and independently verified to construct a reasonable picture.
Multimodal Input
GPT-4 was the first GPT model to accept both text and image inputs:
Images are processed through a vision encoder (likely a ViT variant) that produces patch embeddings, which are projected into the same embedding space as text tokens. The Transformer then attends over both text and image tokens.
Mixture of Experts (MoE) - The Rumored Architecture
Multiple credible sources (George Hotz, leaked information) indicate GPT-4 uses a Mixture of Experts architecture:
- ~1.8 trillion total parameters across all experts
- 8 experts per MoE layer, with top-2 routing (2 experts active per token)
- ~220B parameters active per forward pass (not all 1.8T)
- 16 inference passes for a single response (speculative decoding or similar)
The MoE architecture replaces the standard FFN layer with multiple expert FFN layers and a routing mechanism:
Where is the gating function that determines how much each expert contributes:
With top-2 routing, only 2 of 8 experts are active per token, meaning the model uses roughly of its parameters per forward pass.
import numpy as np
def mixture_of_experts(x, expert_ffns, gate_weights, top_k=2):
"""
Simplified MoE forward pass.
x: input tensor (d_model,)
expert_ffns: list of expert functions
gate_weights: (num_experts, d_model) gating matrix
top_k: number of experts to route to
"""
num_experts = len(expert_ffns)
# Compute gating scores
scores = gate_weights @ x # (num_experts,)
# Softmax over experts
probs = np.exp(scores) / np.exp(scores).sum()
# Select top-k experts
top_indices = np.argsort(probs)[-top_k:]
# Renormalize probabilities over selected experts
top_probs = probs[top_indices]
top_probs = top_probs / top_probs.sum()
# Compute weighted sum of expert outputs
output = np.zeros_like(x)
for idx, prob in zip(top_indices, top_probs):
output += prob * expert_ffns[idx](x)
return output
# Example: 8 experts, top-2 routing
# Total params: 8 × FFN_size, but only 2 × FFN_size used per token
# This is why GPT-4 can have 1.8T total params but only ~220B active
num_experts = 8
top_k = 2
active_fraction = top_k / num_experts
print(f"Active fraction: {active_fraction:.1%}") # 25%
print(f"If total params = 1.8T, active params ≈ {1.8 * active_fraction:.1f}T")
# ≈ 0.45T (the non-expert layers like attention are always active,
# so the actual active count is ~220B)
Why MoE?
The key advantage of MoE is decoupling total model capacity from per-token compute cost:
| Architecture | Total Parameters | Active per Token | Compute per Token |
|---|---|---|---|
| Dense 175B (GPT-3) | 175B | 175B | High |
| Dense 1.8T (hypothetical) | 1.8T | 1.8T | Extreme |
| MoE 1.8T (GPT-4 style) | 1.8T | ~220B | Moderate |
MoE gives you the knowledge capacity of a 1.8T model with the inference cost of a ~220B model. The tradeoff is higher memory requirements (all experts must be loaded) and load balancing complexity.
GPT-4 Performance
GPT-4 achieved remarkable results on professional and academic exams:
| Exam | GPT-3.5 Percentile | GPT-4 Percentile |
|---|---|---|
| Bar Exam (Uniform) | ~10th | ~90th |
| SAT Math | ~70th | ~89th |
| SAT EBRW | ~87th | ~93rd |
| GRE Quantitative | ~25th | ~80th |
| AP Biology | ~62nd | ~85th-100th |
| AP Chemistry | ~22nd-46th | ~71st-85th |
| LSAT | ~40th | ~88th |
Predictable Scaling
Perhaps the most scientifically important contribution of the GPT-4 report was demonstrating predictable loss scaling. OpenAI trained small models on reduced compute and accurately predicted GPT-4's final loss before training the full model:
They claimed the prediction was accurate to within a small margin for a model that cost $100M+ to train \text{---} validating that scaling laws work reliably enough for planning purposes.
If asked "How many parameters does GPT-4 have?" and you answer with certainty, that is a red flag. OpenAI has not officially disclosed GPT-4's architecture. The ~1.8T MoE figure comes from leaks, not official sources. The correct answer is: "OpenAI has not officially disclosed GPT-4's architecture. Leaked information suggests approximately 1.8 trillion total parameters with a Mixture of Experts architecture using 8 experts and top-2 routing, giving roughly 220 billion active parameters per forward pass. But this is unconfirmed."
"GPT-4 represents two major advances: multimodal input (text + images) and massive scale, likely via a Mixture of Experts architecture. MoE decouples capacity from compute by having many expert FFN layers but routing each token to only the top-k experts. The rumored architecture has ~1.8T total parameters but ~220B active per token. GPT-4 demonstrated near-expert performance on professional exams and showed that loss scaling is predictable enough to plan $100M training runs. The most important insight is that MoE allows continued scaling without proportional compute increases."
Part 6 - The Conceptual Arc: Five Paradigm Shifts
The Evolution
Each GPT generation introduced a fundamentally new idea about how to use language models:
| Generation | Key Paradigm Shift | What It Proved |
|---|---|---|
| GPT-1 | Pre-train then fine-tune | Unsupervised pre-training learns transferable representations |
| GPT-2 | Task-agnostic prompting | Large LMs implicitly learn to perform tasks without fine-tuning |
| GPT-3 | In-context learning | Few examples in the prompt suffice; no gradient updates needed |
| InstructGPT | RLHF alignment | Human feedback makes small models outperform large unaligned ones |
| GPT-4 | Multimodal MoE | MoE scales capacity without proportional compute; vision + language unification |
The Scaling Hypothesis
The deepest lesson of the GPT series is the scaling hypothesis: increasing model size, data size, and compute - in the right proportions - reliably produces qualitatively new capabilities. Abilities like in-context learning, chain-of-thought reasoning, and multimodal understanding were not explicitly programmed; they emerged from scale.
The relationship is not linear - it follows power laws with occasional emergent phase transitions where new capabilities appear suddenly as scale increases.
Part 7 - Comparing GPT to BERT
This comparison is among the most commonly asked interview questions. Understanding both sides of the architecture fork is critical.
| Aspect | GPT Series | BERT |
|---|---|---|
| Architecture | Transformer decoder | Transformer encoder |
| Directionality | Unidirectional (causal) | Bidirectional |
| Pre-training objective | Next token prediction | Masked language modeling |
| Adaptation paradigm | Prompting / in-context learning | Fine-tuning |
| Strengths | Generation, reasoning, few-shot | Understanding, classification, retrieval |
| Scaling trajectory | 117M → 1.8T (continued scaling) | 110M → 340M (largely stopped) |
| Modern relevance | Dominant paradigm (ChatGPT, Claude, etc.) | Still used in search, NER, embeddings |
Why GPT Won
The GPT paradigm ultimately became dominant for a simple reason: autoregressive generation is a universal interface. Any task - classification, translation, QA, reasoning, coding - can be expressed as "generate the right text." BERT's bidirectional architecture is inherently limited to encoding, not generation.
As models scaled, the advantage of bidirectional context for understanding tasks was overwhelmed by the versatility and emergent capabilities of autoregressive models.
At Google, BERT-style models are still heavily used in production (Search ranking, Ads quality, Gemini's encoder components). At OpenAI, Anthropic, and most startups, the GPT-style decoder-only architecture dominates. Know your audience.
Part 8 - Implementation: Building a Mini-GPT
Understanding the GPT architecture means being able to implement it. Here is a simplified but complete implementation:
import numpy as np
class GPTBlock:
"""A single GPT transformer block."""
def __init__(self, d_model, n_heads, d_ff):
self.d_model = d_model
self.n_heads = n_heads
self.d_k = d_model // n_heads
self.d_ff = d_ff
# Multi-head self-attention weights
self.W_Q = np.random.randn(d_model, d_model) * 0.02
self.W_K = np.random.randn(d_model, d_model) * 0.02
self.W_V = np.random.randn(d_model, d_model) * 0.02
self.W_O = np.random.randn(d_model, d_model) * 0.02
# Feed-forward network weights
self.W1 = np.random.randn(d_model, d_ff) * 0.02
self.b1 = np.zeros(d_ff)
self.W2 = np.random.randn(d_ff, d_model) * 0.02
self.b2 = np.zeros(d_model)
def causal_attention(self, x):
"""Multi-head causal (masked) self-attention."""
seq_len = x.shape[0]
Q = x @ self.W_Q # (seq_len, d_model)
K = x @ self.W_K
V = x @ self.W_V
# Compute attention scores
scores = Q @ K.T / np.sqrt(self.d_k) # (seq_len, seq_len)
# Apply causal mask: position i can only attend to positions <= i
mask = np.triu(np.ones((seq_len, seq_len)) * -1e9, k=1)
scores = scores + mask
# Softmax
weights = np.exp(scores - scores.max(axis=-1, keepdims=True))
weights = weights / weights.sum(axis=-1, keepdims=True)
# Weighted sum
output = weights @ V # (seq_len, d_model)
return output @ self.W_O
def feed_forward(self, x):
"""Position-wise feed-forward network with GELU activation."""
# GELU approximation (used in GPT-2+)
h = x @ self.W1 + self.b1
h = h * 0.5 * (1 + np.tanh(np.sqrt(2/np.pi) * (h + 0.044715 * h**3)))
return h @ self.W2 + self.b2
def forward(self, x):
"""Pre-norm transformer block (GPT-2 style)."""
# Self-attention with residual (pre-norm)
x = x + self.causal_attention(layer_norm(x))
# Feed-forward with residual (pre-norm)
x = x + self.feed_forward(layer_norm(x))
return x
def layer_norm(x, eps=1e-5):
"""Layer normalization."""
mean = x.mean(axis=-1, keepdims=True)
var = x.var(axis=-1, keepdims=True)
return (x - mean) / np.sqrt(var + eps)
# GPT-2 Small dimensions
d_model = 768
n_heads = 12
d_ff = 3072
n_layers = 12
# Create a stack of transformer blocks
blocks = [GPTBlock(d_model, n_heads, d_ff) for _ in range(n_layers)]
# Forward pass through all layers
x = np.random.randn(10, d_model) # 10 tokens
for block in blocks:
x = block.forward(x)
print(f"Input shape: (10, {d_model})")
print(f"Output shape: {x.shape}") # (10, 768)
# Final output goes through LayerNorm → Linear (d_model, vocab_size) → Softmax
Practice Problems
Problem 1: Pre-Training Objective
Explain why next-token prediction is a sufficient objective for learning general-purpose language representations. What types of knowledge must the model acquire to predict the next token well?
Hint
To predict the next token, the model must learn: (1) syntax (grammar rules determine what tokens are valid next), (2) semantics (meaning constrains what is likely), (3) world knowledge (facts about the world help predict factual text), (4) reasoning (logical sequences require reasoning to continue), (5) common sense (understanding everyday situations). The beauty of next-token prediction is that ALL of these are needed to minimize the loss on diverse text.
Problem 2: In-Context Learning Mechanism
A Transformer is given 5 examples of "input → output" in its context, followed by a new input. The model produces the correct output. Explain two hypotheses for how this works, given that NO parameter updates occur.
Hint
Hypothesis 1 (Task location): The model has already learned many tasks during pre-training. The examples serve as a "pointer" to the correct task in the model's learned distribution. The model identifies the task and applies its pre-existing knowledge. Supported by: randomizing labels in examples barely hurts performance on some tasks (Min et al., 2022). Hypothesis 2 (Mesa-optimization): The Transformer implements an implicit learning algorithm in its forward pass. The attention mechanism can compute something resembling gradient descent over the examples. Supported by: Transformers can provably learn to implement linear regression in-context (Akyürek et al., 2022).
Problem 3: RLHF Failure Modes
What happens if you remove the KL penalty from the PPO objective in RLHF? Describe the failure mode and explain why the KL term is necessary.
Hint
Without the KL penalty, the model will "reward hack" - it will find degenerate text patterns that score highly on the reward model but are nonsensical to humans. For example, it might repeat certain phrases the reward model has learned to associate with quality, or produce adversarial outputs that exploit reward model weaknesses. The KL penalty keeps the model's output distribution close to the SFT model, preventing it from drifting into regions where the reward model's predictions are unreliable.
Problem 4: MoE vs Dense Trade-offs
You are designing a new LLM and must choose between a 200B dense model and a 1T MoE model with 8 experts (top-2 routing). Both have approximately the same active parameter count per token. What are the trade-offs for training, inference, and deployment?
Hint
Training: MoE requires all experts in memory (5x the memory of dense), but has higher capacity. MoE also requires load balancing losses to prevent expert collapse. Inference: Both have similar per-token compute (FLOPs), but MoE needs 5x memory for all experts. Batching is harder with MoE because different tokens route to different experts, creating uneven workloads. Deployment: MoE requires model parallelism across more GPUs due to memory. Dense models are simpler to shard. However, MoE offers better quality-per-FLOP if memory is not the bottleneck.
Problem 5: GPT Timeline
For each GPT generation, name the single most important idea and explain why it could not have happened at a smaller scale.
Hint
GPT-1 (pre-train + fine-tune): Could work at any scale - even 117M parameters showed clear benefits from pre-training. GPT-2 (zero-shot): Needed ~1B+ parameters - smaller models cannot perform tasks zero-shot because they lack the capacity to implicitly encode task descriptions. GPT-3 (in-context learning): Needed ~100B+ parameters - few-shot in-context learning is an emergent capability that appears only at sufficient scale. InstructGPT (RLHF): Could theoretically work at any scale, but RLHF's impact is most dramatic at scales where the base model has rich capabilities that alignment can surface. GPT-4 (MoE): MoE is specifically a technique for scaling beyond what dense models can economically achieve.
Interview Cheat Sheet
| Question | Key Points |
|---|---|
| "What did GPT-1 introduce?" | Pre-train Transformer decoder on next-token prediction, then fine-tune for downstream tasks. 117M params on BooksCorpus. |
| "How is GPT-2 different from GPT-1?" | Paradigm shift from fine-tuning to zero-shot. 1.5B params. Showed tasks emerge from scale without task-specific training. |
| "Explain in-context learning (GPT-3)" | Conditioning on few examples in the prompt, zero parameter updates. 175B params. Leading hypothesis: task location, not learning. |
| "How does RLHF work?" | Three stages: SFT on demonstrations, reward model from comparisons, PPO with KL penalty. Aligned 1.3B beats unaligned 175B. |
| "What is GPT-4's architecture?" | Not officially disclosed. Leaked: ~1.8T MoE with 8 experts, top-2 routing, ~220B active params. Multimodal (text + image). |
| "Why MoE?" | Decouples capacity from per-token compute. 1.8T total but ~220B active. More knowledge, same inference cost as smaller dense model. |
| "GPT vs BERT?" | GPT: decoder, autoregressive, generation-focused, scales to prompting. BERT: encoder, bidirectional, understanding-focused, requires fine-tuning. |
| "Why did GPT win over BERT?" | Autoregressive generation is a universal interface. Any task = generate the right text. BERT is limited to encoding. |
| "What is the alignment tax?" | RLHF may slightly reduce benchmark performance but dramatically improves real-world usefulness. Tax is small and sometimes negative. |
| "What is reward hacking?" | Without KL penalty, models exploit reward model weaknesses. KL constraint keeps outputs close to SFT distribution. |
Spaced Repetition Checkpoints
Day 0 (Today)
- Explain the key innovation of each GPT generation in one sentence
- Draw the three-stage RLHF pipeline from memory
- Explain in-context learning and why it involves zero parameter updates
Day 3
- Write the RLHF loss functions (SFT, reward model, PPO) from memory
- Explain MoE architecture: gating, top-k routing, active parameters
- Compare GPT vs BERT across 5 dimensions
Day 7
- Give a 10-minute presentation on the GPT series evolution
- Explain two hypotheses for how in-context learning works
- Discuss GPT-4's multimodal architecture
Day 14
- Mock interview: answer all 10 cheat sheet questions
- Explain why the scaling hypothesis matters
- Discuss reward hacking and the KL penalty
Day 21
- Full 20-minute paper discussion simulation covering the GPT series
- Handle follow-up questions about scaling, alignment, and MoE
- Discuss the future of the GPT paradigm and open questions
Next Steps
You have now traced the most important lineage in modern AI - from GPT-1's pre-training insight to GPT-4's multimodal MoE. Next, step into a different branch of deep learning history with Chapter 6: ResNet and Skip Connections - the paper that proved depth matters and solved the degradation problem that had limited neural networks for years.
