Skip to main content

GPT Series - The Arc from 117M to a Trillion Parameters

Reading time: ~50 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Engineer, LLM Engineer, NLP Engineer

The Real Interview Moment

You are in an OpenAI research engineer interview. The interviewer leans back and says: "Walk me through the GPT lineage - GPT-1 through GPT-4. For each model, tell me the single most important idea it introduced, and why that idea changed the field. I do not want a parameter count recitation. I want to know what each paper proved."

You start with GPT-1, and she immediately probes: "GPT-1 used the same architecture as the Transformer decoder. Why was unsupervised pre-training with a language model objective the key insight, and why had nobody done it effectively before?" You explain the pre-train then fine-tune paradigm, and she follows up: "GPT-2 dropped fine-tuning entirely. Why? And how does in-context learning in GPT-3 actually work - is it gradient-free learning?"

This is the most important lineage in modern AI. Every model in the GPT series introduced an idea that reshaped how the field thinks about language, scale, and intelligence. Candidates who can only recite parameter counts get a "no-hire." Candidates who can articulate the conceptual leap at each generation - and explain why scale enabled those leaps - get a "strong hire."

What You Will Master

  • Explain GPT-1's contribution: unsupervised pre-training + supervised fine-tuning
  • Describe GPT-2's paradigm shift: task-agnostic multitask learning via zero-shot
  • Derive GPT-3's in-context learning mechanism and explain few-shot prompting
  • Explain InstructGPT's RLHF pipeline: SFT, reward modeling, PPO
  • Discuss GPT-4's multimodal capabilities and rumored MoE architecture
  • Trace the conceptual evolution from fine-tuning to prompting to alignment
  • Compare the GPT series to BERT and modern alternatives

Self-Assessment: Where Are You Now?

Skill1 - Cannot2 - Vaguely3 - Can Explain4 - Can Derive5 - Can TeachYour Score
Explain GPT-1's pre-training objective___
Explain why GPT-2 dropped fine-tuning___
Describe in-context learning (GPT-3)___
Explain few-shot vs zero-shot vs one-shot___
Describe the RLHF pipeline (InstructGPT)___
Explain reward modeling and PPO___
Discuss GPT-4's multimodal capabilities___
Explain the MoE architecture hypothesis___
Trace the paradigm shifts across generations___
Compare GPT vs BERT design philosophies___

Target: All 4s and 5s before your interview.

Part 1 - GPT-1: Unsupervised Pre-Training (2018)

The Paper

"Improving Language Understanding by Generative Pre-Training" - Radford et al., 2018

The Core Idea

Before GPT-1, NLP models were trained from scratch on each task. Word2Vec and GloVe provided pre-trained word embeddings, but the model architecture still had to be trained from labeled data. GPT-1 proved that a two-stage approach works dramatically better:

  1. Pre-train a Transformer decoder on a large unlabeled text corpus using a language modeling objective
  2. Fine-tune the same model on each downstream task with minimal architectural changes

The language modeling objective is simple next-token prediction:

LLM=t=1TlogP(wtw1,w2,,wt1;θ)\mathcal{L}_{\text{LM}} = -\sum_{t=1}^{T} \log P(w_t | w_1, w_2, \ldots, w_{t-1}; \theta)

The model learns to predict the next word given all previous words. This is unsupervised - no labeled data is needed. The key insight is that next-token prediction on diverse text forces the model to learn syntax, semantics, world knowledge, and reasoning as a byproduct.

Architecture Details

ParameterValue
ArchitectureTransformer decoder (12 layers)
Parameters117M
Hidden size768
Attention heads12
Context window512 tokens
Training dataBooksCorpus (7,000 unpublished books, ~800M words)
TokenizationBPE (40,000 merges)
OptimizerAdam with warmup and cosine decay

GPT-1 Two-Stage Training: Pre-Train then Fine-Tune

How Fine-Tuning Worked

GPT-1's fine-tuning was clever. Instead of adding complex task-specific architectures, every task was reformulated as a sequence:

  • Classification: [START] text [EXTRACT] → linear layer on the [EXTRACT] token
  • Entailment: [START] premise [DELIM] hypothesis [EXTRACT]
  • Similarity: Both orderings concatenated, representations added
  • Multiple choice: Each option paired with context, scored independently

The fine-tuning loss combined the task loss with the language modeling loss:

L=Ltask+λLLM\mathcal{L} = \mathcal{L}_{\text{task}} + \lambda \cdot \mathcal{L}_{\text{LM}}

The auxiliary LM loss (λ=0.5\lambda = 0.5) acted as a regularizer, preventing the model from forgetting its pre-trained representations.

Results and Impact

GPT-1 achieved state-of-the-art on 9 of 12 benchmarks, including:

  • Commonsense reasoning (Stories Cloze): 86.5% (+5.7%)
  • Question answering (RACE): 59.0% (+5.7%)
  • Textual entailment (RTE): 56.0%
60-Second Answer

"GPT-1 proved that unsupervised pre-training with a language model objective, followed by supervised fine-tuning, dramatically outperforms training from scratch. It used a 12-layer Transformer decoder trained on BooksCorpus to predict the next token. The pre-trained representations captured enough linguistic knowledge that fine-tuning with minimal task-specific modifications achieved state-of-the-art on most benchmarks. The key insight was that next-token prediction is a sufficiently rich objective to learn general-purpose language representations."

Part 2 - GPT-2: Zero-Shot Transfer (2019)

The Paper

"Language Models are Unsupervised Multitask Learners" - Radford et al., 2019

The Paradigm Shift

GPT-2's core claim was radical: a language model trained on enough diverse text can perform tasks without any fine-tuning at all. The argument runs as follows:

  1. Every NLP task can be framed as predicting text given some context
  2. A sufficiently large language model trained on diverse data has implicitly seen examples of every task
  3. Therefore, the model can perform tasks zero-shot by conditioning on appropriate prompts

P(outputinput)=P(outputtask description,input)P(\text{output} | \text{input}) = P(\text{output} | \text{task description}, \text{input})

For example, to translate English to French, you do not fine-tune a translation model. You prompt:

Translate English to French:
sea otter => loutre de mer
cheese => fromage
the cat sat on the mat =>

The model completes the sequence by producing the translation.

Scale Changes Things

GPT-2 demonstrated a crucial principle that would define the next era of AI: scale changes the qualitative behavior of models.

VariantParametersLayersHidden SizeZero-Shot Performance
GPT-2 Small117M12768Baseline
GPT-2 Medium345M241024Better
GPT-2 Large762M361280Much better
GPT-2 XL1.5B481600Best

The training data was also scaled dramatically:

AspectGPT-1GPT-2
Training dataBooksCorpus (800M words)WebText (40GB, 8M web pages)
Data curationExisting datasetCustom: Reddit links with 3+ karma
Vocabulary40K BPE50,257 BPE
Context window5121024

Key Results

GPT-2 achieved state-of-the-art on several benchmarks without any training on those benchmarks:

BenchmarkPrevious SOTAGPT-2 (zero-shot)
LAMBADA (last word prediction)99.8 (PPL)8.6 (PPL)
Children's Book Test (NE)85.3%89.1%
Winograd Schema-70.7%

The model also generated remarkably coherent long text, which led to the (controversial) decision to initially withhold the full model weights.

Why WebText Mattered

The key data innovation was quality-filtered web text. Instead of crawling the entire web (Common Crawl), Radford et al. scraped all outbound links from Reddit posts with at least 3 upvotes. This produced a dataset that was:

  • Diverse: Covered every topic discussed on Reddit
  • Quality-filtered: Human curation via upvotes
  • Large enough: 40GB of text (~10x BooksCorpus)
Common Trap

Do not describe GPT-2 as "just a bigger GPT-1." The conceptual leap from "pre-train then fine-tune" to "pre-train and directly prompt" was fundamental. GPT-1 required task-specific fine-tuning with labeled data. GPT-2 showed that task performance emerges from scale alone - no fine-tuning needed. This was the seed of the "foundation model" concept.

# Conceptual difference: GPT-1 vs GPT-2 task adaptation

# GPT-1 approach: Fine-tune on each task
def gpt1_sentiment(text, model, labeled_data):
"""Requires labeled training data and gradient updates."""
fine_tuned_model = fine_tune(model, labeled_data, epochs=3, lr=2e-5)
return fine_tuned_model.classify(text)

# GPT-2 approach: Zero-shot prompting
def gpt2_sentiment(text, model):
"""No training data needed. Just prompt."""
prompt = f"Review: {text}\nSentiment:"
return model.generate(prompt) # Model outputs "positive" or "negative"

Part 3 - GPT-3: In-Context Learning (2020)

The Paper

"Language Models are Few-Shot Learners" - Brown et al., 2020

The Scale Leap

GPT-3 scaled by two orders of magnitude:

ParameterGPT-2GPT-3
Parameters1.5B175B
Layers4896
Hidden size160012,288
Attention heads2596
Context window10242048
Training data40GB570GB (filtered Common Crawl + books + Wikipedia)
Training cost~$50K (estimated)~$4.6M (estimated)

In-Context Learning: The Key Innovation

GPT-3 introduced in-context learning - the ability to perform tasks by conditioning on a few examples in the prompt, with no gradient updates whatsoever:

Translate English to French:

sea otter => loutre de mer
peppermint => menthe poivrée
plush giraffe => girafe en peluche
cheese =>

The model outputs "fromage" without ever being trained on translation.

Three modes of in-context learning:

Zero-shot: P(ytask description,x)\text{Zero-shot: } P(y | \text{task description}, x) One-shot: P(ytask description,x1,y1,x)\text{One-shot: } P(y | \text{task description}, x_1, y_1, x) Few-shot: P(ytask description,x1,y1,,xk,yk,x)\text{Few-shot: } P(y | \text{task description}, x_1, y_1, \ldots, x_k, y_k, x)

In-Context Learning: Zero-Shot, One-Shot, Few-Shot

How Does In-Context Learning Work?

This is one of the most debated questions in modern ML. The key hypotheses:

Hypothesis 1: Implicit Bayesian inference. The model has learned a prior over tasks during pre-training. The in-context examples narrow the posterior to the correct task, and the model applies the inferred task to the test input.

Hypothesis 2: Mesa-optimization. The Transformer internally implements a learning algorithm (akin to gradient descent) within its forward pass. Research by Akyürek et al. (2022) showed that Transformers can implement linear regression in their forward pass.

Hypothesis 3: Task location. The model has already learned to perform many tasks during pre-training. The in-context examples serve as a "task locator" - they tell the model which of its existing capabilities to apply. This is supported by the finding that in-context learning performance does not degrade much when labels are randomized (Min et al., 2022).

ICL(x)fθ(x) where θ=argminθi(fθ(xi),yi)\text{ICL}(x) \approx f_{\theta^*}(x) \text{ where } \theta^* = \arg\min_\theta \sum_i \ell(f_\theta(x_i), y_i)

The notation above is suggestive: in-context learning behaves as if the model is doing gradient descent internally, but without any actual parameter updates.

Instant Rejection

If asked "Is in-context learning actually learning?" and you say "Yes, the model updates its weights," that is an instant rejection. In-context learning involves ZERO parameter updates. All computation happens in a single forward pass. The model's weights are frozen. This is fundamentally different from fine-tuning or training. What changes is the input context, not the parameters.

Scaling Laws in GPT-3

The paper showed smooth power-law improvements across three orders of magnitude of scale:

# GPT-3 scaling: performance improves as a power law of model size
import numpy as np

# Approximate cross-entropy loss vs parameter count (from the paper)
# L(N) ≈ (N_c / N)^α where α ≈ 0.076 and N_c ≈ 8.8 × 10^13
def gpt3_loss_vs_params(N):
"""Approximate test loss as a function of parameter count."""
N_c = 8.8e13
alpha = 0.076
return (N_c / N) ** alpha

param_counts = [125e6, 350e6, 760e6, 1.3e9, 2.7e9, 6.7e9, 13e9, 175e9]
model_names = ["125M", "350M", "760M", "1.3B", "2.7B", "6.7B", "13B", "175B"]

for name, N in zip(model_names, param_counts):
loss = gpt3_loss_vs_params(N)
print(f"GPT-3 {name:>5s}: Approx loss = {loss:.3f}")

Data Contamination

GPT-3 was one of the first papers to seriously analyze data contamination - the risk that test sets appear in the massive training data. They found contamination in several benchmarks and attempted to measure its effect. This became a template for all subsequent large model evaluations.

60-Second Answer

"GPT-3's key contribution is in-context learning: the ability to perform new tasks by conditioning on a few examples in the prompt, with zero parameter updates. At 175B parameters trained on 570GB of text, GPT-3 showed that scale enables emergent capabilities - few-shot performance that smaller models cannot achieve. This proved that sufficiently large language models are general-purpose few-shot learners, eliminating the need for task-specific fine-tuning in many cases. The mechanism is still debated, but the leading hypothesis is that the model has learned a distribution over tasks during pre-training, and in-context examples serve to locate the relevant task."

Part 4 - InstructGPT: Alignment via RLHF (2022)

The Paper

"Training language models to follow instructions with human feedback" - Ouyang et al., 2022

The Problem

GPT-3 was powerful but poorly behaved. It would:

  • Generate toxic or harmful content when prompted
  • Make up facts confidently (hallucinate)
  • Follow the literal prompt instead of the user's actual intent
  • Produce verbose, unhelpful responses

The core issue: the language modeling objective optimizes for predicting likely text, not for being helpful, harmless, and honest. A model trained to predict web text will produce text that looks like web text - including all its toxicity, misinformation, and irrelevance.

The Three-Stage RLHF Pipeline

InstructGPT introduced the three-stage pipeline that would become the standard for aligning language models:

InstructGPT RLHF Pipeline: SFT → Reward Model → PPO

Stage 1: Supervised Fine-Tuning (SFT)

Collect a dataset of prompts and human-written ideal responses. Fine-tune GPT-3 on this data using standard supervised learning:

LSFT=tlogP(yty<t,x;θ)\mathcal{L}_{\text{SFT}} = -\sum_{t} \log P(y_t | y_{<t}, x; \theta)

Where xx is the prompt and yy is the human-written response.

This produces a model that follows instructions but is not yet optimized for quality. The SFT model is a starting point, not the final product.

Stage 2: Reward Model Training

Collect comparison data: for a given prompt, generate multiple responses and have humans rank them from best to worst. Train a reward model R(x,y)R(x, y) to predict which response humans will prefer.

The loss function uses the Bradley-Terry model of pairwise comparisons:

LRM=E(x,yw,yl)[logσ(R(x,yw)R(x,yl))]\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma\left(R(x, y_w) - R(x, y_l)\right) \right]

Where ywy_w is the preferred response and yly_l is the less-preferred response.

import numpy as np

def reward_model_loss(r_preferred, r_rejected):
"""
Bradley-Terry pairwise loss for reward modeling.

r_preferred: reward score for the human-preferred response
r_rejected: reward score for the rejected response
"""
# We want r_preferred > r_rejected
# The loss pushes the gap to be positive and large
return -np.log(1 / (1 + np.exp(-(r_preferred - r_rejected))))

# Example: reward model correctly ranks
r_good = 2.5 # Human-preferred response
r_bad = -1.0 # Rejected response
print(f"Loss (correct ranking): {reward_model_loss(r_good, r_bad):.4f}")
# Low loss - model agrees with humans

# Example: reward model incorrectly ranks
r_good = -0.5 # Human-preferred response scores lower
r_bad = 1.5 # Rejected response scores higher
print(f"Loss (incorrect ranking): {reward_model_loss(r_good, r_bad):.4f}")
# High loss - model disagrees with humans

Stage 3: PPO Optimization

Use the reward model to optimize the language model via Proximal Policy Optimization (PPO). The objective is:

LPPO=ExD,yπθ[R(x,y)βKL(πθπSFT)]\mathcal{L}_{\text{PPO}} = \mathbb{E}_{x \sim D, y \sim \pi_\theta} \left[ R(x, y) - \beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{SFT}}) \right]

The KL penalty is critical: it prevents the model from drifting too far from the SFT model, which would lead to reward hacking - generating degenerate text that scores highly on the reward model but is nonsensical.

PPO Training Loop with KL Penalty

Results

The results were striking:

ModelParametersHuman Preference Rate
GPT-3 (175B, no alignment)175BBaseline
InstructGPT (SFT only)1.3BPreferred over GPT-3 175B
InstructGPT (SFT + RLHF)1.3BStrongly preferred over GPT-3 175B

The most important finding: a 1.3B parameter model with RLHF was preferred by humans over a 175B parameter model without alignment. Alignment is not just a safety measure - it is a capability amplifier.

Common Trap

Do not describe RLHF as "just making the model nicer." InstructGPT showed that RLHF improves helpfulness, reduces hallucination, and makes the model better at following complex instructions. It is a fundamental training methodology that improves capability, not just safety. The aligned 1.3B model was preferred over the unaligned 175B model - alignment actually unlocks capability that raw pre-training does not surface.

The Alignment Tax

InstructGPT also introduced the concept of the "alignment tax" - the tradeoff between alignment and raw task performance. RLHF can slightly reduce performance on traditional NLP benchmarks while dramatically improving real-world usefulness. The paper found this tax was small and often negative (alignment actually improved benchmark performance on some tasks).

60-Second Answer

"InstructGPT solved the alignment problem for language models using a three-stage pipeline: (1) supervised fine-tuning on human demonstrations, (2) training a reward model on human preference comparisons, and (3) optimizing the language model against the reward model using PPO with a KL penalty. The key result was that a 1.3B RLHF model was preferred by humans over the 175B GPT-3, proving that alignment is a capability multiplier, not just a safety constraint. The KL penalty prevents reward hacking by keeping the model close to the SFT distribution."

Part 5 - GPT-4: Multimodal and Mixture of Experts (2023)

The Paper

"GPT-4 Technical Report" - OpenAI, 2023

What We Know

The GPT-4 technical report is notably sparse on details. OpenAI stated: "Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar."

Despite this, enough information has been leaked and independently verified to construct a reasonable picture.

Multimodal Input

GPT-4 was the first GPT model to accept both text and image inputs:

P(yty<t,xtext,ximage;θ)P(y_t | y_{<t}, x_{\text{text}}, x_{\text{image}}; \theta)

Images are processed through a vision encoder (likely a ViT variant) that produces patch embeddings, which are projected into the same embedding space as text tokens. The Transformer then attends over both text and image tokens.

GPT-4 Multimodal Architecture

Mixture of Experts (MoE) - The Rumored Architecture

Multiple credible sources (George Hotz, leaked information) indicate GPT-4 uses a Mixture of Experts architecture:

  • ~1.8 trillion total parameters across all experts
  • 8 experts per MoE layer, with top-2 routing (2 experts active per token)
  • ~220B parameters active per forward pass (not all 1.8T)
  • 16 inference passes for a single response (speculative decoding or similar)

The MoE architecture replaces the standard FFN layer with multiple expert FFN layers and a routing mechanism:

MoE(x)=i=1Egi(x)FFNi(x)\text{MoE}(x) = \sum_{i=1}^{E} g_i(x) \cdot \text{FFN}_i(x)

Where gi(x)g_i(x) is the gating function that determines how much each expert contributes:

g(x)=TopK(softmax(Wgx))g(x) = \text{TopK}\left(\text{softmax}(W_g \cdot x)\right)

With top-2 routing, only 2 of 8 experts are active per token, meaning the model uses roughly 28=25%\frac{2}{8} = 25\% of its parameters per forward pass.

import numpy as np

def mixture_of_experts(x, expert_ffns, gate_weights, top_k=2):
"""
Simplified MoE forward pass.

x: input tensor (d_model,)
expert_ffns: list of expert functions
gate_weights: (num_experts, d_model) gating matrix
top_k: number of experts to route to
"""
num_experts = len(expert_ffns)

# Compute gating scores
scores = gate_weights @ x # (num_experts,)

# Softmax over experts
probs = np.exp(scores) / np.exp(scores).sum()

# Select top-k experts
top_indices = np.argsort(probs)[-top_k:]

# Renormalize probabilities over selected experts
top_probs = probs[top_indices]
top_probs = top_probs / top_probs.sum()

# Compute weighted sum of expert outputs
output = np.zeros_like(x)
for idx, prob in zip(top_indices, top_probs):
output += prob * expert_ffns[idx](x)

return output

# Example: 8 experts, top-2 routing
# Total params: 8 × FFN_size, but only 2 × FFN_size used per token
# This is why GPT-4 can have 1.8T total params but only ~220B active
num_experts = 8
top_k = 2
active_fraction = top_k / num_experts
print(f"Active fraction: {active_fraction:.1%}") # 25%
print(f"If total params = 1.8T, active params ≈ {1.8 * active_fraction:.1f}T")
# ≈ 0.45T (the non-expert layers like attention are always active,
# so the actual active count is ~220B)

Why MoE?

The key advantage of MoE is decoupling total model capacity from per-token compute cost:

ArchitectureTotal ParametersActive per TokenCompute per Token
Dense 175B (GPT-3)175B175BHigh
Dense 1.8T (hypothetical)1.8T1.8TExtreme
MoE 1.8T (GPT-4 style)1.8T~220BModerate

MoE gives you the knowledge capacity of a 1.8T model with the inference cost of a ~220B model. The tradeoff is higher memory requirements (all experts must be loaded) and load balancing complexity.

GPT-4 Performance

GPT-4 achieved remarkable results on professional and academic exams:

ExamGPT-3.5 PercentileGPT-4 Percentile
Bar Exam (Uniform)~10th~90th
SAT Math~70th~89th
SAT EBRW~87th~93rd
GRE Quantitative~25th~80th
AP Biology~62nd~85th-100th
AP Chemistry~22nd-46th~71st-85th
LSAT~40th~88th

Predictable Scaling

Perhaps the most scientifically important contribution of the GPT-4 report was demonstrating predictable loss scaling. OpenAI trained small models on reduced compute and accurately predicted GPT-4's final loss before training the full model:

L(C)=aCα+LL(C) = aC^{-\alpha} + L_{\infty}

They claimed the prediction was accurate to within a small margin for a model that cost $100M+ to train \text{---} validating that scaling laws work reliably enough for planning purposes.

Instant Rejection

If asked "How many parameters does GPT-4 have?" and you answer with certainty, that is a red flag. OpenAI has not officially disclosed GPT-4's architecture. The ~1.8T MoE figure comes from leaks, not official sources. The correct answer is: "OpenAI has not officially disclosed GPT-4's architecture. Leaked information suggests approximately 1.8 trillion total parameters with a Mixture of Experts architecture using 8 experts and top-2 routing, giving roughly 220 billion active parameters per forward pass. But this is unconfirmed."

60-Second Answer

"GPT-4 represents two major advances: multimodal input (text + images) and massive scale, likely via a Mixture of Experts architecture. MoE decouples capacity from compute by having many expert FFN layers but routing each token to only the top-k experts. The rumored architecture has ~1.8T total parameters but ~220B active per token. GPT-4 demonstrated near-expert performance on professional exams and showed that loss scaling is predictable enough to plan $100M training runs. The most important insight is that MoE allows continued scaling without proportional compute increases."

Part 6 - The Conceptual Arc: Five Paradigm Shifts

The Evolution

Each GPT generation introduced a fundamentally new idea about how to use language models:

GPT Evolution: GPT-1 through GPT-4

GenerationKey Paradigm ShiftWhat It Proved
GPT-1Pre-train then fine-tuneUnsupervised pre-training learns transferable representations
GPT-2Task-agnostic promptingLarge LMs implicitly learn to perform tasks without fine-tuning
GPT-3In-context learningFew examples in the prompt suffice; no gradient updates needed
InstructGPTRLHF alignmentHuman feedback makes small models outperform large unaligned ones
GPT-4Multimodal MoEMoE scales capacity without proportional compute; vision + language unification

The Scaling Hypothesis

The deepest lesson of the GPT series is the scaling hypothesis: increasing model size, data size, and compute - in the right proportions - reliably produces qualitatively new capabilities. Abilities like in-context learning, chain-of-thought reasoning, and multimodal understanding were not explicitly programmed; they emerged from scale.

Capability=f(Parameters,Data,Compute)\text{Capability} = f(\text{Parameters}, \text{Data}, \text{Compute})

The relationship is not linear - it follows power laws with occasional emergent phase transitions where new capabilities appear suddenly as scale increases.

Part 7 - Comparing GPT to BERT

This comparison is among the most commonly asked interview questions. Understanding both sides of the architecture fork is critical.

AspectGPT SeriesBERT
ArchitectureTransformer decoderTransformer encoder
DirectionalityUnidirectional (causal)Bidirectional
Pre-training objectiveNext token predictionMasked language modeling
Adaptation paradigmPrompting / in-context learningFine-tuning
StrengthsGeneration, reasoning, few-shotUnderstanding, classification, retrieval
Scaling trajectory117M → 1.8T (continued scaling)110M → 340M (largely stopped)
Modern relevanceDominant paradigm (ChatGPT, Claude, etc.)Still used in search, NER, embeddings

Why GPT Won

The GPT paradigm ultimately became dominant for a simple reason: autoregressive generation is a universal interface. Any task - classification, translation, QA, reasoning, coding - can be expressed as "generate the right text." BERT's bidirectional architecture is inherently limited to encoding, not generation.

As models scaled, the advantage of bidirectional context for understanding tasks was overwhelmed by the versatility and emergent capabilities of autoregressive models.

Company Variation

At Google, BERT-style models are still heavily used in production (Search ranking, Ads quality, Gemini's encoder components). At OpenAI, Anthropic, and most startups, the GPT-style decoder-only architecture dominates. Know your audience.

Part 8 - Implementation: Building a Mini-GPT

Understanding the GPT architecture means being able to implement it. Here is a simplified but complete implementation:

import numpy as np

class GPTBlock:
"""A single GPT transformer block."""

def __init__(self, d_model, n_heads, d_ff):
self.d_model = d_model
self.n_heads = n_heads
self.d_k = d_model // n_heads
self.d_ff = d_ff

# Multi-head self-attention weights
self.W_Q = np.random.randn(d_model, d_model) * 0.02
self.W_K = np.random.randn(d_model, d_model) * 0.02
self.W_V = np.random.randn(d_model, d_model) * 0.02
self.W_O = np.random.randn(d_model, d_model) * 0.02

# Feed-forward network weights
self.W1 = np.random.randn(d_model, d_ff) * 0.02
self.b1 = np.zeros(d_ff)
self.W2 = np.random.randn(d_ff, d_model) * 0.02
self.b2 = np.zeros(d_model)

def causal_attention(self, x):
"""Multi-head causal (masked) self-attention."""
seq_len = x.shape[0]

Q = x @ self.W_Q # (seq_len, d_model)
K = x @ self.W_K
V = x @ self.W_V

# Compute attention scores
scores = Q @ K.T / np.sqrt(self.d_k) # (seq_len, seq_len)

# Apply causal mask: position i can only attend to positions <= i
mask = np.triu(np.ones((seq_len, seq_len)) * -1e9, k=1)
scores = scores + mask

# Softmax
weights = np.exp(scores - scores.max(axis=-1, keepdims=True))
weights = weights / weights.sum(axis=-1, keepdims=True)

# Weighted sum
output = weights @ V # (seq_len, d_model)
return output @ self.W_O

def feed_forward(self, x):
"""Position-wise feed-forward network with GELU activation."""
# GELU approximation (used in GPT-2+)
h = x @ self.W1 + self.b1
h = h * 0.5 * (1 + np.tanh(np.sqrt(2/np.pi) * (h + 0.044715 * h**3)))
return h @ self.W2 + self.b2

def forward(self, x):
"""Pre-norm transformer block (GPT-2 style)."""
# Self-attention with residual (pre-norm)
x = x + self.causal_attention(layer_norm(x))
# Feed-forward with residual (pre-norm)
x = x + self.feed_forward(layer_norm(x))
return x


def layer_norm(x, eps=1e-5):
"""Layer normalization."""
mean = x.mean(axis=-1, keepdims=True)
var = x.var(axis=-1, keepdims=True)
return (x - mean) / np.sqrt(var + eps)


# GPT-2 Small dimensions
d_model = 768
n_heads = 12
d_ff = 3072
n_layers = 12

# Create a stack of transformer blocks
blocks = [GPTBlock(d_model, n_heads, d_ff) for _ in range(n_layers)]

# Forward pass through all layers
x = np.random.randn(10, d_model) # 10 tokens
for block in blocks:
x = block.forward(x)

print(f"Input shape: (10, {d_model})")
print(f"Output shape: {x.shape}") # (10, 768)
# Final output goes through LayerNorm → Linear (d_model, vocab_size) → Softmax

Practice Problems

Problem 1: Pre-Training Objective

Explain why next-token prediction is a sufficient objective for learning general-purpose language representations. What types of knowledge must the model acquire to predict the next token well?

Hint

To predict the next token, the model must learn: (1) syntax (grammar rules determine what tokens are valid next), (2) semantics (meaning constrains what is likely), (3) world knowledge (facts about the world help predict factual text), (4) reasoning (logical sequences require reasoning to continue), (5) common sense (understanding everyday situations). The beauty of next-token prediction is that ALL of these are needed to minimize the loss on diverse text.

Problem 2: In-Context Learning Mechanism

A Transformer is given 5 examples of "input → output" in its context, followed by a new input. The model produces the correct output. Explain two hypotheses for how this works, given that NO parameter updates occur.

Hint

Hypothesis 1 (Task location): The model has already learned many tasks during pre-training. The examples serve as a "pointer" to the correct task in the model's learned distribution. The model identifies the task and applies its pre-existing knowledge. Supported by: randomizing labels in examples barely hurts performance on some tasks (Min et al., 2022). Hypothesis 2 (Mesa-optimization): The Transformer implements an implicit learning algorithm in its forward pass. The attention mechanism can compute something resembling gradient descent over the examples. Supported by: Transformers can provably learn to implement linear regression in-context (Akyürek et al., 2022).

Problem 3: RLHF Failure Modes

What happens if you remove the KL penalty from the PPO objective in RLHF? Describe the failure mode and explain why the KL term is necessary.

Hint

Without the KL penalty, the model will "reward hack" - it will find degenerate text patterns that score highly on the reward model but are nonsensical to humans. For example, it might repeat certain phrases the reward model has learned to associate with quality, or produce adversarial outputs that exploit reward model weaknesses. The KL penalty βKL(πθπSFT)\beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{SFT}}) keeps the model's output distribution close to the SFT model, preventing it from drifting into regions where the reward model's predictions are unreliable.

Problem 4: MoE vs Dense Trade-offs

You are designing a new LLM and must choose between a 200B dense model and a 1T MoE model with 8 experts (top-2 routing). Both have approximately the same active parameter count per token. What are the trade-offs for training, inference, and deployment?

Hint

Training: MoE requires all experts in memory (5x the memory of dense), but has higher capacity. MoE also requires load balancing losses to prevent expert collapse. Inference: Both have similar per-token compute (FLOPs), but MoE needs 5x memory for all experts. Batching is harder with MoE because different tokens route to different experts, creating uneven workloads. Deployment: MoE requires model parallelism across more GPUs due to memory. Dense models are simpler to shard. However, MoE offers better quality-per-FLOP if memory is not the bottleneck.

Problem 5: GPT Timeline

For each GPT generation, name the single most important idea and explain why it could not have happened at a smaller scale.

Hint

GPT-1 (pre-train + fine-tune): Could work at any scale - even 117M parameters showed clear benefits from pre-training. GPT-2 (zero-shot): Needed ~1B+ parameters - smaller models cannot perform tasks zero-shot because they lack the capacity to implicitly encode task descriptions. GPT-3 (in-context learning): Needed ~100B+ parameters - few-shot in-context learning is an emergent capability that appears only at sufficient scale. InstructGPT (RLHF): Could theoretically work at any scale, but RLHF's impact is most dramatic at scales where the base model has rich capabilities that alignment can surface. GPT-4 (MoE): MoE is specifically a technique for scaling beyond what dense models can economically achieve.

Interview Cheat Sheet

QuestionKey Points
"What did GPT-1 introduce?"Pre-train Transformer decoder on next-token prediction, then fine-tune for downstream tasks. 117M params on BooksCorpus.
"How is GPT-2 different from GPT-1?"Paradigm shift from fine-tuning to zero-shot. 1.5B params. Showed tasks emerge from scale without task-specific training.
"Explain in-context learning (GPT-3)"Conditioning on few examples in the prompt, zero parameter updates. 175B params. Leading hypothesis: task location, not learning.
"How does RLHF work?"Three stages: SFT on demonstrations, reward model from comparisons, PPO with KL penalty. Aligned 1.3B beats unaligned 175B.
"What is GPT-4's architecture?"Not officially disclosed. Leaked: ~1.8T MoE with 8 experts, top-2 routing, ~220B active params. Multimodal (text + image).
"Why MoE?"Decouples capacity from per-token compute. 1.8T total but ~220B active. More knowledge, same inference cost as smaller dense model.
"GPT vs BERT?"GPT: decoder, autoregressive, generation-focused, scales to prompting. BERT: encoder, bidirectional, understanding-focused, requires fine-tuning.
"Why did GPT win over BERT?"Autoregressive generation is a universal interface. Any task = generate the right text. BERT is limited to encoding.
"What is the alignment tax?"RLHF may slightly reduce benchmark performance but dramatically improves real-world usefulness. Tax is small and sometimes negative.
"What is reward hacking?"Without KL penalty, models exploit reward model weaknesses. KL constraint keeps outputs close to SFT distribution.

Spaced Repetition Checkpoints

Day 0 (Today)

  • Explain the key innovation of each GPT generation in one sentence
  • Draw the three-stage RLHF pipeline from memory
  • Explain in-context learning and why it involves zero parameter updates

Day 3

  • Write the RLHF loss functions (SFT, reward model, PPO) from memory
  • Explain MoE architecture: gating, top-k routing, active parameters
  • Compare GPT vs BERT across 5 dimensions

Day 7

  • Give a 10-minute presentation on the GPT series evolution
  • Explain two hypotheses for how in-context learning works
  • Discuss GPT-4's multimodal architecture

Day 14

  • Mock interview: answer all 10 cheat sheet questions
  • Explain why the scaling hypothesis matters
  • Discuss reward hacking and the KL penalty

Day 21

  • Full 20-minute paper discussion simulation covering the GPT series
  • Handle follow-up questions about scaling, alignment, and MoE
  • Discuss the future of the GPT paradigm and open questions

Next Steps

You have now traced the most important lineage in modern AI - from GPT-1's pre-training insight to GPT-4's multimodal MoE. Next, step into a different branch of deep learning history with Chapter 6: ResNet and Skip Connections - the paper that proved depth matters and solved the degradation problem that had limited neural networks for years.

© 2026 EngineersOfAI. All rights reserved.