GPT Series - The Arc from 117M to a Trillion Parameters

Reading time: ~50 min | Interview relevance: Critical | Roles: MLE, AI Eng, Research Engineer, LLM Engineer, NLP Engineer

The Real Interview Moment

You are in an OpenAI research engineer interview. The interviewer leans back and says: "Walk me through the GPT lineage - GPT-1 through GPT-4. For each model, tell me the single most important idea it introduced, and why that idea changed the field. I do not want a parameter count recitation. I want to know what each paper proved."

You start with GPT-1, and she immediately probes: "GPT-1 used the same architecture as the Transformer decoder. Why was unsupervised pre-training with a language model objective the key insight, and why had nobody done it effectively before?" You explain the pre-train then fine-tune paradigm, and she follows up: "GPT-2 dropped fine-tuning entirely. Why? And how does in-context learning in GPT-3 actually work - is it gradient-free learning?"

This is the most important lineage in modern AI. Every model in the GPT series introduced an idea that reshaped how the field thinks about language, scale, and intelligence. Candidates who can only recite parameter counts get a "no-hire." Candidates who can articulate the conceptual leap at each generation - and explain why scale enabled those leaps - get a "strong hire."

What You Will Master

Explain GPT-1's contribution: unsupervised pre-training + supervised fine-tuning
Describe GPT-2's paradigm shift: task-agnostic multitask learning via zero-shot
Derive GPT-3's in-context learning mechanism and explain few-shot prompting
Explain InstructGPT's RLHF pipeline: SFT, reward modeling, PPO
Discuss GPT-4's multimodal capabilities and rumored MoE architecture
Trace the conceptual evolution from fine-tuning to prompting to alignment
Compare the GPT series to BERT and modern alternatives

Self-Assessment: Where Are You Now?

Skill	1 - Cannot	2 - Vaguely	3 - Can Explain	4 - Can Derive	5 - Can Teach	Your Score
Explain GPT-1's pre-training objective						___
Explain why GPT-2 dropped fine-tuning						___
Describe in-context learning (GPT-3)						___
Explain few-shot vs zero-shot vs one-shot						___
Describe the RLHF pipeline (InstructGPT)						___
Explain reward modeling and PPO						___
Discuss GPT-4's multimodal capabilities						___
Explain the MoE architecture hypothesis						___
Trace the paradigm shifts across generations						___
Compare GPT vs BERT design philosophies						___

Target: All 4s and 5s before your interview.

Part 1 - GPT-1: Unsupervised Pre-Training (2018)

The Paper

"Improving Language Understanding by Generative Pre-Training" - Radford et al., 2018

The Core Idea

Before GPT-1, NLP models were trained from scratch on each task. Word2Vec and GloVe provided pre-trained word embeddings, but the model architecture still had to be trained from labeled data. GPT-1 proved that a two-stage approach works dramatically better:

Pre-train a Transformer decoder on a large unlabeled text corpus using a language modeling objective
Fine-tune the same model on each downstream task with minimal architectural changes

The language modeling objective is simple next-token prediction:

$\mathcal{L}_{\text{LM}} = -\sum_{t=1}^{T} \log P(w_t | w_1, w_2, \ldots, w_{t-1}; \theta)$

The model learns to predict the next word given all previous words. This is unsupervised - no labeled data is needed. The key insight is that next-token prediction on diverse text forces the model to learn syntax, semantics, world knowledge, and reasoning as a byproduct.

Architecture Details

Parameter	Value
Architecture	Transformer decoder (12 layers)
Parameters	117M
Hidden size	768
Attention heads	12
Context window	512 tokens
Training data	BooksCorpus (7,000 unpublished books, ~800M words)
Tokenization	BPE (40,000 merges)
Optimizer	Adam with warmup and cosine decay

GPT-1 Two-Stage Training: Pre-Train then Fine-Tune

How Fine-Tuning Worked

GPT-1's fine-tuning was clever. Instead of adding complex task-specific architectures, every task was reformulated as a sequence:

Classification: [START] text [EXTRACT] → linear layer on the [EXTRACT] token
Entailment: [START] premise [DELIM] hypothesis [EXTRACT]
Similarity: Both orderings concatenated, representations added
Multiple choice: Each option paired with context, scored independently

The fine-tuning loss combined the task loss with the language modeling loss:

$\mathcal{L} = \mathcal{L}_{\text{task}} + \lambda \cdot \mathcal{L}_{\text{LM}}$

The auxiliary LM loss ( $\lambda = 0.5$ ) acted as a regularizer, preventing the model from forgetting its pre-trained representations.

Results and Impact

GPT-1 achieved state-of-the-art on 9 of 12 benchmarks, including:

Commonsense reasoning (Stories Cloze): 86.5% (+5.7%)
Question answering (RACE): 59.0% (+5.7%)
Textual entailment (RTE): 56.0%

60-Second Answer

"GPT-1 proved that unsupervised pre-training with a language model objective, followed by supervised fine-tuning, dramatically outperforms training from scratch. It used a 12-layer Transformer decoder trained on BooksCorpus to predict the next token. The pre-trained representations captured enough linguistic knowledge that fine-tuning with minimal task-specific modifications achieved state-of-the-art on most benchmarks. The key insight was that next-token prediction is a sufficiently rich objective to learn general-purpose language representations."

Part 2 - GPT-2: Zero-Shot Transfer (2019)

The Paper

"Language Models are Unsupervised Multitask Learners" - Radford et al., 2019

The Paradigm Shift

GPT-2's core claim was radical: a language model trained on enough diverse text can perform tasks without any fine-tuning at all. The argument runs as follows:

Every NLP task can be framed as predicting text given some context
A sufficiently large language model trained on diverse data has implicitly seen examples of every task
Therefore, the model can perform tasks zero-shot by conditioning on appropriate prompts

$P(\text{output} | \text{input}) = P(\text{output} | \text{task description}, \text{input})$

For example, to translate English to French, you do not fine-tune a translation model. You prompt:

Translate English to French:
sea otter => loutre de mer
cheese => fromage
the cat sat on the mat =>

The model completes the sequence by producing the translation.

Scale Changes Things

GPT-2 demonstrated a crucial principle that would define the next era of AI: scale changes the qualitative behavior of models.

Variant	Parameters	Layers	Hidden Size	Zero-Shot Performance
GPT-2 Small	117M	12	768	Baseline
GPT-2 Medium	345M	24	1024	Better
GPT-2 Large	762M	36	1280	Much better
GPT-2 XL	1.5B	48	1600	Best

The training data was also scaled dramatically:

Aspect	GPT-1	GPT-2
Training data	BooksCorpus (800M words)	WebText (40GB, 8M web pages)
Data curation	Existing dataset	Custom: Reddit links with 3+ karma
Vocabulary	40K BPE	50,257 BPE
Context window	512	1024

Key Results

GPT-2 achieved state-of-the-art on several benchmarks without any training on those benchmarks:

Benchmark	Previous SOTA	GPT-2 (zero-shot)
LAMBADA (last word prediction)	99.8 (PPL)	8.6 (PPL)
Children's Book Test (NE)	85.3%	89.1%
Winograd Schema	-	70.7%

The model also generated remarkably coherent long text, which led to the (controversial) decision to initially withhold the full model weights.

Why WebText Mattered

The key data innovation was quality-filtered web text. Instead of crawling the entire web (Common Crawl), Radford et al. scraped all outbound links from Reddit posts with at least 3 upvotes. This produced a dataset that was:

Diverse: Covered every topic discussed on Reddit
Quality-filtered: Human curation via upvotes
Large enough: 40GB of text (~10x BooksCorpus)

Common Trap

Do not describe GPT-2 as "just a bigger GPT-1." The conceptual leap from "pre-train then fine-tune" to "pre-train and directly prompt" was fundamental. GPT-1 required task-specific fine-tuning with labeled data. GPT-2 showed that task performance emerges from scale alone - no fine-tuning needed. This was the seed of the "foundation model" concept.

# Conceptual difference: GPT-1 vs GPT-2 task adaptation

# GPT-1 approach: Fine-tune on each task
def gpt1_sentiment(text, model, labeled_data):
    """Requires labeled training data and gradient updates."""
    fine_tuned_model = fine_tune(model, labeled_data, epochs=3, lr=2e-5)
    return fine_tuned_model.classify(text)

# GPT-2 approach: Zero-shot prompting
def gpt2_sentiment(text, model):
    """No training data needed. Just prompt."""
    prompt = f"Review: {text}\nSentiment:"
    return model.generate(prompt)  # Model outputs "positive" or "negative"

Part 3 - GPT-3: In-Context Learning (2020)

The Paper

"Language Models are Few-Shot Learners" - Brown et al., 2020

The Scale Leap

GPT-3 scaled by two orders of magnitude:

Parameter	GPT-2	GPT-3
Parameters	1.5B	175B
Layers	48	96
Hidden size	1600	12,288
Attention heads	25	96
Context window	1024	2048
Training data	40GB	570GB (filtered Common Crawl + books + Wikipedia)
Training cost	~$50K (estimated)	~$4.6M (estimated)

In-Context Learning: The Key Innovation

GPT-3 introduced in-context learning - the ability to perform tasks by conditioning on a few examples in the prompt, with no gradient updates whatsoever:

Translate English to French:

sea otter => loutre de mer
peppermint => menthe poivrée
plush giraffe => girafe en peluche
cheese =>

The model outputs "fromage" without ever being trained on translation.

Three modes of in-context learning:

$\text{Zero-shot: } P(y | \text{task description}, x)$ $\text{One-shot: } P(y | \text{task description}, x_1, y_1, x)$ $\text{Few-shot: } P(y | \text{task description}, x_1, y_1, \ldots, x_k, y_k, x)$

In-Context Learning: Zero-Shot, One-Shot, Few-Shot

How Does In-Context Learning Work?

This is one of the most debated questions in modern ML. The key hypotheses:

Hypothesis 1: Implicit Bayesian inference. The model has learned a prior over tasks during pre-training. The in-context examples narrow the posterior to the correct task, and the model applies the inferred task to the test input.

Hypothesis 2: Mesa-optimization. The Transformer internally implements a learning algorithm (akin to gradient descent) within its forward pass. Research by Akyürek et al. (2022) showed that Transformers can implement linear regression in their forward pass.

Hypothesis 3: Task location. The model has already learned to perform many tasks during pre-training. The in-context examples serve as a "task locator" - they tell the model which of its existing capabilities to apply. This is supported by the finding that in-context learning performance does not degrade much when labels are randomized (Min et al., 2022).

$\text{ICL}(x) \approx f_{\theta^*}(x) \text{ where } \theta^* = \arg\min_\theta \sum_i \ell(f_\theta(x_i), y_i)$

The notation above is suggestive: in-context learning behaves as if the model is doing gradient descent internally, but without any actual parameter updates.

Instant Rejection

If asked "Is in-context learning actually learning?" and you say "Yes, the model updates its weights," that is an instant rejection. In-context learning involves ZERO parameter updates. All computation happens in a single forward pass. The model's weights are frozen. This is fundamentally different from fine-tuning or training. What changes is the input context, not the parameters.

Scaling Laws in GPT-3

The paper showed smooth power-law improvements across three orders of magnitude of scale:

# GPT-3 scaling: performance improves as a power law of model size
import numpy as np

# Approximate cross-entropy loss vs parameter count (from the paper)
# L(N) ≈ (N_c / N)^α where α ≈ 0.076 and N_c ≈ 8.8 × 10^13
def gpt3_loss_vs_params(N):
    """Approximate test loss as a function of parameter count."""
    N_c = 8.8e13
    alpha = 0.076
    return (N_c / N) ** alpha

param_counts = [125e6, 350e6, 760e6, 1.3e9, 2.7e9, 6.7e9, 13e9, 175e9]
model_names = ["125M", "350M", "760M", "1.3B", "2.7B", "6.7B", "13B", "175B"]

for name, N in zip(model_names, param_counts):
    loss = gpt3_loss_vs_params(N)
    print(f"GPT-3 {name:>5s}: Approx loss = {loss:.3f}")

Data Contamination

GPT-3 was one of the first papers to seriously analyze data contamination - the risk that test sets appear in the massive training data. They found contamination in several benchmarks and attempted to measure its effect. This became a template for all subsequent large model evaluations.

60-Second Answer

"GPT-3's key contribution is in-context learning: the ability to perform new tasks by conditioning on a few examples in the prompt, with zero parameter updates. At 175B parameters trained on 570GB of text, GPT-3 showed that scale enables emergent capabilities - few-shot performance that smaller models cannot achieve. This proved that sufficiently large language models are general-purpose few-shot learners, eliminating the need for task-specific fine-tuning in many cases. The mechanism is still debated, but the leading hypothesis is that the model has learned a distribution over tasks during pre-training, and in-context examples serve to locate the relevant task."

Part 4 - InstructGPT: Alignment via RLHF (2022)

The Paper

"Training language models to follow instructions with human feedback" - Ouyang et al., 2022

The Problem

GPT-3 was powerful but poorly behaved. It would:

Generate toxic or harmful content when prompted
Make up facts confidently (hallucinate)
Follow the literal prompt instead of the user's actual intent
Produce verbose, unhelpful responses

The core issue: the language modeling objective optimizes for predicting likely text, not for being helpful, harmless, and honest. A model trained to predict web text will produce text that looks like web text - including all its toxicity, misinformation, and irrelevance.

The Three-Stage RLHF Pipeline

InstructGPT introduced the three-stage pipeline that would become the standard for aligning language models:

InstructGPT RLHF Pipeline: SFT → Reward Model → PPO

Stage 1: Supervised Fine-Tuning (SFT)

Collect a dataset of prompts and human-written ideal responses. Fine-tune GPT-3 on this data using standard supervised learning:

$\mathcal{L}_{\text{SFT}} = -\sum_{t} \log P(y_t | y_{<t}, x; \theta)$

Where $x$ is the prompt and $y$ is the human-written response.

This produces a model that follows instructions but is not yet optimized for quality. The SFT model is a starting point, not the final product.

Stage 2: Reward Model Training

Collect comparison data: for a given prompt, generate multiple responses and have humans rank them from best to worst. Train a reward model $R(x, y)$ to predict which response humans will prefer.

The loss function uses the Bradley-Terry model of pairwise comparisons:

$\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma\left(R(x, y_w) - R(x, y_l)\right) \right]$

Where $y_w$ is the preferred response and $y_l$ is the less-preferred response.

import numpy as np

def reward_model_loss(r_preferred, r_rejected):
    """
    Bradley-Terry pairwise loss for reward modeling.

    r_preferred: reward score for the human-preferred response
    r_rejected: reward score for the rejected response
    """
    # We want r_preferred > r_rejected
    # The loss pushes the gap to be positive and large
    return -np.log(1 / (1 + np.exp(-(r_preferred - r_rejected))))

# Example: reward model correctly ranks
r_good = 2.5   # Human-preferred response
r_bad = -1.0   # Rejected response
print(f"Loss (correct ranking): {reward_model_loss(r_good, r_bad):.4f}")
# Low loss - model agrees with humans

# Example: reward model incorrectly ranks
r_good = -0.5  # Human-preferred response scores lower
r_bad = 1.5    # Rejected response scores higher
print(f"Loss (incorrect ranking): {reward_model_loss(r_good, r_bad):.4f}")
# High loss - model disagrees with humans

Stage 3: PPO Optimization

Use the reward model to optimize the language model via Proximal Policy Optimization (PPO). The objective is:

$\mathcal{L}_{\text{PPO}} = \mathbb{E}_{x \sim D, y \sim \pi_\theta} \left[ R(x, y) - \beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{SFT}}) \right]$

The KL penalty is critical: it prevents the model from drifting too far from the SFT model, which would lead to reward hacking - generating degenerate text that scores highly on the reward model but is nonsensical.

PPO Training Loop with KL Penalty

Results

The results were striking:

Model	Parameters	Human Preference Rate
GPT-3 (175B, no alignment)	175B	Baseline
InstructGPT (SFT only)	1.3B	Preferred over GPT-3 175B
InstructGPT (SFT + RLHF)	1.3B	Strongly preferred over GPT-3 175B

The most important finding: a 1.3B parameter model with RLHF was preferred by humans over a 175B parameter model without alignment. Alignment is not just a safety measure - it is a capability amplifier.

Common Trap

Do not describe RLHF as "just making the model nicer." InstructGPT showed that RLHF improves helpfulness, reduces hallucination, and makes the model better at following complex instructions. It is a fundamental training methodology that improves capability, not just safety. The aligned 1.3B model was preferred over the unaligned 175B model - alignment actually unlocks capability that raw pre-training does not surface.

The Alignment Tax

InstructGPT also introduced the concept of the "alignment tax" - the tradeoff between alignment and raw task performance. RLHF can slightly reduce performance on traditional NLP benchmarks while dramatically improving real-world usefulness. The paper found this tax was small and often negative (alignment actually improved benchmark performance on some tasks).

60-Second Answer

"InstructGPT solved the alignment problem for language models using a three-stage pipeline: (1) supervised fine-tuning on human demonstrations, (2) training a reward model on human preference comparisons, and (3) optimizing the language model against the reward model using PPO with a KL penalty. The key result was that a 1.3B RLHF model was preferred by humans over the 175B GPT-3, proving that alignment is a capability multiplier, not just a safety constraint. The KL penalty prevents reward hacking by keeping the model close to the SFT distribution."

Part 5 - GPT-4: Multimodal and Mixture of Experts (2023)

The Paper

"GPT-4 Technical Report" - OpenAI, 2023

What We Know

The GPT-4 technical report is notably sparse on details. OpenAI stated: "Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar."

Despite this, enough information has been leaked and independently verified to construct a reasonable picture.

Multimodal Input

GPT-4 was the first GPT model to accept both text and image inputs:

$P(y_t | y_{<t}, x_{\text{text}}, x_{\text{image}}; \theta)$

Images are processed through a vision encoder (likely a ViT variant) that produces patch embeddings, which are projected into the same embedding space as text tokens. The Transformer then attends over both text and image tokens.

GPT-4 Multimodal Architecture

Mixture of Experts (MoE) - The Rumored Architecture

Multiple credible sources (George Hotz, leaked information) indicate GPT-4 uses a Mixture of Experts architecture:

~1.8 trillion total parameters across all experts
8 experts per MoE layer, with top-2 routing (2 experts active per token)
~220B parameters active per forward pass (not all 1.8T)
16 inference passes for a single response (speculative decoding or similar)

The MoE architecture replaces the standard FFN layer with multiple expert FFN layers and a routing mechanism:

$\text{MoE}(x) = \sum_{i=1}^{E} g_i(x) \cdot \text{FFN}_i(x)$

Where $g_i(x)$ is the gating function that determines how much each expert contributes:

$g(x) = \text{TopK}\left(\text{softmax}(W_g \cdot x)\right)$

With top-2 routing, only 2 of 8 experts are active per token, meaning the model uses roughly $\frac{2}{8} = 25\%$ of its parameters per forward pass.

import numpy as np

def mixture_of_experts(x, expert_ffns, gate_weights, top_k=2):
    """
    Simplified MoE forward pass.

    x: input tensor (d_model,)
    expert_ffns: list of expert functions
    gate_weights: (num_experts, d_model) gating matrix
    top_k: number of experts to route to
    """
    num_experts = len(expert_ffns)

    # Compute gating scores
    scores = gate_weights @ x  # (num_experts,)

    # Softmax over experts
    probs = np.exp(scores) / np.exp(scores).sum()

    # Select top-k experts
    top_indices = np.argsort(probs)[-top_k:]

    # Renormalize probabilities over selected experts
    top_probs = probs[top_indices]
    top_probs = top_probs / top_probs.sum()

    # Compute weighted sum of expert outputs
    output = np.zeros_like(x)
    for idx, prob in zip(top_indices, top_probs):
        output += prob * expert_ffns[idx](x)

    return output

# Example: 8 experts, top-2 routing
# Total params: 8 × FFN_size, but only 2 × FFN_size used per token
# This is why GPT-4 can have 1.8T total params but only ~220B active
num_experts = 8
top_k = 2
active_fraction = top_k / num_experts
print(f"Active fraction: {active_fraction:.1%}")  # 25%
print(f"If total params = 1.8T, active params ≈ {1.8 * active_fraction:.1f}T")
# ≈ 0.45T (the non-expert layers like attention are always active,
# so the actual active count is ~220B)

Why MoE?

The key advantage of MoE is decoupling total model capacity from per-token compute cost:

Architecture	Total Parameters	Active per Token	Compute per Token
Dense 175B (GPT-3)	175B	175B	High
Dense 1.8T (hypothetical)	1.8T	1.8T	Extreme
MoE 1.8T (GPT-4 style)	1.8T	~220B	Moderate

MoE gives you the knowledge capacity of a 1.8T model with the inference cost of a ~220B model. The tradeoff is higher memory requirements (all experts must be loaded) and load balancing complexity.

GPT-4 Performance

GPT-4 achieved remarkable results on professional and academic exams:

Exam	GPT-3.5 Percentile	GPT-4 Percentile
Bar Exam (Uniform)	~10th	~90th
SAT Math	~70th	~89th
SAT EBRW	~87th	~93rd
GRE Quantitative	~25th	~80th
AP Biology	~62nd	~85th-100th
AP Chemistry	~22nd-46th	~71st-85th
LSAT	~40th	~88th

Predictable Scaling

Perhaps the most scientifically important contribution of the GPT-4 report was demonstrating predictable loss scaling. OpenAI trained small models on reduced compute and accurately predicted GPT-4's final loss before training the full model:

$L(C) = aC^{-\alpha} + L_{\infty}$

They claimed the prediction was accurate to within a small margin for a model that cost $100M+ to train \text{---} validating that scaling laws work reliably enough for planning purposes.

Instant Rejection

If asked "How many parameters does GPT-4 have?" and you answer with certainty, that is a red flag. OpenAI has not officially disclosed GPT-4's architecture. The ~1.8T MoE figure comes from leaks, not official sources. The correct answer is: "OpenAI has not officially disclosed GPT-4's architecture. Leaked information suggests approximately 1.8 trillion total parameters with a Mixture of Experts architecture using 8 experts and top-2 routing, giving roughly 220 billion active parameters per forward pass. But this is unconfirmed."

60-Second Answer

"GPT-4 represents two major advances: multimodal input (text + images) and massive scale, likely via a Mixture of Experts architecture. MoE decouples capacity from compute by having many expert FFN layers but routing each token to only the top-k experts. The rumored architecture has ~1.8T total parameters but ~220B active per token. GPT-4 demonstrated near-expert performance on professional exams and showed that loss scaling is predictable enough to plan $100M training runs. The most important insight is that MoE allows continued scaling without proportional compute increases."

Part 6 - The Conceptual Arc: Five Paradigm Shifts

The Evolution

Each GPT generation introduced a fundamentally new idea about how to use language models:

GPT Evolution: GPT-1 through GPT-4

Generation	Key Paradigm Shift	What It Proved
GPT-1	Pre-train then fine-tune	Unsupervised pre-training learns transferable representations
GPT-2	Task-agnostic prompting	Large LMs implicitly learn to perform tasks without fine-tuning
GPT-3	In-context learning	Few examples in the prompt suffice; no gradient updates needed
InstructGPT	RLHF alignment	Human feedback makes small models outperform large unaligned ones
GPT-4	Multimodal MoE	MoE scales capacity without proportional compute; vision + language unification

The Scaling Hypothesis

The deepest lesson of the GPT series is the scaling hypothesis: increasing model size, data size, and compute - in the right proportions - reliably produces qualitatively new capabilities. Abilities like in-context learning, chain-of-thought reasoning, and multimodal understanding were not explicitly programmed; they emerged from scale.

$\text{Capability} = f(\text{Parameters}, \text{Data}, \text{Compute})$

The relationship is not linear - it follows power laws with occasional emergent phase transitions where new capabilities appear suddenly as scale increases.

Part 7 - Comparing GPT to BERT

This comparison is among the most commonly asked interview questions. Understanding both sides of the architecture fork is critical.

Aspect	GPT Series	BERT
Architecture	Transformer decoder	Transformer encoder
Directionality	Unidirectional (causal)	Bidirectional
Pre-training objective	Next token prediction	Masked language modeling
Adaptation paradigm	Prompting / in-context learning	Fine-tuning
Strengths	Generation, reasoning, few-shot	Understanding, classification, retrieval
Scaling trajectory	117M → 1.8T (continued scaling)	110M → 340M (largely stopped)
Modern relevance	Dominant paradigm (ChatGPT, Claude, etc.)	Still used in search, NER, embeddings

Why GPT Won

The GPT paradigm ultimately became dominant for a simple reason: autoregressive generation is a universal interface. Any task - classification, translation, QA, reasoning, coding - can be expressed as "generate the right text." BERT's bidirectional architecture is inherently limited to encoding, not generation.

As models scaled, the advantage of bidirectional context for understanding tasks was overwhelmed by the versatility and emergent capabilities of autoregressive models.

Company Variation

At Google, BERT-style models are still heavily used in production (Search ranking, Ads quality, Gemini's encoder components). At OpenAI, Anthropic, and most startups, the GPT-style decoder-only architecture dominates. Know your audience.

Part 8 - Implementation: Building a Mini-GPT

Understanding the GPT architecture means being able to implement it. Here is a simplified but complete implementation:

import numpy as np

class GPTBlock:
    """A single GPT transformer block."""

    def __init__(self, d_model, n_heads, d_ff):
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        self.d_ff = d_ff

        # Multi-head self-attention weights
        self.W_Q = np.random.randn(d_model, d_model) * 0.02
        self.W_K = np.random.randn(d_model, d_model) * 0.02
        self.W_V = np.random.randn(d_model, d_model) * 0.02
        self.W_O = np.random.randn(d_model, d_model) * 0.02

        # Feed-forward network weights
        self.W1 = np.random.randn(d_model, d_ff) * 0.02
        self.b1 = np.zeros(d_ff)
        self.W2 = np.random.randn(d_ff, d_model) * 0.02
        self.b2 = np.zeros(d_model)

    def causal_attention(self, x):
        """Multi-head causal (masked) self-attention."""
        seq_len = x.shape[0]

        Q = x @ self.W_Q  # (seq_len, d_model)
        K = x @ self.W_K
        V = x @ self.W_V

        # Compute attention scores
        scores = Q @ K.T / np.sqrt(self.d_k)  # (seq_len, seq_len)

        # Apply causal mask: position i can only attend to positions <= i
        mask = np.triu(np.ones((seq_len, seq_len)) * -1e9, k=1)
        scores = scores + mask

        # Softmax
        weights = np.exp(scores - scores.max(axis=-1, keepdims=True))
        weights = weights / weights.sum(axis=-1, keepdims=True)

        # Weighted sum
        output = weights @ V  # (seq_len, d_model)
        return output @ self.W_O

    def feed_forward(self, x):
        """Position-wise feed-forward network with GELU activation."""
        # GELU approximation (used in GPT-2+)
        h = x @ self.W1 + self.b1
        h = h * 0.5 * (1 + np.tanh(np.sqrt(2/np.pi) * (h + 0.044715 * h**3)))
        return h @ self.W2 + self.b2

    def forward(self, x):
        """Pre-norm transformer block (GPT-2 style)."""
        # Self-attention with residual (pre-norm)
        x = x + self.causal_attention(layer_norm(x))
        # Feed-forward with residual (pre-norm)
        x = x + self.feed_forward(layer_norm(x))
        return x


def layer_norm(x, eps=1e-5):
    """Layer normalization."""
    mean = x.mean(axis=-1, keepdims=True)
    var = x.var(axis=-1, keepdims=True)
    return (x - mean) / np.sqrt(var + eps)


# GPT-2 Small dimensions
d_model = 768
n_heads = 12
d_ff = 3072
n_layers = 12

# Create a stack of transformer blocks
blocks = [GPTBlock(d_model, n_heads, d_ff) for _ in range(n_layers)]

# Forward pass through all layers
x = np.random.randn(10, d_model)  # 10 tokens
for block in blocks:
    x = block.forward(x)

print(f"Input shape: (10, {d_model})")
print(f"Output shape: {x.shape}")  # (10, 768)
# Final output goes through LayerNorm → Linear (d_model, vocab_size) → Softmax

Practice Problems

Problem 1: Pre-Training Objective

Explain why next-token prediction is a sufficient objective for learning general-purpose language representations. What types of knowledge must the model acquire to predict the next token well?

Hint

To predict the next token, the model must learn: (1) syntax (grammar rules determine what tokens are valid next), (2) semantics (meaning constrains what is likely), (3) world knowledge (facts about the world help predict factual text), (4) reasoning (logical sequences require reasoning to continue), (5) common sense (understanding everyday situations). The beauty of next-token prediction is that ALL of these are needed to minimize the loss on diverse text.

Problem 2: In-Context Learning Mechanism

A Transformer is given 5 examples of "input → output" in its context, followed by a new input. The model produces the correct output. Explain two hypotheses for how this works, given that NO parameter updates occur.

Hint

Hypothesis 1 (Task location): The model has already learned many tasks during pre-training. The examples serve as a "pointer" to the correct task in the model's learned distribution. The model identifies the task and applies its pre-existing knowledge. Supported by: randomizing labels in examples barely hurts performance on some tasks (Min et al., 2022). Hypothesis 2 (Mesa-optimization): The Transformer implements an implicit learning algorithm in its forward pass. The attention mechanism can compute something resembling gradient descent over the examples. Supported by: Transformers can provably learn to implement linear regression in-context (Akyürek et al., 2022).

Problem 3: RLHF Failure Modes

What happens if you remove the KL penalty from the PPO objective in RLHF? Describe the failure mode and explain why the KL term is necessary.

Hint

Without the KL penalty, the model will "reward hack" - it will find degenerate text patterns that score highly on the reward model but are nonsensical to humans. For example, it might repeat certain phrases the reward model has learned to associate with quality, or produce adversarial outputs that exploit reward model weaknesses. The KL penalty $\beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{SFT}})$ keeps the model's output distribution close to the SFT model, preventing it from drifting into regions where the reward model's predictions are unreliable.

Problem 4: MoE vs Dense Trade-offs

You are designing a new LLM and must choose between a 200B dense model and a 1T MoE model with 8 experts (top-2 routing). Both have approximately the same active parameter count per token. What are the trade-offs for training, inference, and deployment?

Hint

Training: MoE requires all experts in memory (5x the memory of dense), but has higher capacity. MoE also requires load balancing losses to prevent expert collapse. Inference: Both have similar per-token compute (FLOPs), but MoE needs 5x memory for all experts. Batching is harder with MoE because different tokens route to different experts, creating uneven workloads. Deployment: MoE requires model parallelism across more GPUs due to memory. Dense models are simpler to shard. However, MoE offers better quality-per-FLOP if memory is not the bottleneck.

Problem 5: GPT Timeline

For each GPT generation, name the single most important idea and explain why it could not have happened at a smaller scale.

Hint

GPT-1 (pre-train + fine-tune): Could work at any scale - even 117M parameters showed clear benefits from pre-training. GPT-2 (zero-shot): Needed ~1B+ parameters - smaller models cannot perform tasks zero-shot because they lack the capacity to implicitly encode task descriptions. GPT-3 (in-context learning): Needed ~100B+ parameters - few-shot in-context learning is an emergent capability that appears only at sufficient scale. InstructGPT (RLHF): Could theoretically work at any scale, but RLHF's impact is most dramatic at scales where the base model has rich capabilities that alignment can surface. GPT-4 (MoE): MoE is specifically a technique for scaling beyond what dense models can economically achieve.

Interview Cheat Sheet

Question	Key Points
"What did GPT-1 introduce?"	Pre-train Transformer decoder on next-token prediction, then fine-tune for downstream tasks. 117M params on BooksCorpus.
"How is GPT-2 different from GPT-1?"	Paradigm shift from fine-tuning to zero-shot. 1.5B params. Showed tasks emerge from scale without task-specific training.
"Explain in-context learning (GPT-3)"	Conditioning on few examples in the prompt, zero parameter updates. 175B params. Leading hypothesis: task location, not learning.
"How does RLHF work?"	Three stages: SFT on demonstrations, reward model from comparisons, PPO with KL penalty. Aligned 1.3B beats unaligned 175B.
"What is GPT-4's architecture?"	Not officially disclosed. Leaked: ~1.8T MoE with 8 experts, top-2 routing, ~220B active params. Multimodal (text + image).
"Why MoE?"	Decouples capacity from per-token compute. 1.8T total but ~220B active. More knowledge, same inference cost as smaller dense model.
"GPT vs BERT?"	GPT: decoder, autoregressive, generation-focused, scales to prompting. BERT: encoder, bidirectional, understanding-focused, requires fine-tuning.
"Why did GPT win over BERT?"	Autoregressive generation is a universal interface. Any task = generate the right text. BERT is limited to encoding.
"What is the alignment tax?"	RLHF may slightly reduce benchmark performance but dramatically improves real-world usefulness. Tax is small and sometimes negative.
"What is reward hacking?"	Without KL penalty, models exploit reward model weaknesses. KL constraint keeps outputs close to SFT distribution.

Spaced Repetition Checkpoints

Day 0 (Today)

Explain the key innovation of each GPT generation in one sentence
Draw the three-stage RLHF pipeline from memory
Explain in-context learning and why it involves zero parameter updates

Day 3

Write the RLHF loss functions (SFT, reward model, PPO) from memory
Explain MoE architecture: gating, top-k routing, active parameters
Compare GPT vs BERT across 5 dimensions

Day 7

Give a 10-minute presentation on the GPT series evolution
Explain two hypotheses for how in-context learning works
Discuss GPT-4's multimodal architecture

Day 14

Mock interview: answer all 10 cheat sheet questions
Explain why the scaling hypothesis matters
Discuss reward hacking and the KL penalty

Day 21

Full 20-minute paper discussion simulation covering the GPT series
Handle follow-up questions about scaling, alignment, and MoE
Discuss the future of the GPT paradigm and open questions

Next Steps

You have now traced the most important lineage in modern AI - from GPT-1's pre-training insight to GPT-4's multimodal MoE. Next, step into a different branch of deep learning history with Chapter 6: ResNet and Skip Connections - the paper that proved depth matters and solved the degradation problem that had limited neural networks for years.

The Real Interview Moment​

What You Will Master​

Self-Assessment: Where Are You Now?​

Part 1 - GPT-1: Unsupervised Pre-Training (2018)​

The Paper​

The Core Idea​

Architecture Details​

How Fine-Tuning Worked​

Results and Impact​

Part 2 - GPT-2: Zero-Shot Transfer (2019)​

The Paper​

The Paradigm Shift​

Scale Changes Things​

Key Results​

Why WebText Mattered​

Part 3 - GPT-3: In-Context Learning (2020)​

The Paper​

The Scale Leap​

In-Context Learning: The Key Innovation​

How Does In-Context Learning Work?​

Scaling Laws in GPT-3​

Data Contamination​

Part 4 - InstructGPT: Alignment via RLHF (2022)​

The Paper​

The Problem​

The Three-Stage RLHF Pipeline​

Stage 1: Supervised Fine-Tuning (SFT)​

Stage 2: Reward Model Training​

Stage 3: PPO Optimization​

Results​

The Alignment Tax​

Part 5 - GPT-4: Multimodal and Mixture of Experts (2023)​

The Paper​

What We Know​

Multimodal Input​

Mixture of Experts (MoE) - The Rumored Architecture​

Why MoE?​

GPT-4 Performance​

Predictable Scaling​

Part 6 - The Conceptual Arc: Five Paradigm Shifts​

The Evolution​

The Scaling Hypothesis​

Part 7 - Comparing GPT to BERT​

Why GPT Won​

Part 8 - Implementation: Building a Mini-GPT​

Practice Problems​

Problem 1: Pre-Training Objective​

Problem 2: In-Context Learning Mechanism​

Problem 3: RLHF Failure Modes​

Problem 4: MoE vs Dense Trade-offs​

Problem 5: GPT Timeline​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 (Today)​

Day 3​

Day 7​

Day 14​

Day 21​

Next Steps​

The Real Interview Moment

What You Will Master

Self-Assessment: Where Are You Now?

Part 1 - GPT-1: Unsupervised Pre-Training (2018)

The Paper

The Core Idea

Architecture Details

How Fine-Tuning Worked

Results and Impact

Part 2 - GPT-2: Zero-Shot Transfer (2019)

The Paper

The Paradigm Shift

Scale Changes Things

Key Results

Why WebText Mattered

Part 3 - GPT-3: In-Context Learning (2020)

The Paper

The Scale Leap

In-Context Learning: The Key Innovation

How Does In-Context Learning Work?

Scaling Laws in GPT-3

Data Contamination

Part 4 - InstructGPT: Alignment via RLHF (2022)

The Paper

The Problem

The Three-Stage RLHF Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Stage 2: Reward Model Training

Stage 3: PPO Optimization

Results

The Alignment Tax

Part 5 - GPT-4: Multimodal and Mixture of Experts (2023)

The Paper

What We Know

Multimodal Input

Mixture of Experts (MoE) - The Rumored Architecture

Why MoE?

GPT-4 Performance

Predictable Scaling

Part 6 - The Conceptual Arc: Five Paradigm Shifts

The Evolution

The Scaling Hypothesis

Part 7 - Comparing GPT to BERT

Why GPT Won

Part 8 - Implementation: Building a Mini-GPT

Practice Problems

Problem 1: Pre-Training Objective

Problem 2: In-Context Learning Mechanism

Problem 3: RLHF Failure Modes

Problem 4: MoE vs Dense Trade-offs

Problem 5: GPT Timeline

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 (Today)

Day 3

Day 7

Day 14

Day 21

Next Steps