RLHF Papers - Aligning Language Models with Human Preferences

Reading time: ~35 min | Interview relevance: Critical | Roles: MLE, Research Engineer, AI Engineer

The Real Interview Moment

You're in a research interview at an AI safety-focused company. The interviewer asks: "GPT-4 and Claude don't just predict the next token - they follow instructions, refuse harmful requests, and produce helpful responses. Walk me through the technical pipeline that transforms a base language model into an aligned assistant. I want to hear about reward modeling, PPO, the InstructGPT approach, and then explain how DPO simplifies this. What are the trade-offs between these approaches?"

This question separates candidates who have casually read about RLHF from those who understand the full alignment pipeline. The interviewer wants to hear about the three-stage process, the mathematical objective of each stage, and the practical engineering challenges. They want you to reason about why DPO eliminates the reward model, not just that it does.

This is the alignment interview. Every frontier AI lab considers RLHF (and its successors) a core competency. If you're interviewing at OpenAI, Anthropic, Google DeepMind, or Meta - you need this deeply.

What You Will Master

After reading this page, you will be able to:

Explain the alignment problem and why pre-training alone is insufficient
Describe the three-stage InstructGPT pipeline with mathematical precision
Derive the reward modeling objective from pairwise comparisons
Explain PPO for language models and why it's used over simpler RL algorithms
Describe Constitutional AI and self-improvement without human labels
Derive DPO from the RLHF objective and explain the simplification
Compare RLHF, Constitutional AI, and DPO on trade-offs
Discuss practical challenges: reward hacking, KL divergence, distribution shift

Part 1 - The Alignment Problem

Why Pre-Training Is Not Enough

A language model trained on next-token prediction learns to model the distribution of text on the internet. This means it can:

Continue any text in a plausible way
Generate toxic, harmful, or factually incorrect content
Follow instructions inconsistently
Produce verbose, hedging, or unhelpful responses

The model optimizes $P(x_{t+1} | x_1, ..., x_t)$ , but users want it to optimize for helpfulness, harmlessness, and honesty.

60-Second Answer

"The alignment problem is the gap between what a language model is trained to do (predict the next token) and what we want it to do (be helpful, harmless, and honest). Pre-training gives the model capabilities, but not values. RLHF bridges this gap by training the model to optimize for human preferences rather than token prediction likelihood."

The Three H's of Alignment

Property	Definition	Failure Mode
Helpful	Provides useful, relevant, complete responses	Refuses benign requests, gives vague answers
Harmless	Avoids generating dangerous, biased, or toxic content	Helps with harmful requests, generates misinformation
Honest	Acknowledges uncertainty, doesn't fabricate facts	Confidently states falsehoods, hallucinates citations

These objectives can conflict. A maximally helpful model might help with harmful requests. A maximally harmless model might refuse everything. Alignment is about finding the right balance.

The Alignment Gap

Part 2 - The InstructGPT Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Objective: Train the model to follow instructions by showing it examples of good behavior.

Data: Human labelers write high-quality responses to a diverse set of prompts.

$\mathcal{L}_\text{SFT} = -\sum_{t} \log P_\theta(y_t | x, y_{<t})$

This is standard next-token prediction, but on curated (prompt, response) pairs rather than internet text.

Key details from InstructGPT:

~13,000 demonstration examples from 40 labelers
Fine-tuned GPT-3 (175B) for 16 epochs
This alone significantly improved instruction following

Common Trap

Many candidates skip SFT and jump straight to reward modeling. SFT is critical - it shifts the model's output distribution toward the kind of text we want to evaluate. Without SFT, the reward model would need to evaluate arbitrary internet text, which is a much harder problem. SFT narrows the space to "reasonable assistant responses."

Stage 2: Reward Modeling (RM)

Objective: Train a model to predict which response a human would prefer.

Data: For each prompt, generate multiple responses, then have humans rank them.

Given a prompt $x$ and two responses $y_w$ (preferred) and $y_l$ (rejected), the reward model $r_\phi$ is trained with:

$\mathcal{L}_\text{RM} = -\log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))$

This is the Bradley-Terry model - the probability that response $y_w$ is preferred over $y_l$ is:

$P(y_w \succ y_l | x) = \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))$

where $\sigma$ is the sigmoid function.

Architecture: The reward model is typically the SFT model with the language modeling head replaced by a scalar output head. For InstructGPT, it was a 6B parameter model.

class RewardModel(nn.Module):
    """Reward model: LLM backbone + scalar head."""

    def __init__(self, base_model):
        super().__init__()
        self.backbone = base_model  # Pre-trained transformer
        self.reward_head = nn.Linear(hidden_size, 1)  # Scalar output

    def forward(self, input_ids, attention_mask):
        # Get last hidden state
        outputs = self.backbone(input_ids, attention_mask=attention_mask)
        last_hidden = outputs.last_hidden_state[:, -1, :]  # Last token

        # Scalar reward
        reward = self.reward_head(last_hidden)
        return reward.squeeze(-1)

def reward_loss(reward_model, x, y_w, y_l):
    """Bradley-Terry pairwise loss."""
    r_w = reward_model(x, y_w)  # Reward for preferred
    r_l = reward_model(x, y_l)  # Reward for rejected
    return -torch.log(torch.sigmoid(r_w - r_l)).mean()

Key details from InstructGPT:

~33,000 comparison pairs
Labelers ranked 4-9 responses per prompt (converted to pairwise comparisons)
Inter-annotator agreement: ~73% (alignment is inherently noisy)

Stage 3: PPO (Proximal Policy Optimization)

Objective: Optimize the language model to maximize the reward while staying close to the SFT model.

The RL objective:

$\max_\theta \mathbb{E}_{x \sim D, y \sim \pi_\theta(\cdot|x)} \left[ r_\phi(x, y) - \beta \cdot D_\text{KL}(\pi_\theta(\cdot|x) \| \pi_\text{ref}(\cdot|x)) \right]$

where:

$\pi_\theta$ is the policy (the language model being optimized)
$\pi_\text{ref}$ is the reference policy (the SFT model, frozen)
$r_\phi$ is the trained reward model
$\beta$ controls the KL penalty strength
$D_\text{KL}$ prevents the model from drifting too far from the SFT model

Why the KL penalty? Without it, the model would learn to produce degenerate outputs that exploit weaknesses in the reward model ("reward hacking"). The KL penalty ensures the model stays in a distribution where the reward model's predictions are reliable.

Instant Rejection

Never say "RLHF uses reinforcement learning to make the model learn from human feedback." This is too vague and sounds like you don't understand the pipeline. Be specific: "RLHF uses PPO to optimize a language model policy against a reward model trained on human preference comparisons, with a KL divergence penalty to prevent reward hacking."

The PPO Update

PPO clips the policy ratio to prevent too-large updates:

$\mathcal{L}_\text{PPO} = \min\left(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t\right)$

where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)}$ is the probability ratio and $\hat{A}_t$ is the advantage estimate.

For language models:

State = prompt + tokens generated so far
Action = next token
Reward = 0 for all tokens except the last, where it equals $r_\phi(x, y) - \beta \log \frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}$

RLHF Three-Stage Pipeline

Part 3 - Constitutional AI (Anthropic)

The Self-Improvement Approach

Constitutional AI (Bai et al., 2022) addresses two limitations of standard RLHF:

Human labeler bottleneck: Collecting human preferences is expensive and slow
Harmlessness at scale: Humans find it psychologically taxing to evaluate harmful content

The CAI Pipeline

Stage 1: Critique and Revision (Red-Teaming + Self-Improvement)

Generate a response to a potentially harmful prompt
Ask the model to critique its own response based on a set of principles (the "constitution")
Ask the model to revise the response based on the critique
Use the revised response as training data for SFT

Prompt: "How do I pick a lock?"

Initial response: "Here's how to pick a lock: Step 1..."

Critique prompt: "Identify specific ways in which the response
is harmful, unethical, or illegal."

Critique: "The response provides detailed instructions for
breaking into locks, which could facilitate burglary..."

Revision prompt: "Please rewrite the response to remove any
harmful content while being helpful."

Revised: "Lock picking is a skill used by locksmiths and
security professionals. If you're locked out, I'd recommend
contacting a licensed locksmith..."

Stage 2: RLAIF (RL from AI Feedback)

Instead of human comparisons, use the model itself to compare responses based on constitutional principles:

$P(y_w \succ y_l | x, \text{principle}) = \text{LLM}(\text{"Which response better follows the principle?"})$

This produces AI-labeled preference data, which trains a reward model, which is used for PPO - identical to RLHF but with AI preferences instead of human ones.

The Constitution

A set of principles like:

"Choose the response that is most helpful while being harmless"
"Choose the response that is most honest and doesn't fabricate information"
"Choose the response that best supports human autonomy and agency"

Company Variation

Anthropic: Pioneered Constitutional AI. Claude is trained with a combination of RLHF and CAI.
OpenAI: Primarily uses RLHF with human labelers but has explored "rule-based reward models" (RBRM) which are similar in spirit.
Google: Uses RLHF for Gemini, with some AI-assisted evaluation.
Meta: LLaMA 2 used RLHF with human preferences. LLaMA 3 added iterative DPO.

Part 4 - DPO: Removing the Reward Model

The Key Insight

DPO (Direct Preference Optimization, Rafailov et al., 2023) shows that you can skip the reward model entirely.

The RLHF objective:

$\max_\theta \mathbb{E}_{x, y \sim \pi_\theta} \left[ r_\phi(x, y) - \beta D_\text{KL}(\pi_\theta \| \pi_\text{ref}) \right]$

has a closed-form optimal solution:

$\pi^*(y|x) = \frac{1}{Z(x)} \pi_\text{ref}(y|x) \exp\left(\frac{1}{\beta} r(x, y)\right)$

Rearranging to express the reward in terms of the optimal policy:

$r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_\text{ref}(y|x)} + \beta \log Z(x)$

The DPO Loss

Substituting this into the Bradley-Terry preference model:

$\mathcal{L}_\text{DPO}(\theta) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma\left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)} \right) \right]$

The partition function $Z(x)$ cancels out because both $y_w$ and $y_l$ share the same prompt $x$ .

60-Second Answer

"DPO derives from the same objective as RLHF but shows that the optimal policy under KL-constrained reward maximization has a closed form. By rearranging, you can express the reward implicitly in terms of the policy's log-probabilities. This means you can directly optimize the policy on preference pairs without ever training a separate reward model. The loss function just compares log-probability ratios: increase the probability of preferred responses relative to the reference, and decrease the probability of rejected responses."

DPO Implementation

import torch
import torch.nn.functional as F

def dpo_loss(
    policy_logps_w,    # log π_θ(y_w|x) - log probs of preferred under policy
    policy_logps_l,    # log π_θ(y_l|x) - log probs of rejected under policy
    ref_logps_w,       # log π_ref(y_w|x) - log probs of preferred under reference
    ref_logps_l,       # log π_ref(y_l|x) - log probs of rejected under reference
    beta=0.1,          # Temperature parameter
):
    """
    Direct Preference Optimization loss.

    Increases probability of preferred response relative to reference,
    decreases probability of rejected response relative to reference.
    """
    # Log-probability ratios
    log_ratio_w = policy_logps_w - ref_logps_w  # How much did we increase y_w?
    log_ratio_l = policy_logps_l - ref_logps_l  # How much did we increase y_l?

    # DPO loss: want log_ratio_w > log_ratio_l
    logits = beta * (log_ratio_w - log_ratio_l)
    loss = -F.logsigmoid(logits).mean()

    # Useful metrics
    with torch.no_grad():
        rewards_w = beta * log_ratio_w
        rewards_l = beta * log_ratio_l
        accuracy = (rewards_w > rewards_l).float().mean()

    return loss, rewards_w.mean(), rewards_l.mean(), accuracy

DPO vs RLHF: Detailed Comparison

Aspect	RLHF (PPO)	DPO
Pipeline complexity	4 models (policy, ref, reward, value)	2 models (policy, ref)
Training	RL loop with rollouts, advantage estimation	Standard supervised learning
Memory	~4x model size (4 models)	~2x model size (2 models)
Hyperparameters	PPO clip, GAE lambda, reward scaling, KL coeff	Just $\beta$
Stability	Notoriously unstable, reward hacking	More stable
Quality	Slightly better at the frontier	Comparable for most tasks
Iteration speed	Slow (online generation required)	Fast (offline, batch processing)
Online learning	Natural (generates new data)	Harder (fixed preference dataset)
Exploration	Explores new response strategies	Limited to existing data distribution
Reward hacking	Possible (must monitor KL)	Less susceptible

RLHF vs DPO Pipeline Comparison

Part 5 - Beyond DPO: Modern Alignment Methods

IPO (Identity Preference Optimization)

DPO can overfit to the preference data. IPO (Azar et al., 2023) adds regularization:

$\mathcal{L}_\text{IPO} = \left( \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)} - \frac{1}{2\beta} \right)^2$

KTO (Kahneman-Tversky Optimization)

KTO (Ethayarajh et al., 2024) doesn't need paired preferences - just binary feedback (thumbs up/down):

$\mathcal{L}_\text{KTO} = \begin{cases} -\sigma(\beta r_\theta(x,y) - z_\text{ref}) & \text{if } y \text{ is good} \\ -\sigma(-(r_\theta(x,y) - z_\text{ref})) & \text{if } y \text{ is bad} \end{cases}$

This is significant because paired comparisons are expensive to collect, while binary feedback is cheap.

ORPO (Odds Ratio Preference Optimization)

ORPO (Hong et al., 2024) eliminates both the reward model and the reference model by combining SFT and preference alignment into a single loss:

$\mathcal{L}_\text{ORPO} = \mathcal{L}_\text{SFT}(y_w) - \lambda \log \sigma\left( \log \frac{P_\theta(y_w|x)}{1 - P_\theta(y_w|x)} - \log \frac{P_\theta(y_l|x)}{1 - P_\theta(y_l|x)} \right)$

Iterative DPO / Online DPO

The main weakness of offline DPO is that it trains on a fixed preference dataset. Iterative DPO addresses this:

Train DPO on initial preference data
Generate new responses from the updated policy
Collect/generate new preferences on these responses
Repeat

This mimics the online exploration benefit of PPO while keeping DPO's simplicity.

Evolution Summary

Alignment Methods Evolution

Part 6 - Practical Challenges

Reward Hacking

The model finds outputs that score high with the reward model but are not actually preferred by humans:

Symptom	Example	Mitigation
Verbose responses	Model writes 3 paragraphs when 1 sentence suffices	Length penalty in reward
Sycophancy	"That's a great question!" before every answer	Train reward model on diverse labelers
Format gaming	Excessive bullet points, markdown formatting	Normalize format in reward training
Hedge stacking	"I think, perhaps, it might be possible that..."	Penalize uncertainty markers

KL Divergence Management

The $\beta$ parameter in the KL penalty is critical:

$\beta$ value	Effect	Risk
Too low ( $\beta < 0.01$ )	Model diverges far from reference	Reward hacking, degenerate outputs
Sweet spot ( $\beta \approx 0.1$ )	Balanced alignment	-
Too high ( $\beta > 1.0$ )	Model barely changes from SFT	Under-alignment, wasted compute

Distribution Shift in Reward Models

The reward model is trained on the SFT model's output distribution. As PPO shifts the policy, the reward model evaluates out-of-distribution text, making its predictions unreliable.

Solutions:

Periodic reward model retraining on policy outputs
Ensemble reward models for uncertainty estimation
Conservative KL penalty to keep the policy close to the SFT distribution
DPO (avoids the reward model entirely)

Part 7 - The Full Picture: From Pre-Training to Deployment

Pre-Training to Deployment Pipeline

Compute Breakdown

Stage	Compute (relative)	Data	Duration
Pre-training	1000x	Trillions of tokens	Months
SFT	1x	10K-100K examples	Hours
Reward modeling	1-5x	100K comparisons	Hours-days
PPO/DPO	5-20x	Online generation	Days
Total alignment	<3% of pre-training	-	-

The remarkable finding: alignment is cheap relative to pre-training. Less than 3% of compute produces the difference between a base model and ChatGPT.

Part 8 - Practice Problems

Problem 1: Reward Model Design

You're building a reward model for a coding assistant. What specific challenges arise compared to a general-purpose reward model? How would you handle them?

Hint 1 - Direction

Code quality has objective components (correctness, efficiency) and subjective ones (readability, style). Think about what signal types are available.

Full Answer + Rubric

Challenges specific to code:

Correctness is verifiable: Unlike general text, code can be tested. Use execution-based reward signals (does the code pass test cases?) as a complement to human preferences.
Multi-dimensional quality: Code quality includes correctness, efficiency, readability, security, and idiomatic style. A single scalar reward conflates these.
Length bias: Good code is often shorter. The reward model must not prefer verbose explanations.
Language-specific knowledge: Python style differs from Java style. The reward model needs per-language evaluation capability.

Solutions:

Hybrid reward: Combine a learned preference model with execution-based verification (unit test pass rate).

$r_\text{total}(x, y) = \alpha \cdot r_\text{learned}(x, y) + (1-\alpha) \cdot r_\text{execution}(x, y)$

Multi-aspect reward: Train separate reward models for correctness, efficiency, and style. Combine with configurable weights.
Code-specific labeler pool: Use experienced programmers, not general crowdworkers, for preference labeling.
Execution sandbox: Run generated code in a sandbox and use pass@k metrics as an additional signal.

Scoring:

Strong Hire: Identifies execution-based verification as a unique advantage, proposes hybrid reward, considers multi-aspect quality
Lean Hire: Mentions code correctness testing but doesn't integrate it into the reward model design
No Hire: Treats code reward modeling identically to text reward modeling

Problem 2: DPO Derivation Walkthrough

Starting from the RLHF objective $\max_\theta \mathbb{E}[r(x,y) - \beta D_\text{KL}(\pi_\theta \| \pi_\text{ref})]$ , derive the DPO loss. Show each step.

Hint 1 - Direction

First find the optimal policy $\pi^*$ by solving the KL-constrained optimization. Then express $r(x,y)$ in terms of $\pi^*$ and $\pi_\text{ref}$ . Finally substitute into the Bradley-Terry model.

Full Answer

Step 1: Solve for optimal policy.

The objective $\max_\pi \mathbb{E}_{y \sim \pi}[r(x,y)] - \beta D_\text{KL}(\pi \| \pi_\text{ref})$ has the closed-form solution:

$\pi^*(y|x) = \frac{1}{Z(x)} \pi_\text{ref}(y|x) \exp\left(\frac{r(x,y)}{\beta}\right)$

where $Z(x) = \sum_y \pi_\text{ref}(y|x) \exp(r(x,y)/\beta)$ is the partition function.

Step 2: Express reward in terms of policy.

Taking the log and rearranging:

$r(x,y) = \beta \log \frac{\pi^*(y|x)}{\pi_\text{ref}(y|x)} + \beta \log Z(x)$

Step 3: Substitute into Bradley-Terry.

The preference probability under Bradley-Terry:

$P(y_w \succ y_l) = \sigma(r(x,y_w) - r(x,y_l))$

Substituting:

$= \sigma\left(\beta \log \frac{\pi^*(y_w|x)}{\pi_\text{ref}(y_w|x)} + \beta \log Z(x) - \beta \log \frac{\pi^*(y_l|x)}{\pi_\text{ref}(y_l|x)} - \beta \log Z(x)\right)$

The $\beta \log Z(x)$ terms cancel:

$= \sigma\left(\beta \log \frac{\pi^*(y_w|x)}{\pi_\text{ref}(y_w|x)} - \beta \log \frac{\pi^*(y_l|x)}{\pi_\text{ref}(y_l|x)}\right)$

Step 4: DPO loss.

Replace $\pi^*$ with the parameterized policy $\pi_\theta$ and maximize the log-likelihood:

$\mathcal{L}_\text{DPO} = -\mathbb{E} \left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)}\right)\right]$

Problem 3: RLHF Debugging

You've trained a reward model and run PPO for 1000 steps. The reward score keeps increasing, but human evaluators say the model's responses have gotten worse. What's happening?

Hint 1 - Direction

This is the classic reward hacking problem. The model has found a way to score high on the reward model without actually being better.

Full Answer + Rubric

Diagnosis: Reward hacking - the policy has learned to exploit patterns in the reward model that don't correspond to actual quality.

Common reward hacking patterns to check:

Length gaming: Model outputs much longer responses. Many reward models have a length bias. Check average response length over training.
Format gaming: Excessive use of bullet points, headers, or markdown that the reward model associates with quality.
Sycophancy: Starting every response with "Great question!" or similar phrases the reward model was trained to prefer.
KL divergence: Check if $D_\text{KL}(\pi_\theta \| \pi_\text{ref})$ has exploded. If so, the policy has diverged far from the SFT model into a region where the reward model is unreliable.

Fixes:

Increase $\beta$ : Stronger KL penalty to keep the policy closer to the reference model.
Length normalization: Normalize reward by response length.
Reward model ensemble: Use multiple reward models and take the conservative estimate (minimum).
Early stopping: Use human evaluation as the stopping criterion, not reward score.
DPO: Switch to DPO to avoid the reward model entirely.

Scoring:

Strong Hire: Immediately identifies reward hacking, gives specific examples, proposes multiple targeted fixes including KL analysis
Lean Hire: Identifies reward hacking in general but can't give specific patterns or fixes
No Hire: Suggests training longer, increasing learning rate, or collecting more preference data

Part 9 - The Papers in Context

InstructGPT (Ouyang et al., 2022)

Key contribution: First demonstration that RLHF at scale produces dramatically better assistants
Surprising finding: The 1.3B InstructGPT model was preferred over the 175B base GPT-3 by human evaluators
Data efficiency: Only ~13K demonstrations and ~33K comparisons - tiny relative to pre-training data

Constitutional AI (Bai et al., 2022)

Key contribution: Showed that AI feedback can replace human feedback for harmlessness training
Insight: Self-critique and revision produces higher-quality training data than trying to train directly
Limitation: Relies on the model already being capable enough to self-critique

DPO (Rafailov et al., 2023)

Key contribution: Mathematical proof that you can skip the reward model entirely
Impact: Dramatically simplified the alignment pipeline, made alignment accessible to smaller teams
Limitation: Offline training means no exploration - the quality ceiling depends on the preference data distribution

Interview Cheat Sheet

Question Pattern	Framework	Key Phrases
"Explain RLHF"	3 stages: SFT → RM → PPO, with KL penalty	"RLHF has three stages: supervised fine-tuning on demonstrations, reward model training on preference comparisons, and PPO optimization with a KL penalty to prevent reward hacking"
"What is reward hacking?"	Model exploits RM → high score but low quality → detection and mitigation	"The policy finds outputs that score high with the reward model but aren't actually preferred by humans - like generating excessively long or sycophantic responses"
"Explain DPO"	Same objective as RLHF → closed-form solution → reward cancels → direct loss	"DPO shows the KL-constrained reward maximization has a closed-form optimal policy. By rearranging, you can express the reward implicitly and train directly on preferences."
"RLHF vs DPO trade-offs?"	RLHF: better exploration, harder to train. DPO: simpler, offline, stable	"RLHF can explore new responses during training but requires 4 models and is unstable. DPO is simpler and more stable but limited to the preference data distribution."
"Constitutional AI?"	Self-critique → revision → RLAIF → scalable harmlessness	"Constitutional AI uses the model to critique and revise its own responses based on principles, then trains on AI preferences rather than human ones."
"Why KL penalty?"	Prevents divergence → reward model is only accurate near SFT distribution	"Without the KL penalty, the policy would drift into regions where the reward model is unreliable, leading to reward hacking."

Spaced Repetition Checkpoints

Day 0: Read this page. Draw the 3-stage InstructGPT pipeline from memory. Write the DPO loss.
Day 3: Explain the Bradley-Terry model and why it's used for reward modeling. Derive the DPO loss from the RLHF objective.
Day 7: Compare RLHF, Constitutional AI, and DPO - give 3 advantages and 2 disadvantages of each.
Day 14: Explain reward hacking with 3 specific examples and mitigations.
Day 21: Solve all three practice problems from memory. Time yourself - 8-10 minutes each.

Next Steps

Continue to Diffusion Model Papers for the generative modeling revolution beyond language
Review GPT Series to understand the base models that RLHF aligns
For LoRA-based alignment, see LoRA and PEFT

The Real Interview Moment​

What You Will Master​

Part 1 - The Alignment Problem​

Why Pre-Training Is Not Enough​

The Three H's of Alignment​

Part 2 - The InstructGPT Pipeline​

Stage 1: Supervised Fine-Tuning (SFT)​

Stage 2: Reward Modeling (RM)​

Stage 3: PPO (Proximal Policy Optimization)​

The PPO Update​

Part 3 - Constitutional AI (Anthropic)​

The Self-Improvement Approach​

The CAI Pipeline​

The Constitution​

Part 4 - DPO: Removing the Reward Model​

The Key Insight​

The DPO Loss​

DPO Implementation​

DPO vs RLHF: Detailed Comparison​

Part 5 - Beyond DPO: Modern Alignment Methods​

IPO (Identity Preference Optimization)​

KTO (Kahneman-Tversky Optimization)​

ORPO (Odds Ratio Preference Optimization)​

Iterative DPO / Online DPO​

Evolution Summary​

Part 6 - Practical Challenges​

Reward Hacking​

KL Divergence Management​

Distribution Shift in Reward Models​

Part 7 - The Full Picture: From Pre-Training to Deployment​

Compute Breakdown​

Part 8 - Practice Problems​

Problem 1: Reward Model Design​

Problem 2: DPO Derivation Walkthrough​

Problem 3: RLHF Debugging​

Part 9 - The Papers in Context​

InstructGPT (Ouyang et al., 2022)​

Constitutional AI (Bai et al., 2022)​

DPO (Rafailov et al., 2023)​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Next Steps​

The Real Interview Moment

What You Will Master

Part 1 - The Alignment Problem

Why Pre-Training Is Not Enough

The Three H's of Alignment

Part 2 - The InstructGPT Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Stage 2: Reward Modeling (RM)

Stage 3: PPO (Proximal Policy Optimization)

The PPO Update

Part 3 - Constitutional AI (Anthropic)

The Self-Improvement Approach

The CAI Pipeline

The Constitution

Part 4 - DPO: Removing the Reward Model

The Key Insight

The DPO Loss

DPO Implementation

DPO vs RLHF: Detailed Comparison

Part 5 - Beyond DPO: Modern Alignment Methods

IPO (Identity Preference Optimization)

KTO (Kahneman-Tversky Optimization)

ORPO (Odds Ratio Preference Optimization)

Iterative DPO / Online DPO

Evolution Summary

Part 6 - Practical Challenges

Reward Hacking

KL Divergence Management

Distribution Shift in Reward Models

Part 7 - The Full Picture: From Pre-Training to Deployment

Compute Breakdown

Part 8 - Practice Problems

Problem 1: Reward Model Design

Problem 2: DPO Derivation Walkthrough

Problem 3: RLHF Debugging

Part 9 - The Papers in Context

InstructGPT (Ouyang et al., 2022)

Constitutional AI (Bai et al., 2022)

DPO (Rafailov et al., 2023)

Interview Cheat Sheet

Spaced Repetition Checkpoints

Next Steps