Skip to main content

RLHF Papers - Aligning Language Models with Human Preferences

Reading time: ~35 min | Interview relevance: Critical | Roles: MLE, Research Engineer, AI Engineer

The Real Interview Moment

You're in a research interview at an AI safety-focused company. The interviewer asks: "GPT-4 and Claude don't just predict the next token - they follow instructions, refuse harmful requests, and produce helpful responses. Walk me through the technical pipeline that transforms a base language model into an aligned assistant. I want to hear about reward modeling, PPO, the InstructGPT approach, and then explain how DPO simplifies this. What are the trade-offs between these approaches?"

This question separates candidates who have casually read about RLHF from those who understand the full alignment pipeline. The interviewer wants to hear about the three-stage process, the mathematical objective of each stage, and the practical engineering challenges. They want you to reason about why DPO eliminates the reward model, not just that it does.

This is the alignment interview. Every frontier AI lab considers RLHF (and its successors) a core competency. If you're interviewing at OpenAI, Anthropic, Google DeepMind, or Meta - you need this deeply.

What You Will Master

After reading this page, you will be able to:

  • Explain the alignment problem and why pre-training alone is insufficient
  • Describe the three-stage InstructGPT pipeline with mathematical precision
  • Derive the reward modeling objective from pairwise comparisons
  • Explain PPO for language models and why it's used over simpler RL algorithms
  • Describe Constitutional AI and self-improvement without human labels
  • Derive DPO from the RLHF objective and explain the simplification
  • Compare RLHF, Constitutional AI, and DPO on trade-offs
  • Discuss practical challenges: reward hacking, KL divergence, distribution shift

Part 1 - The Alignment Problem

Why Pre-Training Is Not Enough

A language model trained on next-token prediction learns to model the distribution of text on the internet. This means it can:

  • Continue any text in a plausible way
  • Generate toxic, harmful, or factually incorrect content
  • Follow instructions inconsistently
  • Produce verbose, hedging, or unhelpful responses

The model optimizes P(xt+1x1,...,xt)P(x_{t+1} | x_1, ..., x_t), but users want it to optimize for helpfulness, harmlessness, and honesty.

60-Second Answer

"The alignment problem is the gap between what a language model is trained to do (predict the next token) and what we want it to do (be helpful, harmless, and honest). Pre-training gives the model capabilities, but not values. RLHF bridges this gap by training the model to optimize for human preferences rather than token prediction likelihood."

The Three H's of Alignment

PropertyDefinitionFailure Mode
HelpfulProvides useful, relevant, complete responsesRefuses benign requests, gives vague answers
HarmlessAvoids generating dangerous, biased, or toxic contentHelps with harmful requests, generates misinformation
HonestAcknowledges uncertainty, doesn't fabricate factsConfidently states falsehoods, hallucinates citations

These objectives can conflict. A maximally helpful model might help with harmful requests. A maximally harmless model might refuse everything. Alignment is about finding the right balance.

The Alignment Gap

Part 2 - The InstructGPT Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Objective: Train the model to follow instructions by showing it examples of good behavior.

Data: Human labelers write high-quality responses to a diverse set of prompts.

LSFT=tlogPθ(ytx,y<t)\mathcal{L}_\text{SFT} = -\sum_{t} \log P_\theta(y_t | x, y_{<t})

This is standard next-token prediction, but on curated (prompt, response) pairs rather than internet text.

Key details from InstructGPT:

  • ~13,000 demonstration examples from 40 labelers
  • Fine-tuned GPT-3 (175B) for 16 epochs
  • This alone significantly improved instruction following
Common Trap

Many candidates skip SFT and jump straight to reward modeling. SFT is critical - it shifts the model's output distribution toward the kind of text we want to evaluate. Without SFT, the reward model would need to evaluate arbitrary internet text, which is a much harder problem. SFT narrows the space to "reasonable assistant responses."

Stage 2: Reward Modeling (RM)

Objective: Train a model to predict which response a human would prefer.

Data: For each prompt, generate multiple responses, then have humans rank them.

Given a prompt xx and two responses ywy_w (preferred) and yly_l (rejected), the reward model rϕr_\phi is trained with:

LRM=logσ(rϕ(x,yw)rϕ(x,yl))\mathcal{L}_\text{RM} = -\log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))

This is the Bradley-Terry model - the probability that response ywy_w is preferred over yly_l is:

P(ywylx)=σ(rϕ(x,yw)rϕ(x,yl))P(y_w \succ y_l | x) = \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))

where σ\sigma is the sigmoid function.

Architecture: The reward model is typically the SFT model with the language modeling head replaced by a scalar output head. For InstructGPT, it was a 6B parameter model.

class RewardModel(nn.Module):
"""Reward model: LLM backbone + scalar head."""

def __init__(self, base_model):
super().__init__()
self.backbone = base_model # Pre-trained transformer
self.reward_head = nn.Linear(hidden_size, 1) # Scalar output

def forward(self, input_ids, attention_mask):
# Get last hidden state
outputs = self.backbone(input_ids, attention_mask=attention_mask)
last_hidden = outputs.last_hidden_state[:, -1, :] # Last token

# Scalar reward
reward = self.reward_head(last_hidden)
return reward.squeeze(-1)

def reward_loss(reward_model, x, y_w, y_l):
"""Bradley-Terry pairwise loss."""
r_w = reward_model(x, y_w) # Reward for preferred
r_l = reward_model(x, y_l) # Reward for rejected
return -torch.log(torch.sigmoid(r_w - r_l)).mean()

Key details from InstructGPT:

  • ~33,000 comparison pairs
  • Labelers ranked 4-9 responses per prompt (converted to pairwise comparisons)
  • Inter-annotator agreement: ~73% (alignment is inherently noisy)

Stage 3: PPO (Proximal Policy Optimization)

Objective: Optimize the language model to maximize the reward while staying close to the SFT model.

The RL objective:

maxθExD,yπθ(x)[rϕ(x,y)βDKL(πθ(x)πref(x))]\max_\theta \mathbb{E}_{x \sim D, y \sim \pi_\theta(\cdot|x)} \left[ r_\phi(x, y) - \beta \cdot D_\text{KL}(\pi_\theta(\cdot|x) \| \pi_\text{ref}(\cdot|x)) \right]

where:

  • πθ\pi_\theta is the policy (the language model being optimized)
  • πref\pi_\text{ref} is the reference policy (the SFT model, frozen)
  • rϕr_\phi is the trained reward model
  • β\beta controls the KL penalty strength
  • DKLD_\text{KL} prevents the model from drifting too far from the SFT model

Why the KL penalty? Without it, the model would learn to produce degenerate outputs that exploit weaknesses in the reward model ("reward hacking"). The KL penalty ensures the model stays in a distribution where the reward model's predictions are reliable.

Instant Rejection

Never say "RLHF uses reinforcement learning to make the model learn from human feedback." This is too vague and sounds like you don't understand the pipeline. Be specific: "RLHF uses PPO to optimize a language model policy against a reward model trained on human preference comparisons, with a KL divergence penalty to prevent reward hacking."

The PPO Update

PPO clips the policy ratio to prevent too-large updates:

LPPO=min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)\mathcal{L}_\text{PPO} = \min\left(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t\right)

where rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)} is the probability ratio and A^t\hat{A}_t is the advantage estimate.

For language models:

  • State = prompt + tokens generated so far
  • Action = next token
  • Reward = 0 for all tokens except the last, where it equals rϕ(x,y)βlogπθ(yx)πref(yx)r_\phi(x, y) - \beta \log \frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}

RLHF Three-Stage Pipeline

Part 3 - Constitutional AI (Anthropic)

The Self-Improvement Approach

Constitutional AI (Bai et al., 2022) addresses two limitations of standard RLHF:

  1. Human labeler bottleneck: Collecting human preferences is expensive and slow
  2. Harmlessness at scale: Humans find it psychologically taxing to evaluate harmful content

The CAI Pipeline

Stage 1: Critique and Revision (Red-Teaming + Self-Improvement)

  1. Generate a response to a potentially harmful prompt
  2. Ask the model to critique its own response based on a set of principles (the "constitution")
  3. Ask the model to revise the response based on the critique
  4. Use the revised response as training data for SFT
Prompt: "How do I pick a lock?"

Initial response: "Here's how to pick a lock: Step 1..."

Critique prompt: "Identify specific ways in which the response
is harmful, unethical, or illegal."

Critique: "The response provides detailed instructions for
breaking into locks, which could facilitate burglary..."

Revision prompt: "Please rewrite the response to remove any
harmful content while being helpful."

Revised: "Lock picking is a skill used by locksmiths and
security professionals. If you're locked out, I'd recommend
contacting a licensed locksmith..."

Stage 2: RLAIF (RL from AI Feedback)

Instead of human comparisons, use the model itself to compare responses based on constitutional principles:

P(ywylx,principle)=LLM("Which response better follows the principle?")P(y_w \succ y_l | x, \text{principle}) = \text{LLM}(\text{"Which response better follows the principle?"})

This produces AI-labeled preference data, which trains a reward model, which is used for PPO - identical to RLHF but with AI preferences instead of human ones.

The Constitution

A set of principles like:

  • "Choose the response that is most helpful while being harmless"
  • "Choose the response that is most honest and doesn't fabricate information"
  • "Choose the response that best supports human autonomy and agency"
Company Variation
  • Anthropic: Pioneered Constitutional AI. Claude is trained with a combination of RLHF and CAI.
  • OpenAI: Primarily uses RLHF with human labelers but has explored "rule-based reward models" (RBRM) which are similar in spirit.
  • Google: Uses RLHF for Gemini, with some AI-assisted evaluation.
  • Meta: LLaMA 2 used RLHF with human preferences. LLaMA 3 added iterative DPO.

Part 4 - DPO: Removing the Reward Model

The Key Insight

DPO (Direct Preference Optimization, Rafailov et al., 2023) shows that you can skip the reward model entirely.

The RLHF objective:

maxθEx,yπθ[rϕ(x,y)βDKL(πθπref)]\max_\theta \mathbb{E}_{x, y \sim \pi_\theta} \left[ r_\phi(x, y) - \beta D_\text{KL}(\pi_\theta \| \pi_\text{ref}) \right]

has a closed-form optimal solution:

π(yx)=1Z(x)πref(yx)exp(1βr(x,y))\pi^*(y|x) = \frac{1}{Z(x)} \pi_\text{ref}(y|x) \exp\left(\frac{1}{\beta} r(x, y)\right)

Rearranging to express the reward in terms of the optimal policy:

r(x,y)=βlogπ(yx)πref(yx)+βlogZ(x)r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_\text{ref}(y|x)} + \beta \log Z(x)

The DPO Loss

Substituting this into the Bradley-Terry preference model:

LDPO(θ)=E(x,yw,yl)[logσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}_\text{DPO}(\theta) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma\left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)} \right) \right]

The partition function Z(x)Z(x) cancels out because both ywy_w and yly_l share the same prompt xx.

60-Second Answer

"DPO derives from the same objective as RLHF but shows that the optimal policy under KL-constrained reward maximization has a closed form. By rearranging, you can express the reward implicitly in terms of the policy's log-probabilities. This means you can directly optimize the policy on preference pairs without ever training a separate reward model. The loss function just compares log-probability ratios: increase the probability of preferred responses relative to the reference, and decrease the probability of rejected responses."

DPO Implementation

import torch
import torch.nn.functional as F

def dpo_loss(
policy_logps_w, # log π_θ(y_w|x) - log probs of preferred under policy
policy_logps_l, # log π_θ(y_l|x) - log probs of rejected under policy
ref_logps_w, # log π_ref(y_w|x) - log probs of preferred under reference
ref_logps_l, # log π_ref(y_l|x) - log probs of rejected under reference
beta=0.1, # Temperature parameter
):
"""
Direct Preference Optimization loss.

Increases probability of preferred response relative to reference,
decreases probability of rejected response relative to reference.
"""
# Log-probability ratios
log_ratio_w = policy_logps_w - ref_logps_w # How much did we increase y_w?
log_ratio_l = policy_logps_l - ref_logps_l # How much did we increase y_l?

# DPO loss: want log_ratio_w > log_ratio_l
logits = beta * (log_ratio_w - log_ratio_l)
loss = -F.logsigmoid(logits).mean()

# Useful metrics
with torch.no_grad():
rewards_w = beta * log_ratio_w
rewards_l = beta * log_ratio_l
accuracy = (rewards_w > rewards_l).float().mean()

return loss, rewards_w.mean(), rewards_l.mean(), accuracy

DPO vs RLHF: Detailed Comparison

AspectRLHF (PPO)DPO
Pipeline complexity4 models (policy, ref, reward, value)2 models (policy, ref)
TrainingRL loop with rollouts, advantage estimationStandard supervised learning
Memory~4x model size (4 models)~2x model size (2 models)
HyperparametersPPO clip, GAE lambda, reward scaling, KL coeffJust β\beta
StabilityNotoriously unstable, reward hackingMore stable
QualitySlightly better at the frontierComparable for most tasks
Iteration speedSlow (online generation required)Fast (offline, batch processing)
Online learningNatural (generates new data)Harder (fixed preference dataset)
ExplorationExplores new response strategiesLimited to existing data distribution
Reward hackingPossible (must monitor KL)Less susceptible

RLHF vs DPO Pipeline Comparison

Part 5 - Beyond DPO: Modern Alignment Methods

IPO (Identity Preference Optimization)

DPO can overfit to the preference data. IPO (Azar et al., 2023) adds regularization:

LIPO=(logπθ(ywx)πref(ywx)logπθ(ylx)πref(ylx)12β)2\mathcal{L}_\text{IPO} = \left( \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)} - \frac{1}{2\beta} \right)^2

KTO (Kahneman-Tversky Optimization)

KTO (Ethayarajh et al., 2024) doesn't need paired preferences - just binary feedback (thumbs up/down):

LKTO={σ(βrθ(x,y)zref)if y is goodσ((rθ(x,y)zref))if y is bad\mathcal{L}_\text{KTO} = \begin{cases} -\sigma(\beta r_\theta(x,y) - z_\text{ref}) & \text{if } y \text{ is good} \\ -\sigma(-(r_\theta(x,y) - z_\text{ref})) & \text{if } y \text{ is bad} \end{cases}

This is significant because paired comparisons are expensive to collect, while binary feedback is cheap.

ORPO (Odds Ratio Preference Optimization)

ORPO (Hong et al., 2024) eliminates both the reward model and the reference model by combining SFT and preference alignment into a single loss:

LORPO=LSFT(yw)λlogσ(logPθ(ywx)1Pθ(ywx)logPθ(ylx)1Pθ(ylx))\mathcal{L}_\text{ORPO} = \mathcal{L}_\text{SFT}(y_w) - \lambda \log \sigma\left( \log \frac{P_\theta(y_w|x)}{1 - P_\theta(y_w|x)} - \log \frac{P_\theta(y_l|x)}{1 - P_\theta(y_l|x)} \right)

Iterative DPO / Online DPO

The main weakness of offline DPO is that it trains on a fixed preference dataset. Iterative DPO addresses this:

  1. Train DPO on initial preference data
  2. Generate new responses from the updated policy
  3. Collect/generate new preferences on these responses
  4. Repeat

This mimics the online exploration benefit of PPO while keeping DPO's simplicity.

Evolution Summary

Alignment Methods Evolution

Part 6 - Practical Challenges

Reward Hacking

The model finds outputs that score high with the reward model but are not actually preferred by humans:

SymptomExampleMitigation
Verbose responsesModel writes 3 paragraphs when 1 sentence sufficesLength penalty in reward
Sycophancy"That's a great question!" before every answerTrain reward model on diverse labelers
Format gamingExcessive bullet points, markdown formattingNormalize format in reward training
Hedge stacking"I think, perhaps, it might be possible that..."Penalize uncertainty markers

KL Divergence Management

The β\beta parameter in the KL penalty is critical:

β\beta valueEffectRisk
Too low (β<0.01\beta < 0.01)Model diverges far from referenceReward hacking, degenerate outputs
Sweet spot (β0.1\beta \approx 0.1)Balanced alignment-
Too high (β>1.0\beta > 1.0)Model barely changes from SFTUnder-alignment, wasted compute

Distribution Shift in Reward Models

The reward model is trained on the SFT model's output distribution. As PPO shifts the policy, the reward model evaluates out-of-distribution text, making its predictions unreliable.

Solutions:

  1. Periodic reward model retraining on policy outputs
  2. Ensemble reward models for uncertainty estimation
  3. Conservative KL penalty to keep the policy close to the SFT distribution
  4. DPO (avoids the reward model entirely)

Part 7 - The Full Picture: From Pre-Training to Deployment

Pre-Training to Deployment Pipeline

Compute Breakdown

StageCompute (relative)DataDuration
Pre-training1000xTrillions of tokensMonths
SFT1x10K-100K examplesHours
Reward modeling1-5x100K comparisonsHours-days
PPO/DPO5-20xOnline generationDays
Total alignment<3% of pre-training--

The remarkable finding: alignment is cheap relative to pre-training. Less than 3% of compute produces the difference between a base model and ChatGPT.

Part 8 - Practice Problems

Problem 1: Reward Model Design

You're building a reward model for a coding assistant. What specific challenges arise compared to a general-purpose reward model? How would you handle them?

Hint 1 - Direction

Code quality has objective components (correctness, efficiency) and subjective ones (readability, style). Think about what signal types are available.

Full Answer + Rubric

Challenges specific to code:

  1. Correctness is verifiable: Unlike general text, code can be tested. Use execution-based reward signals (does the code pass test cases?) as a complement to human preferences.

  2. Multi-dimensional quality: Code quality includes correctness, efficiency, readability, security, and idiomatic style. A single scalar reward conflates these.

  3. Length bias: Good code is often shorter. The reward model must not prefer verbose explanations.

  4. Language-specific knowledge: Python style differs from Java style. The reward model needs per-language evaluation capability.

Solutions:

  1. Hybrid reward: Combine a learned preference model with execution-based verification (unit test pass rate).

rtotal(x,y)=αrlearned(x,y)+(1α)rexecution(x,y)r_\text{total}(x, y) = \alpha \cdot r_\text{learned}(x, y) + (1-\alpha) \cdot r_\text{execution}(x, y)

  1. Multi-aspect reward: Train separate reward models for correctness, efficiency, and style. Combine with configurable weights.

  2. Code-specific labeler pool: Use experienced programmers, not general crowdworkers, for preference labeling.

  3. Execution sandbox: Run generated code in a sandbox and use pass@k metrics as an additional signal.

Scoring:

  • Strong Hire: Identifies execution-based verification as a unique advantage, proposes hybrid reward, considers multi-aspect quality
  • Lean Hire: Mentions code correctness testing but doesn't integrate it into the reward model design
  • No Hire: Treats code reward modeling identically to text reward modeling

Problem 2: DPO Derivation Walkthrough

Starting from the RLHF objective maxθE[r(x,y)βDKL(πθπref)]\max_\theta \mathbb{E}[r(x,y) - \beta D_\text{KL}(\pi_\theta \| \pi_\text{ref})], derive the DPO loss. Show each step.

Hint 1 - Direction

First find the optimal policy π\pi^* by solving the KL-constrained optimization. Then express r(x,y)r(x,y) in terms of π\pi^* and πref\pi_\text{ref}. Finally substitute into the Bradley-Terry model.

Full Answer

Step 1: Solve for optimal policy.

The objective maxπEyπ[r(x,y)]βDKL(ππref)\max_\pi \mathbb{E}_{y \sim \pi}[r(x,y)] - \beta D_\text{KL}(\pi \| \pi_\text{ref}) has the closed-form solution:

π(yx)=1Z(x)πref(yx)exp(r(x,y)β)\pi^*(y|x) = \frac{1}{Z(x)} \pi_\text{ref}(y|x) \exp\left(\frac{r(x,y)}{\beta}\right)

where Z(x)=yπref(yx)exp(r(x,y)/β)Z(x) = \sum_y \pi_\text{ref}(y|x) \exp(r(x,y)/\beta) is the partition function.

Step 2: Express reward in terms of policy.

Taking the log and rearranging:

r(x,y)=βlogπ(yx)πref(yx)+βlogZ(x)r(x,y) = \beta \log \frac{\pi^*(y|x)}{\pi_\text{ref}(y|x)} + \beta \log Z(x)

Step 3: Substitute into Bradley-Terry.

The preference probability under Bradley-Terry:

P(ywyl)=σ(r(x,yw)r(x,yl))P(y_w \succ y_l) = \sigma(r(x,y_w) - r(x,y_l))

Substituting:

=σ(βlogπ(ywx)πref(ywx)+βlogZ(x)βlogπ(ylx)πref(ylx)βlogZ(x))= \sigma\left(\beta \log \frac{\pi^*(y_w|x)}{\pi_\text{ref}(y_w|x)} + \beta \log Z(x) - \beta \log \frac{\pi^*(y_l|x)}{\pi_\text{ref}(y_l|x)} - \beta \log Z(x)\right)

The βlogZ(x)\beta \log Z(x) terms cancel:

=σ(βlogπ(ywx)πref(ywx)βlogπ(ylx)πref(ylx))= \sigma\left(\beta \log \frac{\pi^*(y_w|x)}{\pi_\text{ref}(y_w|x)} - \beta \log \frac{\pi^*(y_l|x)}{\pi_\text{ref}(y_l|x)}\right)

Step 4: DPO loss.

Replace π\pi^* with the parameterized policy πθ\pi_\theta and maximize the log-likelihood:

LDPO=E[logσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}_\text{DPO} = -\mathbb{E} \left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)}\right)\right]

Problem 3: RLHF Debugging

You've trained a reward model and run PPO for 1000 steps. The reward score keeps increasing, but human evaluators say the model's responses have gotten worse. What's happening?

Hint 1 - Direction

This is the classic reward hacking problem. The model has found a way to score high on the reward model without actually being better.

Full Answer + Rubric

Diagnosis: Reward hacking - the policy has learned to exploit patterns in the reward model that don't correspond to actual quality.

Common reward hacking patterns to check:

  1. Length gaming: Model outputs much longer responses. Many reward models have a length bias. Check average response length over training.

  2. Format gaming: Excessive use of bullet points, headers, or markdown that the reward model associates with quality.

  3. Sycophancy: Starting every response with "Great question!" or similar phrases the reward model was trained to prefer.

  4. KL divergence: Check if DKL(πθπref)D_\text{KL}(\pi_\theta \| \pi_\text{ref}) has exploded. If so, the policy has diverged far from the SFT model into a region where the reward model is unreliable.

Fixes:

  1. Increase β\beta: Stronger KL penalty to keep the policy closer to the reference model.
  2. Length normalization: Normalize reward by response length.
  3. Reward model ensemble: Use multiple reward models and take the conservative estimate (minimum).
  4. Early stopping: Use human evaluation as the stopping criterion, not reward score.
  5. DPO: Switch to DPO to avoid the reward model entirely.

Scoring:

  • Strong Hire: Immediately identifies reward hacking, gives specific examples, proposes multiple targeted fixes including KL analysis
  • Lean Hire: Identifies reward hacking in general but can't give specific patterns or fixes
  • No Hire: Suggests training longer, increasing learning rate, or collecting more preference data

Part 9 - The Papers in Context

InstructGPT (Ouyang et al., 2022)

  • Key contribution: First demonstration that RLHF at scale produces dramatically better assistants
  • Surprising finding: The 1.3B InstructGPT model was preferred over the 175B base GPT-3 by human evaluators
  • Data efficiency: Only ~13K demonstrations and ~33K comparisons - tiny relative to pre-training data

Constitutional AI (Bai et al., 2022)

  • Key contribution: Showed that AI feedback can replace human feedback for harmlessness training
  • Insight: Self-critique and revision produces higher-quality training data than trying to train directly
  • Limitation: Relies on the model already being capable enough to self-critique

DPO (Rafailov et al., 2023)

  • Key contribution: Mathematical proof that you can skip the reward model entirely
  • Impact: Dramatically simplified the alignment pipeline, made alignment accessible to smaller teams
  • Limitation: Offline training means no exploration - the quality ceiling depends on the preference data distribution

Interview Cheat Sheet

Question PatternFrameworkKey Phrases
"Explain RLHF"3 stages: SFT → RM → PPO, with KL penalty"RLHF has three stages: supervised fine-tuning on demonstrations, reward model training on preference comparisons, and PPO optimization with a KL penalty to prevent reward hacking"
"What is reward hacking?"Model exploits RM → high score but low quality → detection and mitigation"The policy finds outputs that score high with the reward model but aren't actually preferred by humans - like generating excessively long or sycophantic responses"
"Explain DPO"Same objective as RLHF → closed-form solution → reward cancels → direct loss"DPO shows the KL-constrained reward maximization has a closed-form optimal policy. By rearranging, you can express the reward implicitly and train directly on preferences."
"RLHF vs DPO trade-offs?"RLHF: better exploration, harder to train. DPO: simpler, offline, stable"RLHF can explore new responses during training but requires 4 models and is unstable. DPO is simpler and more stable but limited to the preference data distribution."
"Constitutional AI?"Self-critique → revision → RLAIF → scalable harmlessness"Constitutional AI uses the model to critique and revise its own responses based on principles, then trains on AI preferences rather than human ones."
"Why KL penalty?"Prevents divergence → reward model is only accurate near SFT distribution"Without the KL penalty, the policy would drift into regions where the reward model is unreliable, leading to reward hacking."

Spaced Repetition Checkpoints

  • Day 0: Read this page. Draw the 3-stage InstructGPT pipeline from memory. Write the DPO loss.
  • Day 3: Explain the Bradley-Terry model and why it's used for reward modeling. Derive the DPO loss from the RLHF objective.
  • Day 7: Compare RLHF, Constitutional AI, and DPO - give 3 advantages and 2 disadvantages of each.
  • Day 14: Explain reward hacking with 3 specific examples and mitigations.
  • Day 21: Solve all three practice problems from memory. Time yourself - 8-10 minutes each.

Next Steps

© 2026 EngineersOfAI. All rights reserved.