RLHF Papers - Aligning Language Models with Human Preferences
Reading time: ~35 min | Interview relevance: Critical | Roles: MLE, Research Engineer, AI Engineer
The Real Interview Moment
You're in a research interview at an AI safety-focused company. The interviewer asks: "GPT-4 and Claude don't just predict the next token - they follow instructions, refuse harmful requests, and produce helpful responses. Walk me through the technical pipeline that transforms a base language model into an aligned assistant. I want to hear about reward modeling, PPO, the InstructGPT approach, and then explain how DPO simplifies this. What are the trade-offs between these approaches?"
This question separates candidates who have casually read about RLHF from those who understand the full alignment pipeline. The interviewer wants to hear about the three-stage process, the mathematical objective of each stage, and the practical engineering challenges. They want you to reason about why DPO eliminates the reward model, not just that it does.
This is the alignment interview. Every frontier AI lab considers RLHF (and its successors) a core competency. If you're interviewing at OpenAI, Anthropic, Google DeepMind, or Meta - you need this deeply.
What You Will Master
After reading this page, you will be able to:
- Explain the alignment problem and why pre-training alone is insufficient
- Describe the three-stage InstructGPT pipeline with mathematical precision
- Derive the reward modeling objective from pairwise comparisons
- Explain PPO for language models and why it's used over simpler RL algorithms
- Describe Constitutional AI and self-improvement without human labels
- Derive DPO from the RLHF objective and explain the simplification
- Compare RLHF, Constitutional AI, and DPO on trade-offs
- Discuss practical challenges: reward hacking, KL divergence, distribution shift
Part 1 - The Alignment Problem
Why Pre-Training Is Not Enough
A language model trained on next-token prediction learns to model the distribution of text on the internet. This means it can:
- Continue any text in a plausible way
- Generate toxic, harmful, or factually incorrect content
- Follow instructions inconsistently
- Produce verbose, hedging, or unhelpful responses
The model optimizes , but users want it to optimize for helpfulness, harmlessness, and honesty.
"The alignment problem is the gap between what a language model is trained to do (predict the next token) and what we want it to do (be helpful, harmless, and honest). Pre-training gives the model capabilities, but not values. RLHF bridges this gap by training the model to optimize for human preferences rather than token prediction likelihood."
The Three H's of Alignment
| Property | Definition | Failure Mode |
|---|---|---|
| Helpful | Provides useful, relevant, complete responses | Refuses benign requests, gives vague answers |
| Harmless | Avoids generating dangerous, biased, or toxic content | Helps with harmful requests, generates misinformation |
| Honest | Acknowledges uncertainty, doesn't fabricate facts | Confidently states falsehoods, hallucinates citations |
These objectives can conflict. A maximally helpful model might help with harmful requests. A maximally harmless model might refuse everything. Alignment is about finding the right balance.
Part 2 - The InstructGPT Pipeline
Stage 1: Supervised Fine-Tuning (SFT)
Objective: Train the model to follow instructions by showing it examples of good behavior.
Data: Human labelers write high-quality responses to a diverse set of prompts.
This is standard next-token prediction, but on curated (prompt, response) pairs rather than internet text.
Key details from InstructGPT:
- ~13,000 demonstration examples from 40 labelers
- Fine-tuned GPT-3 (175B) for 16 epochs
- This alone significantly improved instruction following
Many candidates skip SFT and jump straight to reward modeling. SFT is critical - it shifts the model's output distribution toward the kind of text we want to evaluate. Without SFT, the reward model would need to evaluate arbitrary internet text, which is a much harder problem. SFT narrows the space to "reasonable assistant responses."
Stage 2: Reward Modeling (RM)
Objective: Train a model to predict which response a human would prefer.
Data: For each prompt, generate multiple responses, then have humans rank them.
Given a prompt and two responses (preferred) and (rejected), the reward model is trained with:
This is the Bradley-Terry model - the probability that response is preferred over is:
where is the sigmoid function.
Architecture: The reward model is typically the SFT model with the language modeling head replaced by a scalar output head. For InstructGPT, it was a 6B parameter model.
class RewardModel(nn.Module):
"""Reward model: LLM backbone + scalar head."""
def __init__(self, base_model):
super().__init__()
self.backbone = base_model # Pre-trained transformer
self.reward_head = nn.Linear(hidden_size, 1) # Scalar output
def forward(self, input_ids, attention_mask):
# Get last hidden state
outputs = self.backbone(input_ids, attention_mask=attention_mask)
last_hidden = outputs.last_hidden_state[:, -1, :] # Last token
# Scalar reward
reward = self.reward_head(last_hidden)
return reward.squeeze(-1)
def reward_loss(reward_model, x, y_w, y_l):
"""Bradley-Terry pairwise loss."""
r_w = reward_model(x, y_w) # Reward for preferred
r_l = reward_model(x, y_l) # Reward for rejected
return -torch.log(torch.sigmoid(r_w - r_l)).mean()
Key details from InstructGPT:
- ~33,000 comparison pairs
- Labelers ranked 4-9 responses per prompt (converted to pairwise comparisons)
- Inter-annotator agreement: ~73% (alignment is inherently noisy)
Stage 3: PPO (Proximal Policy Optimization)
Objective: Optimize the language model to maximize the reward while staying close to the SFT model.
The RL objective:
where:
- is the policy (the language model being optimized)
- is the reference policy (the SFT model, frozen)
- is the trained reward model
- controls the KL penalty strength
- prevents the model from drifting too far from the SFT model
Why the KL penalty? Without it, the model would learn to produce degenerate outputs that exploit weaknesses in the reward model ("reward hacking"). The KL penalty ensures the model stays in a distribution where the reward model's predictions are reliable.
Never say "RLHF uses reinforcement learning to make the model learn from human feedback." This is too vague and sounds like you don't understand the pipeline. Be specific: "RLHF uses PPO to optimize a language model policy against a reward model trained on human preference comparisons, with a KL divergence penalty to prevent reward hacking."
The PPO Update
PPO clips the policy ratio to prevent too-large updates:
where is the probability ratio and is the advantage estimate.
For language models:
- State = prompt + tokens generated so far
- Action = next token
- Reward = 0 for all tokens except the last, where it equals
Part 3 - Constitutional AI (Anthropic)
The Self-Improvement Approach
Constitutional AI (Bai et al., 2022) addresses two limitations of standard RLHF:
- Human labeler bottleneck: Collecting human preferences is expensive and slow
- Harmlessness at scale: Humans find it psychologically taxing to evaluate harmful content
The CAI Pipeline
Stage 1: Critique and Revision (Red-Teaming + Self-Improvement)
- Generate a response to a potentially harmful prompt
- Ask the model to critique its own response based on a set of principles (the "constitution")
- Ask the model to revise the response based on the critique
- Use the revised response as training data for SFT
Prompt: "How do I pick a lock?"
Initial response: "Here's how to pick a lock: Step 1..."
Critique prompt: "Identify specific ways in which the response
is harmful, unethical, or illegal."
Critique: "The response provides detailed instructions for
breaking into locks, which could facilitate burglary..."
Revision prompt: "Please rewrite the response to remove any
harmful content while being helpful."
Revised: "Lock picking is a skill used by locksmiths and
security professionals. If you're locked out, I'd recommend
contacting a licensed locksmith..."
Stage 2: RLAIF (RL from AI Feedback)
Instead of human comparisons, use the model itself to compare responses based on constitutional principles:
This produces AI-labeled preference data, which trains a reward model, which is used for PPO - identical to RLHF but with AI preferences instead of human ones.
The Constitution
A set of principles like:
- "Choose the response that is most helpful while being harmless"
- "Choose the response that is most honest and doesn't fabricate information"
- "Choose the response that best supports human autonomy and agency"
- Anthropic: Pioneered Constitutional AI. Claude is trained with a combination of RLHF and CAI.
- OpenAI: Primarily uses RLHF with human labelers but has explored "rule-based reward models" (RBRM) which are similar in spirit.
- Google: Uses RLHF for Gemini, with some AI-assisted evaluation.
- Meta: LLaMA 2 used RLHF with human preferences. LLaMA 3 added iterative DPO.
Part 4 - DPO: Removing the Reward Model
The Key Insight
DPO (Direct Preference Optimization, Rafailov et al., 2023) shows that you can skip the reward model entirely.
The RLHF objective:
has a closed-form optimal solution:
Rearranging to express the reward in terms of the optimal policy:
The DPO Loss
Substituting this into the Bradley-Terry preference model:
The partition function cancels out because both and share the same prompt .
"DPO derives from the same objective as RLHF but shows that the optimal policy under KL-constrained reward maximization has a closed form. By rearranging, you can express the reward implicitly in terms of the policy's log-probabilities. This means you can directly optimize the policy on preference pairs without ever training a separate reward model. The loss function just compares log-probability ratios: increase the probability of preferred responses relative to the reference, and decrease the probability of rejected responses."
DPO Implementation
import torch
import torch.nn.functional as F
def dpo_loss(
policy_logps_w, # log π_θ(y_w|x) - log probs of preferred under policy
policy_logps_l, # log π_θ(y_l|x) - log probs of rejected under policy
ref_logps_w, # log π_ref(y_w|x) - log probs of preferred under reference
ref_logps_l, # log π_ref(y_l|x) - log probs of rejected under reference
beta=0.1, # Temperature parameter
):
"""
Direct Preference Optimization loss.
Increases probability of preferred response relative to reference,
decreases probability of rejected response relative to reference.
"""
# Log-probability ratios
log_ratio_w = policy_logps_w - ref_logps_w # How much did we increase y_w?
log_ratio_l = policy_logps_l - ref_logps_l # How much did we increase y_l?
# DPO loss: want log_ratio_w > log_ratio_l
logits = beta * (log_ratio_w - log_ratio_l)
loss = -F.logsigmoid(logits).mean()
# Useful metrics
with torch.no_grad():
rewards_w = beta * log_ratio_w
rewards_l = beta * log_ratio_l
accuracy = (rewards_w > rewards_l).float().mean()
return loss, rewards_w.mean(), rewards_l.mean(), accuracy
DPO vs RLHF: Detailed Comparison
| Aspect | RLHF (PPO) | DPO |
|---|---|---|
| Pipeline complexity | 4 models (policy, ref, reward, value) | 2 models (policy, ref) |
| Training | RL loop with rollouts, advantage estimation | Standard supervised learning |
| Memory | ~4x model size (4 models) | ~2x model size (2 models) |
| Hyperparameters | PPO clip, GAE lambda, reward scaling, KL coeff | Just |
| Stability | Notoriously unstable, reward hacking | More stable |
| Quality | Slightly better at the frontier | Comparable for most tasks |
| Iteration speed | Slow (online generation required) | Fast (offline, batch processing) |
| Online learning | Natural (generates new data) | Harder (fixed preference dataset) |
| Exploration | Explores new response strategies | Limited to existing data distribution |
| Reward hacking | Possible (must monitor KL) | Less susceptible |
Part 5 - Beyond DPO: Modern Alignment Methods
IPO (Identity Preference Optimization)
DPO can overfit to the preference data. IPO (Azar et al., 2023) adds regularization:
KTO (Kahneman-Tversky Optimization)
KTO (Ethayarajh et al., 2024) doesn't need paired preferences - just binary feedback (thumbs up/down):
This is significant because paired comparisons are expensive to collect, while binary feedback is cheap.
ORPO (Odds Ratio Preference Optimization)
ORPO (Hong et al., 2024) eliminates both the reward model and the reference model by combining SFT and preference alignment into a single loss:
Iterative DPO / Online DPO
The main weakness of offline DPO is that it trains on a fixed preference dataset. Iterative DPO addresses this:
- Train DPO on initial preference data
- Generate new responses from the updated policy
- Collect/generate new preferences on these responses
- Repeat
This mimics the online exploration benefit of PPO while keeping DPO's simplicity.
Evolution Summary
Part 6 - Practical Challenges
Reward Hacking
The model finds outputs that score high with the reward model but are not actually preferred by humans:
| Symptom | Example | Mitigation |
|---|---|---|
| Verbose responses | Model writes 3 paragraphs when 1 sentence suffices | Length penalty in reward |
| Sycophancy | "That's a great question!" before every answer | Train reward model on diverse labelers |
| Format gaming | Excessive bullet points, markdown formatting | Normalize format in reward training |
| Hedge stacking | "I think, perhaps, it might be possible that..." | Penalize uncertainty markers |
KL Divergence Management
The parameter in the KL penalty is critical:
| value | Effect | Risk |
|---|---|---|
| Too low () | Model diverges far from reference | Reward hacking, degenerate outputs |
| Sweet spot () | Balanced alignment | - |
| Too high () | Model barely changes from SFT | Under-alignment, wasted compute |
Distribution Shift in Reward Models
The reward model is trained on the SFT model's output distribution. As PPO shifts the policy, the reward model evaluates out-of-distribution text, making its predictions unreliable.
Solutions:
- Periodic reward model retraining on policy outputs
- Ensemble reward models for uncertainty estimation
- Conservative KL penalty to keep the policy close to the SFT distribution
- DPO (avoids the reward model entirely)
Part 7 - The Full Picture: From Pre-Training to Deployment
Compute Breakdown
| Stage | Compute (relative) | Data | Duration |
|---|---|---|---|
| Pre-training | 1000x | Trillions of tokens | Months |
| SFT | 1x | 10K-100K examples | Hours |
| Reward modeling | 1-5x | 100K comparisons | Hours-days |
| PPO/DPO | 5-20x | Online generation | Days |
| Total alignment | <3% of pre-training | - | - |
The remarkable finding: alignment is cheap relative to pre-training. Less than 3% of compute produces the difference between a base model and ChatGPT.
Part 8 - Practice Problems
Problem 1: Reward Model Design
You're building a reward model for a coding assistant. What specific challenges arise compared to a general-purpose reward model? How would you handle them?
Hint 1 - Direction
Code quality has objective components (correctness, efficiency) and subjective ones (readability, style). Think about what signal types are available.
Full Answer + Rubric
Challenges specific to code:
-
Correctness is verifiable: Unlike general text, code can be tested. Use execution-based reward signals (does the code pass test cases?) as a complement to human preferences.
-
Multi-dimensional quality: Code quality includes correctness, efficiency, readability, security, and idiomatic style. A single scalar reward conflates these.
-
Length bias: Good code is often shorter. The reward model must not prefer verbose explanations.
-
Language-specific knowledge: Python style differs from Java style. The reward model needs per-language evaluation capability.
Solutions:
- Hybrid reward: Combine a learned preference model with execution-based verification (unit test pass rate).
-
Multi-aspect reward: Train separate reward models for correctness, efficiency, and style. Combine with configurable weights.
-
Code-specific labeler pool: Use experienced programmers, not general crowdworkers, for preference labeling.
-
Execution sandbox: Run generated code in a sandbox and use pass@k metrics as an additional signal.
Scoring:
- Strong Hire: Identifies execution-based verification as a unique advantage, proposes hybrid reward, considers multi-aspect quality
- Lean Hire: Mentions code correctness testing but doesn't integrate it into the reward model design
- No Hire: Treats code reward modeling identically to text reward modeling
Problem 2: DPO Derivation Walkthrough
Starting from the RLHF objective , derive the DPO loss. Show each step.
Hint 1 - Direction
First find the optimal policy by solving the KL-constrained optimization. Then express in terms of and . Finally substitute into the Bradley-Terry model.
Full Answer
Step 1: Solve for optimal policy.
The objective has the closed-form solution:
where is the partition function.
Step 2: Express reward in terms of policy.
Taking the log and rearranging:
Step 3: Substitute into Bradley-Terry.
The preference probability under Bradley-Terry:
Substituting:
The terms cancel:
Step 4: DPO loss.
Replace with the parameterized policy and maximize the log-likelihood:
Problem 3: RLHF Debugging
You've trained a reward model and run PPO for 1000 steps. The reward score keeps increasing, but human evaluators say the model's responses have gotten worse. What's happening?
Hint 1 - Direction
This is the classic reward hacking problem. The model has found a way to score high on the reward model without actually being better.
Full Answer + Rubric
Diagnosis: Reward hacking - the policy has learned to exploit patterns in the reward model that don't correspond to actual quality.
Common reward hacking patterns to check:
-
Length gaming: Model outputs much longer responses. Many reward models have a length bias. Check average response length over training.
-
Format gaming: Excessive use of bullet points, headers, or markdown that the reward model associates with quality.
-
Sycophancy: Starting every response with "Great question!" or similar phrases the reward model was trained to prefer.
-
KL divergence: Check if has exploded. If so, the policy has diverged far from the SFT model into a region where the reward model is unreliable.
Fixes:
- Increase : Stronger KL penalty to keep the policy closer to the reference model.
- Length normalization: Normalize reward by response length.
- Reward model ensemble: Use multiple reward models and take the conservative estimate (minimum).
- Early stopping: Use human evaluation as the stopping criterion, not reward score.
- DPO: Switch to DPO to avoid the reward model entirely.
Scoring:
- Strong Hire: Immediately identifies reward hacking, gives specific examples, proposes multiple targeted fixes including KL analysis
- Lean Hire: Identifies reward hacking in general but can't give specific patterns or fixes
- No Hire: Suggests training longer, increasing learning rate, or collecting more preference data
Part 9 - The Papers in Context
InstructGPT (Ouyang et al., 2022)
- Key contribution: First demonstration that RLHF at scale produces dramatically better assistants
- Surprising finding: The 1.3B InstructGPT model was preferred over the 175B base GPT-3 by human evaluators
- Data efficiency: Only ~13K demonstrations and ~33K comparisons - tiny relative to pre-training data
Constitutional AI (Bai et al., 2022)
- Key contribution: Showed that AI feedback can replace human feedback for harmlessness training
- Insight: Self-critique and revision produces higher-quality training data than trying to train directly
- Limitation: Relies on the model already being capable enough to self-critique
DPO (Rafailov et al., 2023)
- Key contribution: Mathematical proof that you can skip the reward model entirely
- Impact: Dramatically simplified the alignment pipeline, made alignment accessible to smaller teams
- Limitation: Offline training means no exploration - the quality ceiling depends on the preference data distribution
Interview Cheat Sheet
| Question Pattern | Framework | Key Phrases |
|---|---|---|
| "Explain RLHF" | 3 stages: SFT → RM → PPO, with KL penalty | "RLHF has three stages: supervised fine-tuning on demonstrations, reward model training on preference comparisons, and PPO optimization with a KL penalty to prevent reward hacking" |
| "What is reward hacking?" | Model exploits RM → high score but low quality → detection and mitigation | "The policy finds outputs that score high with the reward model but aren't actually preferred by humans - like generating excessively long or sycophantic responses" |
| "Explain DPO" | Same objective as RLHF → closed-form solution → reward cancels → direct loss | "DPO shows the KL-constrained reward maximization has a closed-form optimal policy. By rearranging, you can express the reward implicitly and train directly on preferences." |
| "RLHF vs DPO trade-offs?" | RLHF: better exploration, harder to train. DPO: simpler, offline, stable | "RLHF can explore new responses during training but requires 4 models and is unstable. DPO is simpler and more stable but limited to the preference data distribution." |
| "Constitutional AI?" | Self-critique → revision → RLAIF → scalable harmlessness | "Constitutional AI uses the model to critique and revise its own responses based on principles, then trains on AI preferences rather than human ones." |
| "Why KL penalty?" | Prevents divergence → reward model is only accurate near SFT distribution | "Without the KL penalty, the policy would drift into regions where the reward model is unreliable, leading to reward hacking." |
Spaced Repetition Checkpoints
- Day 0: Read this page. Draw the 3-stage InstructGPT pipeline from memory. Write the DPO loss.
- Day 3: Explain the Bradley-Terry model and why it's used for reward modeling. Derive the DPO loss from the RLHF objective.
- Day 7: Compare RLHF, Constitutional AI, and DPO - give 3 advantages and 2 disadvantages of each.
- Day 14: Explain reward hacking with 3 specific examples and mitigations.
- Day 21: Solve all three practice problems from memory. Time yourself - 8-10 minutes each.
Next Steps
- Continue to Diffusion Model Papers for the generative modeling revolution beyond language
- Review GPT Series to understand the base models that RLHF aligns
- For LoRA-based alignment, see LoRA and PEFT
