RLHF and Alignment
Alignment is the field's answer to a deceptively simple question: how do you make a language model do what humans actually want? This chapter covers the full alignment pipeline - from collecting preference data, through reward modeling and PPO optimization, to modern alternatives like DPO and constitutional AI. If you are interviewing for any role that touches LLM training, safety, or product, expect at least one alignment question.
Why Alignment Matters
Pre-trained language models learn to predict the next token. That objective produces fluent text, but fluency is not the same as helpfulness, truthfulness, or safety. A model trained only on next-token prediction will happily:
- Generate toxic content if the prompt context makes it likely
- Hallucinate facts with high confidence
- Follow dangerous instructions (e.g., synthesize harmful substances)
- Produce verbose, hedging, or sycophantic responses
Alignment techniques close the gap between "predict likely text" and "produce text humans actually prefer."
Interviewers want you to articulate why pre-training alone is insufficient. The strongest candidates connect alignment to concrete failure modes - not just vague statements about "safety."
The RLHF Pipeline
The canonical RLHF pipeline, introduced by InstructGPT (Ouyang et al., 2022), has three stages:
Stage 1: Supervised Fine-Tuning (SFT)
Start with a pre-trained base model and fine-tune it on high-quality demonstration data - typically human-written responses to a diverse set of prompts.
Key details:
- The SFT dataset is usually 10K-100K examples (much smaller than pre-training data)
- Quality matters more than quantity - a small set of expert demonstrations outperforms a large set of noisy ones
- The SFT model serves as the reference policy in later stages
At OpenAI, SFT data was written by a team of ~40 contractors with detailed labeling guidelines. At Anthropic, the SFT stage uses a mix of human and AI-generated demonstrations (RLAIF-style). Smaller companies often start with open-source instruction datasets like Alpaca or OpenAssistant.
Stage 2: Reward Model Training
The reward model (RM) learns to predict which of two responses a human would prefer. It is the bridge between human judgment and automated optimization.
Data collection:
- Sample a prompt from the prompt distribution
- Generate responses from the SFT model (typically to )
- Human annotators rank the responses (or provide pairwise comparisons)
- Convert rankings to pairwise preferences: where is preferred over
Training objective (Bradley-Terry model):
where is the scalar reward assigned by the model to response given prompt , and is the sigmoid function.
Architecture: The reward model is typically initialized from the SFT model checkpoint, with the final unembedding layer replaced by a linear head that outputs a scalar reward.
"The reward model takes a prompt-response pair and outputs a scalar score. It's trained on human preference comparisons using a Bradley-Terry pairwise loss. The key insight is that we only need relative preferences - annotators don't need to assign absolute quality scores, just say which response is better."
Stage 3: PPO Optimization
Proximal Policy Optimization (PPO) is used to fine-tune the SFT model to maximize the reward model's score, subject to a KL divergence penalty that prevents the policy from drifting too far from the reference model.
The RL objective:
where:
- is the policy being optimized
- is the frozen SFT model (reference policy)
- is the learned reward model
- controls the strength of the KL penalty
- is the prompt distribution
PPO implementation details for LLMs:
| Component | Role |
|---|---|
| Policy model | The LLM being optimized () |
| Reference model | Frozen copy of the SFT model () |
| Reward model | Scores completions () |
| Value model | Estimates expected future reward (critic) |
This means RLHF with PPO requires four models in memory simultaneously - a major computational cost.
PPO-specific mechanics:
- Clipped surrogate objective: PPO clips the probability ratio to to prevent destructively large updates
- Generalized Advantage Estimation (GAE): Used to compute per-token advantages with bias-variance tradeoff controlled by
- Mini-batch updates: Each batch of generations is used for multiple gradient steps
Don't confuse the KL penalty in the RLHF objective with the PPO clipping mechanism. They serve different purposes: the KL penalty prevents reward hacking (policy-level), while PPO clipping prevents destructive optimization steps (gradient-level). Many candidates conflate these.
The KL Divergence Constraint
The KL penalty is arguably the most important design choice in RLHF. Without it, the policy will exploit the reward model.
Why KL Matters
Intuition: The reward model is an imperfect proxy for human preferences. It was trained on a finite dataset and has blind spots. Without the KL constraint, the policy will find inputs that score highly according to the reward model but are clearly degenerate to a human - this is reward hacking (also called reward overoptimization or Goodhart's Law).
Mathematical detail: The per-token KL divergence is:
In practice, this is estimated as:
Choosing
- Too high : Policy barely changes from SFT - alignment gains are minimal
- Too low : Reward hacking - policy exploits RM weaknesses
- Typical range:
- Some implementations use adaptive KL: adjust dynamically to target a specific KL budget
A strong answer on KL explains three things: (1) why it is needed (reward model is imperfect), (2) what happens without it (reward hacking), and (3) how is tuned in practice (grid search or adaptive targeting).
Reward Hacking and Mitigations
Reward hacking is when the optimized policy achieves high reward scores without actually improving response quality. It is one of the most studied failure modes in alignment.
Common Reward Hacking Patterns
| Pattern | Description | Example |
|---|---|---|
| Length gaming | Reward model gives higher scores to longer responses | Model produces verbose, padded responses with unnecessary caveats |
| Sycophancy | RM rewards agreement with user | Model agrees with factually wrong premises |
| Repetition exploitation | Certain phrases consistently score high | Model repeats "I hope that helps!" or "Great question!" |
| Format gaming | RM prefers certain formats | Model always produces numbered lists regardless of context |
| Hedging | RM rewards cautious language | Model adds excessive disclaimers to every response |
Mitigations
-
KL divergence constraint - The primary defense (discussed above)
-
Reward model ensembles - Train multiple RMs and use the conservative estimate (e.g., minimum or lower confidence bound):
- Length penalties - Explicitly penalize the reward for response length:
-
Reward model retraining - Periodically retrain the RM on outputs from the current policy (iterated RLHF)
-
Process-based reward models - Reward each reasoning step, not just the final answer. This makes gaming harder since every step must look reasonable.
Saying "just increase the reward model's accuracy" without acknowledging Goodhart's Law signals a fundamental misunderstanding of alignment. The issue is not that the RM is inaccurate - it is that any learned proxy will be exploited by a sufficiently powerful optimizer.
Preference Data Collection
The quality of RLHF depends entirely on the quality of preference data. This section covers practical considerations that interviewers - especially at AI labs - care deeply about.
Annotation Guidelines
What makes a good response? Annotators need clear criteria. InstructGPT used a hierarchy:
- Helpfulness - Does the response actually answer the question?
- Truthfulness - Are claims factually correct?
- Harmlessness - Does the response avoid generating harmful content?
When criteria conflict (e.g., a user asks "how do I pick a lock?"), the guidelines must specify priority ordering. This is where alignment philosophy meets engineering.
Pairwise vs. Rating vs. Ranking
| Method | Pros | Cons |
|---|---|---|
| Pairwise comparison | Simplest for annotators, highest agreement | comparisons for responses |
| Likert rating (1-5) | Efficient, one judgment per response | Low inter-annotator agreement, scale anchoring varies |
| Full ranking | Maximum information per annotation | Cognitively demanding for more than 4-5 responses |
Best practice: Use pairwise comparisons for the core dataset. Derive rankings from pairwise results using the Bradley-Terry model or Elo rating. InstructGPT generated responses and collected pairwise comparisons per prompt.
Inter-Annotator Agreement
Human preferences are noisy. Typical inter-annotator agreement rates:
- Easy comparisons (one response is clearly better): ~85-95% agreement
- Hard comparisons (both responses are similar quality): ~55-65% agreement
- Overall: ~72-77% agreement (reported by InstructGPT)
Candidates sometimes assume preference data is "ground truth." It is not - human preferences are stochastic, context-dependent, and vary across annotators. Strong candidates acknowledge this noise and discuss how it affects RM training (e.g., label smoothing, filtering low-confidence pairs).
Scaling Laws for Preference Data
More preference data helps, but with diminishing returns:
- 10K-50K comparisons: Meaningful reward model
- 50K-200K comparisons: Strong reward model
- 200K+ comparisons: Marginal gains; quality and diversity matter more
Direct Preference Optimization (DPO)
DPO (Rafailov et al., 2023) eliminates the need for a separate reward model and RL loop entirely. It is one of the most important developments in post-RLHF alignment.
Core Insight
The key mathematical insight: the optimal policy under the RLHF objective (reward maximization + KL constraint) has a closed-form solution:
Rearranging for the reward:
Substituting this into the Bradley-Terry preference model and noting that cancels in pairwise comparisons, we get the DPO loss:
DPO vs. RLHF
| Dimension | RLHF (PPO) | DPO |
|---|---|---|
| Reward model | Required (separate training) | Implicit (derived from policy) |
| Models in memory | 4 (policy, ref, RM, value) | 2 (policy, ref) |
| Training stability | Sensitive to hyperparameters | More stable (standard supervised loss) |
| Compute cost | High (RL loop, generation) | Lower (no generation during training) |
| Reward hacking | Explicit RM can be exploited | No explicit RM to exploit |
| Expressiveness | Can learn complex reward signals | Limited by Bradley-Terry assumption |
| Online data | Generates new responses during training | Uses fixed offline dataset |
DPO Variants
- IPO (Identity Preference Optimization): Addresses DPO's overfitting to preference margins by using a simpler loss that does not assume the Bradley-Terry model
- KTO (Kahneman-Tversky Optimization): Works with binary feedback (thumbs up/down) instead of pairwise comparisons - much easier to collect
- ORPO (Odds Ratio Preference Optimization): Combines SFT and preference optimization into a single stage
- SimPO: Simplifies DPO by removing the need for a reference model entirely
"DPO shows that you can skip the reward model and RL loop entirely. The math works out because the optimal RLHF policy has a closed form - you can rearrange the objective to get a supervised loss that directly increases the log-probability gap between preferred and dispreferred responses, scaled by beta."
RLAIF: Reinforcement Learning from AI Feedback
RLAIF (Bai et al., 2022; Lee et al., 2023) replaces human annotators with an AI model that provides preference judgments.
How RLAIF Works
- Generate response pairs for each prompt
- Ask a capable LLM (e.g., GPT-4, Claude) to judge which response is better
- Train a reward model on the AI-generated preferences
- Run PPO as in standard RLHF (or use DPO on AI preferences)
Advantages
- Scale: AI feedback can be generated at 100-1000x the rate of human annotation
- Consistency: AI judges are more consistent (lower noise) than human annotators
- Cost: Orders of magnitude cheaper than human annotation
- Coverage: Can evaluate on specialized domains where human experts are scarce
Risks
- Capability ceiling: The AI judge cannot reliably evaluate responses that exceed its own capability
- Bias amplification: The AI judge's biases get baked into the reward model
- Homogenization: Models trained on AI feedback may converge to similar styles
- Sycophancy loops: If the AI judge has the same sycophantic tendencies, these get reinforced
Anthropic pioneered RLAIF with Constitutional AI. Google used RLAIF extensively for PaLM 2 and Gemini. OpenAI primarily uses human feedback but supplements with model-generated comparisons for scale. Meta's Llama models use a mix of human and AI feedback.
Constitutional AI (CAI)
Constitutional AI (Bai et al., 2022) is Anthropic's framework for alignment that uses a set of principles (a "constitution") to guide AI self-improvement.
The Two-Phase Process
Phase 1 - Critique and Revision (SL-CAI):
- Prompt the model to generate a potentially harmful response
- Ask the model to critique its own response based on a constitutional principle (e.g., "Is this response harmful?")
- Ask the model to revise the response to address the critique
- Fine-tune on the revised (improved) responses
Phase 2 - RL from AI Feedback (RL-CAI):
- Generate pairs of responses
- Ask the AI to judge which response better adheres to the constitution
- Train a reward model on these AI-generated preferences
- Optimize with PPO
Example Constitutional Principles
- "Choose the response that is least likely to be used for harm"
- "Choose the response that is most helpful while being honest"
- "Choose the response that is least toxic or offensive"
- "Choose the response that most respects individual autonomy"
Why It Matters
CAI reduces dependence on human annotators for safety-related preferences, scales better, and makes the alignment criteria explicit and auditable - you can read the constitution and understand what the model was trained to value.
Safety vs. Helpfulness Tradeoff
One of the central tensions in alignment is the tradeoff between making a model safe and making it helpful.
The Pareto Frontier
Over-refusal problem: An overly aligned model refuses benign requests because they superficially resemble dangerous ones. Examples:
- Refusing to explain how locks work (legitimate educational content)
- Refusing to write fiction with conflict (normal storytelling)
- Adding unnecessary safety disclaimers to cooking recipes
Under-alignment problem: An insufficiently aligned model provides harmful information freely:
- Detailed instructions for illegal activities
- Generating manipulative content
- Confirming user's false medical self-diagnoses
Practical Approaches to the Tradeoff
-
Tiered safety: Apply different safety thresholds to different harm categories. Mild topics (e.g., alcohol) get lighter restrictions than severe topics (e.g., weapons).
-
Context-aware refusal: Instead of categorical refusal, consider the likely use case. "How does a virus spread?" is a legitimate biology question, not a bioweapons inquiry.
-
Calibrated responses: Provide information with appropriate caveats rather than refusing outright. "Here's how X works, but Y is illegal and Z is the legal alternative."
-
System prompt steering: Allow deployers to adjust the safety-helpfulness tradeoff for their use case (e.g., a medical AI can discuss medications more freely than a general chatbot).
The best candidates discuss this tradeoff with nuance. Avoid taking extreme positions ("models should never refuse" or "safety always trumps helpfulness"). Interviewers want to see that you understand the complexity and can reason about specific cases.
RLHF vs. DPO vs. RLAIF Comparison
| Dimension | RLHF (PPO) | DPO | RLAIF |
|---|---|---|---|
| Feedback source | Human annotators | Human annotators | AI model |
| Training method | RL (PPO) | Supervised loss | RL or DPO |
| Reward model | Explicit, separate | Implicit | Explicit, separate |
| Compute cost | Very high | Moderate | High (but annotation is cheap) |
| Training stability | Low (RL is finicky) | High | Moderate |
| Data efficiency | Moderate | High | Low (AI data is noisy but abundant) |
| Scalability | Limited by human annotation | Limited by human annotation | Highly scalable |
| Reward hacking risk | High | Lower | Moderate |
| Best for | Frontier models with budget | Open-source, research | Scaling alignment cheaply |
"RLHF gives maximum control but is expensive and unstable. DPO is simpler, more stable, and cheaper - it is the default choice for most teams today. RLAIF solves the data bottleneck by using AI judges, but it introduces the risk of amplifying AI biases. In practice, most labs use a combination: AI feedback for scale, human feedback for calibration."
Practice Problems
Problem 1: Reward Model Design
You are designing a reward model for a customer service chatbot. The model should reward helpful, accurate, and polite responses. How would you structure the training data and loss function?
Hint 1 - Direction
Think about what makes customer service different from general chat. Consider domain-specific quality criteria and how to source comparison data.
Hint 2 - Insight
You need multi-dimensional quality criteria: accuracy (did the agent solve the problem?), tone (was it professional?), efficiency (did it resolve quickly?). Consider using real customer satisfaction data as a signal.
Hint 3 - Full Solution
Data collection:
- Source prompts from real customer conversations (anonymized)
- Generate responses per prompt using the SFT model
- Have trained annotators rank responses on three axes: helpfulness (1-5), accuracy (binary), and tone (1-3)
- Construct pairwise preferences from rankings
Loss function:
- Use the standard Bradley-Terry pairwise loss
- Add a length penalty to prevent verbose responses (customers want quick resolutions)
- Consider a multi-objective reward:
Domain-specific considerations:
- Include "escalation detection" - the model should know when to hand off to a human
- Test for hallucination of policies (e.g., promising refunds the company does not offer)
- Evaluate on resolution rate, not just response quality
Scoring rubric:
| Grade | Criteria |
|---|---|
| Strong Hire | Discusses multi-dimensional reward, domain-specific failure modes (hallucinated policies), and evaluation beyond preference accuracy (e.g., resolution rate). Mentions calibration between annotators. |
| Lean Hire | Covers basic RM training, mentions pairwise loss, considers at least one domain-specific issue. |
| No Hire | Describes generic RM training without tailoring to the customer service domain. Does not consider domain-specific failure modes. |
Problem 2: DPO Troubleshooting
You fine-tuned a model with DPO and the loss converges, but the model's responses are worse than the SFT baseline. What went wrong?
Hint 1 - Direction
Think about the assumptions DPO makes and what data quality issues could cause the model to learn the wrong direction.
Hint 2 - Insight
DPO assumes the preference data is consistent with the Bradley-Terry model and that the reference model generated the responses. What happens if neither assumption holds?
Hint 3 - Full Solution
Likely causes:
-
Preference data quality: If chosen/rejected labels are noisy or inconsistent, DPO will learn to increase probability of low-quality responses. Check inter-annotator agreement.
-
Distribution mismatch: DPO works best when the response pairs were generated by (or near) the reference model. If you are using preference data from a different model, the implicit reward function may be miscalibrated.
-
too low: The model diverged too far from the reference, overfitting to noise in the preferences. Try increasing .
-
too high: The model barely changed. Check the KL divergence - if it is near zero, increase learning rate or decrease .
-
Chosen responses are out-of-distribution: If the "preferred" responses are written by humans or a much more capable model, the policy may struggle to assign them high probability, leading to unstable gradients.
-
Evaluation mismatch: The evaluation criteria may differ from the preference criteria. DPO optimized for what annotators preferred, but you are evaluating on something else (e.g., factual accuracy).
Debugging steps:
- Plot training loss curve - look for instability
- Measure KL divergence from reference - too high or too low?
- Sample outputs and compare to SFT baseline qualitatively
- Check if chosen responses are within the reference model's capability
Scoring rubric:
| Grade | Criteria |
|---|---|
| Strong Hire | Identifies 3+ plausible causes, proposes systematic debugging approach, understands the distribution mismatch issue. |
| Lean Hire | Identifies at least 2 causes including data quality. Mentions tuning. |
| No Hire | Cannot identify plausible causes beyond "bad data." Does not understand DPO's assumptions. |
Problem 3: Safety Alignment Strategy
Your company is launching a general-purpose chatbot. Design an alignment strategy that balances safety and helpfulness. You have a budget for 50K human preference annotations.
Hint 1 - Direction
Think about how to allocate your annotation budget across safety-critical and general helpfulness data. Consider what can be automated and what requires human judgment.
Hint 2 - Insight
A hybrid approach works best: use RLAIF for the bulk of general helpfulness training, reserve human annotations for safety-critical categories where AI judgment is unreliable. Consider constitutional AI for scalable safety training.
Hint 3 - Full Solution
Strategy:
-
Annotation budget allocation:
- 20K annotations for safety-critical comparisons (harmful content, misinformation, illegal advice)
- 20K annotations for general helpfulness
- 10K annotations for edge cases (ambiguous requests, dual-use knowledge)
-
Supplement with RLAIF:
- Use a strong AI judge (GPT-4 or Claude) to generate 200K+ additional preference pairs for general quality
- Focus AI feedback on style, clarity, and completeness (domains where AI judges are reliable)
- Reserve human annotation for safety judgments (where AI judges may have blind spots)
-
Training pipeline:
- Stage 1: SFT on high-quality demonstrations
- Stage 2: DPO on the combined human + AI preference dataset (DPO for stability and lower compute)
- Stage 3: Safety-specific DPO pass using only the 20K safety annotations with lower (stronger alignment on safety)
-
Evaluation:
- Red-teaming with adversarial prompts
- Helpfulness benchmarks (MT-Bench, AlpacaEval)
- Safety benchmarks (ToxiGen, BBQ, custom harm categories)
- Monitor over-refusal rate on benign prompts
-
Post-deployment:
- Collect user feedback (thumbs up/down) for ongoing RLAIF
- Monitor for reward hacking patterns
- Periodic human evaluation audits
Scoring rubric:
| Grade | Criteria |
|---|---|
| Strong Hire | Proposes a principled budget allocation, combines human and AI feedback appropriately, addresses both safety and helpfulness, includes evaluation and monitoring. |
| Lean Hire | Reasonable strategy with human annotation. Mentions safety vs. helpfulness tradeoff. |
| No Hire | Spends all budget on one dimension. No evaluation plan. Does not consider RLAIF for scaling. |
Interview Cheat Sheet
| Topic | Key Fact | Why It Matters |
|---|---|---|
| RLHF stages | SFT, then RM, then PPO | The canonical 3-stage pipeline |
| Reward model loss | Bradley-Terry pairwise sigmoid loss | Most common formulation |
| PPO memory cost | 4 models simultaneously | Major practical constraint |
| KL penalty | Prevents reward hacking | Goodhart's Law in action |
| DPO advantage | No RM, no RL, just supervised loss | Simpler and more stable |
| DPO loss | Log-prob ratio gap between preferred and dispreferred | Closed-form from RLHF objective |
| RLAIF | AI generates preferences instead of humans | Scales but risks bias amplification |
| Constitutional AI | Principles-based self-critique and revision | Anthropic's framework, makes criteria explicit |
| Reward hacking | Policy exploits RM weaknesses | Mitigate with KL, ensembles, length penalties |
| Safety tradeoff | Over-refusal vs. under-alignment | Context-aware, tiered approach works best |
| parameter | Controls KL constraint strength | Too high: no change. Too low: reward hacking. |
| DPO variants | IPO, KTO, ORPO, SimPO | Each relaxes different DPO assumptions |
Spaced Repetition Checkpoints
Use these prompts to test your recall at increasing intervals:
Day 0 (Today)
- What are the three stages of the RLHF pipeline?
- Write the reward model (Bradley-Terry) loss function from memory
- What is the role of the KL divergence constraint?
Day 3
- Explain the DPO loss function and its derivation in one paragraph
- Name three reward hacking patterns and one mitigation for each
- What are the four models required in PPO-based RLHF?
Day 7
- Compare RLHF, DPO, and RLAIF: when would you use each?
- Describe the two phases of Constitutional AI
- How would you allocate a 50K annotation budget between safety and helpfulness?
Day 14
- Design a complete alignment pipeline for a domain-specific chatbot
- Explain why reward hacking is fundamentally unavoidable (Goodhart's Law) and how to mitigate it
- Discuss the safety vs. helpfulness tradeoff with concrete examples
Day 21
- Teach the full RLHF-to-DPO derivation to someone unfamiliar with the topic
- Critique the limitations of current alignment techniques - what problems remain unsolved?
- Design an evaluation suite that measures both safety and helpfulness without over-indexing on either
