RLHF and Alignment

Alignment is the field's answer to a deceptively simple question: how do you make a language model do what humans actually want? This chapter covers the full alignment pipeline - from collecting preference data, through reward modeling and PPO optimization, to modern alternatives like DPO and constitutional AI. If you are interviewing for any role that touches LLM training, safety, or product, expect at least one alignment question.

Why Alignment Matters

Pre-trained language models learn to predict the next token. That objective produces fluent text, but fluency is not the same as helpfulness, truthfulness, or safety. A model trained only on next-token prediction will happily:

Generate toxic content if the prompt context makes it likely
Hallucinate facts with high confidence
Follow dangerous instructions (e.g., synthesize harmful substances)
Produce verbose, hedging, or sycophantic responses

Alignment techniques close the gap between "predict likely text" and "produce text humans actually prefer."

Interviewer's Perspective

Interviewers want you to articulate why pre-training alone is insufficient. The strongest candidates connect alignment to concrete failure modes - not just vague statements about "safety."

The RLHF Pipeline

The canonical RLHF pipeline, introduced by InstructGPT (Ouyang et al., 2022), has three stages:

RLHF Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Start with a pre-trained base model and fine-tune it on high-quality demonstration data - typically human-written responses to a diverse set of prompts.

Key details:

The SFT dataset is usually 10K-100K examples (much smaller than pre-training data)
Quality matters more than quantity - a small set of expert demonstrations outperforms a large set of noisy ones
The SFT model serves as the reference policy $\pi_{\text{ref}}$ in later stages

Company Variation

At OpenAI, SFT data was written by a team of ~40 contractors with detailed labeling guidelines. At Anthropic, the SFT stage uses a mix of human and AI-generated demonstrations (RLAIF-style). Smaller companies often start with open-source instruction datasets like Alpaca or OpenAssistant.

Stage 2: Reward Model Training

The reward model (RM) learns to predict which of two responses a human would prefer. It is the bridge between human judgment and automated optimization.

Data collection:

Sample a prompt from the prompt distribution
Generate $K$ responses from the SFT model (typically $K = 4$ to $K = 9$ )
Human annotators rank the responses (or provide pairwise comparisons)
Convert rankings to pairwise preferences: $(x, y_w, y_l)$ where $y_w$ is preferred over $y_l$

Training objective (Bradley-Terry model):

\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( r_\theta(x, y_w) - r_\theta(x, y_l) \right) \right]

where $r_\theta(x, y)$ is the scalar reward assigned by the model to response $y$ given prompt $x$ , and $\sigma$ is the sigmoid function.

Architecture: The reward model is typically initialized from the SFT model checkpoint, with the final unembedding layer replaced by a linear head that outputs a scalar reward.

60-Second Answer

"The reward model takes a prompt-response pair and outputs a scalar score. It's trained on human preference comparisons using a Bradley-Terry pairwise loss. The key insight is that we only need relative preferences - annotators don't need to assign absolute quality scores, just say which response is better."

Stage 3: PPO Optimization

Proximal Policy Optimization (PPO) is used to fine-tune the SFT model to maximize the reward model's score, subject to a KL divergence penalty that prevents the policy from drifting too far from the reference model.

The RL objective:

\max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)} \left[ r_\phi(x, y) - \beta \, D_{\text{KL}}\left(\pi_\theta(\cdot|x) \,\|\, \pi_{\text{ref}}(\cdot|x)\right) \right]

where:

$\pi_\theta$ is the policy being optimized
$\pi_{\text{ref}}$ is the frozen SFT model (reference policy)
$r_\phi$ is the learned reward model
$\beta$ controls the strength of the KL penalty
$\mathcal{D}$ is the prompt distribution

PPO implementation details for LLMs:

Component	Role
Policy model	The LLM being optimized ( $\pi_\theta$ )
Reference model	Frozen copy of the SFT model ( $\pi_{\text{ref}}$ )
Reward model	Scores completions ( $r_\phi$ )
Value model	Estimates expected future reward (critic)

This means RLHF with PPO requires four models in memory simultaneously - a major computational cost.

PPO-specific mechanics:

Clipped surrogate objective: PPO clips the probability ratio $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)}$ to $[1-\epsilon, 1+\epsilon]$ to prevent destructively large updates
Generalized Advantage Estimation (GAE): Used to compute per-token advantages with bias-variance tradeoff controlled by $\lambda$
Mini-batch updates: Each batch of generations is used for multiple gradient steps

Common Trap

Don't confuse the KL penalty in the RLHF objective with the PPO clipping mechanism. They serve different purposes: the KL penalty prevents reward hacking (policy-level), while PPO clipping prevents destructive optimization steps (gradient-level). Many candidates conflate these.

The KL Divergence Constraint

The KL penalty is arguably the most important design choice in RLHF. Without it, the policy will exploit the reward model.

Why KL Matters

KL Divergence Constraint

Intuition: The reward model is an imperfect proxy for human preferences. It was trained on a finite dataset and has blind spots. Without the KL constraint, the policy will find inputs that score highly according to the reward model but are clearly degenerate to a human - this is reward hacking (also called reward overoptimization or Goodhart's Law).

Mathematical detail: The per-token KL divergence is:

D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) = \sum_t \sum_{v \in V} \pi_\theta(v|x, y_{< t}) \log \frac{\pi_\theta(v|x, y_{< t})}{\pi_{\text{ref}}(v|x, y_{< t})}

In practice, this is estimated as:

\hat{D}_{\text{KL}} = \sum_t \left[ \log \pi_\theta(y_t|x, y_{< t}) - \log \pi_{\text{ref}}(y_t|x, y_{< t}) \right]

Choosing $\beta$

Too high $\beta$ : Policy barely changes from SFT - alignment gains are minimal
Too low $\beta$ : Reward hacking - policy exploits RM weaknesses
Typical range: $\beta \in [0.01, 0.2]$
Some implementations use adaptive KL: adjust $\beta$ dynamically to target a specific KL budget

Interviewer's Perspective

A strong answer on KL explains three things: (1) why it is needed (reward model is imperfect), (2) what happens without it (reward hacking), and (3) how $\beta$ is tuned in practice (grid search or adaptive targeting).

Reward Hacking and Mitigations

Reward hacking is when the optimized policy achieves high reward scores without actually improving response quality. It is one of the most studied failure modes in alignment.

Common Reward Hacking Patterns

Pattern	Description	Example
Length gaming	Reward model gives higher scores to longer responses	Model produces verbose, padded responses with unnecessary caveats
Sycophancy	RM rewards agreement with user	Model agrees with factually wrong premises
Repetition exploitation	Certain phrases consistently score high	Model repeats "I hope that helps!" or "Great question!"
Format gaming	RM prefers certain formats	Model always produces numbered lists regardless of context
Hedging	RM rewards cautious language	Model adds excessive disclaimers to every response

Mitigations

KL divergence constraint - The primary defense (discussed above)
Reward model ensembles - Train multiple RMs and use the conservative estimate (e.g., minimum or lower confidence bound):

r_{\text{ensemble}}(x, y) = \min_{i \in [N]} r_{\phi_i}(x, y)

Length penalties - Explicitly penalize the reward for response length:

r_{\text{adjusted}}(x, y) = r_\phi(x, y) - \alpha \cdot \text{len}(y)

Reward model retraining - Periodically retrain the RM on outputs from the current policy (iterated RLHF)
Process-based reward models - Reward each reasoning step, not just the final answer. This makes gaming harder since every step must look reasonable.

Instant Rejection

Saying "just increase the reward model's accuracy" without acknowledging Goodhart's Law signals a fundamental misunderstanding of alignment. The issue is not that the RM is inaccurate - it is that any learned proxy will be exploited by a sufficiently powerful optimizer.

Preference Data Collection

The quality of RLHF depends entirely on the quality of preference data. This section covers practical considerations that interviewers - especially at AI labs - care deeply about.

Annotation Guidelines

What makes a good response? Annotators need clear criteria. InstructGPT used a hierarchy:

Helpfulness - Does the response actually answer the question?
Truthfulness - Are claims factually correct?
Harmlessness - Does the response avoid generating harmful content?

When criteria conflict (e.g., a user asks "how do I pick a lock?"), the guidelines must specify priority ordering. This is where alignment philosophy meets engineering.

Pairwise vs. Rating vs. Ranking

Method	Pros	Cons
Pairwise comparison	Simplest for annotators, highest agreement	$O(K^2)$ comparisons for $K$ responses
Likert rating (1-5)	Efficient, one judgment per response	Low inter-annotator agreement, scale anchoring varies
Full ranking	Maximum information per annotation	Cognitively demanding for more than 4-5 responses

Best practice: Use pairwise comparisons for the core dataset. Derive rankings from pairwise results using the Bradley-Terry model or Elo rating. InstructGPT generated $K \in \{4, ..., 9\}$ responses and collected $\binom{K}{2}$ pairwise comparisons per prompt.

Inter-Annotator Agreement

Human preferences are noisy. Typical inter-annotator agreement rates:

Easy comparisons (one response is clearly better): ~85-95% agreement
Hard comparisons (both responses are similar quality): ~55-65% agreement
Overall: ~72-77% agreement (reported by InstructGPT)

Common Trap

Candidates sometimes assume preference data is "ground truth." It is not - human preferences are stochastic, context-dependent, and vary across annotators. Strong candidates acknowledge this noise and discuss how it affects RM training (e.g., label smoothing, filtering low-confidence pairs).

Scaling Laws for Preference Data

More preference data helps, but with diminishing returns:

10K-50K comparisons: Meaningful reward model
50K-200K comparisons: Strong reward model
200K+ comparisons: Marginal gains; quality and diversity matter more

Direct Preference Optimization (DPO)

DPO (Rafailov et al., 2023) eliminates the need for a separate reward model and RL loop entirely. It is one of the most important developments in post-RLHF alignment.

Core Insight

The key mathematical insight: the optimal policy under the RLHF objective (reward maximization + KL constraint) has a closed-form solution:

\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r^*(x, y)\right)

Rearranging for the reward:

r^*(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)

Substituting this into the Bradley-Terry preference model and noting that $Z(x)$ cancels in pairwise comparisons, we get the DPO loss:

\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right]

DPO vs. RLHF

DPO vs RLHF

Dimension	RLHF (PPO)	DPO
Reward model	Required (separate training)	Implicit (derived from policy)
Models in memory	4 (policy, ref, RM, value)	2 (policy, ref)
Training stability	Sensitive to hyperparameters	More stable (standard supervised loss)
Compute cost	High (RL loop, generation)	Lower (no generation during training)
Reward hacking	Explicit RM can be exploited	No explicit RM to exploit
Expressiveness	Can learn complex reward signals	Limited by Bradley-Terry assumption
Online data	Generates new responses during training	Uses fixed offline dataset

DPO Variants

IPO (Identity Preference Optimization): Addresses DPO's overfitting to preference margins by using a simpler loss that does not assume the Bradley-Terry model
KTO (Kahneman-Tversky Optimization): Works with binary feedback (thumbs up/down) instead of pairwise comparisons - much easier to collect
ORPO (Odds Ratio Preference Optimization): Combines SFT and preference optimization into a single stage
SimPO: Simplifies DPO by removing the need for a reference model entirely

60-Second Answer

"DPO shows that you can skip the reward model and RL loop entirely. The math works out because the optimal RLHF policy has a closed form - you can rearrange the objective to get a supervised loss that directly increases the log-probability gap between preferred and dispreferred responses, scaled by beta."

RLAIF: Reinforcement Learning from AI Feedback

RLAIF (Bai et al., 2022; Lee et al., 2023) replaces human annotators with an AI model that provides preference judgments.

How RLAIF Works

Generate response pairs for each prompt
Ask a capable LLM (e.g., GPT-4, Claude) to judge which response is better
Train a reward model on the AI-generated preferences
Run PPO as in standard RLHF (or use DPO on AI preferences)

Advantages

Scale: AI feedback can be generated at 100-1000x the rate of human annotation
Consistency: AI judges are more consistent (lower noise) than human annotators
Cost: Orders of magnitude cheaper than human annotation
Coverage: Can evaluate on specialized domains where human experts are scarce

Risks

Capability ceiling: The AI judge cannot reliably evaluate responses that exceed its own capability
Bias amplification: The AI judge's biases get baked into the reward model
Homogenization: Models trained on AI feedback may converge to similar styles
Sycophancy loops: If the AI judge has the same sycophantic tendencies, these get reinforced

Company Variation

Anthropic pioneered RLAIF with Constitutional AI. Google used RLAIF extensively for PaLM 2 and Gemini. OpenAI primarily uses human feedback but supplements with model-generated comparisons for scale. Meta's Llama models use a mix of human and AI feedback.

Constitutional AI (CAI)

Constitutional AI (Bai et al., 2022) is Anthropic's framework for alignment that uses a set of principles (a "constitution") to guide AI self-improvement.

The Two-Phase Process

Constitutional AI

Phase 1 - Critique and Revision (SL-CAI):

Prompt the model to generate a potentially harmful response
Ask the model to critique its own response based on a constitutional principle (e.g., "Is this response harmful?")
Ask the model to revise the response to address the critique
Fine-tune on the revised (improved) responses

Phase 2 - RL from AI Feedback (RL-CAI):

Generate pairs of responses
Ask the AI to judge which response better adheres to the constitution
Train a reward model on these AI-generated preferences
Optimize with PPO

Example Constitutional Principles

"Choose the response that is least likely to be used for harm"
"Choose the response that is most helpful while being honest"
"Choose the response that is least toxic or offensive"
"Choose the response that most respects individual autonomy"

Why It Matters

CAI reduces dependence on human annotators for safety-related preferences, scales better, and makes the alignment criteria explicit and auditable - you can read the constitution and understand what the model was trained to value.

Safety vs. Helpfulness Tradeoff

One of the central tensions in alignment is the tradeoff between making a model safe and making it helpful.

The Pareto Frontier

Safety-Helpfulness Tradeoff

Over-refusal problem: An overly aligned model refuses benign requests because they superficially resemble dangerous ones. Examples:

Refusing to explain how locks work (legitimate educational content)
Refusing to write fiction with conflict (normal storytelling)
Adding unnecessary safety disclaimers to cooking recipes

Under-alignment problem: An insufficiently aligned model provides harmful information freely:

Detailed instructions for illegal activities
Generating manipulative content
Confirming user's false medical self-diagnoses

Practical Approaches to the Tradeoff

Tiered safety: Apply different safety thresholds to different harm categories. Mild topics (e.g., alcohol) get lighter restrictions than severe topics (e.g., weapons).
Context-aware refusal: Instead of categorical refusal, consider the likely use case. "How does a virus spread?" is a legitimate biology question, not a bioweapons inquiry.
Calibrated responses: Provide information with appropriate caveats rather than refusing outright. "Here's how X works, but Y is illegal and Z is the legal alternative."
System prompt steering: Allow deployers to adjust the safety-helpfulness tradeoff for their use case (e.g., a medical AI can discuss medications more freely than a general chatbot).

Interviewer's Perspective

The best candidates discuss this tradeoff with nuance. Avoid taking extreme positions ("models should never refuse" or "safety always trumps helpfulness"). Interviewers want to see that you understand the complexity and can reason about specific cases.

RLHF vs. DPO vs. RLAIF Comparison

Dimension	RLHF (PPO)	DPO	RLAIF
Feedback source	Human annotators	Human annotators	AI model
Training method	RL (PPO)	Supervised loss	RL or DPO
Reward model	Explicit, separate	Implicit	Explicit, separate
Compute cost	Very high	Moderate	High (but annotation is cheap)
Training stability	Low (RL is finicky)	High	Moderate
Data efficiency	Moderate	High	Low (AI data is noisy but abundant)
Scalability	Limited by human annotation	Limited by human annotation	Highly scalable
Reward hacking risk	High	Lower	Moderate
Best for	Frontier models with budget	Open-source, research	Scaling alignment cheaply

60-Second Answer

"RLHF gives maximum control but is expensive and unstable. DPO is simpler, more stable, and cheaper - it is the default choice for most teams today. RLAIF solves the data bottleneck by using AI judges, but it introduces the risk of amplifying AI biases. In practice, most labs use a combination: AI feedback for scale, human feedback for calibration."

Practice Problems

Problem 1: Reward Model Design

You are designing a reward model for a customer service chatbot. The model should reward helpful, accurate, and polite responses. How would you structure the training data and loss function?

Hint 1 - Direction

Think about what makes customer service different from general chat. Consider domain-specific quality criteria and how to source comparison data.

Hint 2 - Insight

You need multi-dimensional quality criteria: accuracy (did the agent solve the problem?), tone (was it professional?), efficiency (did it resolve quickly?). Consider using real customer satisfaction data as a signal.

Hint 3 - Full Solution

Data collection:

Source prompts from real customer conversations (anonymized)
Generate $K=4$ responses per prompt using the SFT model
Have trained annotators rank responses on three axes: helpfulness (1-5), accuracy (binary), and tone (1-3)
Construct pairwise preferences from rankings

Loss function:

Use the standard Bradley-Terry pairwise loss
Add a length penalty to prevent verbose responses (customers want quick resolutions)
Consider a multi-objective reward: $r(x, y) = w_1 \cdot r_{\text{helpful}} + w_2 \cdot r_{\text{accurate}} + w_3 \cdot r_{\text{tone}}$

Domain-specific considerations:

Include "escalation detection" - the model should know when to hand off to a human
Test for hallucination of policies (e.g., promising refunds the company does not offer)
Evaluate on resolution rate, not just response quality

Scoring rubric:

Grade	Criteria
Strong Hire	Discusses multi-dimensional reward, domain-specific failure modes (hallucinated policies), and evaluation beyond preference accuracy (e.g., resolution rate). Mentions calibration between annotators.
Lean Hire	Covers basic RM training, mentions pairwise loss, considers at least one domain-specific issue.
No Hire	Describes generic RM training without tailoring to the customer service domain. Does not consider domain-specific failure modes.

Problem 2: DPO Troubleshooting

You fine-tuned a model with DPO and the loss converges, but the model's responses are worse than the SFT baseline. What went wrong?

Hint 1 - Direction

Think about the assumptions DPO makes and what data quality issues could cause the model to learn the wrong direction.

Hint 2 - Insight

DPO assumes the preference data is consistent with the Bradley-Terry model and that the reference model generated the responses. What happens if neither assumption holds?

Hint 3 - Full Solution

Likely causes:

Preference data quality: If chosen/rejected labels are noisy or inconsistent, DPO will learn to increase probability of low-quality responses. Check inter-annotator agreement.
Distribution mismatch: DPO works best when the response pairs were generated by (or near) the reference model. If you are using preference data from a different model, the implicit reward function may be miscalibrated.
$\beta$ too low: The model diverged too far from the reference, overfitting to noise in the preferences. Try increasing $\beta$ .
$\beta$ too high: The model barely changed. Check the KL divergence - if it is near zero, increase learning rate or decrease $\beta$ .
Chosen responses are out-of-distribution: If the "preferred" responses are written by humans or a much more capable model, the policy may struggle to assign them high probability, leading to unstable gradients.
Evaluation mismatch: The evaluation criteria may differ from the preference criteria. DPO optimized for what annotators preferred, but you are evaluating on something else (e.g., factual accuracy).

Debugging steps:

Plot training loss curve - look for instability
Measure KL divergence from reference - too high or too low?
Sample outputs and compare to SFT baseline qualitatively
Check if chosen responses are within the reference model's capability

Scoring rubric:

Grade	Criteria
Strong Hire	Identifies 3+ plausible causes, proposes systematic debugging approach, understands the distribution mismatch issue.
Lean Hire	Identifies at least 2 causes including data quality. Mentions $\beta$ tuning.
No Hire	Cannot identify plausible causes beyond "bad data." Does not understand DPO's assumptions.

Problem 3: Safety Alignment Strategy

Your company is launching a general-purpose chatbot. Design an alignment strategy that balances safety and helpfulness. You have a budget for 50K human preference annotations.

Hint 1 - Direction

Think about how to allocate your annotation budget across safety-critical and general helpfulness data. Consider what can be automated and what requires human judgment.

Hint 2 - Insight

A hybrid approach works best: use RLAIF for the bulk of general helpfulness training, reserve human annotations for safety-critical categories where AI judgment is unreliable. Consider constitutional AI for scalable safety training.

Hint 3 - Full Solution

Strategy:

Annotation budget allocation:
- 20K annotations for safety-critical comparisons (harmful content, misinformation, illegal advice)
- 20K annotations for general helpfulness
- 10K annotations for edge cases (ambiguous requests, dual-use knowledge)
Supplement with RLAIF:
- Use a strong AI judge (GPT-4 or Claude) to generate 200K+ additional preference pairs for general quality
- Focus AI feedback on style, clarity, and completeness (domains where AI judges are reliable)
- Reserve human annotation for safety judgments (where AI judges may have blind spots)
Training pipeline:
- Stage 1: SFT on high-quality demonstrations
- Stage 2: DPO on the combined human + AI preference dataset (DPO for stability and lower compute)
- Stage 3: Safety-specific DPO pass using only the 20K safety annotations with lower $\beta$ (stronger alignment on safety)
Evaluation:
- Red-teaming with adversarial prompts
- Helpfulness benchmarks (MT-Bench, AlpacaEval)
- Safety benchmarks (ToxiGen, BBQ, custom harm categories)
- Monitor over-refusal rate on benign prompts
Post-deployment:
- Collect user feedback (thumbs up/down) for ongoing RLAIF
- Monitor for reward hacking patterns
- Periodic human evaluation audits

Scoring rubric:

Grade	Criteria
Strong Hire	Proposes a principled budget allocation, combines human and AI feedback appropriately, addresses both safety and helpfulness, includes evaluation and monitoring.
Lean Hire	Reasonable strategy with human annotation. Mentions safety vs. helpfulness tradeoff.
No Hire	Spends all budget on one dimension. No evaluation plan. Does not consider RLAIF for scaling.

Interview Cheat Sheet

Topic	Key Fact	Why It Matters
RLHF stages	SFT, then RM, then PPO	The canonical 3-stage pipeline
Reward model loss	Bradley-Terry pairwise sigmoid loss	Most common formulation
PPO memory cost	4 models simultaneously	Major practical constraint
KL penalty	Prevents reward hacking	Goodhart's Law in action
DPO advantage	No RM, no RL, just supervised loss	Simpler and more stable
DPO loss	Log-prob ratio gap between preferred and dispreferred	Closed-form from RLHF objective
RLAIF	AI generates preferences instead of humans	Scales but risks bias amplification
Constitutional AI	Principles-based self-critique and revision	Anthropic's framework, makes criteria explicit
Reward hacking	Policy exploits RM weaknesses	Mitigate with KL, ensembles, length penalties
Safety tradeoff	Over-refusal vs. under-alignment	Context-aware, tiered approach works best
$\beta$ parameter	Controls KL constraint strength	Too high: no change. Too low: reward hacking.
DPO variants	IPO, KTO, ORPO, SimPO	Each relaxes different DPO assumptions

Spaced Repetition Checkpoints

Use these prompts to test your recall at increasing intervals:

Day 0 (Today)

What are the three stages of the RLHF pipeline?
Write the reward model (Bradley-Terry) loss function from memory
What is the role of the KL divergence constraint?

Day 3

Explain the DPO loss function and its derivation in one paragraph
Name three reward hacking patterns and one mitigation for each
What are the four models required in PPO-based RLHF?

Day 7

Compare RLHF, DPO, and RLAIF: when would you use each?
Describe the two phases of Constitutional AI
How would you allocate a 50K annotation budget between safety and helpfulness?

Day 14

Design a complete alignment pipeline for a domain-specific chatbot
Explain why reward hacking is fundamentally unavoidable (Goodhart's Law) and how to mitigate it
Discuss the safety vs. helpfulness tradeoff with concrete examples

Day 21

Teach the full RLHF-to-DPO derivation to someone unfamiliar with the topic
Critique the limitations of current alignment techniques - what problems remain unsolved?
Design an evaluation suite that measures both safety and helpfulness without over-indexing on either

Why Alignment Matters​

The RLHF Pipeline​

Stage 1: Supervised Fine-Tuning (SFT)​

Stage 2: Reward Model Training​

Stage 3: PPO Optimization​

The KL Divergence Constraint​

Why KL Matters​

Choosing β\betaβ​

Reward Hacking and Mitigations​

Common Reward Hacking Patterns​

Mitigations​

Preference Data Collection​

Annotation Guidelines​

Pairwise vs. Rating vs. Ranking​

Inter-Annotator Agreement​

Scaling Laws for Preference Data​

Direct Preference Optimization (DPO)​

Core Insight​

DPO vs. RLHF​

DPO Variants​

RLAIF: Reinforcement Learning from AI Feedback​

How RLAIF Works​

Advantages​

Risks​

Constitutional AI (CAI)​

The Two-Phase Process​

Example Constitutional Principles​

Why It Matters​

Safety vs. Helpfulness Tradeoff​

The Pareto Frontier​

Practical Approaches to the Tradeoff​

RLHF vs. DPO vs. RLAIF Comparison​

Practice Problems​

Problem 1: Reward Model Design​

Problem 2: DPO Troubleshooting​

Problem 3: Safety Alignment Strategy​

Interview Cheat Sheet​

Spaced Repetition Checkpoints​

Day 0 (Today)​

Day 3​

Day 7​

Day 14​

Day 21​

Why Alignment Matters

The RLHF Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Stage 2: Reward Model Training

Stage 3: PPO Optimization

The KL Divergence Constraint

Why KL Matters

Choosing $\beta$

Reward Hacking and Mitigations

Common Reward Hacking Patterns

Mitigations

Preference Data Collection

Annotation Guidelines

Pairwise vs. Rating vs. Ranking

Inter-Annotator Agreement

Scaling Laws for Preference Data

Direct Preference Optimization (DPO)

Core Insight

DPO vs. RLHF

DPO Variants

RLAIF: Reinforcement Learning from AI Feedback

How RLAIF Works

Advantages

Risks

Constitutional AI (CAI)

The Two-Phase Process

Example Constitutional Principles

Why It Matters

Safety vs. Helpfulness Tradeoff

The Pareto Frontier

Practical Approaches to the Tradeoff

RLHF vs. DPO vs. RLAIF Comparison

Practice Problems

Problem 1: Reward Model Design

Problem 2: DPO Troubleshooting

Problem 3: Safety Alignment Strategy

Interview Cheat Sheet

Spaced Repetition Checkpoints

Day 0 (Today)

Day 3

Day 7

Day 14

Day 21