Skip to main content

RLHF and Alignment

Alignment is the field's answer to a deceptively simple question: how do you make a language model do what humans actually want? This chapter covers the full alignment pipeline - from collecting preference data, through reward modeling and PPO optimization, to modern alternatives like DPO and constitutional AI. If you are interviewing for any role that touches LLM training, safety, or product, expect at least one alignment question.

Why Alignment Matters

Pre-trained language models learn to predict the next token. That objective produces fluent text, but fluency is not the same as helpfulness, truthfulness, or safety. A model trained only on next-token prediction will happily:

  • Generate toxic content if the prompt context makes it likely
  • Hallucinate facts with high confidence
  • Follow dangerous instructions (e.g., synthesize harmful substances)
  • Produce verbose, hedging, or sycophantic responses

Alignment techniques close the gap between "predict likely text" and "produce text humans actually prefer."

Interviewer's Perspective

Interviewers want you to articulate why pre-training alone is insufficient. The strongest candidates connect alignment to concrete failure modes - not just vague statements about "safety."

The RLHF Pipeline

The canonical RLHF pipeline, introduced by InstructGPT (Ouyang et al., 2022), has three stages:

RLHF Pipeline

Stage 1: Supervised Fine-Tuning (SFT)

Start with a pre-trained base model and fine-tune it on high-quality demonstration data - typically human-written responses to a diverse set of prompts.

Key details:

  • The SFT dataset is usually 10K-100K examples (much smaller than pre-training data)
  • Quality matters more than quantity - a small set of expert demonstrations outperforms a large set of noisy ones
  • The SFT model serves as the reference policy πref\pi_{\text{ref}} in later stages
Company Variation

At OpenAI, SFT data was written by a team of ~40 contractors with detailed labeling guidelines. At Anthropic, the SFT stage uses a mix of human and AI-generated demonstrations (RLAIF-style). Smaller companies often start with open-source instruction datasets like Alpaca or OpenAssistant.

Stage 2: Reward Model Training

The reward model (RM) learns to predict which of two responses a human would prefer. It is the bridge between human judgment and automated optimization.

Data collection:

  1. Sample a prompt from the prompt distribution
  2. Generate KK responses from the SFT model (typically K=4K = 4 to K=9K = 9)
  3. Human annotators rank the responses (or provide pairwise comparisons)
  4. Convert rankings to pairwise preferences: (x,yw,yl)(x, y_w, y_l) where ywy_w is preferred over yly_l

Training objective (Bradley-Terry model):

LRM=E(x,yw,yl)[logσ(rθ(x,yw)rθ(x,yl))]\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( r_\theta(x, y_w) - r_\theta(x, y_l) \right) \right]

where rθ(x,y)r_\theta(x, y) is the scalar reward assigned by the model to response yy given prompt xx, and σ\sigma is the sigmoid function.

Architecture: The reward model is typically initialized from the SFT model checkpoint, with the final unembedding layer replaced by a linear head that outputs a scalar reward.

60-Second Answer

"The reward model takes a prompt-response pair and outputs a scalar score. It's trained on human preference comparisons using a Bradley-Terry pairwise loss. The key insight is that we only need relative preferences - annotators don't need to assign absolute quality scores, just say which response is better."

Stage 3: PPO Optimization

Proximal Policy Optimization (PPO) is used to fine-tune the SFT model to maximize the reward model's score, subject to a KL divergence penalty that prevents the policy from drifting too far from the reference model.

The RL objective:

maxπθExD,yπθ(x)[rϕ(x,y)βDKL(πθ(x)πref(x))]\max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)} \left[ r_\phi(x, y) - \beta \, D_{\text{KL}}\left(\pi_\theta(\cdot|x) \,\|\, \pi_{\text{ref}}(\cdot|x)\right) \right]

where:

  • πθ\pi_\theta is the policy being optimized
  • πref\pi_{\text{ref}} is the frozen SFT model (reference policy)
  • rϕr_\phi is the learned reward model
  • β\beta controls the strength of the KL penalty
  • D\mathcal{D} is the prompt distribution

PPO implementation details for LLMs:

ComponentRole
Policy modelThe LLM being optimized (πθ\pi_\theta)
Reference modelFrozen copy of the SFT model (πref\pi_{\text{ref}})
Reward modelScores completions (rϕr_\phi)
Value modelEstimates expected future reward (critic)

This means RLHF with PPO requires four models in memory simultaneously - a major computational cost.

PPO-specific mechanics:

  • Clipped surrogate objective: PPO clips the probability ratio rt(θ)=πθ(atst)πold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\text{old}}(a_t|s_t)} to [1ϵ,1+ϵ][1-\epsilon, 1+\epsilon] to prevent destructively large updates
  • Generalized Advantage Estimation (GAE): Used to compute per-token advantages with bias-variance tradeoff controlled by λ\lambda
  • Mini-batch updates: Each batch of generations is used for multiple gradient steps
Common Trap

Don't confuse the KL penalty in the RLHF objective with the PPO clipping mechanism. They serve different purposes: the KL penalty prevents reward hacking (policy-level), while PPO clipping prevents destructive optimization steps (gradient-level). Many candidates conflate these.

The KL Divergence Constraint

The KL penalty is arguably the most important design choice in RLHF. Without it, the policy will exploit the reward model.

Why KL Matters

KL Divergence Constraint

Intuition: The reward model is an imperfect proxy for human preferences. It was trained on a finite dataset and has blind spots. Without the KL constraint, the policy will find inputs that score highly according to the reward model but are clearly degenerate to a human - this is reward hacking (also called reward overoptimization or Goodhart's Law).

Mathematical detail: The per-token KL divergence is:

DKL(πθπref)=tvVπθ(vx,y<t)logπθ(vx,y<t)πref(vx,y<t)D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) = \sum_t \sum_{v \in V} \pi_\theta(v|x, y_{< t}) \log \frac{\pi_\theta(v|x, y_{< t})}{\pi_{\text{ref}}(v|x, y_{< t})}

In practice, this is estimated as:

D^KL=t[logπθ(ytx,y<t)logπref(ytx,y<t)]\hat{D}_{\text{KL}} = \sum_t \left[ \log \pi_\theta(y_t|x, y_{< t}) - \log \pi_{\text{ref}}(y_t|x, y_{< t}) \right]

Choosing β\beta

  • Too high β\beta: Policy barely changes from SFT - alignment gains are minimal
  • Too low β\beta: Reward hacking - policy exploits RM weaknesses
  • Typical range: β[0.01,0.2]\beta \in [0.01, 0.2]
  • Some implementations use adaptive KL: adjust β\beta dynamically to target a specific KL budget
Interviewer's Perspective

A strong answer on KL explains three things: (1) why it is needed (reward model is imperfect), (2) what happens without it (reward hacking), and (3) how β\beta is tuned in practice (grid search or adaptive targeting).

Reward Hacking and Mitigations

Reward hacking is when the optimized policy achieves high reward scores without actually improving response quality. It is one of the most studied failure modes in alignment.

Common Reward Hacking Patterns

PatternDescriptionExample
Length gamingReward model gives higher scores to longer responsesModel produces verbose, padded responses with unnecessary caveats
SycophancyRM rewards agreement with userModel agrees with factually wrong premises
Repetition exploitationCertain phrases consistently score highModel repeats "I hope that helps!" or "Great question!"
Format gamingRM prefers certain formatsModel always produces numbered lists regardless of context
HedgingRM rewards cautious languageModel adds excessive disclaimers to every response

Mitigations

  1. KL divergence constraint - The primary defense (discussed above)

  2. Reward model ensembles - Train multiple RMs and use the conservative estimate (e.g., minimum or lower confidence bound):

rensemble(x,y)=mini[N]rϕi(x,y)r_{\text{ensemble}}(x, y) = \min_{i \in [N]} r_{\phi_i}(x, y)
  1. Length penalties - Explicitly penalize the reward for response length:
radjusted(x,y)=rϕ(x,y)αlen(y)r_{\text{adjusted}}(x, y) = r_\phi(x, y) - \alpha \cdot \text{len}(y)
  1. Reward model retraining - Periodically retrain the RM on outputs from the current policy (iterated RLHF)

  2. Process-based reward models - Reward each reasoning step, not just the final answer. This makes gaming harder since every step must look reasonable.

Instant Rejection

Saying "just increase the reward model's accuracy" without acknowledging Goodhart's Law signals a fundamental misunderstanding of alignment. The issue is not that the RM is inaccurate - it is that any learned proxy will be exploited by a sufficiently powerful optimizer.

Preference Data Collection

The quality of RLHF depends entirely on the quality of preference data. This section covers practical considerations that interviewers - especially at AI labs - care deeply about.

Annotation Guidelines

What makes a good response? Annotators need clear criteria. InstructGPT used a hierarchy:

  1. Helpfulness - Does the response actually answer the question?
  2. Truthfulness - Are claims factually correct?
  3. Harmlessness - Does the response avoid generating harmful content?

When criteria conflict (e.g., a user asks "how do I pick a lock?"), the guidelines must specify priority ordering. This is where alignment philosophy meets engineering.

Pairwise vs. Rating vs. Ranking

MethodProsCons
Pairwise comparisonSimplest for annotators, highest agreementO(K2)O(K^2) comparisons for KK responses
Likert rating (1-5)Efficient, one judgment per responseLow inter-annotator agreement, scale anchoring varies
Full rankingMaximum information per annotationCognitively demanding for more than 4-5 responses

Best practice: Use pairwise comparisons for the core dataset. Derive rankings from pairwise results using the Bradley-Terry model or Elo rating. InstructGPT generated K{4,...,9}K \in \{4, ..., 9\} responses and collected (K2)\binom{K}{2} pairwise comparisons per prompt.

Inter-Annotator Agreement

Human preferences are noisy. Typical inter-annotator agreement rates:

  • Easy comparisons (one response is clearly better): ~85-95% agreement
  • Hard comparisons (both responses are similar quality): ~55-65% agreement
  • Overall: ~72-77% agreement (reported by InstructGPT)
Common Trap

Candidates sometimes assume preference data is "ground truth." It is not - human preferences are stochastic, context-dependent, and vary across annotators. Strong candidates acknowledge this noise and discuss how it affects RM training (e.g., label smoothing, filtering low-confidence pairs).

Scaling Laws for Preference Data

More preference data helps, but with diminishing returns:

  • 10K-50K comparisons: Meaningful reward model
  • 50K-200K comparisons: Strong reward model
  • 200K+ comparisons: Marginal gains; quality and diversity matter more

Direct Preference Optimization (DPO)

DPO (Rafailov et al., 2023) eliminates the need for a separate reward model and RL loop entirely. It is one of the most important developments in post-RLHF alignment.

Core Insight

The key mathematical insight: the optimal policy under the RLHF objective (reward maximization + KL constraint) has a closed-form solution:

π(yx)=1Z(x)πref(yx)exp(1βr(x,y))\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{1}{\beta} r^*(x, y)\right)

Rearranging for the reward:

r(x,y)=βlogπ(yx)πref(yx)+βlogZ(x)r^*(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)

Substituting this into the Bradley-Terry preference model and noting that Z(x)Z(x) cancels in pairwise comparisons, we get the DPO loss:

LDPO=E(x,yw,yl)[logσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right]

DPO vs. RLHF

DPO vs RLHF

DimensionRLHF (PPO)DPO
Reward modelRequired (separate training)Implicit (derived from policy)
Models in memory4 (policy, ref, RM, value)2 (policy, ref)
Training stabilitySensitive to hyperparametersMore stable (standard supervised loss)
Compute costHigh (RL loop, generation)Lower (no generation during training)
Reward hackingExplicit RM can be exploitedNo explicit RM to exploit
ExpressivenessCan learn complex reward signalsLimited by Bradley-Terry assumption
Online dataGenerates new responses during trainingUses fixed offline dataset

DPO Variants

  • IPO (Identity Preference Optimization): Addresses DPO's overfitting to preference margins by using a simpler loss that does not assume the Bradley-Terry model
  • KTO (Kahneman-Tversky Optimization): Works with binary feedback (thumbs up/down) instead of pairwise comparisons - much easier to collect
  • ORPO (Odds Ratio Preference Optimization): Combines SFT and preference optimization into a single stage
  • SimPO: Simplifies DPO by removing the need for a reference model entirely
60-Second Answer

"DPO shows that you can skip the reward model and RL loop entirely. The math works out because the optimal RLHF policy has a closed form - you can rearrange the objective to get a supervised loss that directly increases the log-probability gap between preferred and dispreferred responses, scaled by beta."

RLAIF: Reinforcement Learning from AI Feedback

RLAIF (Bai et al., 2022; Lee et al., 2023) replaces human annotators with an AI model that provides preference judgments.

How RLAIF Works

  1. Generate response pairs for each prompt
  2. Ask a capable LLM (e.g., GPT-4, Claude) to judge which response is better
  3. Train a reward model on the AI-generated preferences
  4. Run PPO as in standard RLHF (or use DPO on AI preferences)

Advantages

  • Scale: AI feedback can be generated at 100-1000x the rate of human annotation
  • Consistency: AI judges are more consistent (lower noise) than human annotators
  • Cost: Orders of magnitude cheaper than human annotation
  • Coverage: Can evaluate on specialized domains where human experts are scarce

Risks

  • Capability ceiling: The AI judge cannot reliably evaluate responses that exceed its own capability
  • Bias amplification: The AI judge's biases get baked into the reward model
  • Homogenization: Models trained on AI feedback may converge to similar styles
  • Sycophancy loops: If the AI judge has the same sycophantic tendencies, these get reinforced
Company Variation

Anthropic pioneered RLAIF with Constitutional AI. Google used RLAIF extensively for PaLM 2 and Gemini. OpenAI primarily uses human feedback but supplements with model-generated comparisons for scale. Meta's Llama models use a mix of human and AI feedback.

Constitutional AI (CAI)

Constitutional AI (Bai et al., 2022) is Anthropic's framework for alignment that uses a set of principles (a "constitution") to guide AI self-improvement.

The Two-Phase Process

Constitutional AI

Phase 1 - Critique and Revision (SL-CAI):

  1. Prompt the model to generate a potentially harmful response
  2. Ask the model to critique its own response based on a constitutional principle (e.g., "Is this response harmful?")
  3. Ask the model to revise the response to address the critique
  4. Fine-tune on the revised (improved) responses

Phase 2 - RL from AI Feedback (RL-CAI):

  1. Generate pairs of responses
  2. Ask the AI to judge which response better adheres to the constitution
  3. Train a reward model on these AI-generated preferences
  4. Optimize with PPO

Example Constitutional Principles

  • "Choose the response that is least likely to be used for harm"
  • "Choose the response that is most helpful while being honest"
  • "Choose the response that is least toxic or offensive"
  • "Choose the response that most respects individual autonomy"

Why It Matters

CAI reduces dependence on human annotators for safety-related preferences, scales better, and makes the alignment criteria explicit and auditable - you can read the constitution and understand what the model was trained to value.

Safety vs. Helpfulness Tradeoff

One of the central tensions in alignment is the tradeoff between making a model safe and making it helpful.

The Pareto Frontier

Safety-Helpfulness Tradeoff

Over-refusal problem: An overly aligned model refuses benign requests because they superficially resemble dangerous ones. Examples:

  • Refusing to explain how locks work (legitimate educational content)
  • Refusing to write fiction with conflict (normal storytelling)
  • Adding unnecessary safety disclaimers to cooking recipes

Under-alignment problem: An insufficiently aligned model provides harmful information freely:

  • Detailed instructions for illegal activities
  • Generating manipulative content
  • Confirming user's false medical self-diagnoses

Practical Approaches to the Tradeoff

  1. Tiered safety: Apply different safety thresholds to different harm categories. Mild topics (e.g., alcohol) get lighter restrictions than severe topics (e.g., weapons).

  2. Context-aware refusal: Instead of categorical refusal, consider the likely use case. "How does a virus spread?" is a legitimate biology question, not a bioweapons inquiry.

  3. Calibrated responses: Provide information with appropriate caveats rather than refusing outright. "Here's how X works, but Y is illegal and Z is the legal alternative."

  4. System prompt steering: Allow deployers to adjust the safety-helpfulness tradeoff for their use case (e.g., a medical AI can discuss medications more freely than a general chatbot).

Interviewer's Perspective

The best candidates discuss this tradeoff with nuance. Avoid taking extreme positions ("models should never refuse" or "safety always trumps helpfulness"). Interviewers want to see that you understand the complexity and can reason about specific cases.

RLHF vs. DPO vs. RLAIF Comparison

DimensionRLHF (PPO)DPORLAIF
Feedback sourceHuman annotatorsHuman annotatorsAI model
Training methodRL (PPO)Supervised lossRL or DPO
Reward modelExplicit, separateImplicitExplicit, separate
Compute costVery highModerateHigh (but annotation is cheap)
Training stabilityLow (RL is finicky)HighModerate
Data efficiencyModerateHighLow (AI data is noisy but abundant)
ScalabilityLimited by human annotationLimited by human annotationHighly scalable
Reward hacking riskHighLowerModerate
Best forFrontier models with budgetOpen-source, researchScaling alignment cheaply
60-Second Answer

"RLHF gives maximum control but is expensive and unstable. DPO is simpler, more stable, and cheaper - it is the default choice for most teams today. RLAIF solves the data bottleneck by using AI judges, but it introduces the risk of amplifying AI biases. In practice, most labs use a combination: AI feedback for scale, human feedback for calibration."

Practice Problems

Problem 1: Reward Model Design

You are designing a reward model for a customer service chatbot. The model should reward helpful, accurate, and polite responses. How would you structure the training data and loss function?

Hint 1 - Direction

Think about what makes customer service different from general chat. Consider domain-specific quality criteria and how to source comparison data.

Hint 2 - Insight

You need multi-dimensional quality criteria: accuracy (did the agent solve the problem?), tone (was it professional?), efficiency (did it resolve quickly?). Consider using real customer satisfaction data as a signal.

Hint 3 - Full Solution

Data collection:

  • Source prompts from real customer conversations (anonymized)
  • Generate K=4K=4 responses per prompt using the SFT model
  • Have trained annotators rank responses on three axes: helpfulness (1-5), accuracy (binary), and tone (1-3)
  • Construct pairwise preferences from rankings

Loss function:

  • Use the standard Bradley-Terry pairwise loss
  • Add a length penalty to prevent verbose responses (customers want quick resolutions)
  • Consider a multi-objective reward: r(x,y)=w1rhelpful+w2raccurate+w3rtoner(x, y) = w_1 \cdot r_{\text{helpful}} + w_2 \cdot r_{\text{accurate}} + w_3 \cdot r_{\text{tone}}

Domain-specific considerations:

  • Include "escalation detection" - the model should know when to hand off to a human
  • Test for hallucination of policies (e.g., promising refunds the company does not offer)
  • Evaluate on resolution rate, not just response quality

Scoring rubric:

GradeCriteria
Strong HireDiscusses multi-dimensional reward, domain-specific failure modes (hallucinated policies), and evaluation beyond preference accuracy (e.g., resolution rate). Mentions calibration between annotators.
Lean HireCovers basic RM training, mentions pairwise loss, considers at least one domain-specific issue.
No HireDescribes generic RM training without tailoring to the customer service domain. Does not consider domain-specific failure modes.

Problem 2: DPO Troubleshooting

You fine-tuned a model with DPO and the loss converges, but the model's responses are worse than the SFT baseline. What went wrong?

Hint 1 - Direction

Think about the assumptions DPO makes and what data quality issues could cause the model to learn the wrong direction.

Hint 2 - Insight

DPO assumes the preference data is consistent with the Bradley-Terry model and that the reference model generated the responses. What happens if neither assumption holds?

Hint 3 - Full Solution

Likely causes:

  1. Preference data quality: If chosen/rejected labels are noisy or inconsistent, DPO will learn to increase probability of low-quality responses. Check inter-annotator agreement.

  2. Distribution mismatch: DPO works best when the response pairs were generated by (or near) the reference model. If you are using preference data from a different model, the implicit reward function may be miscalibrated.

  3. β\beta too low: The model diverged too far from the reference, overfitting to noise in the preferences. Try increasing β\beta.

  4. β\beta too high: The model barely changed. Check the KL divergence - if it is near zero, increase learning rate or decrease β\beta.

  5. Chosen responses are out-of-distribution: If the "preferred" responses are written by humans or a much more capable model, the policy may struggle to assign them high probability, leading to unstable gradients.

  6. Evaluation mismatch: The evaluation criteria may differ from the preference criteria. DPO optimized for what annotators preferred, but you are evaluating on something else (e.g., factual accuracy).

Debugging steps:

  • Plot training loss curve - look for instability
  • Measure KL divergence from reference - too high or too low?
  • Sample outputs and compare to SFT baseline qualitatively
  • Check if chosen responses are within the reference model's capability

Scoring rubric:

GradeCriteria
Strong HireIdentifies 3+ plausible causes, proposes systematic debugging approach, understands the distribution mismatch issue.
Lean HireIdentifies at least 2 causes including data quality. Mentions β\beta tuning.
No HireCannot identify plausible causes beyond "bad data." Does not understand DPO's assumptions.

Problem 3: Safety Alignment Strategy

Your company is launching a general-purpose chatbot. Design an alignment strategy that balances safety and helpfulness. You have a budget for 50K human preference annotations.

Hint 1 - Direction

Think about how to allocate your annotation budget across safety-critical and general helpfulness data. Consider what can be automated and what requires human judgment.

Hint 2 - Insight

A hybrid approach works best: use RLAIF for the bulk of general helpfulness training, reserve human annotations for safety-critical categories where AI judgment is unreliable. Consider constitutional AI for scalable safety training.

Hint 3 - Full Solution

Strategy:

  1. Annotation budget allocation:

    • 20K annotations for safety-critical comparisons (harmful content, misinformation, illegal advice)
    • 20K annotations for general helpfulness
    • 10K annotations for edge cases (ambiguous requests, dual-use knowledge)
  2. Supplement with RLAIF:

    • Use a strong AI judge (GPT-4 or Claude) to generate 200K+ additional preference pairs for general quality
    • Focus AI feedback on style, clarity, and completeness (domains where AI judges are reliable)
    • Reserve human annotation for safety judgments (where AI judges may have blind spots)
  3. Training pipeline:

    • Stage 1: SFT on high-quality demonstrations
    • Stage 2: DPO on the combined human + AI preference dataset (DPO for stability and lower compute)
    • Stage 3: Safety-specific DPO pass using only the 20K safety annotations with lower β\beta (stronger alignment on safety)
  4. Evaluation:

    • Red-teaming with adversarial prompts
    • Helpfulness benchmarks (MT-Bench, AlpacaEval)
    • Safety benchmarks (ToxiGen, BBQ, custom harm categories)
    • Monitor over-refusal rate on benign prompts
  5. Post-deployment:

    • Collect user feedback (thumbs up/down) for ongoing RLAIF
    • Monitor for reward hacking patterns
    • Periodic human evaluation audits

Scoring rubric:

GradeCriteria
Strong HireProposes a principled budget allocation, combines human and AI feedback appropriately, addresses both safety and helpfulness, includes evaluation and monitoring.
Lean HireReasonable strategy with human annotation. Mentions safety vs. helpfulness tradeoff.
No HireSpends all budget on one dimension. No evaluation plan. Does not consider RLAIF for scaling.

Interview Cheat Sheet

TopicKey FactWhy It Matters
RLHF stagesSFT, then RM, then PPOThe canonical 3-stage pipeline
Reward model lossBradley-Terry pairwise sigmoid lossMost common formulation
PPO memory cost4 models simultaneouslyMajor practical constraint
KL penaltyPrevents reward hackingGoodhart's Law in action
DPO advantageNo RM, no RL, just supervised lossSimpler and more stable
DPO lossLog-prob ratio gap between preferred and dispreferredClosed-form from RLHF objective
RLAIFAI generates preferences instead of humansScales but risks bias amplification
Constitutional AIPrinciples-based self-critique and revisionAnthropic's framework, makes criteria explicit
Reward hackingPolicy exploits RM weaknessesMitigate with KL, ensembles, length penalties
Safety tradeoffOver-refusal vs. under-alignmentContext-aware, tiered approach works best
β\beta parameterControls KL constraint strengthToo high: no change. Too low: reward hacking.
DPO variantsIPO, KTO, ORPO, SimPOEach relaxes different DPO assumptions

Spaced Repetition Checkpoints

Use these prompts to test your recall at increasing intervals:

Day 0 (Today)

  • What are the three stages of the RLHF pipeline?
  • Write the reward model (Bradley-Terry) loss function from memory
  • What is the role of the KL divergence constraint?

Day 3

  • Explain the DPO loss function and its derivation in one paragraph
  • Name three reward hacking patterns and one mitigation for each
  • What are the four models required in PPO-based RLHF?

Day 7

  • Compare RLHF, DPO, and RLAIF: when would you use each?
  • Describe the two phases of Constitutional AI
  • How would you allocate a 50K annotation budget between safety and helpfulness?

Day 14

  • Design a complete alignment pipeline for a domain-specific chatbot
  • Explain why reward hacking is fundamentally unavoidable (Goodhart's Law) and how to mitigate it
  • Discuss the safety vs. helpfulness tradeoff with concrete examples

Day 21

  • Teach the full RLHF-to-DPO derivation to someone unfamiliar with the topic
  • Critique the limitations of current alignment techniques - what problems remain unsolved?
  • Design an evaluation suite that measures both safety and helpfulness without over-indexing on either
© 2026 EngineersOfAI. All rights reserved.