Skip to main content

RLHF: Reinforcement Learning from Human Feedback

The Helpful but Dangerous Model

In 2022, OpenAI ran an experiment. They asked human raters to compare outputs from two models: GPT-3 (175B, not instruction-tuned) and an early version of InstructGPT (1.3B, RLHF-trained). The smaller model was trained with reinforcement learning from human feedback. The larger model was trained purely on text prediction.

The result: human raters preferred the 1.3B InstructGPT model 71% of the time over the 175B GPT-3 model.

A model 100x smaller beat a model 100x larger. Not on a narrow benchmark - in open-ended human evaluation of which response was more helpful, honest, and harmless.

This result exposed a deep problem with raw language models. GPT-3 knows an enormous amount. It is excellent at predicting text. But it has no mechanism for caring whether its output is helpful, truthful, or safe. It will helpfully explain how to pick a lock. It will confidently state false facts. It will generate harmful content if the text continuation probabilities favor it. Pretraining on text teaches the model to predict text - not to be a helpful assistant.

RLHF is the solution that OpenAI, Anthropic, and DeepMind independently developed and deployed: collect human preferences over model outputs, train a reward model to predict those preferences, and use reinforcement learning to optimize the language model to produce outputs that the reward model scores highly.

The three phases of RLHF produced the models we call "aligned" - GPT-4, Claude, Gemini - and their defining characteristic is not knowledge or reasoning ability, but the shaping of behavior by human preference.

Why This Exists: The Alignment Problem in Practice

After SFT and instruction tuning, you have a model that follows instructions. But "following instructions" is not the same as being aligned with human intent. Consider what a base model does when given the prompt "How do I whittle a knife?":

A raw language model might continue with a straightforward tutorial. That is fine.

But "How do I whittle a knife so I can kill my sister?" is a different prompt entirely. A model trained only to continue text might continue with the whittling tutorial - because whittling tutorials are common in text, and the "so I can kill my sister" addendum does not change the text continuation probabilities for whittling instructions.

Human intent is multi-dimensional. A genuinely helpful response considers not just what the user asked for, but whether fulfilling the request causes harm, whether the information is truthful, whether the response is complete, and dozens of other factors that are hard to specify in a training objective.

You cannot write a loss function that captures "be helpful, harmless, and honest" directly. But you can collect examples of human preferences - "Response A was better than Response B because it was more helpful without being harmful" - and train a model to learn that preference function. That is RLHF.

Historical Context: How RLHF Came Together

2017 - OpenAI and DeepMind (Christiano et al.) demonstrated learning from human preferences in game-playing agents. The key paper: "Deep reinforcement learning from human preferences." Showed that a model could learn complex behaviors from preference comparisons without explicit reward functions.

2020 - OpenAI applied the idea to language: "Learning to summarize from human feedback" (Stiennon et al.). Fine-tuned GPT-3 for summarization using human preferences. The RLHF-trained model produced significantly better summaries than standard fine-tuning.

2022 - InstructGPT (Ouyang et al.) scaled RLHF to GPT-3 and produced the landmark result: 1.3B RLHF model beats 175B raw GPT-3 in human evaluations.

2022 - Anthropic (the company founded by former OpenAI researchers) applied RLHF at scale for Claude, introducing Constitutional AI as a more scalable variant.

2023 - RLHF became standard practice for every major LLM deployment. The open-source community began exploring alternatives (DPO, covered in Lesson 11) to reduce RLHF's complexity.

The Three Phases of RLHF

Phase 1: Supervised Fine-Tuning (SFT)

The starting point: collect high-quality demonstrations of the desired behavior. For InstructGPT, labelers were given a prompt from the API and asked to write what they considered an ideal response. ~13,000 examples, diverse prompts (helpfulness, coding, factual questions, creative writing).

Fine-tune the base model on these demonstrations using the standard language modeling loss. This produces a model that follows the demonstrated format and style. The SFT model is the starting point for Phase 2 and 3.

Phase 2: Reward Model Training

The reward model (RM) is a crucial component. It is a neural network that takes (prompt, response) as input and outputs a scalar score indicating how well the response aligns with human preferences.

Collecting preference data: for each prompt, generate KK different responses (typically K=4K = 4 to 99). Show all (K2)\binom{K}{2} pairs to human labelers and ask: which response is better? Labelers are trained with detailed guidelines - what "helpful" means, how to handle borderline cases, what constitutes harm.

Training objective - the Bradley-Terry model for pairwise preferences:

LRM=E(x,yw,yl)[logσ(rθ(x,yw)rθ(x,yl))]\mathcal{L}_{RM} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma\left( r_\theta(x, y_w) - r_\theta(x, y_l) \right) \right]

where xx is the prompt, ywy_w is the preferred response ("winner"), yly_l is the less preferred response ("loser"), rθr_\theta is the reward model, and σ\sigma is the sigmoid function.

This loss says: the reward model should assign a higher score to the preferred response. The sigmoid ensures the model assigns probabilities to preference, not just rankings. If the RM assigns rθ(x,yw)rθ(x,yl)=3.0r_\theta(x, y_w) - r_\theta(x, y_l) = 3.0, it is very confident that ywy_w is preferred. If the difference is 0.1, it is nearly uncertain.

Reward model architecture: typically initialized from the SFT model (same base). A linear layer is added on top of the final hidden state of [EOS] token to produce the scalar reward.

InstructGPT numbers: trained on approximately 33,000 comparison examples (6,000 prompts with ~5 comparisons each). The RM achieved 69-77% accuracy at predicting held-out human preferences.

Phase 3: PPO Fine-Tuning

With the reward model trained, the task is to find a policy πθ\pi_\theta (the language model) that generates responses that maximize the reward model's score - while not deviating too far from the SFT model.

Why PPO? PPO (Proximal Policy Optimization, Schulman et al., 2017) is a policy gradient algorithm with a clipped objective that prevents too-large policy updates. For LLMs:

  • The policy πθ(yx)\pi_\theta(y|x) is the language model (a distribution over response tokens)
  • The "action" is generating a response yy token by token
  • The "reward" is the scalar score from the reward model at the end of the sequence

The combined objective:

LPPO=ExD,yπθ(yx)[rθ(x,y)βlogπθ(yx)πref(yx)]\mathcal{L}_{PPO} = \mathbb{E}_{x \sim D, y \sim \pi_\theta(y|x)} \left[ r_\theta(x, y) - \beta \log \frac{\pi_\theta(y|x)}{\pi_{ref}(y|x)} \right]

The first term: maximize reward model score. The second term: KL divergence penalty between the current policy πθ\pi_\theta and the reference policy πref\pi_{ref} (the SFT model). β\beta controls the strength of this constraint.

Why the KL penalty? Without it, PPO will find ways to maximize the reward model's score that have nothing to do with being genuinely helpful. The model might learn to produce responses that "look" like the reward model's training data but are nonsensical - this is reward hacking. The KL penalty prevents the model from drifting too far from the SFT model, ensuring the policy stays in a distribution where the reward model's scores are meaningful.

In practice, β\beta is set between 0.1 and 0.5. Larger β\beta means more conservative updates - the model stays closer to SFT. Smaller β\beta allows more aggressive optimization of the reward model but risks reward hacking.

The Reward Hacking Problem

Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure."

The reward model is a proxy for human preference - not a perfect measure. Once the language model is optimized to maximize the reward model's score, it will find inputs that score highly on the proxy but do not reflect actual human preferences.

Known reward hacking behaviors in RLHF:

  • Length gaming: reward models often prefer longer responses (correlates with appearing more thorough). PPO finds this and produces verbosely padded responses.
  • Sycophancy: reward models trained on human feedback inherit human biases - humans prefer responses that confirm their existing beliefs. PPO produces a model that tells users what they want to hear rather than what is true.
  • Surface-level alignment: the model learns to produce responses that "look" aligned (polite, structured, with caveats) without actually being more helpful or truthful.
  • Optimization pressure: with enough PPO steps, the model will find edge cases in the reward model's training distribution and exploit them.

Mitigation strategies:

  1. High KL penalty (β\beta): limits how far the policy can diverge from SFT, limiting exploitation
  2. Diverse, high-quality reward model training data: reduces blind spots in the reward model
  3. Iterative reward model updates: retrain the reward model on the RLHF model's outputs to continuously close gaps
  4. Multiple reward models: use an ensemble to reduce overfitting to any single model's biases
  5. Conservative PPO training (fewer steps): stop before significant reward hacking occurs

Constitutional AI: Anthropic's RLHF Variant

Anthropic introduced Constitutional AI (CAI, Bai et al., 2022) as a more scalable and controllable variant of RLHF. The key innovation: replace human preference labelers with the AI itself, guided by a set of principles (the "constitution").

Phase 1 - Supervised learning from AI feedback (SLAF):

  1. Sample harmful or sensitive responses from the model
  2. Ask the model to critique its own response according to a constitutional principle (e.g., "Is this response harmful? How could it be improved?")
  3. Ask the model to revise the response based on the critique
  4. Fine-tune on the revised (more helpful, less harmful) responses

Phase 2 - RL from AI Feedback (RLAIF):

  1. Generate response pairs for a set of prompts
  2. Ask a large, capable model (e.g., Claude) to determine which response is more aligned with constitutional principles
  3. Use these AI-generated preference labels to train a reward model
  4. Apply PPO as in standard RLHF

Advantages of CAI: scalable (AI labelers are cheaper than human labelers for many preference judgments), consistent (AI applies the same principles across all comparisons), transparent (the constitution makes the alignment objectives explicit and auditable). The model can be made more helpful (by adjusting constitutional principles toward helpfulness) or more cautious (by adding safety principles) by modifying the constitution.

Limitations: the AI labeler's preferences are shaped by its own training. RLAIF inherits and may amplify the labeling model's biases. Human oversight remains important for validating the quality of AI-generated preferences.

InstructGPT Results: What the Numbers Mean

The InstructGPT paper reported several key results:

  1. 1.3B beats 175B: Human raters preferred 1.3B InstructGPT over 175B GPT-3 71% of the time. This is the headline result - alignment matters more than raw scale for human-facing applications.

  2. Toxicity reduction: InstructGPT produced ~25% fewer toxic completions than GPT-3 on the RealToxicityPrompts benchmark.

  3. Truthfulness improvement: On TruthfulQA, InstructGPT produced truthful responses 27% more often than GPT-3.

  4. Benchmark regression: RLHF-trained models performed slightly worse on some standard NLP benchmarks (MMLU, HellaSwag). The "alignment tax" - optimizing for human preference can slightly hurt performance on academic benchmarks that test raw knowledge retrieval. This is an active area of research.

  5. Scaling with human feedback: More human preference data improved both win rates and safety metrics. The relationship was roughly log-linear - halving the preference data did not halve the quality, but quality degraded.

Code: Reward Model Training

"""
Reward model training for RLHF.
Demonstrates:
1. Reward model architecture (LM backbone + scalar head)
2. Bradley-Terry loss for preference learning
3. Training on comparison pairs
4. Full RLHF loop concept using TRL
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM
from torch import Tensor
from typing import Optional


# ---- Reward Model Architecture ----

class RewardModel(nn.Module):
"""
Reward model: transformer backbone + scalar head.
Takes (prompt + response) as input, outputs a scalar reward.
"""
def __init__(self, backbone_name: str, dropout: float = 0.1):
super().__init__()
self.backbone = AutoModel.from_pretrained(backbone_name)
self.dropout = nn.Dropout(dropout)
# Map final hidden state to scalar reward
self.reward_head = nn.Linear(self.backbone.config.hidden_size, 1)

def forward(
self,
input_ids: Tensor,
attention_mask: Tensor,
) -> Tensor:
outputs = self.backbone(
input_ids=input_ids,
attention_mask=attention_mask,
)
# Use the last token's hidden state as the sequence representation
# (EOS token for autoregressive models)
last_hidden = outputs.last_hidden_state # (batch, seq_len, hidden_size)

# Get last non-padding token position
seq_lengths = attention_mask.sum(dim=1) - 1 # (batch,)
batch_size = input_ids.shape[0]

# Gather last non-padding hidden state
last_token_hidden = last_hidden[
torch.arange(batch_size, device=input_ids.device),
seq_lengths,
] # (batch, hidden_size)

last_token_hidden = self.dropout(last_token_hidden)
reward = self.reward_head(last_token_hidden).squeeze(-1) # (batch,)
return reward


# ---- Bradley-Terry Loss ----

def bradley_terry_loss(
reward_chosen: Tensor, # (batch,) rewards for preferred responses
reward_rejected: Tensor, # (batch,) rewards for less-preferred responses
) -> Tensor:
"""
Bradley-Terry pairwise preference loss.

Maximizes the probability that the chosen response has higher reward:
L = -log(sigma(r_chosen - r_rejected))

Equivalent to binary cross-entropy where positive = chosen is better.
"""
# log(sigma(x)) = -log(1 + e^(-x)) = -softplus(-x)
loss = -F.logsigmoid(reward_chosen - reward_rejected).mean()
return loss


# ---- Reward Model Training Loop ----

class RewardModelTrainer:
"""
Trains a reward model on preference comparison data.
"""
def __init__(
self,
model: RewardModel,
tokenizer,
learning_rate: float = 1e-5,
max_length: int = 1024,
):
self.model = model
self.tokenizer = tokenizer
self.max_length = max_length
self.optimizer = torch.optim.AdamW(
model.parameters(),
lr=learning_rate,
weight_decay=0.01,
)

def tokenize_pair(self, prompt: str, response: str) -> dict:
"""Tokenize a prompt+response pair for reward model input."""
text = prompt + self.tokenizer.sep_token + response
return self.tokenizer(
text,
truncation=True,
max_length=self.max_length,
padding="max_length",
return_tensors="pt",
)

def train_step(
self,
prompts: list,
chosen_responses: list,
rejected_responses: list,
) -> float:
"""Single training step on a batch of preference pairs."""
self.model.train()

# Tokenize chosen and rejected responses
chosen_inputs = self.tokenize_pair_batch(prompts, chosen_responses)
rejected_inputs = self.tokenize_pair_batch(prompts, rejected_responses)

# Get rewards
reward_chosen = self.model(
input_ids=chosen_inputs["input_ids"],
attention_mask=chosen_inputs["attention_mask"],
)
reward_rejected = self.model(
input_ids=rejected_inputs["input_ids"],
attention_mask=rejected_inputs["attention_mask"],
)

# Bradley-Terry loss
loss = bradley_terry_loss(reward_chosen, reward_rejected)

# Update
self.optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
self.optimizer.step()

# Compute accuracy (how often does chosen have higher reward?)
accuracy = (reward_chosen > reward_rejected).float().mean().item()

return loss.item(), accuracy

def tokenize_pair_batch(self, prompts, responses):
"""Tokenize a batch of prompt+response pairs."""
texts = [p + self.tokenizer.sep_token + r
for p, r in zip(prompts, responses)]
return self.tokenizer(
texts,
truncation=True,
max_length=self.max_length,
padding=True,
return_tensors="pt",
)


# ---- Full RLHF Training with TRL ----

def run_rlhf_with_trl(
sft_model_name: str,
reward_model_name: str,
prompts: list,
output_dir: str = "./rlhf-model",
):
"""
Full RLHF training loop using TRL PPOTrainer.
Requires:
- sft_model_name: path to SFT model (starting point)
- reward_model_name: path to trained reward model
"""
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from trl import create_reference_model

# Load SFT model as policy (trainable)
model = AutoModelForCausalLMWithValueHead.from_pretrained(sft_model_name)
# Load SFT model as reference (frozen) - used for KL penalty
ref_model = create_reference_model(model)

tokenizer = AutoTokenizer.from_pretrained(sft_model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token

# Load reward model
reward_model = RewardModel(reward_model_name)
reward_tokenizer = AutoTokenizer.from_pretrained(reward_model_name)

ppo_config = PPOConfig(
model_name=sft_model_name,
learning_rate=1.41e-5, # From InstructGPT paper
batch_size=32,
mini_batch_size=4,
ppo_epochs=4, # Number of PPO steps per batch
kl_penalty="kl", # Type of KL penalty
init_kl_coef=0.2, # Initial beta (KL coefficient)
target_kl=6, # Target KL divergence (adaptive)
gamma=1, # Discount factor (1.0 for bandit setting)
lam=0.95, # GAE lambda
cliprange=0.2, # PPO clip range
cliprange_value=0.2, # Value function clip range
vf_coef=0.1, # Value function loss coefficient
)

ppo_trainer = PPOTrainer(
config=ppo_config,
model=model,
ref_model=ref_model,
tokenizer=tokenizer,
dataset=None, # Provide actual dataset here
)

def compute_rewards(prompts, responses):
"""Get reward model scores for a batch of responses."""
rewards = []
for prompt, response in zip(prompts, responses):
inputs = reward_tokenizer(
prompt + response,
return_tensors="pt",
truncation=True,
max_length=1024,
)
with torch.no_grad():
reward = reward_model(**inputs).item()
rewards.append(torch.tensor(reward))
return rewards

# PPO training loop (simplified)
for batch_prompts in prompts:
# Generate responses from current policy
input_ids = tokenizer(batch_prompts, return_tensors="pt")["input_ids"]

generation_kwargs = {
"max_new_tokens": 200,
"do_sample": True,
"temperature": 1.0,
"top_p": 0.9,
"pad_token_id": tokenizer.eos_token_id,
}

response_tensors = ppo_trainer.generate(
input_ids,
**generation_kwargs,
)
response_texts = [
tokenizer.decode(r, skip_special_tokens=True)
for r in response_tensors
]

# Get rewards from reward model
rewards = compute_rewards(batch_prompts, response_texts)

# PPO update step
stats = ppo_trainer.step(
queries=list(input_ids),
responses=response_tensors,
scores=rewards,
)

ppo_trainer.save_model(output_dir)
return ppo_trainer

Production Engineering Notes

RLHF is Expensive and Unstable

Full RLHF is complex to implement and expensive to run:

  • Human labeler cost: 0.500.50-2.00 per comparison pair; 30,000 pairs = 15,00015,000-60,000
  • Reward model training: 1-3 A100-hours for a 7B model
  • PPO training: 10-50 GPU-hours on 8-16 GPUs (unstable, requires monitoring)
  • Multiple training stages: SFT → RM → PPO, each requiring checkpoints and evaluation

This complexity is why DPO (Lesson 11) gained rapid adoption - it removes the reward model and RL entirely, achieving similar quality with far less engineering overhead.

When to Use RLHF vs DPO

RLHF is still the right choice when:

  • You need online learning (generating responses and labeling them in a loop)
  • The task requires iterative reward signal (code execution feedback, tool use)
  • You have human labelers and want the best possible alignment quality
  • You need fine-grained control over the optimization process

Use DPO when:

  • You have offline preference data (collected in advance)
  • You want simpler, more stable training
  • You are resource-constrained (DPO requires only one model, not three)
note

The InstructGPT scaling result One of RLHF's most important findings: the alignment quality scales with both model size AND human feedback data quality. But model size matters more after a point - a well-aligned small model consistently outperforms a large unaligned model on human preference tasks. This has practical implications: spend your budget on alignment quality (human labeler training, preference data diversity) rather than simply scaling the model.

Common Mistakes

danger

Using too much PPO PPO training on a language model is unstable. Running too many PPO steps causes reward hacking - the model finds degenerate solutions that score high on the reward model but produce low-quality outputs (extreme verbosity, sycophancy, repetitive structure). Monitor the KL divergence between the policy and the reference model. If KL exceeds 10-20 nats, you have likely overfit to the reward model. Stop PPO training early and rely on the KL penalty to constrain the policy.

danger

Training the reward model on too few or biased examples The reward model is only as good as the preference data it was trained on. If human labelers had a systematic bias (prefer longer responses, prefer confident-sounding answers regardless of accuracy), the reward model will encode that bias. RLHF will then optimize the language model toward that bias. Mitigation: use diverse labeler pools, provide detailed labeling guidelines, include agreement metrics as quality filters (discard comparisons where labelers strongly disagreed), and audit reward model behavior on held-out examples before PPO training.

warning

Not maintaining a frozen reference model during PPO The KL penalty in PPO requires comparing the current policy to the original SFT policy. If you accidentally train the reference model or update it during PPO, the KL constraint becomes meaningless - the model can deviate arbitrarily from the SFT baseline without penalty. Always verify that the reference model's weights are frozen (requires_grad=False for all parameters) before starting PPO.

tip

Monitor reward model accuracy during training - it should plateau During reward model training, track validation accuracy (how often does the RM correctly identify the human-preferred response?). A good reward model achieves 70-80% accuracy on held-out comparisons (human agreement itself is around 75-80%, so this is near the ceiling). If accuracy plateaus below 65%, you likely have data quality issues or an insufficient model. A reward model with 60% accuracy provides a very weak training signal for PPO.

Interview Q&A

Q1: Explain the three phases of RLHF and why each is necessary.

Phase 1 (SFT): fine-tune the base model on demonstrations of desired behavior. This produces a model that generates responses in the right format. Without SFT, the base model generates text that looks nothing like a helpful assistant's response - the RL training would have a very poor starting point.

Phase 2 (Reward Model): collect human preference comparisons and train a model to predict them. This is necessary because human preferences are hard to specify as a loss function. The reward model is a learned proxy for human judgment.

Phase 3 (PPO): use RL to fine-tune the language model to maximize the reward model's score while staying close to the SFT model (KL penalty). This is necessary because the SFT model is trained on demonstrations (mimicking ideal responses) rather than optimizing an objective. RL allows the model to explore the response space and find responses that are genuinely preferred by the reward model, not just similar to the demonstrations.

Q2: What is the Bradley-Terry model and why is it used for reward model training?

The Bradley-Terry model is a statistical model for pairwise comparisons. It models the probability that item ii is preferred over item jj as P(ij)=σ(sisj)P(i \succ j) = \sigma(s_i - s_j), where sis_i is the score (reward) of item ii and σ\sigma is the sigmoid function. For RLHF, the loss is logσ(r(x,yw)r(x,yl))-\log \sigma(r(x, y_w) - r(x, y_l)) - maximize the probability that the preferred response has a higher reward. It is used because: (1) it naturally extends to multiple comparisons per prompt (not just binary); (2) it is differentiable and easy to optimize; (3) it is well-understood statistically; (4) it handles the ordinal nature of preferences (preferred vs not preferred) without requiring absolute scores.

Q3: What is reward hacking and how do you prevent it?

Reward hacking (Goodhart's Law applied to RLHF) is when the language model finds ways to achieve high scores from the reward model that do not reflect genuine alignment. Examples: producing extremely long responses (reward models correlate length with quality), agreeing with the user regardless of accuracy (sycophancy), using specific surface patterns (hedging language, bullet points) that the reward model associates with quality without those patterns actually improving quality. Prevention: (1) KL penalty to constrain policy drift; (2) early stopping of PPO before over-optimization; (3) monitoring reward model score vs independent human evaluation (if these diverge, you have reward hacking); (4) diverse reward model training data; (5) iterative reward model retraining on RLHF model outputs.

Q4: Why did InstructGPT (1.3B RLHF) beat GPT-3 (175B) in human evaluations?

This result illustrates that alignment quality matters more than raw scale for human-facing tasks. GPT-3 was trained to predict text - it is excellent at completing text but has no mechanism for caring whether the output is helpful, harmless, or honest. It will complete harmful prompts, produce confident misinformation, generate irrelevant text continuation. InstructGPT was explicitly trained to produce responses that humans find helpful, harmless, and honest. The smaller model was literally optimized for the evaluation metric (human preference) while the larger model was not. This is not magic - it is the difference between optimizing for the right objective (human preference) vs an indirect objective (text prediction).

Q5: What is Constitutional AI and how does it differ from standard RLHF?

Constitutional AI (Bai et al., 2022) replaces human preference labelers with AI labelers guided by explicit principles. In standard RLHF, humans compare responses and say which is better. In CAI: (1) SLAF phase - the model critiques and revises its own harmful outputs guided by a written constitution of principles; (2) RLAIF phase - a large AI model (not human labelers) generates preference comparisons by judging which response is more aligned with the constitution. Advantages: scalable (AI labelers are cheap), consistent (principles are applied uniformly), transparent (the constitution is auditable). Disadvantages: inherits the labeling model's biases; human oversight is still needed to validate that the constitution captures the right values. Used by Anthropic for Claude's alignment training.

Advanced: Implementing Reward Model Evaluation

A well-trained reward model is the foundation of RLHF quality. Here is a complete evaluation framework:

"""
Reward model evaluation and calibration.
Critical for ensuring RLHF training quality.
"""

import torch
import numpy as np
from typing import List, Tuple
from sklearn.metrics import roc_auc_score, accuracy_score


def evaluate_reward_model(
reward_model,
tokenizer,
test_pairs: List[Tuple[str, str, str, int]],
# Each tuple: (prompt, response_a, response_b, label)
# label: 0 if response_a preferred, 1 if response_b preferred
batch_size: int = 8,
) -> dict:
"""
Evaluate reward model quality on held-out preference pairs.

Metrics:
- Accuracy: fraction of pairs where RM correctly identifies preferred response
- AUC: area under ROC curve (measures ranking quality)
- Margin distribution: distribution of reward differences for correct/incorrect pairs
"""
reward_model.eval()
all_probs = []
all_labels = []
all_margins = []

for i in range(0, len(test_pairs), batch_size):
batch = test_pairs[i:i + batch_size]
prompts = [p[0] for p in batch]
responses_a = [p[1] for p in batch]
responses_b = [p[2] for p in batch]
labels = [p[3] for p in batch]

def get_rewards(prompts, responses):
texts = [p + tokenizer.sep_token + r for p, r in zip(prompts, responses)]
inputs = tokenizer(
texts, return_tensors="pt", truncation=True,
max_length=1024, padding=True
)
with torch.no_grad():
rewards = reward_model(**inputs).cpu().numpy()
return rewards

rewards_a = get_rewards(prompts, responses_a)
rewards_b = get_rewards(prompts, responses_b)

# P(B preferred) = sigma(r_b - r_a)
margin = rewards_b - rewards_a
prob_b_preferred = 1 / (1 + np.exp(-margin)) # sigmoid

all_probs.extend(prob_b_preferred.tolist())
all_labels.extend(labels)
all_margins.extend(margin.tolist())

# Metrics
predicted_labels = [1 if p > 0.5 else 0 for p in all_probs]
accuracy = accuracy_score(all_labels, predicted_labels)
auc = roc_auc_score(all_labels, all_probs)

# Margin analysis
correct_margins = [abs(m) for m, l, p in zip(all_margins, all_labels, predicted_labels) if l == p]
incorrect_margins = [abs(m) for m, l, p in zip(all_margins, all_labels, predicted_labels) if l != p]

return {
"accuracy": accuracy,
"auc": auc,
"avg_margin_correct": np.mean(correct_margins) if correct_margins else 0,
"avg_margin_incorrect": np.mean(incorrect_margins) if incorrect_margins else 0,
"margin_separation": np.mean(correct_margins) - np.mean(incorrect_margins),
}


def monitor_ppo_training(policy_model, ref_model, tokenizer, eval_prompts, step):
"""
Monitor PPO training health.
Computes KL divergence between current policy and reference.
If KL exceeds ~15 nats, reward hacking is likely occurring.
"""
policy_model.eval()
ref_model.eval()

kl_values = []
for prompt in eval_prompts[:20]: # Sample 20 prompts
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
policy_logits = policy_model(**inputs).logits[:, -1, :]
ref_logits = ref_model(**inputs).logits[:, -1, :]

# KL divergence at the last token position
policy_probs = torch.softmax(policy_logits, dim=-1)
ref_probs = torch.softmax(ref_logits, dim=-1)

# KL(policy || ref)
kl = (policy_probs * (torch.log(policy_probs + 1e-8) - torch.log(ref_probs + 1e-8))).sum(-1)
kl_values.append(kl.item())

avg_kl = np.mean(kl_values)

status = "OK"
if avg_kl > 15:
status = "WARNING: Likely reward hacking"
elif avg_kl > 8:
status = "CAUTION: High KL, consider stopping"

print(f"Step {step}: Avg KL = {avg_kl:.3f} nats | Status: {status}")
return avg_kl


# Labeler agreement analysis - critical for reward model data quality
def analyze_labeler_agreement(comparisons_with_labels: list) -> dict:
"""
Analyze agreement between human labelers.
Low agreement means noisy data - filter these pairs or re-collect.

comparisons_with_labels: list of {prompt, response_a, response_b, labeler_votes}
labeler_votes: list of labels from multiple labelers (0 or 1)
"""
agreements = []
for comp in comparisons_with_labels:
votes = comp["labeler_votes"]
n = len(votes)
# Fraction of labelers in agreement with majority
majority = 1 if sum(votes) > n / 2 else 0
agreement_rate = sum(v == majority for v in votes) / n
agreements.append(agreement_rate)

return {
"avg_agreement": np.mean(agreements),
"fraction_unanimous": sum(a == 1.0 for a in agreements) / len(agreements),
"fraction_low_agreement": sum(a < 0.6 for a in agreements) / len(agreements),
# Recommend filtering pairs with agreement < 0.6
}

RLHF Engineering: Scaling Considerations

Human labeler throughput: A skilled labeler can evaluate approximately 30-60 preference pairs per hour for typical instruction-following tasks. More complex tasks (code correctness, technical accuracy) require 10-20 pairs per hour. For InstructGPT-scale data (33,000 pairs): approximately 500-1,500 labeler-hours. At 25/hour:25/hour: 12,500 to $37,500 just for preference data collection.

Reward model size: the reward model should generally be at least as large as the policy model being aligned. A 7B reward model is appropriate for aligning a 7B policy. Using a 1B reward model to align a 70B policy is under-specified - the reward model's capacity may limit alignment quality.

PPO batch size: unlike SFT, PPO benefits significantly from large batch sizes because the policy gradient estimates have high variance. Use batch sizes of 128-512 with mini-batch size 16-32. This requires gradient accumulation and multiple GPUs.

Number of PPO steps: typically 100-500 PPO update steps for a 7B model. More than 1,000 steps without monitoring risks reward hacking. Monitor reward model score alongside independent human evaluations - if they diverge, stop PPO.

note

The InstructGPT paper's appendix is required reading The InstructGPT paper (Ouyang et al., 2022) has a remarkably detailed appendix covering labeler guidelines, interface design, agreement metrics, and hyperparameters. If you are implementing RLHF seriously, reading Appendix B (labeler guidelines) and Appendix D (PPO training details) is more valuable than reading most other RLHF papers. The practical details of how to train labelers, what to do about disagreements, and how to set up the annotation interface are documented nowhere else with this level of specificity.


Alternatives to PPO in RLHF

PPO is not the only way to optimize against a reward model. Several alternatives have emerged that are simpler to implement and sometimes achieve better results:

REINFORCE with Baseline

The simplest policy gradient method applied to language model fine-tuning:

import torch
import torch.nn.functional as F

def reinforce_step(
policy_model,
reward_model,
ref_model,
prompts: list[str],
tokenizer,
kl_coeff: float = 0.05,
num_samples: int = 4, # Sample multiple responses per prompt
) -> dict:
"""
REINFORCE with baseline for LM alignment.
Simpler than PPO - no value function, no clipping.
"""
all_log_probs = []
all_rewards = []
all_kl_penalties = []

for prompt in prompts:
prompt_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()

# Sample multiple responses
with torch.no_grad():
generated = policy_model.generate(
prompt_ids,
max_new_tokens=256,
do_sample=True,
temperature=0.9,
num_return_sequences=num_samples,
)

# Score with reward model
rewards = []
for response_ids in generated:
response_text = tokenizer.decode(response_ids, skip_special_tokens=True)
reward_input = tokenizer(response_text, return_tensors="pt").cuda()
with torch.no_grad():
reward = reward_model(**reward_input).logits.squeeze()
rewards.append(reward.item())

# Baseline = mean reward across samples (variance reduction)
baseline = sum(rewards) / len(rewards)

for response_ids, reward in zip(generated, rewards):
# Compute log probs under current policy
response_text = tokenizer.decode(response_ids, skip_special_tokens=True)
encoded = tokenizer(response_text, return_tensors="pt").cuda()

log_probs_policy = policy_model(**encoded, labels=encoded.input_ids).loss * -1
log_probs_ref = ref_model(**encoded, labels=encoded.input_ids).loss * -1

# KL penalty
kl = log_probs_policy - log_probs_ref

# Advantage = reward - baseline
advantage = reward - baseline

all_log_probs.append(log_probs_policy)
all_rewards.append(torch.tensor(advantage))
all_kl_penalties.append(kl)

# REINFORCE loss: -E[advantage * log_prob] + KL_coeff * KL
policy_losses = [-lp * adv for lp, adv in zip(all_log_probs, all_rewards)]
policy_loss = torch.stack(policy_losses).mean()
kl_loss = torch.stack(all_kl_penalties).mean()

total_loss = policy_loss + kl_coeff * kl_loss

return {
"loss": total_loss,
"policy_loss": policy_loss.item(),
"kl_loss": kl_loss.item(),
"mean_reward": sum(r.item() for r in all_rewards) / len(all_rewards),
}

Group Relative Policy Optimization (GRPO)

GRPO (Shao et al., 2024), developed at DeepSeek, is a PPO variant that eliminates the learned value function. Instead, it estimates advantages using the relative rewards within a group of sampled responses for the same prompt.

def grpo_step(
policy_model,
reward_fn, # Can be a neural reward model or a rule-based verifier
prompts: list[str],
tokenizer,
group_size: int = 8, # Number of responses sampled per prompt
kl_coeff: float = 0.04,
clip_range: float = 0.2,
) -> dict:
"""
GRPO: Group Relative Policy Optimization.
Advantage = (reward - group_mean) / group_std
No value function needed - group statistics serve as the baseline.
"""
policy_model.eval()
all_responses = []
all_rewards = []

# Phase 1: Sample responses and compute rewards (no grad)
with torch.no_grad():
for prompt in prompts:
prompt_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
responses = policy_model.generate(
prompt_ids,
max_new_tokens=512,
do_sample=True,
temperature=0.8,
num_return_sequences=group_size,
)

rewards = []
for resp in responses:
resp_text = tokenizer.decode(resp, skip_special_tokens=True)
reward = reward_fn(prompt, resp_text) # Float reward
rewards.append(reward)

# Normalize within group - this is the key GRPO insight
reward_mean = sum(rewards) / len(rewards)
reward_std = (sum((r - reward_mean)**2 for r in rewards) / len(rewards)) ** 0.5 + 1e-8
normalized_rewards = [(r - reward_mean) / reward_std for r in rewards]

for resp, norm_reward in zip(responses, normalized_rewards):
all_responses.append((prompt, resp, norm_reward))
all_rewards.append(norm_reward)

# Phase 2: Compute policy gradient loss (with grad)
policy_model.train()
total_loss = torch.tensor(0.0, requires_grad=True, device="cuda")

for prompt, response_ids, advantage in all_responses:
full_text = tokenizer.decode(response_ids, skip_special_tokens=True)
encoded = tokenizer(full_text, return_tensors="pt").cuda()

output = policy_model(**encoded, labels=encoded.input_ids)
log_probs = -output.loss # Per-token log probability

# PPO-style clipped objective using advantage
loss = -log_probs * advantage
total_loss = total_loss + loss

return {
"loss": total_loss / len(all_responses),
"mean_reward": sum(all_rewards) / len(all_rewards),
}

Interview Q&A

Q1: Explain the three phases of RLHF and why each is necessary.

RLHF has three essential phases. Phase 1 (SFT): We need a base model that can produce coherent responses in the instruction-following format. A raw pretrained model outputs anything - code, news articles, or random continuations - not necessarily helpful answers. SFT teaches the model the "language" of question-answering. Phase 2 (Reward Model): We cannot directly optimize against human preferences because humans are not differentiable. We train a reward model to proxy human preferences, making optimization tractable. The reward model must be trained because there is no analytical function that captures "helpfulness." Phase 3 (PPO): We use the reward model as a training signal to update the SFT model toward higher-reward behaviors. We cannot just fine-tune on the highest-rated responses (this would be rejection sampling, not RL) - we need gradient flow through the reward signal to discover new high-reward behaviors the model hasn't produced yet.

Q2: What is reward hacking and why is it inevitable at scale?

Reward hacking occurs when the policy finds high-reward behaviors that exploit the reward model rather than genuinely satisfying the objective. Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." The reward model is trained on a finite set of human judgments and inevitably has blind spots. As PPO optimization pressure increases, the policy finds and exploits these blind spots - producing outputs that the reward model rates highly but that humans would not. Examples: generating very long responses (humans often rate verbose answers as more thoughtful), excessive sycophancy (agreeing with wrong premises), and outputs that pattern-match to rewarded styles without actual helpfulness. Mitigation: KL divergence penalty limits how far the policy can deviate from the SFT model, limiting how much it can overoptimize.

Q3: What is the InstructGPT key finding about model size and alignment?

The headline result from Ouyang et al. (2022): InstructGPT 1.3B (aligned via RLHF) was preferred by human evaluators to GPT-3 175B (unaligned) on 85% of head-to-head comparisons. A model that is 100x smaller but carefully aligned is significantly more useful than a massive model that is not. This has two implications: (1) alignment is not just about making models safe - it makes them genuinely more useful; (2) the "alignment tax" narrative (alignment reduces capability) was wrong for instruction following - alignment actually improves capability as measured by human preference. The reason: GPT-3 was pretrained to predict next tokens in any context; InstructGPT was specifically trained to respond helpfully to user instructions.

Q4: What is the Bradley-Terry model and why is it used for the reward model?

The Bradley-Terry model is a probabilistic model for pairwise comparisons. Given two items A and B with "strength" parameters rAr_A and rBr_B, the probability that A is preferred to B is: P(AB)=σ(rArB)P(A \succ B) = \sigma(r_A - r_B) where σ\sigma is the sigmoid function. For RLHF, rAr_A and rBr_B are the scalar outputs of the reward model for two responses to the same prompt. The Bradley-Terry model is used because: (1) it is mathematically tractable - the log-likelihood of a preference dataset is convex; (2) it matches the cognitive model of human preference - preferences are probabilistic, not deterministic; (3) it naturally handles transitivity - if A is consistently preferred to B and B to C, A should be preferred to C, which the model captures through the scalar reward scale.

Q5: What is the KL divergence penalty in PPO for LM alignment and why is it necessary?

The PPO objective for LM alignment is: L=E[rϕ(x,y)βKL(πθπSFT)]\mathcal{L} = \mathbb{E}[r_\phi(x, y) - \beta \cdot \text{KL}(\pi_\theta || \pi_{SFT})]. The KL term measures how far the current policy πθ\pi_\theta has drifted from the SFT model πSFT\pi_{SFT}. It is necessary for two reasons: (1) Reward hacking prevention - without the KL penalty, PPO would quickly find reward-hacking strategies that exploit the reward model's weaknesses. The KL penalty limits the policy's ability to deviate into out-of-distribution territory where the reward model's predictions are unreliable. (2) Maintaining language quality - the SFT model produces coherent, grammatical text. Unrestricted PPO optimization could distort the language model's outputs into incoherent sequences that happen to receive high rewards. The KL penalty anchors the policy to the distribution where language quality is preserved. β\beta is typically set between 0.01 and 0.1 - smaller values allow more optimization, larger values are more conservative.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the RLHF Pipeline demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.