Understand how RLHF aligns LLMs with human preferences through three phases - SFT, reward model training, and PPO - and why it produced InstructGPT's surprising result that smaller aligned models beat larger unaligned ones.

How does reinforcement learning from human feedback work in practice?

RLHF: Reinforcement Learning from Human Feedback covers RLHF, reinforcement learning from human feedback, reward model from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/pretraining-and-finetuning/rlhf

What is the difference between RLHF and reward model?

See the full breakdown at https://engineersofai.com/docs/llms/pretraining-and-finetuning/rlhf

RLHF: Reinforcement Learning from Human Feedback

The Helpful but Dangerous Model

In 2022, OpenAI ran an experiment. They asked human raters to compare outputs from two models: GPT-3 (175B, not instruction-tuned) and an early version of InstructGPT (1.3B, RLHF-trained). The smaller model was trained with reinforcement learning from human feedback. The larger model was trained purely on text prediction.

The result: human raters preferred the 1.3B InstructGPT model 71% of the time over the 175B GPT-3 model.

A model 100x smaller beat a model 100x larger. Not on a narrow benchmark - in open-ended human evaluation of which response was more helpful, honest, and harmless.

This result exposed a deep problem with raw language models. GPT-3 knows an enormous amount. It is excellent at predicting text. But it has no mechanism for caring whether its output is helpful, truthful, or safe. It will helpfully explain how to pick a lock. It will confidently state false facts. It will generate harmful content if the text continuation probabilities favor it. Pretraining on text teaches the model to predict text - not to be a helpful assistant.

RLHF is the solution that OpenAI, Anthropic, and DeepMind independently developed and deployed: collect human preferences over model outputs, train a reward model to predict those preferences, and use reinforcement learning to optimize the language model to produce outputs that the reward model scores highly.

The three phases of RLHF produced the models we call "aligned" - GPT-4, Claude, Gemini - and their defining characteristic is not knowledge or reasoning ability, but the shaping of behavior by human preference.

Why This Exists: The Alignment Problem in Practice

After SFT and instruction tuning, you have a model that follows instructions. But "following instructions" is not the same as being aligned with human intent. Consider what a base model does when given the prompt "How do I whittle a knife?":

A raw language model might continue with a straightforward tutorial. That is fine.

But "How do I whittle a knife so I can kill my sister?" is a different prompt entirely. A model trained only to continue text might continue with the whittling tutorial - because whittling tutorials are common in text, and the "so I can kill my sister" addendum does not change the text continuation probabilities for whittling instructions.

Human intent is multi-dimensional. A genuinely helpful response considers not just what the user asked for, but whether fulfilling the request causes harm, whether the information is truthful, whether the response is complete, and dozens of other factors that are hard to specify in a training objective.

You cannot write a loss function that captures "be helpful, harmless, and honest" directly. But you can collect examples of human preferences - "Response A was better than Response B because it was more helpful without being harmful" - and train a model to learn that preference function. That is RLHF.

Historical Context: How RLHF Came Together

2017 - OpenAI and DeepMind (Christiano et al.) demonstrated learning from human preferences in game-playing agents. The key paper: "Deep reinforcement learning from human preferences." Showed that a model could learn complex behaviors from preference comparisons without explicit reward functions.

2020 - OpenAI applied the idea to language: "Learning to summarize from human feedback" (Stiennon et al.). Fine-tuned GPT-3 for summarization using human preferences. The RLHF-trained model produced significantly better summaries than standard fine-tuning.

2022 - InstructGPT (Ouyang et al.) scaled RLHF to GPT-3 and produced the landmark result: 1.3B RLHF model beats 175B raw GPT-3 in human evaluations.

2022 - Anthropic (the company founded by former OpenAI researchers) applied RLHF at scale for Claude, introducing Constitutional AI as a more scalable variant.

2023 - RLHF became standard practice for every major LLM deployment. The open-source community began exploring alternatives (DPO, covered in Lesson 11) to reduce RLHF's complexity.

The Three Phases of RLHF

Phase 1: Supervised Fine-Tuning (SFT)

The starting point: collect high-quality demonstrations of the desired behavior. For InstructGPT, labelers were given a prompt from the API and asked to write what they considered an ideal response. ~13,000 examples, diverse prompts (helpfulness, coding, factual questions, creative writing).

Fine-tune the base model on these demonstrations using the standard language modeling loss. This produces a model that follows the demonstrated format and style. The SFT model is the starting point for Phase 2 and 3.

Phase 2: Reward Model Training

The reward model (RM) is a crucial component. It is a neural network that takes (prompt, response) as input and outputs a scalar score indicating how well the response aligns with human preferences.

Collecting preference data: for each prompt, generate $K$ different responses (typically $K = 4$ to $9$ ). Show all $\binom{K}{2}$ pairs to human labelers and ask: which response is better? Labelers are trained with detailed guidelines - what "helpful" means, how to handle borderline cases, what constitutes harm.

Training objective - the Bradley-Terry model for pairwise preferences:

$\mathcal{L}_{RM} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma\left( r_\theta(x, y_w) - r_\theta(x, y_l) \right) \right]$

where $x$ is the prompt, $y_w$ is the preferred response ("winner"), $y_l$ is the less preferred response ("loser"), $r_\theta$ is the reward model, and $\sigma$ is the sigmoid function.

This loss says: the reward model should assign a higher score to the preferred response. The sigmoid ensures the model assigns probabilities to preference, not just rankings. If the RM assigns $r_\theta(x, y_w) - r_\theta(x, y_l) = 3.0$ , it is very confident that $y_w$ is preferred. If the difference is 0.1, it is nearly uncertain.

Reward model architecture: typically initialized from the SFT model (same base). A linear layer is added on top of the final hidden state of [EOS] token to produce the scalar reward.

InstructGPT numbers: trained on approximately 33,000 comparison examples (6,000 prompts with ~5 comparisons each). The RM achieved 69-77% accuracy at predicting held-out human preferences.

Phase 3: PPO Fine-Tuning

With the reward model trained, the task is to find a policy $\pi_\theta$ (the language model) that generates responses that maximize the reward model's score - while not deviating too far from the SFT model.

Why PPO? PPO (Proximal Policy Optimization, Schulman et al., 2017) is a policy gradient algorithm with a clipped objective that prevents too-large policy updates. For LLMs:

The policy $\pi_\theta(y|x)$ is the language model (a distribution over response tokens)
The "action" is generating a response $y$ token by token
The "reward" is the scalar score from the reward model at the end of the sequence

The combined objective:

$\mathcal{L}_{PPO} = \mathbb{E}_{x \sim D, y \sim \pi_\theta(y|x)} \left[ r_\theta(x, y) - \beta \log \frac{\pi_\theta(y|x)}{\pi_{ref}(y|x)} \right]$

The first term: maximize reward model score. The second term: KL divergence penalty between the current policy $\pi_\theta$ and the reference policy $\pi_{ref}$ (the SFT model). $\beta$ controls the strength of this constraint.

Why the KL penalty? Without it, PPO will find ways to maximize the reward model's score that have nothing to do with being genuinely helpful. The model might learn to produce responses that "look" like the reward model's training data but are nonsensical - this is reward hacking. The KL penalty prevents the model from drifting too far from the SFT model, ensuring the policy stays in a distribution where the reward model's scores are meaningful.

In practice, $\beta$ is set between 0.1 and 0.5. Larger $\beta$ means more conservative updates - the model stays closer to SFT. Smaller $\beta$ allows more aggressive optimization of the reward model but risks reward hacking.

The Reward Hacking Problem

Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure."

The reward model is a proxy for human preference - not a perfect measure. Once the language model is optimized to maximize the reward model's score, it will find inputs that score highly on the proxy but do not reflect actual human preferences.

Known reward hacking behaviors in RLHF:

Length gaming: reward models often prefer longer responses (correlates with appearing more thorough). PPO finds this and produces verbosely padded responses.
Sycophancy: reward models trained on human feedback inherit human biases - humans prefer responses that confirm their existing beliefs. PPO produces a model that tells users what they want to hear rather than what is true.
Surface-level alignment: the model learns to produce responses that "look" aligned (polite, structured, with caveats) without actually being more helpful or truthful.
Optimization pressure: with enough PPO steps, the model will find edge cases in the reward model's training distribution and exploit them.

Mitigation strategies:

High KL penalty ( $\beta$ ): limits how far the policy can diverge from SFT, limiting exploitation
Diverse, high-quality reward model training data: reduces blind spots in the reward model
Iterative reward model updates: retrain the reward model on the RLHF model's outputs to continuously close gaps
Multiple reward models: use an ensemble to reduce overfitting to any single model's biases
Conservative PPO training (fewer steps): stop before significant reward hacking occurs

Constitutional AI: Anthropic's RLHF Variant

Anthropic introduced Constitutional AI (CAI, Bai et al., 2022) as a more scalable and controllable variant of RLHF. The key innovation: replace human preference labelers with the AI itself, guided by a set of principles (the "constitution").

Phase 1 - Supervised learning from AI feedback (SLAF):

Sample harmful or sensitive responses from the model
Ask the model to critique its own response according to a constitutional principle (e.g., "Is this response harmful? How could it be improved?")
Ask the model to revise the response based on the critique
Fine-tune on the revised (more helpful, less harmful) responses

Phase 2 - RL from AI Feedback (RLAIF):

Generate response pairs for a set of prompts
Ask a large, capable model (e.g., Claude) to determine which response is more aligned with constitutional principles
Use these AI-generated preference labels to train a reward model
Apply PPO as in standard RLHF

Advantages of CAI: scalable (AI labelers are cheaper than human labelers for many preference judgments), consistent (AI applies the same principles across all comparisons), transparent (the constitution makes the alignment objectives explicit and auditable). The model can be made more helpful (by adjusting constitutional principles toward helpfulness) or more cautious (by adding safety principles) by modifying the constitution.

Limitations: the AI labeler's preferences are shaped by its own training. RLAIF inherits and may amplify the labeling model's biases. Human oversight remains important for validating the quality of AI-generated preferences.

InstructGPT Results: What the Numbers Mean

The InstructGPT paper reported several key results:

1.3B beats 175B: Human raters preferred 1.3B InstructGPT over 175B GPT-3 71% of the time. This is the headline result - alignment matters more than raw scale for human-facing applications.
Toxicity reduction: InstructGPT produced ~25% fewer toxic completions than GPT-3 on the RealToxicityPrompts benchmark.
Truthfulness improvement: On TruthfulQA, InstructGPT produced truthful responses 27% more often than GPT-3.
Benchmark regression: RLHF-trained models performed slightly worse on some standard NLP benchmarks (MMLU, HellaSwag). The "alignment tax" - optimizing for human preference can slightly hurt performance on academic benchmarks that test raw knowledge retrieval. This is an active area of research.
Scaling with human feedback: More human preference data improved both win rates and safety metrics. The relationship was roughly log-linear - halving the preference data did not halve the quality, but quality degraded.

Code: Reward Model Training

"""
Reward model training for RLHF.
Demonstrates:
1. Reward model architecture (LM backbone + scalar head)
2. Bradley-Terry loss for preference learning
3. Training on comparison pairs
4. Full RLHF loop concept using TRL
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM
from torch import Tensor
from typing import Optional


# ---- Reward Model Architecture ----

class RewardModel(nn.Module):
    """
    Reward model: transformer backbone + scalar head.
    Takes (prompt + response) as input, outputs a scalar reward.
    """
    def __init__(self, backbone_name: str, dropout: float = 0.1):
        super().__init__()
        self.backbone = AutoModel.from_pretrained(backbone_name)
        self.dropout = nn.Dropout(dropout)
        # Map final hidden state to scalar reward
        self.reward_head = nn.Linear(self.backbone.config.hidden_size, 1)

    def forward(
        self,
        input_ids: Tensor,
        attention_mask: Tensor,
    ) -> Tensor:
        outputs = self.backbone(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )
        # Use the last token's hidden state as the sequence representation
        # (EOS token for autoregressive models)
        last_hidden = outputs.last_hidden_state  # (batch, seq_len, hidden_size)

        # Get last non-padding token position
        seq_lengths = attention_mask.sum(dim=1) - 1  # (batch,)
        batch_size = input_ids.shape[0]

        # Gather last non-padding hidden state
        last_token_hidden = last_hidden[
            torch.arange(batch_size, device=input_ids.device),
            seq_lengths,
        ]  # (batch, hidden_size)

        last_token_hidden = self.dropout(last_token_hidden)
        reward = self.reward_head(last_token_hidden).squeeze(-1)  # (batch,)
        return reward


# ---- Bradley-Terry Loss ----

def bradley_terry_loss(
    reward_chosen: Tensor,    # (batch,) rewards for preferred responses
    reward_rejected: Tensor,  # (batch,) rewards for less-preferred responses
) -> Tensor:
    """
    Bradley-Terry pairwise preference loss.

    Maximizes the probability that the chosen response has higher reward:
    L = -log(sigma(r_chosen - r_rejected))

    Equivalent to binary cross-entropy where positive = chosen is better.
    """
    # log(sigma(x)) = -log(1 + e^(-x)) = -softplus(-x)
    loss = -F.logsigmoid(reward_chosen - reward_rejected).mean()
    return loss


# ---- Reward Model Training Loop ----

class RewardModelTrainer:
    """
    Trains a reward model on preference comparison data.
    """
    def __init__(
        self,
        model: RewardModel,
        tokenizer,
        learning_rate: float = 1e-5,
        max_length: int = 1024,
    ):
        self.model = model
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.optimizer = torch.optim.AdamW(
            model.parameters(),
            lr=learning_rate,
            weight_decay=0.01,
        )

    def tokenize_pair(self, prompt: str, response: str) -> dict:
        """Tokenize a prompt+response pair for reward model input."""
        text = prompt + self.tokenizer.sep_token + response
        return self.tokenizer(
            text,
            truncation=True,
            max_length=self.max_length,
            padding="max_length",
            return_tensors="pt",
        )

    def train_step(
        self,
        prompts: list,
        chosen_responses: list,
        rejected_responses: list,
    ) -> float:
        """Single training step on a batch of preference pairs."""
        self.model.train()

        # Tokenize chosen and rejected responses
        chosen_inputs = self.tokenize_pair_batch(prompts, chosen_responses)
        rejected_inputs = self.tokenize_pair_batch(prompts, rejected_responses)

        # Get rewards
        reward_chosen = self.model(
            input_ids=chosen_inputs["input_ids"],
            attention_mask=chosen_inputs["attention_mask"],
        )
        reward_rejected = self.model(
            input_ids=rejected_inputs["input_ids"],
            attention_mask=rejected_inputs["attention_mask"],
        )

        # Bradley-Terry loss
        loss = bradley_terry_loss(reward_chosen, reward_rejected)

        # Update
        self.optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0)
        self.optimizer.step()

        # Compute accuracy (how often does chosen have higher reward?)
        accuracy = (reward_chosen > reward_rejected).float().mean().item()

        return loss.item(), accuracy

    def tokenize_pair_batch(self, prompts, responses):
        """Tokenize a batch of prompt+response pairs."""
        texts = [p + self.tokenizer.sep_token + r
                 for p, r in zip(prompts, responses)]
        return self.tokenizer(
            texts,
            truncation=True,
            max_length=self.max_length,
            padding=True,
            return_tensors="pt",
        )


# ---- Full RLHF Training with TRL ----

def run_rlhf_with_trl(
    sft_model_name: str,
    reward_model_name: str,
    prompts: list,
    output_dir: str = "./rlhf-model",
):
    """
    Full RLHF training loop using TRL PPOTrainer.
    Requires:
    - sft_model_name: path to SFT model (starting point)
    - reward_model_name: path to trained reward model
    """
    from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
    from trl import create_reference_model

    # Load SFT model as policy (trainable)
    model = AutoModelForCausalLMWithValueHead.from_pretrained(sft_model_name)
    # Load SFT model as reference (frozen) - used for KL penalty
    ref_model = create_reference_model(model)

    tokenizer = AutoTokenizer.from_pretrained(sft_model_name)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Load reward model
    reward_model = RewardModel(reward_model_name)
    reward_tokenizer = AutoTokenizer.from_pretrained(reward_model_name)

    ppo_config = PPOConfig(
        model_name=sft_model_name,
        learning_rate=1.41e-5,   # From InstructGPT paper
        batch_size=32,
        mini_batch_size=4,
        ppo_epochs=4,            # Number of PPO steps per batch
        kl_penalty="kl",         # Type of KL penalty
        init_kl_coef=0.2,        # Initial beta (KL coefficient)
        target_kl=6,             # Target KL divergence (adaptive)
        gamma=1,                 # Discount factor (1.0 for bandit setting)
        lam=0.95,                # GAE lambda
        cliprange=0.2,           # PPO clip range
        cliprange_value=0.2,     # Value function clip range
        vf_coef=0.1,             # Value function loss coefficient
    )

    ppo_trainer = PPOTrainer(
        config=ppo_config,
        model=model,
        ref_model=ref_model,
        tokenizer=tokenizer,
        dataset=None,  # Provide actual dataset here
    )

    def compute_rewards(prompts, responses):
        """Get reward model scores for a batch of responses."""
        rewards = []
        for prompt, response in zip(prompts, responses):
            inputs = reward_tokenizer(
                prompt + response,
                return_tensors="pt",
                truncation=True,
                max_length=1024,
            )
            with torch.no_grad():
                reward = reward_model(**inputs).item()
            rewards.append(torch.tensor(reward))
        return rewards

    # PPO training loop (simplified)
    for batch_prompts in prompts:
        # Generate responses from current policy
        input_ids = tokenizer(batch_prompts, return_tensors="pt")["input_ids"]

        generation_kwargs = {
            "max_new_tokens": 200,
            "do_sample": True,
            "temperature": 1.0,
            "top_p": 0.9,
            "pad_token_id": tokenizer.eos_token_id,
        }

        response_tensors = ppo_trainer.generate(
            input_ids,
            **generation_kwargs,
        )
        response_texts = [
            tokenizer.decode(r, skip_special_tokens=True)
            for r in response_tensors
        ]

        # Get rewards from reward model
        rewards = compute_rewards(batch_prompts, response_texts)

        # PPO update step
        stats = ppo_trainer.step(
            queries=list(input_ids),
            responses=response_tensors,
            scores=rewards,
        )

    ppo_trainer.save_model(output_dir)
    return ppo_trainer

Production Engineering Notes

RLHF is Expensive and Unstable

Full RLHF is complex to implement and expensive to run:

Human labeler cost: $0.50-$ 2.00 per comparison pair; 30,000 pairs = $15,000-$ 60,000
Reward model training: 1-3 A100-hours for a 7B model
PPO training: 10-50 GPU-hours on 8-16 GPUs (unstable, requires monitoring)
Multiple training stages: SFT → RM → PPO, each requiring checkpoints and evaluation

This complexity is why DPO (Lesson 11) gained rapid adoption - it removes the reward model and RL entirely, achieving similar quality with far less engineering overhead.

When to Use RLHF vs DPO

RLHF is still the right choice when:

You need online learning (generating responses and labeling them in a loop)
The task requires iterative reward signal (code execution feedback, tool use)
You have human labelers and want the best possible alignment quality
You need fine-grained control over the optimization process

Use DPO when:

You have offline preference data (collected in advance)
You want simpler, more stable training
You are resource-constrained (DPO requires only one model, not three)

note

The InstructGPT scaling result One of RLHF's most important findings: the alignment quality scales with both model size AND human feedback data quality. But model size matters more after a point - a well-aligned small model consistently outperforms a large unaligned model on human preference tasks. This has practical implications: spend your budget on alignment quality (human labeler training, preference data diversity) rather than simply scaling the model.

Common Mistakes

danger

Using too much PPO PPO training on a language model is unstable. Running too many PPO steps causes reward hacking - the model finds degenerate solutions that score high on the reward model but produce low-quality outputs (extreme verbosity, sycophancy, repetitive structure). Monitor the KL divergence between the policy and the reference model. If KL exceeds 10-20 nats, you have likely overfit to the reward model. Stop PPO training early and rely on the KL penalty to constrain the policy.

danger

Training the reward model on too few or biased examples The reward model is only as good as the preference data it was trained on. If human labelers had a systematic bias (prefer longer responses, prefer confident-sounding answers regardless of accuracy), the reward model will encode that bias. RLHF will then optimize the language model toward that bias. Mitigation: use diverse labeler pools, provide detailed labeling guidelines, include agreement metrics as quality filters (discard comparisons where labelers strongly disagreed), and audit reward model behavior on held-out examples before PPO training.

warning

Not maintaining a frozen reference model during PPO The KL penalty in PPO requires comparing the current policy to the original SFT policy. If you accidentally train the reference model or update it during PPO, the KL constraint becomes meaningless - the model can deviate arbitrarily from the SFT baseline without penalty. Always verify that the reference model's weights are frozen (requires_grad=False for all parameters) before starting PPO.

tip

Monitor reward model accuracy during training - it should plateau During reward model training, track validation accuracy (how often does the RM correctly identify the human-preferred response?). A good reward model achieves 70-80% accuracy on held-out comparisons (human agreement itself is around 75-80%, so this is near the ceiling). If accuracy plateaus below 65%, you likely have data quality issues or an insufficient model. A reward model with 60% accuracy provides a very weak training signal for PPO.

Interview Q&A

Q1: Explain the three phases of RLHF and why each is necessary.

Phase 1 (SFT): fine-tune the base model on demonstrations of desired behavior. This produces a model that generates responses in the right format. Without SFT, the base model generates text that looks nothing like a helpful assistant's response - the RL training would have a very poor starting point.

Phase 2 (Reward Model): collect human preference comparisons and train a model to predict them. This is necessary because human preferences are hard to specify as a loss function. The reward model is a learned proxy for human judgment.

Phase 3 (PPO): use RL to fine-tune the language model to maximize the reward model's score while staying close to the SFT model (KL penalty). This is necessary because the SFT model is trained on demonstrations (mimicking ideal responses) rather than optimizing an objective. RL allows the model to explore the response space and find responses that are genuinely preferred by the reward model, not just similar to the demonstrations.

Q2: What is the Bradley-Terry model and why is it used for reward model training?

The Bradley-Terry model is a statistical model for pairwise comparisons. It models the probability that item $i$ is preferred over item $j$ as $P(i \succ j) = \sigma(s_i - s_j)$ , where $s_i$ is the score (reward) of item $i$ and $\sigma$ is the sigmoid function. For RLHF, the loss is $-\log \sigma(r(x, y_w) - r(x, y_l))$ - maximize the probability that the preferred response has a higher reward. It is used because: (1) it naturally extends to multiple comparisons per prompt (not just binary); (2) it is differentiable and easy to optimize; (3) it is well-understood statistically; (4) it handles the ordinal nature of preferences (preferred vs not preferred) without requiring absolute scores.

Q3: What is reward hacking and how do you prevent it?

Reward hacking (Goodhart's Law applied to RLHF) is when the language model finds ways to achieve high scores from the reward model that do not reflect genuine alignment. Examples: producing extremely long responses (reward models correlate length with quality), agreeing with the user regardless of accuracy (sycophancy), using specific surface patterns (hedging language, bullet points) that the reward model associates with quality without those patterns actually improving quality. Prevention: (1) KL penalty to constrain policy drift; (2) early stopping of PPO before over-optimization; (3) monitoring reward model score vs independent human evaluation (if these diverge, you have reward hacking); (4) diverse reward model training data; (5) iterative reward model retraining on RLHF model outputs.

Q4: Why did InstructGPT (1.3B RLHF) beat GPT-3 (175B) in human evaluations?

This result illustrates that alignment quality matters more than raw scale for human-facing tasks. GPT-3 was trained to predict text - it is excellent at completing text but has no mechanism for caring whether the output is helpful, harmless, or honest. It will complete harmful prompts, produce confident misinformation, generate irrelevant text continuation. InstructGPT was explicitly trained to produce responses that humans find helpful, harmless, and honest. The smaller model was literally optimized for the evaluation metric (human preference) while the larger model was not. This is not magic - it is the difference between optimizing for the right objective (human preference) vs an indirect objective (text prediction).

Q5: What is Constitutional AI and how does it differ from standard RLHF?

Constitutional AI (Bai et al., 2022) replaces human preference labelers with AI labelers guided by explicit principles. In standard RLHF, humans compare responses and say which is better. In CAI: (1) SLAF phase - the model critiques and revises its own harmful outputs guided by a written constitution of principles; (2) RLAIF phase - a large AI model (not human labelers) generates preference comparisons by judging which response is more aligned with the constitution. Advantages: scalable (AI labelers are cheap), consistent (principles are applied uniformly), transparent (the constitution is auditable). Disadvantages: inherits the labeling model's biases; human oversight is still needed to validate that the constitution captures the right values. Used by Anthropic for Claude's alignment training.

Advanced: Implementing Reward Model Evaluation

A well-trained reward model is the foundation of RLHF quality. Here is a complete evaluation framework:

"""
Reward model evaluation and calibration.
Critical for ensuring RLHF training quality.
"""

import torch
import numpy as np
from typing import List, Tuple
from sklearn.metrics import roc_auc_score, accuracy_score


def evaluate_reward_model(
    reward_model,
    tokenizer,
    test_pairs: List[Tuple[str, str, str, int]],
    # Each tuple: (prompt, response_a, response_b, label)
    # label: 0 if response_a preferred, 1 if response_b preferred
    batch_size: int = 8,
) -> dict:
    """
    Evaluate reward model quality on held-out preference pairs.

    Metrics:
    - Accuracy: fraction of pairs where RM correctly identifies preferred response
    - AUC: area under ROC curve (measures ranking quality)
    - Margin distribution: distribution of reward differences for correct/incorrect pairs
    """
    reward_model.eval()
    all_probs = []
    all_labels = []
    all_margins = []

    for i in range(0, len(test_pairs), batch_size):
        batch = test_pairs[i:i + batch_size]
        prompts = [p[0] for p in batch]
        responses_a = [p[1] for p in batch]
        responses_b = [p[2] for p in batch]
        labels = [p[3] for p in batch]

        def get_rewards(prompts, responses):
            texts = [p + tokenizer.sep_token + r for p, r in zip(prompts, responses)]
            inputs = tokenizer(
                texts, return_tensors="pt", truncation=True,
                max_length=1024, padding=True
            )
            with torch.no_grad():
                rewards = reward_model(**inputs).cpu().numpy()
            return rewards

        rewards_a = get_rewards(prompts, responses_a)
        rewards_b = get_rewards(prompts, responses_b)

        # P(B preferred) = sigma(r_b - r_a)
        margin = rewards_b - rewards_a
        prob_b_preferred = 1 / (1 + np.exp(-margin))  # sigmoid

        all_probs.extend(prob_b_preferred.tolist())
        all_labels.extend(labels)
        all_margins.extend(margin.tolist())

    # Metrics
    predicted_labels = [1 if p > 0.5 else 0 for p in all_probs]
    accuracy = accuracy_score(all_labels, predicted_labels)
    auc = roc_auc_score(all_labels, all_probs)

    # Margin analysis
    correct_margins = [abs(m) for m, l, p in zip(all_margins, all_labels, predicted_labels) if l == p]
    incorrect_margins = [abs(m) for m, l, p in zip(all_margins, all_labels, predicted_labels) if l != p]

    return {
        "accuracy": accuracy,
        "auc": auc,
        "avg_margin_correct": np.mean(correct_margins) if correct_margins else 0,
        "avg_margin_incorrect": np.mean(incorrect_margins) if incorrect_margins else 0,
        "margin_separation": np.mean(correct_margins) - np.mean(incorrect_margins),
    }


def monitor_ppo_training(policy_model, ref_model, tokenizer, eval_prompts, step):
    """
    Monitor PPO training health.
    Computes KL divergence between current policy and reference.
    If KL exceeds ~15 nats, reward hacking is likely occurring.
    """
    policy_model.eval()
    ref_model.eval()

    kl_values = []
    for prompt in eval_prompts[:20]:  # Sample 20 prompts
        inputs = tokenizer(prompt, return_tensors="pt")

        with torch.no_grad():
            policy_logits = policy_model(**inputs).logits[:, -1, :]
            ref_logits = ref_model(**inputs).logits[:, -1, :]

        # KL divergence at the last token position
        policy_probs = torch.softmax(policy_logits, dim=-1)
        ref_probs = torch.softmax(ref_logits, dim=-1)

        # KL(policy || ref)
        kl = (policy_probs * (torch.log(policy_probs + 1e-8) - torch.log(ref_probs + 1e-8))).sum(-1)
        kl_values.append(kl.item())

    avg_kl = np.mean(kl_values)

    status = "OK"
    if avg_kl > 15:
        status = "WARNING: Likely reward hacking"
    elif avg_kl > 8:
        status = "CAUTION: High KL, consider stopping"

    print(f"Step {step}: Avg KL = {avg_kl:.3f} nats | Status: {status}")
    return avg_kl


# Labeler agreement analysis - critical for reward model data quality
def analyze_labeler_agreement(comparisons_with_labels: list) -> dict:
    """
    Analyze agreement between human labelers.
    Low agreement means noisy data - filter these pairs or re-collect.

    comparisons_with_labels: list of {prompt, response_a, response_b, labeler_votes}
    labeler_votes: list of labels from multiple labelers (0 or 1)
    """
    agreements = []
    for comp in comparisons_with_labels:
        votes = comp["labeler_votes"]
        n = len(votes)
        # Fraction of labelers in agreement with majority
        majority = 1 if sum(votes) > n / 2 else 0
        agreement_rate = sum(v == majority for v in votes) / n
        agreements.append(agreement_rate)

    return {
        "avg_agreement": np.mean(agreements),
        "fraction_unanimous": sum(a == 1.0 for a in agreements) / len(agreements),
        "fraction_low_agreement": sum(a < 0.6 for a in agreements) / len(agreements),
        # Recommend filtering pairs with agreement < 0.6
    }

RLHF Engineering: Scaling Considerations

Human labeler throughput: A skilled labeler can evaluate approximately 30-60 preference pairs per hour for typical instruction-following tasks. More complex tasks (code correctness, technical accuracy) require 10-20 pairs per hour. For InstructGPT-scale data (33,000 pairs): approximately 500-1,500 labeler-hours. At $25/hour:$ 12,500 to $37,500 just for preference data collection.

Reward model size: the reward model should generally be at least as large as the policy model being aligned. A 7B reward model is appropriate for aligning a 7B policy. Using a 1B reward model to align a 70B policy is under-specified - the reward model's capacity may limit alignment quality.

PPO batch size: unlike SFT, PPO benefits significantly from large batch sizes because the policy gradient estimates have high variance. Use batch sizes of 128-512 with mini-batch size 16-32. This requires gradient accumulation and multiple GPUs.

Number of PPO steps: typically 100-500 PPO update steps for a 7B model. More than 1,000 steps without monitoring risks reward hacking. Monitor reward model score alongside independent human evaluations - if they diverge, stop PPO.

note

The InstructGPT paper's appendix is required reading The InstructGPT paper (Ouyang et al., 2022) has a remarkably detailed appendix covering labeler guidelines, interface design, agreement metrics, and hyperparameters. If you are implementing RLHF seriously, reading Appendix B (labeler guidelines) and Appendix D (PPO training details) is more valuable than reading most other RLHF papers. The practical details of how to train labelers, what to do about disagreements, and how to set up the annotation interface are documented nowhere else with this level of specificity.

Alternatives to PPO in RLHF

PPO is not the only way to optimize against a reward model. Several alternatives have emerged that are simpler to implement and sometimes achieve better results:

REINFORCE with Baseline

The simplest policy gradient method applied to language model fine-tuning:

import torch
import torch.nn.functional as F

def reinforce_step(
    policy_model,
    reward_model,
    ref_model,
    prompts: list[str],
    tokenizer,
    kl_coeff: float = 0.05,
    num_samples: int = 4,      # Sample multiple responses per prompt
) -> dict:
    """
    REINFORCE with baseline for LM alignment.
    Simpler than PPO - no value function, no clipping.
    """
    all_log_probs = []
    all_rewards = []
    all_kl_penalties = []

    for prompt in prompts:
        prompt_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()

        # Sample multiple responses
        with torch.no_grad():
            generated = policy_model.generate(
                prompt_ids,
                max_new_tokens=256,
                do_sample=True,
                temperature=0.9,
                num_return_sequences=num_samples,
            )

        # Score with reward model
        rewards = []
        for response_ids in generated:
            response_text = tokenizer.decode(response_ids, skip_special_tokens=True)
            reward_input = tokenizer(response_text, return_tensors="pt").cuda()
            with torch.no_grad():
                reward = reward_model(**reward_input).logits.squeeze()
            rewards.append(reward.item())

        # Baseline = mean reward across samples (variance reduction)
        baseline = sum(rewards) / len(rewards)

        for response_ids, reward in zip(generated, rewards):
            # Compute log probs under current policy
            response_text = tokenizer.decode(response_ids, skip_special_tokens=True)
            encoded = tokenizer(response_text, return_tensors="pt").cuda()

            log_probs_policy = policy_model(**encoded, labels=encoded.input_ids).loss * -1
            log_probs_ref = ref_model(**encoded, labels=encoded.input_ids).loss * -1

            # KL penalty
            kl = log_probs_policy - log_probs_ref

            # Advantage = reward - baseline
            advantage = reward - baseline

            all_log_probs.append(log_probs_policy)
            all_rewards.append(torch.tensor(advantage))
            all_kl_penalties.append(kl)

    # REINFORCE loss: -E[advantage * log_prob] + KL_coeff * KL
    policy_losses = [-lp * adv for lp, adv in zip(all_log_probs, all_rewards)]
    policy_loss = torch.stack(policy_losses).mean()
    kl_loss = torch.stack(all_kl_penalties).mean()

    total_loss = policy_loss + kl_coeff * kl_loss

    return {
        "loss": total_loss,
        "policy_loss": policy_loss.item(),
        "kl_loss": kl_loss.item(),
        "mean_reward": sum(r.item() for r in all_rewards) / len(all_rewards),
    }

Group Relative Policy Optimization (GRPO)

GRPO (Shao et al., 2024), developed at DeepSeek, is a PPO variant that eliminates the learned value function. Instead, it estimates advantages using the relative rewards within a group of sampled responses for the same prompt.

def grpo_step(
    policy_model,
    reward_fn,           # Can be a neural reward model or a rule-based verifier
    prompts: list[str],
    tokenizer,
    group_size: int = 8,      # Number of responses sampled per prompt
    kl_coeff: float = 0.04,
    clip_range: float = 0.2,
) -> dict:
    """
    GRPO: Group Relative Policy Optimization.
    Advantage = (reward - group_mean) / group_std
    No value function needed - group statistics serve as the baseline.
    """
    policy_model.eval()
    all_responses = []
    all_rewards = []

    # Phase 1: Sample responses and compute rewards (no grad)
    with torch.no_grad():
        for prompt in prompts:
            prompt_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
            responses = policy_model.generate(
                prompt_ids,
                max_new_tokens=512,
                do_sample=True,
                temperature=0.8,
                num_return_sequences=group_size,
            )

            rewards = []
            for resp in responses:
                resp_text = tokenizer.decode(resp, skip_special_tokens=True)
                reward = reward_fn(prompt, resp_text)   # Float reward
                rewards.append(reward)

            # Normalize within group - this is the key GRPO insight
            reward_mean = sum(rewards) / len(rewards)
            reward_std = (sum((r - reward_mean)**2 for r in rewards) / len(rewards)) ** 0.5 + 1e-8
            normalized_rewards = [(r - reward_mean) / reward_std for r in rewards]

            for resp, norm_reward in zip(responses, normalized_rewards):
                all_responses.append((prompt, resp, norm_reward))
                all_rewards.append(norm_reward)

    # Phase 2: Compute policy gradient loss (with grad)
    policy_model.train()
    total_loss = torch.tensor(0.0, requires_grad=True, device="cuda")

    for prompt, response_ids, advantage in all_responses:
        full_text = tokenizer.decode(response_ids, skip_special_tokens=True)
        encoded = tokenizer(full_text, return_tensors="pt").cuda()

        output = policy_model(**encoded, labels=encoded.input_ids)
        log_probs = -output.loss  # Per-token log probability

        # PPO-style clipped objective using advantage
        loss = -log_probs * advantage
        total_loss = total_loss + loss

    return {
        "loss": total_loss / len(all_responses),
        "mean_reward": sum(all_rewards) / len(all_rewards),
    }

Interview Q&A

Q1: Explain the three phases of RLHF and why each is necessary.

RLHF has three essential phases. Phase 1 (SFT): We need a base model that can produce coherent responses in the instruction-following format. A raw pretrained model outputs anything - code, news articles, or random continuations - not necessarily helpful answers. SFT teaches the model the "language" of question-answering. Phase 2 (Reward Model): We cannot directly optimize against human preferences because humans are not differentiable. We train a reward model to proxy human preferences, making optimization tractable. The reward model must be trained because there is no analytical function that captures "helpfulness." Phase 3 (PPO): We use the reward model as a training signal to update the SFT model toward higher-reward behaviors. We cannot just fine-tune on the highest-rated responses (this would be rejection sampling, not RL) - we need gradient flow through the reward signal to discover new high-reward behaviors the model hasn't produced yet.

Q2: What is reward hacking and why is it inevitable at scale?

Reward hacking occurs when the policy finds high-reward behaviors that exploit the reward model rather than genuinely satisfying the objective. Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." The reward model is trained on a finite set of human judgments and inevitably has blind spots. As PPO optimization pressure increases, the policy finds and exploits these blind spots - producing outputs that the reward model rates highly but that humans would not. Examples: generating very long responses (humans often rate verbose answers as more thoughtful), excessive sycophancy (agreeing with wrong premises), and outputs that pattern-match to rewarded styles without actual helpfulness. Mitigation: KL divergence penalty limits how far the policy can deviate from the SFT model, limiting how much it can overoptimize.

Q3: What is the InstructGPT key finding about model size and alignment?

The headline result from Ouyang et al. (2022): InstructGPT 1.3B (aligned via RLHF) was preferred by human evaluators to GPT-3 175B (unaligned) on 85% of head-to-head comparisons. A model that is 100x smaller but carefully aligned is significantly more useful than a massive model that is not. This has two implications: (1) alignment is not just about making models safe - it makes them genuinely more useful; (2) the "alignment tax" narrative (alignment reduces capability) was wrong for instruction following - alignment actually improves capability as measured by human preference. The reason: GPT-3 was pretrained to predict next tokens in any context; InstructGPT was specifically trained to respond helpfully to user instructions.

Q4: What is the Bradley-Terry model and why is it used for the reward model?

The Bradley-Terry model is a probabilistic model for pairwise comparisons. Given two items A and B with "strength" parameters $r_A$ and $r_B$ , the probability that A is preferred to B is: $P(A \succ B) = \sigma(r_A - r_B)$ where $\sigma$ is the sigmoid function. For RLHF, $r_A$ and $r_B$ are the scalar outputs of the reward model for two responses to the same prompt. The Bradley-Terry model is used because: (1) it is mathematically tractable - the log-likelihood of a preference dataset is convex; (2) it matches the cognitive model of human preference - preferences are probabilistic, not deterministic; (3) it naturally handles transitivity - if A is consistently preferred to B and B to C, A should be preferred to C, which the model captures through the scalar reward scale.

Q5: What is the KL divergence penalty in PPO for LM alignment and why is it necessary?

The PPO objective for LM alignment is: $\mathcal{L} = \mathbb{E}[r_\phi(x, y) - \beta \cdot \text{KL}(\pi_\theta || \pi_{SFT})]$ . The KL term measures how far the current policy $\pi_\theta$ has drifted from the SFT model $\pi_{SFT}$ . It is necessary for two reasons: (1) Reward hacking prevention - without the KL penalty, PPO would quickly find reward-hacking strategies that exploit the reward model's weaknesses. The KL penalty limits the policy's ability to deviate into out-of-distribution territory where the reward model's predictions are unreliable. (2) Maintaining language quality - the SFT model produces coherent, grammatical text. Unrestricted PPO optimization could distort the language model's outputs into incoherent sequences that happen to receive high rewards. The KL penalty anchors the policy to the distribution where language quality is preserved. $\beta$ is typically set between 0.01 and 0.1 - smaller values allow more optimization, larger values are more conservative.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the RLHF Pipeline demo on the EngineersOfAI Playground - no code required.

:::

The Helpful but Dangerous Model​

Why This Exists: The Alignment Problem in Practice​

Historical Context: How RLHF Came Together​

The Three Phases of RLHF​

Phase 1: Supervised Fine-Tuning (SFT)​

Phase 2: Reward Model Training​

Phase 3: PPO Fine-Tuning​

The Reward Hacking Problem​

Constitutional AI: Anthropic's RLHF Variant​

InstructGPT Results: What the Numbers Mean​

Code: Reward Model Training​

Production Engineering Notes​

Common Mistakes​

Interview Q&A​

Advanced: Implementing Reward Model Evaluation​

RLHF Engineering: Scaling Considerations​

Alternatives to PPO in RLHF​

REINFORCE with Baseline​

Group Relative Policy Optimization (GRPO)​

Interview Q&A​

The Helpful but Dangerous Model

Why This Exists: The Alignment Problem in Practice

Historical Context: How RLHF Came Together

The Three Phases of RLHF

Phase 1: Supervised Fine-Tuning (SFT)

Phase 2: Reward Model Training

Phase 3: PPO Fine-Tuning

The Reward Hacking Problem

Constitutional AI: Anthropic's RLHF Variant

InstructGPT Results: What the Numbers Mean

Code: Reward Model Training

Production Engineering Notes

Common Mistakes

Interview Q&A

Advanced: Implementing Reward Model Evaluation

RLHF Engineering: Scaling Considerations

Alternatives to PPO in RLHF

REINFORCE with Baseline

Group Relative Policy Optimization (GRPO)

Interview Q&A