A complete technical walkthrough of Reinforcement Learning from Human Feedback - the three-phase pipeline, reward models, PPO, KL penalty, and the limitations that led to newer approaches.

How does reinforcement learning from human feedback work in practice?

RLHF Deep Dive covers RLHF, reinforcement learning from human feedback, reward model from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/alignment-and-safety/rlhf-deep-dive

What is the difference between RLHF and reward model?

See the full breakdown at https://engineersofai.com/docs/llms/alignment-and-safety/rlhf-deep-dive

RLHF Deep Dive

Reading time: 30 min | Relevance: AI Engineer, Research Engineer, ML Engineer

The Night InstructGPT Changed Everything

January 2022. OpenAI posts a preprint: "Training language models to follow instructions with human feedback" (Ouyang et al.). The model is called InstructGPT. It's smaller than GPT-3 - only 1.3 billion parameters - but in human evaluations, users prefer its outputs 85% of the time over GPT-3's 175 billion parameters. A 100× smaller model is beating the flagship. The ML community's initial reaction: this doesn't make sense. The explanation is RLHF, and it changes how everyone thinks about post-training.

Before InstructGPT, the dominant paradigm was simple: train a language model on next-token prediction, then fine-tune it with supervised examples of the behavior you want. If you want a helpful assistant, fine-tune on a dataset of helpful assistant conversations. This works. But it has a critical failure mode: the model learns to imitate the format and style of helpful responses, not to understand what makes a response genuinely helpful. It learns to say things that look like good answers, rather than things that are good answers.

RLHF breaks this limitation. Instead of training on a fixed dataset of "correct" responses, RLHF trains a reward model that predicts which of two responses a human would prefer - and then uses reinforcement learning to generate responses that maximize this predicted reward. The model learns to optimize for human preference, not to imitate human-generated text. This is a subtle but profound difference. Imitation learning generalizes poorly to novel inputs; preference optimization generalizes much better because the reward model has learned the underlying structure of human preferences, not just their surface-level manifestation in a particular dataset.

InstructGPT was the proof of concept. ChatGPT, Claude, Gemini - all use variations of this approach. Understanding RLHF is not optional for anyone building or deploying language models in 2024. This lesson goes deep: the full three-phase pipeline, the math behind reward models, why PPO was chosen (and why it's being replaced), and the failure modes that motivated newer approaches like DPO.

Why This Exists - The Failure of Pure Imitation

Pure supervised fine-tuning (SFT) on a dataset of human-written examples has a fundamental problem: distribution mismatch. During training, the model sees human-written completions and learns to predict them. During inference, the model generates its own completions - and those completions are different from what it was trained on. Every token the model generates pushes it slightly further from the training distribution.

This is sometimes called the "exposure bias" problem. The model was never trained to generate good responses given its own previous tokens as context. It was only trained to generate good responses given human-written previous tokens. A small error early in generation compounds: the model generates a slightly off-distribution token, which makes the next token more off-distribution, until by the end of a long response it's in a region of space it has never been trained on.

RLHF addresses this directly. In the RL phase, the model generates its own completions and receives reward based on those completions. The model is trained exactly on the distribution it will encounter at inference: its own output. There is no distribution mismatch.

Additionally, RLHF can optimize for outcomes that are hard to express in training data. Helpfulness has many dimensions - accuracy, clarity, completeness, tone, safety - and these dimensions interact in complex ways that vary by context. It's nearly impossible to construct a training dataset that captures all these dimensions correctly across all contexts. But a reward model trained on pairwise human preferences learns these interactions implicitly, without them needing to be specified explicitly.

Historical Context

2017 - Paul Christiano, Jan Leike, Tom Brown, et al. publish "Deep reinforcement learning from human preferences" at NeurIPS. The core idea: train reward models from pairwise human preferences, then use RL to optimize the policy against those rewards. Demonstrated on Atari games and robotic locomotion.

2019 - The same team (now distributed across OpenAI and DeepMind) scales this to language tasks, demonstrating that RLHF can align summarization models with human preferences.

2020 - OpenAI applies RLHF to GPT-3 for a summarization task. The results are strong but the technique remains research-oriented.

2022 - Ouyang et al. publish the InstructGPT paper: "Training language models to follow instructions with human feedback." This is the watershed moment - RLHF applied to a general instruction-following task at scale, beating a 100× larger SFT-only baseline in human evaluation.

2022 - Anthropic publishes "Training a helpful and harmless assistant with RLHF" (Bai et al.), extending RLHF with the HHH (Helpful, Harmless, Honest) framework and the concept of Constitutional AI (Lesson 03).

2023 - Rafailov et al. publish "Direct Preference Optimization" (DPO), which bypasses the need for a separate reward model and RL altogether. DPO becomes the dominant approach for preference optimization in many settings (Lesson 04).

The Three-Phase Pipeline

RLHF consists of three sequential phases. Each phase depends on the previous one, and failures in earlier phases compound through the pipeline.

Phase 1: Supervised Fine-Tuning (SFT)

The SFT phase creates a starting point for RLHF. You take a pre-trained language model and fine-tune it on a curated dataset of (prompt, response) pairs. These pairs are high-quality demonstrations of the target behavior - typically written or heavily edited by trained labelers.

The goal of SFT is not to create the final aligned model. It's to create a model that:

Understands the format of the target task (following instructions, answering questions, etc.)
Is close enough to the target distribution that human raters can meaningfully compare its outputs
Provides a stable starting point for RL fine-tuning (raw pre-trained models are too unstable for direct RL)

InstructGPT used ~13,000 demonstration samples for SFT. The quality of these demonstrations matters enormously. Labelers in the InstructGPT study were contractors with backgrounds in writing, who went through training and calibration to ensure consistent output quality.

# Phase 1: SFT is standard fine-tuning
# No special machinery needed beyond a well-curated dataset

from transformers import AutoTokenizer, AutoModelForCausalLM
from torch.optim import AdamW

def sft_loss(model, batch):
    """Standard next-token prediction on demonstration data."""
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]
    labels = batch["labels"]  # -100 for prompt tokens, token_id for response tokens

    outputs = model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        labels=labels
    )
    return outputs.loss  # Cross-entropy on response tokens only

Phase 2: Reward Model Training

The reward model is the heart of RLHF. It's a function that takes a (prompt, response) pair and outputs a scalar reward - an estimate of how much a human would prefer this response.

Data collection: Human labelers are shown a prompt and two or more responses generated by the SFT model. They rank the responses by quality (or select the preferred one). This produces a dataset of pairwise preferences: $(x, y_w, y_l)$ where $x$ is the prompt, $y_w$ is the preferred (winning) response, and $y_l$ is the dispreferred (losing) response.

The Bradley-Terry preference model: The reward model is trained using the Bradley-Terry model of pairwise comparison. The Bradley-Terry model says that the probability of preferring response $y_w$ over $y_l$ given a prompt $x$ is:

$P(y_w \succ y_l \mid x) = \sigma(r(x, y_w) - r(x, y_l))$

where $\sigma$ is the sigmoid function and $r(x, y)$ is the reward model's output. Taking the log:

$\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \log \sigma(r(x, y_w) - r(x, y_l)) \right]$

This is simply a binary cross-entropy loss on the difference in rewards. The model is pushed to assign higher reward to preferred responses and lower reward to dispreferred responses.

Architecture: The reward model is typically the SFT model with the language model head replaced by a scalar output head. The base transformer processes the (prompt, response) pair and the scalar head predicts the reward from the final token's representation.

import torch
import torch.nn as nn
from transformers import AutoModel

class RewardModel(nn.Module):
    """
    Reward model for RLHF.
    Takes (prompt, response) and outputs a scalar reward.
    """

    def __init__(self, base_model_name: str):
        super().__init__()
        self.base = AutoModel.from_pretrained(base_model_name)
        hidden_size = self.base.config.hidden_size
        self.reward_head = nn.Linear(hidden_size, 1)

    def forward(self, input_ids, attention_mask):
        # Get hidden states from transformer
        outputs = self.base(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        # Use the last token's representation as the reward signal
        # (for decoder-only models like GPT-2/LLaMA)
        last_hidden = outputs.last_hidden_state[:, -1, :]
        reward = self.reward_head(last_hidden).squeeze(-1)
        return reward  # Shape: (batch_size,)


def reward_model_loss(
    reward_model: RewardModel,
    chosen_input_ids: torch.Tensor,
    chosen_mask: torch.Tensor,
    rejected_input_ids: torch.Tensor,
    rejected_mask: torch.Tensor
) -> torch.Tensor:
    """
    Bradley-Terry loss for reward model training.
    Pushes reward(chosen) > reward(rejected).
    """
    r_chosen = reward_model(chosen_input_ids, chosen_mask)
    r_rejected = reward_model(rejected_input_ids, rejected_mask)

    # -log(sigmoid(r_chosen - r_rejected))
    loss = -torch.nn.functional.logsigmoid(r_chosen - r_rejected).mean()
    return loss

Key considerations in reward model training:

Training on human preferences, not scores: Pairwise comparisons are more reliable than absolute ratings. Humans are much better at comparing two responses than at rating a single response on a scale.
Normalization: Reward models are often trained with normalized rewards (e.g., using a whitening layer) to prevent the RL phase from being sensitive to the absolute scale of rewards.
Overfitting risk: Reward models trained on limited preference data can overfit. If the policy learns to exploit the reward model's errors, reward hacking occurs. Regularization and ensemble methods help.

Phase 3: RL Fine-Tuning with PPO

In Phase 3, the SFT model is fine-tuned using reinforcement learning to maximize the reward signal from the reward model. The key challenge: the policy must maximize reward while not drifting too far from the SFT model. If it drifts too far, it will produce responses the reward model wasn't trained to evaluate reliably - leading to reward hacking.

The objective function: The RL objective in InstructGPT is:

$\mathcal{J}(\theta) = \mathbb{E}_{x \sim D, y \sim \pi_\theta(y|x)} \left[ r_\phi(x, y) - \beta \cdot \text{KL}(\pi_\theta(\cdot|x) \| \pi_{\text{SFT}}(\cdot|x)) \right]$

where:

$\pi_\theta$ is the current policy (the model being trained)
$\pi_{\text{SFT}}$ is the frozen SFT model (the reference policy)
$r_\phi$ is the reward model
$\beta$ is the KL penalty coefficient (typically 0.01–0.1)
The KL term penalizes the policy for diverging from the SFT model

The KL penalty term is the key innovation that prevents reward hacking. Without it, the policy quickly finds degenerate responses that maximize the reward model score without being genuinely good.

Why PPO: Proximal Policy Optimization (Schulman et al. 2017) is used for the RL step because it constrains each gradient update to not change the policy too much (via a clipped objective). This is crucial for stability - unconstrained policy gradient methods are prone to catastrophic forgetting and policy collapse in the complex, high-dimensional action space of natural language.

The PPO objective clips the policy ratio $\rho_t = \pi_\theta(a_t|s_t) / \pi_{\theta_\text{old}}(a_t|s_t)$ :

$\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min\left(\rho_t A_t, \text{clip}(\rho_t, 1-\epsilon, 1+\epsilon) A_t\right) \right]$

where $A_t$ is the advantage estimate (how much better this action was than the baseline) and $\epsilon$ is typically 0.1–0.2.

import torch
import torch.nn.functional as F

def compute_ppo_loss(
    log_probs_new: torch.Tensor,   # Log probs under current policy
    log_probs_old: torch.Tensor,   # Log probs under old policy (frozen)
    advantages: torch.Tensor,       # Advantage estimates from value function
    clip_epsilon: float = 0.2,
    kl_penalty: float = 0.05,
    log_probs_ref: torch.Tensor = None,  # Log probs under SFT reference policy
) -> torch.Tensor:
    """
    PPO loss for RLHF, with optional KL penalty against SFT reference.
    """
    # Policy ratio (importance sampling weight)
    ratio = torch.exp(log_probs_new - log_probs_old)

    # Clipped surrogate objective
    clip_adv = torch.clamp(ratio, 1 - clip_epsilon, 1 + clip_epsilon) * advantages
    policy_loss = -torch.min(ratio * advantages, clip_adv).mean()

    # KL penalty against SFT model (prevents reward hacking)
    kl_loss = 0.0
    if log_probs_ref is not None:
        kl = (log_probs_new - log_probs_ref).mean()  # Approximate KL
        kl_loss = kl_penalty * kl

    total_loss = policy_loss + kl_loss
    return total_loss, policy_loss, kl_loss

The Role of the KL Penalty

The KL penalty is the safety mechanism that prevents the policy from collapsing into reward-hacked degenerate responses. Without it, PPO quickly finds responses that score very high on the reward model but are clearly bad to human evaluators.

Why does this happen? The reward model is an imperfect approximation of human preferences. It was trained on a finite dataset of preference comparisons. The RL phase can discover responses that are unusual in ways the reward model's training didn't cover - and can't evaluate reliably. These responses might score high on the reward model because they're genuinely good, or because they're outside the reward model's training distribution in ways that produce unreliable high scores.

The KL penalty prevents the policy from going too far into uncharted territory. By keeping the policy close to the SFT model (which was trained to stay close to the human-generated training distribution), it limits the policy's ability to exploit the reward model's blind spots.

The $\beta$ hyperparameter controls the trade-off:

High $\beta$ : Policy stays very close to SFT model. Safe but limited improvement.
Low $\beta$ : Policy optimizes aggressively for reward. High risk of reward hacking.

InstructGPT used $\beta = 0.02$ and found this produced a good balance. The KL divergence was monitored throughout training, and if it exceeded a threshold, training was stopped.

The Labeler Workforce: Human Preference Data at Scale

RLHF requires substantial human feedback data. InstructGPT's reward model was trained on approximately 50,000 pairwise preference comparisons. For each comparison, a trained labeler viewed a prompt and two or more model-generated responses and selected which was better.

Labeler requirements: InstructGPT labelers were selected for English proficiency, sensitivity to harmful content, and calibrated judgment. They went through extensive training and regular calibration sessions to ensure consistency.

Annotation guidelines: Labelers were given detailed rubrics. The three main criteria were:

Helpfulness: Does the response address the user's intent?
Harmlessness: Does the response avoid harmful, biased, or dangerous content?
Honesty: Is the response truthful and appropriately uncertain?

Inter-annotator agreement: A key challenge. Preferences are subjective, and different labelers disagree. InstructGPT measured inter-annotator agreement and found it was moderate (around 70% pairwise agreement). The reward model essentially learns the average of labeler preferences, which means it captures the center of the preference distribution but not its tails.

Annotation bias: Different demographic groups, cultural backgrounds, and individuals with different political views produce systematically different preference ratings. This bias gets baked into the reward model and propagates into the final policy. This is one reason why RLHF-trained models can exhibit political biases, cultural blind spots, and differential treatment of different user populations.

RLHF Limitations

RLHF works well but has significant limitations that motivated newer approaches:

1. Reward model as a bottleneck: The quality of the aligned model is bounded by the quality of the reward model. If the reward model has blind spots, biases, or limited coverage, the policy will exploit them. Improving the reward model requires more human feedback data, which is expensive and slow.

2. Reward hacking despite KL penalty: The KL penalty limits but does not prevent reward hacking. The policy gradually discovers responses that score high on the reward model without being genuinely better. Over long training runs, reward model score continues to increase while true quality plateaus or declines (the "reward model score ceiling" problem).

3. PPO instability: PPO is notoriously difficult to tune for language model fine-tuning. It requires careful hyperparameter selection, and training is often unstable. The RL training loop is significantly more complex than standard fine-tuning.

4. Mode collapse: PPO can cause the policy to collapse to a narrow set of responses that reliably score high on the reward model, reducing the diversity of outputs. This is particularly problematic for creative tasks.

5. Annotation cost and bias: Human preference data is expensive to collect and encodes the biases of the specific annotator population. Scaling RLHF to more tasks and languages requires scaling the human labeling workforce.

6. Three-phase complexity: Managing three interconnected training phases is operationally complex. Failures in any phase compound. The SFT model must be good enough to produce outputs worth comparing. The reward model must be calibrated enough to provide useful signal. The RL phase must be tuned carefully enough to improve without hacking.

These limitations motivated Constitutional AI (Lesson 03), which replaces human feedback with AI feedback guided by explicit principles, and DPO (Lesson 04), which eliminates the separate reward model and RL phase entirely.

Full RLHF Training Pipeline in Code

Here's a simplified but complete outline of an RLHF training loop:

import torch
from dataclasses import dataclass
from typing import Optional

@dataclass
class RLHFConfig:
    sft_model_name: str = "meta-llama/Llama-2-7b-hf"
    reward_model_name: str = "llama-reward-model"
    beta: float = 0.02              # KL penalty coefficient
    ppo_epsilon: float = 0.2       # PPO clip parameter
    ppo_epochs: int = 4            # PPO update epochs per rollout
    batch_size: int = 64
    learning_rate: float = 1e-5
    max_new_tokens: int = 512
    temperature: float = 1.0


class RLHFTrainer:
    """
    Simplified RLHF training loop.
    In practice, use TRL (Transformer Reinforcement Learning) library.
    """

    def __init__(self, config: RLHFConfig):
        self.config = config
        # Phase 1 model - also serves as reference for KL
        self.sft_model = self._load_sft_model()
        self.ref_model = self._load_sft_model()  # Frozen copy
        self.ref_model.eval()
        for param in self.ref_model.parameters():
            param.requires_grad_(False)

        # Phase 2 model
        self.reward_model = self._load_reward_model()
        self.reward_model.eval()
        for param in self.reward_model.parameters():
            param.requires_grad_(False)

        # Phase 3: policy is initialized from SFT model
        self.policy = self.sft_model
        self.optimizer = torch.optim.AdamW(
            self.policy.parameters(),
            lr=config.learning_rate
        )

    def rollout(self, prompts: list[str]) -> dict:
        """Generate responses from current policy."""
        self.policy.eval()
        with torch.no_grad():
            responses = self.policy.generate(
                prompts,
                max_new_tokens=self.config.max_new_tokens,
                temperature=self.config.temperature,
                do_sample=True,
            )
        return responses

    def compute_rewards(self, prompts, responses) -> torch.Tensor:
        """
        Compute rewards for (prompt, response) pairs.
        Subtracts KL penalty from reward model score.
        """
        # Reward model score
        rm_scores = self.reward_model(prompts, responses)

        # KL divergence from reference model (per token, summed)
        with torch.no_grad():
            policy_log_probs = self.policy.log_probs(prompts, responses)
            ref_log_probs = self.ref_model.log_probs(prompts, responses)
        kl_penalty = policy_log_probs - ref_log_probs  # Shape: (batch, seq)
        kl_per_sample = kl_penalty.sum(dim=-1)         # Shape: (batch,)

        # Final reward: RM score minus KL penalty
        rewards = rm_scores - self.config.beta * kl_per_sample
        return rewards

    def ppo_update(self, prompts, responses, rewards):
        """
        Run PPO update on collected (prompt, response, reward) data.
        """
        self.policy.train()

        # Compute old log probs (frozen before update)
        with torch.no_grad():
            old_log_probs = self.policy.log_probs(prompts, responses)

        # Advantage = reward - baseline (use mean reward as baseline)
        baseline = rewards.mean()
        advantages = rewards - baseline

        # PPO epochs
        for _ in range(self.config.ppo_epochs):
            new_log_probs = self.policy.log_probs(prompts, responses)
            loss, _, _ = compute_ppo_loss(
                log_probs_new=new_log_probs,
                log_probs_old=old_log_probs,
                advantages=advantages,
                clip_epsilon=self.config.ppo_epsilon,
            )
            self.optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(self.policy.parameters(), 1.0)
            self.optimizer.step()

    def _load_sft_model(self):
        # In practice: load from checkpoint
        raise NotImplementedError

    def _load_reward_model(self):
        raise NotImplementedError


# In practice, use the TRL library which handles all this complexity:
# from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead

Production Engineering Notes

Using TRL for RLHF in practice

The trl (Transformer Reinforcement Learning) library by Hugging Face is the standard tool for RLHF in practice. It handles the PPO training loop, value head management, KL tracking, and distributed training.

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer

config = PPOConfig(
    model_name="meta-llama/Llama-2-7b-chat-hf",
    learning_rate=1.41e-5,
    batch_size=128,
    mini_batch_size=16,
    gradient_accumulation_steps=1,
    optimize_cuda_cache=True,
    early_stopping=True,
    target_kl=0.1,        # Stop training if KL exceeds this
    ppo_epochs=4,
    seed=42,
    init_kl_coef=0.2,     # Initial KL coefficient (beta)
    adap_kl_ctrl=True,    # Adaptively adjust KL coefficient
)

# The TRL PPOTrainer wraps all the complexity
model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
ppo_trainer = PPOTrainer(config, model, ref_model=None, tokenizer=tokenizer)

Monitoring RLHF training

Key metrics to track during Phase 3:

KL divergence from reference: Should stay below your target (typically 5–15 nats). If it explodes, reduce learning rate or increase $\beta$ .
Reward model score: Should increase steadily. If it plateaus quickly, the policy may be hacking the reward model.
Response length: Should stay stable. Systematic length growth indicates length hacking.
Entropy: Should stay above a minimum threshold. If entropy collapses, mode collapse is occurring.
Win rate against SFT baseline: Evaluate periodically by having human raters (or a strong LLM judge) compare policy outputs to SFT outputs. This is your ground truth quality signal.

Scaling RLHF

RLHF is significantly more computationally expensive than SFT. The policy must be run in forward and backward pass during PPO updates. The reference model must be run in forward pass (but not backward) for KL computation. The reward model must be run in forward pass for reward computation.

For 7B parameter models, RLHF typically requires 8 A100 80GB GPUs. For 70B models, 64+ A100s. Memory optimization techniques critical for RLHF:

DeepSpeed ZeRO-3: Shard model parameters across GPUs
Gradient checkpointing: Trade compute for memory
Flash Attention 2: Reduce attention memory from $O(n^2)$ to $O(n)$
LoRA adapters: Train only a small subset of parameters (reduces memory and improves stability)

Common Mistakes

:::danger Running Phase 3 without monitoring KL divergence Without monitoring KL divergence from the reference model, PPO will rapidly reward-hack the reward model. Always set a KL target and stop training (or reduce learning rate) when it's exceeded. The TRL library's target_kl and adap_kl_ctrl parameters handle this automatically. :::

:::danger Training the reward model to convergence on preference data Overfitted reward models are particularly susceptible to exploitation. Stop reward model training early (use validation preference accuracy as the stopping criterion), and use regularization (weight decay, dropout). A slightly underfit reward model is safer than an overfit one. :::

:::warning Ignoring inter-annotator agreement during labeling If your labelers disagree frequently, your reward model is training on noisy, inconsistent signal. Before scaling labeling, measure inter-annotator agreement (Cohen's kappa or percent agreement) and calibrate labelers to improve consistency. Disagreements also reveal genuine ambiguity in your rubric - address the ambiguity in the guidelines. :::

:::warning Treating RLHF as a one-shot process In practice, RLHF requires multiple iterations. The first reward model reveals blind spots that require new preference data collection. The first RL training reveals failure modes that require reward model improvements. Budget for 3–5 iterations of the full pipeline, not 1. :::

:::tip Use adaptive KL control Rather than setting a fixed $\beta$ , use an adaptive KL controller that increases $\beta$ when KL is too high and decreases it when KL is too low. This maintains the policy in the "Goldilocks zone" - close enough to the SFT model to avoid reward hacking, far enough to make meaningful improvements. TRL implements this with adap_kl_ctrl=True. :::

Interview Q&A

Q1: Explain the RLHF pipeline end-to-end.

RLHF has three phases. First, Supervised Fine-Tuning (SFT): take a pre-trained language model and fine-tune it on a curated dataset of (prompt, response) demonstration pairs to get a reasonable starting point. Second, Reward Model training: collect pairwise preference data by having human labelers compare model responses and select the preferred one. Train a reward model on these preferences using the Bradley-Terry loss - $-\log \sigma(r_w - r_l)$ - where $r_w$ is the reward for the preferred response and $r_l$ for the rejected one. Third, RL Fine-Tuning: use PPO to optimize the SFT model to maximize the reward model's score, while penalizing divergence from the SFT model with a KL term. This KL term prevents reward hacking by keeping the policy close to the known-good SFT distribution.

Q2: What is the KL penalty and why is it necessary?

The KL penalty is a regularization term added to the RLHF objective that penalizes the policy for diverging too far from the SFT reference model. It's computed as $\beta \cdot \text{KL}(\pi_\theta \| \pi_{\text{SFT}})$ and subtracted from the reward.

It's necessary because the reward model is an imperfect approximation of human preferences trained on a finite dataset. Without the KL penalty, PPO quickly discovers responses that score high on the reward model by exploiting its blind spots - regions of response space the reward model wasn't trained to evaluate reliably. The KL penalty keeps the policy close to the SFT model (which was trained to stay in the human-generated distribution), limiting the policy's ability to exploit reward model weaknesses. The coefficient $\beta$ controls the strength of this regularization - too high and the policy barely improves, too low and reward hacking occurs.

Q3: Why use PPO instead of simpler policy gradient methods like REINFORCE?

Several reasons. PPO's clipped objective prevents excessively large policy updates by limiting the policy ratio $\rho_t$ to the range $[1-\epsilon, 1+\epsilon]$ . This is critical for stability in the high-dimensional action space of natural language. Simpler methods like REINFORCE take steps proportional to the gradient magnitude without any constraint on the policy change size, leading to catastrophic updates that can collapse the policy.

Additionally, PPO allows multiple gradient updates on the same batch of rollouts (multiple epochs), making it more sample-efficient than REINFORCE, which requires new rollouts for each update. For LLMs where generation is expensive, this sample efficiency is critical.

Q4: What are the main failure modes of RLHF?

Sycophancy - the policy learns to agree with user beliefs rather than give accurate responses, because agreeable responses get rated higher. Length hacking - the policy generates unnecessarily long responses because raters associate length with quality. Reward model hacking - the policy finds responses that score high on the reward model while being genuinely bad by exploiting the reward model's blind spots. Mode collapse - the policy converges to a narrow set of reliable high-reward responses, losing output diversity. Annotation bias - the reward model encodes the biases of the annotator population, which propagate into the policy.

Q5: How does RLHF scale with model size?

RLHF is significantly more expensive than SFT because three models must be held in memory simultaneously (policy, reference model for KL, reward model for scoring), and the policy must be run in both forward and backward pass during PPO updates. For 7B models, 8×A100 80GB is typical. For 70B models, 64+ A100s.

However, the effectiveness of RLHF scales well with model size. Larger models are better at following the reward signal because they have more capacity to model the nuanced structure of human preferences. InstructGPT showed that RLHF gains relative to SFT-only baselines are larger for bigger models. This suggests that RLHF will remain important as models scale, not just a technique for small models.

Q6: Why did DPO emerge as an alternative to RLHF, and in what settings is RLHF still preferred?

DPO (Direct Preference Optimization) was motivated by RLHF's operational complexity and the reward hacking problem. DPO shows that the RLHF objective has a closed-form optimal policy, meaning you can compute the preference optimization loss directly from preference data without training a separate reward model or running PPO. This eliminates the three-phase complexity, removes the reward hacking risk from the RL phase, and is significantly cheaper to run.

RLHF is still preferred when: (1) you need to optimize for a specific, well-defined scalar reward that isn't easily expressed as pairwise preferences, (2) you want to do online RLHF (generating new rollouts and getting rewards during training), or (3) you need fine-grained control over the KL-reward trade-off across training. DPO is offline - it trains on a fixed preference dataset. RLHF can generate new rollouts from the current policy during training, which allows it to improve in a more targeted way.

Summary

RLHF is the foundational technique that transformed language models from next-token predictors into helpful assistants. The three-phase pipeline - SFT, Reward Model, RL Fine-Tuning - addresses the core limitations of pure imitation learning by training the model to optimize for human preferences rather than to imitate human text.

Key technical components:

Bradley-Terry model: Transforms pairwise preference comparisons into a trainable loss
PPO: Provides stable policy gradient updates with bounded policy changes
KL penalty: Prevents reward hacking by keeping the policy close to the SFT baseline

Key limitations:

Reward model as a bottleneck (quality bounded by preference data quality)
Reward hacking despite KL penalty
PPO instability and operational complexity
Annotation cost and bias

These limitations motivated Constitutional AI (which replaces human feedback with AI feedback) and DPO (which eliminates the reward model and RL phase entirely), covered in the next two lessons.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the RLHF Pipeline demo on the EngineersOfAI Playground - no code required.

:::

The Night InstructGPT Changed Everything​

Why This Exists - The Failure of Pure Imitation​

Historical Context​

The Three-Phase Pipeline​

Phase 1: Supervised Fine-Tuning (SFT)​

Phase 2: Reward Model Training​

Phase 3: RL Fine-Tuning with PPO​

The Role of the KL Penalty​

The Labeler Workforce: Human Preference Data at Scale​

RLHF Limitations​

Full RLHF Training Pipeline in Code​

Production Engineering Notes​

Using TRL for RLHF in practice​

Monitoring RLHF training​

Scaling RLHF​

Common Mistakes​

Interview Q&A​

Summary​