The complete RLHF pipeline: supervised fine-tuning, reward model training from human preferences, and PPO fine-tuning - the technique behind InstructGPT, ChatGPT, and Claude.

How does reinforcement learning from human feedback work in practice?

RL from Human Feedback - How ChatGPT Learned to Be Helpful covers RLHF, reinforcement learning from human feedback, reward model from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/reinforcement-learning/rl-from-human-feedback

What is the difference between RLHF and reward model?

See the full breakdown at https://engineersofai.com/docs/ml/reinforcement-learning/rl-from-human-feedback

RL from Human Feedback - How ChatGPT Learned to Be Helpful

Reading time: ~45 minutes | Level: Reinforcement Learning | Role: MLE, AI Research Engineer, MLOps

The Real Engineering Moment

The year is 2021. GPT-3 is already a landmark achievement - 175 billion parameters, trained on half a trillion tokens, capable of writing essays, code, and poetry. And yet it is, in practice, frustrating to use. Ask it to write a professional email and it might generate five irrelevant continuation sentences. Ask it to summarize a document and it might just keep writing more document. Ask it anything sensitive and it will helpfully provide harmful information with no hesitation, because harmful information is what appears in training data.

The problem is not capability. GPT-3 has enormous capability. The problem is alignment. The model has been trained to predict the next token on internet text. It has not been trained to be helpful, harmless, and honest. It has no concept of what the person asking actually wants. It is a very powerful autocomplete that has never been taught to be an assistant.

A team at OpenAI - led by Ryan Lowe, Jan Leike, and Long Ouyang - runs a different kind of experiment. They hire 40 human labelers. They collect 13,000 demonstration data points of labelers actually writing good responses to user prompts. They collect 33,000 human comparisons - pairs of model outputs with one marked as better. They train a reward model on those comparisons. They use PPO to fine-tune a 1.3 billion parameter GPT-3 to maximize the reward model while staying close to the original. The result, InstructGPT, gets higher human preference ratings than vanilla GPT-3 - which is 100 times larger. A 1.3B model outperforms a 175B model on the task that actually matters: being helpful.

The InstructGPT paper publishes in January 2022. Six months later, the same pipeline - with more data, more compute, a larger base model - becomes ChatGPT. RLHF is not just an ML technique at this point. It is the technique that made language models useful to hundreds of millions of people.

Understanding RLHF is now a requirement for anyone working on language model training, alignment, or deployment. This lesson covers the complete pipeline, the math behind each stage, the engineering decisions that matter in practice, and the failure modes you need to know about.

Why This Exists: The Gap Between Pretraining and Usefulness

Pretraining on internet text teaches a language model to be a very good distribution modeler. It learns to predict what text typically looks like. The problem is that "typical" internet text is not "helpful assistant" text.

What pretraining rewards:

Completing the statistical pattern of the training data
Writing text that looks like what followed similar text in the corpus
Hedging, rambling, or going off-topic - because that is what humans do online

What users actually want:

Concise, accurate, relevant responses
Safe outputs - no harmful or deceptive content
Following instructions precisely

The gap between "good at predicting text" and "good at being an assistant" is enormous. RLHF bridges this gap by directly teaching the model what "good" means through human feedback on actual model outputs.

An alternative framing: pretraining is unsupervised - the model sees everything and learns from everything. RLHF is the feedback stage where we tell the model which of its outputs are actually good. This is analogous to how humans learn - broad exposure to information first, then feedback on what actually worked.

Historical Context

Year	Paper	Key Contribution
2017	"Deep RL from Human Preferences" (Christiano et al.)	Original RLHF idea - applies to RL control tasks
2019	"Fine-Tuning LMs from Human Preferences" (Ziegler et al.)	First application to language model fine-tuning
2020	"Learning to Summarize from Human Feedback" (Stiennon et al.)	RLHF for summarization - clear quality gains over SFT
2022	InstructGPT (Ouyang et al.)	Full RLHF pipeline on GPT-3, instruction-following
2022	ChatGPT (OpenAI)	InstructGPT at production scale - 100M users in 2 months
2022	Constitutional AI (Bai et al., Anthropic)	AI-generated feedback instead of human comparisons
2023	Direct Preference Optimization (Rafailov et al.)	RLHF without RL - closed-form solution

The Three-Stage RLHF Pipeline

RLHF consists of three sequential training stages. Each stage builds on the previous one.

Stage 1: Supervised Fine-Tuning (SFT)

What It Is

Start with a pretrained LLM. Fine-tune it on a curated dataset of (prompt, high-quality response) pairs written by human labelers. Standard next-token-prediction loss:

$\mathcal{L}_{SFT} = -\sum_t \log \pi_\theta(y_t \mid x, y_{<t})$

This is identical to pretraining, but on demonstration data instead of raw internet text.

Why This Matters

The SFT model serves two critical purposes in the RLHF pipeline:

Starting point for PPO: PPO fine-tunes the SFT model. A better SFT model means PPO starts from a better behavioral baseline - faster convergence, better final quality.
Reference model $\pi_\text{ref}$ : the SFT model is frozen after Stage 1 and used throughout PPO training. The KL divergence penalty in Stage 3 measures how far the PPO policy has drifted from this reference, preventing reward hacking.

Data Collection

InstructGPT used 13,000 demonstration examples. Each example is:

A prompt drawn from a diverse distribution of user instructions (writing, summarization, Q&A, code, creative tasks, safety-sensitive requests)
A response written by a trained labeler following explicit quality rubrics

Labeler training is the critical investment. Labelers receive detailed rubrics: prefer concise and accurate responses, avoid harmful content, follow instructions precisely, be honest when uncertain. Inter-rater agreement (Cohen's kappa) is measured to ensure consistency.

:::note SFT Quality Sets the Ceiling The SFT model defines the behavioral ceiling for RLHF. PPO can push the policy toward higher-quality responses within the distribution the SFT model covers, but it cannot invent capabilities the SFT model lacks entirely. If your SFT data quality is poor, no amount of RLHF will fix it. :::

Stage 2: Reward Model Training

The Core Idea

We want to quantify "response quality" as a scalar. But quality is hard to define explicitly - it is far easier for humans to say "response A is better than response B" than to assign a numerical score. Pairwise comparison data is also more reliable: labelers are better calibrated on relative quality than absolute quality.

The reward model learns from pairwise comparisons. Given prompt $x$ and two responses $y_w$ (preferred/winner) and $y_l$ (rejected/loser), train a model to assign higher scores to preferred responses.

The Bradley-Terry Model

The Bradley-Terry model is the standard framework for pairwise preference learning. It models the probability that response $y_w$ is preferred over $y_l$ given prompt $x$ :

$P(y_w \succ y_l \mid x) = \sigma\!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)$

where:

$r_\phi(x, y)$ : the reward model's scalar output for the (prompt, response) pair
$\sigma$ : sigmoid function - maps the score difference to a probability
The model says: the larger the gap between the two reward scores, the more confidently we predict $y_w$ wins

Training loss - cross-entropy on comparison labels:

$\mathcal{L}_{RM} = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\!\left[\log \sigma\!\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right)\right]$

Minimizing this loss pushes $r_\phi(x, y_w)$ up and $r_\phi(x, y_l)$ down relative to each other. The sigmoid ensures the gradients saturate when the model is already confident - preventing extreme reward values.

Reward Model Architecture

The reward model uses the same transformer architecture as the LLM, initialized from the SFT model weights, with the final vocabulary head replaced by a scalar regression head:

Input: [prompt tokens | response tokens | <|endoftext|>]
       ↓
Transformer layers (shared with SFT model, fine-tuned)
       ↓
Hidden state at final token position  → shape: (batch, hidden_dim)
       ↓
Linear layer: hidden_dim → 1
       ↓
Scalar reward score r_φ(x, y)

Using the last token's hidden state is critical - it attends to the entire prompt-response sequence through the causal attention mechanism, giving the scalar head full context.

Collecting Comparison Data

For each prompt, multiple responses are generated (typically 4–9 completions). Labelers rank them by overall quality. Pairwise comparisons are extracted from rankings - ranking 4 responses gives $\binom{4}{2} = 6$ comparison pairs per prompt. This efficiently multiplies the number of training signal examples from each labeling session.

What labelers evaluate:

Instruction-following: does the response do what was asked?
Truthfulness: does the response contain false claims?
Harmlessness: does the response enable harm?
Coherence, clarity, appropriate length

InstructGPT scale: 33,000 comparison pairs from labelers across multiple contractors with continuous inter-rater agreement monitoring.

Stage 3: PPO Fine-Tuning

The Objective

We want to maximize the reward model's score on generated responses, subject to staying close to the SFT reference model. The optimization problem:

$\max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)}\!\left[r_\phi(x, y)\right] - \beta \cdot D_{KL}\!\left(\pi_\theta(\cdot|x) \,\|\, \pi_\text{ref}(\cdot|x)\right)$

In practice, this is implemented as a per-token reward signal:

$R(x, y) = r_\phi(x, y) - \beta \sum_{t=1}^{T} \log \frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_\text{ref}(y_t \mid x, y_{<t})}$

The combined reward $R(x, y)$ is the training signal for PPO. The reward model score is applied at the final token; intermediate tokens receive only the per-token KL penalty.

Why the KL Penalty Is Essential: Reward Hacking

The reward model is a proxy for human preferences - an imperfect one. It was trained on a limited dataset of comparisons and generalizes well in distribution but can be exploited by out-of-distribution responses.

Without the KL penalty, PPO finds responses that maximize $r_\phi(x, y)$ regardless of actual quality. Classic reward hacking patterns:

Generating very long responses (if labelers slightly preferred longer answers, RM overweights length)
Generating confident-sounding but incorrect statements (if labelers preferred confident tone)
Generating nonsense outputs that exploit RM weaknesses the training data did not cover

The KL penalty prevents this by keeping $\pi_\theta$ close to $\pi_\text{ref}$ - the coherent, well-trained SFT model. The larger $\beta$ is, the stronger the regularization. Typical values: $\beta \in [0.02, 0.5]$ .

:::tip Goodhart's Law in RLHF "When a measure becomes a target, it ceases to be a good measure." - Goodhart's Law applied to RL.

The reward model is a measure of quality. Once PPO starts optimizing it, it becomes the target, and the optimization will inevitably find ways to score well on the measure without the underlying quality. The KL penalty is a structural defense against Goodhart's Law - it makes it expensive for the policy to exploit the RM. :::

Token-Level PPO Formulation

The language model generates sequences token by token. Each token generation step is modeled as an RL action:

State: the prompt plus all tokens generated so far (context window)
Action: the next token - discrete, vocabulary size 50,000–100,000
Reward: 0 for all intermediate tokens (plus per-token KL penalty); $r_\phi(x, y)$ at the final token
Episode: one complete prompt-response pair

GAE advantage estimation is applied backwards through the token sequence. The value function (critic) is a separate model - typically another copy of the LLM backbone with a scalar head - that estimates $V(s_t) =$ expected future KL-penalized reward from the current token position.

Code: Toy RLHF Pipeline with trl

The following demonstrates the complete RLHF pipeline using HuggingFace's trl library.

"""
Complete RLHF pipeline demonstration using HuggingFace trl.
pip install trl transformers datasets accelerate peft
"""

import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from datasets import Dataset

# ─────────────────────────────────────────────────────────────────────────────
# STAGE 1: Supervised Fine-Tuning (SFT) - data format and loss
# ─────────────────────────────────────────────────────────────────────────────

# SFT dataset: (prompt, response) pairs written by human labelers
SFT_EXAMPLES = [
    {
        "text": (
            "<|user|>Explain gradient descent in one paragraph.<|assistant|>"
            "Gradient descent is an iterative optimization algorithm that adjusts "
            "model parameters to minimize a loss function. At each step, we compute "
            "the gradient of the loss with respect to all parameters, then subtract "
            "a small fraction (the learning rate) of that gradient. Repeating this "
            "process moves parameters toward a local minimum of the loss surface."
        )
    },
    {
        "text": (
            "<|user|>What is a hash table?<|assistant|>"
            "A hash table is a data structure that maps keys to values using a hash "
            "function. The hash function converts a key into an array index, enabling "
            "O(1) average-case lookup, insertion, and deletion. Collisions (when two "
            "keys hash to the same index) are handled via chaining or open addressing."
        )
    },
]

def train_sft(model_name: str = "gpt2") -> AutoModelForCausalLM:
    """SFT training using standard cross-entropy on (prompt, response) pairs."""
    from transformers import TrainingArguments, Trainer
    from trl import SFTTrainer, SFTConfig

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token

    model = AutoModelForCausalLM.from_pretrained(model_name)

    sft_config = SFTConfig(
        output_dir="./sft_model",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        learning_rate=2e-5,
        max_seq_length=512,
        dataset_text_field="text",
        logging_steps=10,
    )

    dataset = Dataset.from_list(SFT_EXAMPLES)
    trainer = SFTTrainer(
        model=model,
        args=sft_config,
        train_dataset=dataset,
        processing_class=tokenizer,
    )
    trainer.train()
    return model


# ─────────────────────────────────────────────────────────────────────────────
# STAGE 2: Reward Model Training
# ─────────────────────────────────────────────────────────────────────────────

class RewardModel(nn.Module):
    """
    Reward model: LLM backbone + scalar head.
    Initialized from SFT model weights.
    Maps (prompt, response) → scalar quality score.
    """

    def __init__(self, backbone_name: str = "gpt2"):
        super().__init__()
        self.backbone = AutoModelForCausalLM.from_pretrained(backbone_name)
        hidden_dim = self.backbone.config.hidden_size
        # Replace vocabulary head with scalar regression head
        self.scalar_head = nn.Linear(hidden_dim, 1)
        nn.init.zeros_(self.scalar_head.bias)
        nn.init.normal_(self.scalar_head.weight, std=0.02)

    def forward(
        self,
        input_ids: torch.Tensor,
        attention_mask: torch.Tensor,
    ) -> torch.Tensor:
        """Returns scalar reward score for each item in the batch."""
        outputs = self.backbone(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True,
        )
        # Last token hidden state captures full prompt+response context
        last_hidden = outputs.hidden_states[-1]  # (batch, seq_len, hidden_dim)
        # Find last non-padding token position for each sequence
        last_token_idx = attention_mask.sum(dim=1) - 1  # (batch,)
        last_token_hidden = last_hidden[
            torch.arange(last_hidden.size(0)), last_token_idx
        ]  # (batch, hidden_dim)
        return self.scalar_head(last_token_hidden).squeeze(-1)  # (batch,)


def bradley_terry_loss(
    r_chosen: torch.Tensor,
    r_rejected: torch.Tensor,
) -> torch.Tensor:
    """
    Bradley-Terry ranking loss.

    P(chosen > rejected) = σ(r_chosen - r_rejected)
    L = -E[log σ(r_chosen - r_rejected)]
    """
    return -nn.functional.logsigmoid(r_chosen - r_rejected).mean()


# Preference comparison dataset
PREFERENCE_DATA = [
    {
        "prompt": "What is Python?",
        "chosen": (
            "Python is a high-level, interpreted programming language known for "
            "its readable syntax and versatility. It supports multiple programming "
            "paradigms including object-oriented, functional, and procedural programming. "
            "Python is widely used in web development, data science, AI, and automation."
        ),
        "rejected": (
            "Python is a programming language. It is used for many things. "
            "People like Python because it is easy."
        ),
    },
    {
        "prompt": "Explain recursion.",
        "chosen": (
            "Recursion is a programming technique where a function calls itself "
            "to solve a problem by breaking it into smaller instances of the same "
            "problem. Every recursive function needs a base case (when to stop) and "
            "a recursive case (how to reduce the problem). Example: factorial(n) = "
            "n * factorial(n-1), with factorial(0) = 1 as the base case."
        ),
        "rejected": (
            "Recursion is when a function calls itself. It can cause infinite loops "
            "if not written correctly. You need to be careful with recursion."
        ),
    },
]


def train_reward_model(
    model_name: str = "gpt2",
    n_epochs: int = 5,
    lr: float = 1e-5,
) -> RewardModel:
    """Train reward model on pairwise human preference comparisons."""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token

    reward_model = RewardModel(model_name)
    optimizer = torch.optim.AdamW(reward_model.parameters(), lr=lr, weight_decay=0.01)

    reward_model.train()
    for epoch in range(n_epochs):
        total_loss = 0.0
        for item in PREFERENCE_DATA:
            chosen_text = item["prompt"] + "\n" + item["chosen"]
            rejected_text = item["prompt"] + "\n" + item["rejected"]

            chosen_enc = tokenizer(
                chosen_text,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=256,
            )
            rejected_enc = tokenizer(
                rejected_text,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=256,
            )

            r_chosen = reward_model(**chosen_enc)
            r_rejected = reward_model(**rejected_enc)

            # Bradley-Terry loss: push r_chosen up, r_rejected down
            loss = bradley_terry_loss(r_chosen, r_rejected)

            optimizer.zero_grad()
            loss.backward()
            nn.utils.clip_grad_norm_(reward_model.parameters(), max_norm=1.0)
            optimizer.step()
            total_loss += loss.item()

        print(f"Epoch {epoch+1}/{n_epochs}: RM loss = {total_loss/len(PREFERENCE_DATA):.4f}")

    return reward_model


# ─────────────────────────────────────────────────────────────────────────────
# STAGE 3: PPO Fine-Tuning with KL Penalty
# ─────────────────────────────────────────────────────────────────────────────

def train_rlhf_ppo(
    sft_model_name: str = "gpt2",
    reward_model: RewardModel = None,
    n_steps: int = 100,
):
    """
    PPO fine-tuning against the reward model with KL penalty.
    Uses trl.PPOTrainer which handles:
    - Per-token KL penalty computation against reference model
    - GAE advantage estimation
    - Clipped PPO surrogate loss
    - Value function training
    """
    tokenizer = AutoTokenizer.from_pretrained(sft_model_name)
    tokenizer.pad_token = tokenizer.eos_token

    ppo_config = PPOConfig(
        # KL penalty coefficient β - higher = stronger regularization
        # Prevents policy from deviating too far from SFT model
        init_kl_coef=0.2,
        # Adaptive KL: adjust β if KL divergence strays from target
        target_kl=6.0,
        # PPO clipping and epochs
        cliprange=0.2,
        ppo_epochs=4,
        # Batch configuration
        batch_size=8,
        mini_batch_size=2,
        # Reward normalization - critical for stable training
        use_score_scaling=True,
        use_score_norm=True,
    )

    # Policy model = SFT model + value head for PPO critic
    # trl automatically creates a reference model (frozen copy of SFT model)
    model = AutoModelForCausalLMWithValueHead.from_pretrained(sft_model_name)

    ppo_trainer = PPOTrainer(
        config=ppo_config,
        model=model,
        tokenizer=tokenizer,
    )

    # Training prompts
    training_prompts = [
        "What is machine learning?",
        "Explain neural networks simply.",
        "How does backpropagation work?",
        "What is overfitting?",
        "Explain the attention mechanism.",
    ]

    reward_model.eval()

    print("Starting PPO fine-tuning with KL penalty...")
    for step in range(n_steps):
        # Sample a batch of prompts
        import random
        batch_prompts = random.sample(training_prompts, min(2, len(training_prompts)))

        queries = []
        responses = []
        rewards = []

        for prompt in batch_prompts:
            # Encode prompt
            input_ids = tokenizer.encode(prompt, return_tensors="pt")[0]
            queries.append(input_ids)

            # Generate response from current policy
            response_ids = ppo_trainer.generate(
                input_ids,
                max_new_tokens=64,
                do_sample=True,
                temperature=0.9,
                pad_token_id=tokenizer.eos_token_id,
            )
            # Extract only the newly generated tokens
            response_ids = response_ids[len(input_ids):]
            responses.append(response_ids)

            # Score with reward model
            full_ids = torch.cat([input_ids, response_ids]).unsqueeze(0)
            attn_mask = torch.ones_like(full_ids)
            with torch.no_grad():
                reward = reward_model(full_ids, attn_mask)
            rewards.append(reward.squeeze())

        # PPO step: updates policy to maximize (reward_model_score - β·KL)
        # trl computes the per-token KL penalty against the frozen reference model
        stats = ppo_trainer.step(queries, responses, rewards)

        if step % 10 == 0:
            mean_reward = torch.stack(rewards).mean().item()
            kl = stats.get("objective/kl", float("nan"))
            print(f"Step {step:4d} | Mean reward: {mean_reward:.3f} | KL: {kl:.3f}")


# ─────────────────────────────────────────────────────────────────────────────
# Main: Run the full RLHF pipeline
# ─────────────────────────────────────────────────────────────────────────────

if __name__ == "__main__":
    print("=" * 60)
    print("RLHF Pipeline: Stage 1 - SFT")
    print("=" * 60)
    # In production: sft_model = train_sft("gpt2")
    print(f"SFT dataset: {len(SFT_EXAMPLES)} demonstration examples")
    print("SFT loss: -Σ log π(y_t | x, y_{<t})")

    print("\n" + "=" * 60)
    print("RLHF Pipeline: Stage 2 - Reward Model")
    print("=" * 60)
    reward_model = train_reward_model("gpt2", n_epochs=3)

    print("\n" + "=" * 60)
    print("RLHF Pipeline: Stage 3 - PPO Fine-Tuning")
    print("=" * 60)
    train_rlhf_ppo("gpt2", reward_model=reward_model, n_steps=20)

Production Engineering Notes

Compute Cost of RLHF

RLHF is significantly more expensive than standard fine-tuning. PPO training requires four models in GPU memory simultaneously:

Model	Purpose	Trainable
Policy $\pi_\theta$	Generates responses	Yes
Reference model $\pi_\text{ref}$	KL penalty computation	No (frozen)
Reward model $r_\phi$	Scores responses	No (frozen)
Value/critic model	Advantage estimation (GAE)	Yes

For a 7B parameter model in bfloat16: each copy requires ~14GB VRAM. Four copies = ~56GB. With optimizer states (Adam: 2× parameter count) and activation checkpointing, you typically need 4–8 A100 80GB GPUs for the 7B case.

Mitigation strategies:

LoRA/QLoRA: fine-tune only low-rank adapters (reduces trainable parameters by 100×, memory by ~50%)
DeepSpeed ZeRO-3: shard all model parameters and optimizer states across GPUs
vLLM for generation: use PagedAttention for efficient batched response generation
DPO as alternative: eliminates the RL loop entirely - 3–5× cheaper (covered in the next lesson)

Human Labeler Quality

The quality of comparison data is the most critical factor in RLHF success.

Inter-rater agreement: different labelers will disagree on borderline cases. Measure Cohen's kappa continuously. If kappa falls below 0.5, the labeling guidelines are unclear. InstructGPT reported moderate agreement (~0.7 kappa).

Labeler training time: expect 2–4 weeks before labelers become consistent. Early comparisons from new labelers are typically lower quality and should be weighted less or filtered.

Labeler diversity: biases in the labeler pool (geographic, demographic, educational background) appear in the reward model. A team of primarily US-based English speakers produces an RM biased toward US cultural norms and English-language quality signals.

Cost: expect $0.10–$ 1.00 per comparison pair including quality control overhead. At 33,000 pairs, that is $3,300–$ 33,000 just for data collection.

Reward Model Overoptimization

As PPO training progresses, the policy becomes increasingly optimized against the reward model. The RM score on training responses increases, but actual human preference quality peaks and then degrades. This is called overoptimization.

Gao et al. (2023) empirically studied this: quality improves initially, peaks around $\sim 1$ standard deviation of KL divergence from the reference model, then deteriorates. The relationship between RM score and true quality follows an inverted-U curve.

Diagnostics: monitor KL divergence throughout training - it should grow slowly and controllably. Periodically run human evaluations to detect divergence between RM score and actual quality.

Mitigations: larger $\beta$ (stronger KL penalty), ensemble of multiple reward models (harder to hack several simultaneously), iterative RLHF (periodically collect new comparison data and retrain the RM).

Constitutional AI (Anthropic)

Constitutional AI (Bai et al., 2022) replaces human comparisons with AI-generated ones. The process:

Supervised Learning from AI Feedback (SL-CAF): the model critiques its own harmful outputs and revises them according to a "constitution" - a set of principles ("be honest", "avoid harm", "be helpful").
RL from AI Feedback (RLAIF): instead of human labelers ranking response pairs, use a preference model prompted with the constitution to generate comparison rankings.

Advantages: scales cheaply (AI generates millions of comparisons), consistent (no inter-rater variance), transparent (the constitution is auditable and modifiable).

Limitations: the AI's preferences are constrained by what it already knows - cannot correct systematic blindspots in the pretrained model. Bootstrapping problem for weak base models.

Common Mistakes

:::danger Skipping the SFT stage Attempting PPO directly from a pretrained language model (without SFT) almost always fails. The pretrained model's output distribution is too broad - it generates wildly diverse completions, and the reward model cannot provide a useful training signal. SFT anchors the policy to the task distribution before PPO begins. :::

:::danger KL coefficient too small (β near 0) Without the KL penalty, PPO exploits the reward model within a few hundred steps. The policy collapses to degenerate outputs - repetitive text, very long responses, confident-sounding nonsense - that score well on the RM but are useless. Start with β = 0.2 and monitor the actual KL divergence during training. :::

:::warning Reward model trained on too little data A reward model trained on fewer than 5,000 comparisons is typically not reliable enough for PPO fine-tuning. It will have systematic biases that PPO will exploit. Practical minimum: 10,000 comparisons for a 1B parameter model. Production systems use 50,000–300,000. :::

:::warning Not normalizing rewards before PPO Raw reward model outputs can have arbitrary scale and offset depending on initialization. PPO is sensitive to reward scale - large rewards cause large gradient updates and instability. Normalize RM outputs to zero mean, unit variance using running statistics across the training batch. :::

:::tip Iterative RLHF is the production standard The best deployed systems do not run RLHF once and ship. They run iterative cycles: RLHF → collect new comparisons on the improved model → retrain RM on the new distribution → repeat. Each iteration improves both the policy and the reward model. This is the actual ChatGPT training loop. :::

YouTube Resources

Video	Channel	Why Watch It
InstructGPT Paper Explained	Yannic Kilcher	Full InstructGPT paper walkthrough with math
RLHF Explained	Yannic Kilcher	High-level RLHF overview, excellent intuition
How ChatGPT Actually Works	Andrej Karpathy	"State of GPT" - best high-level RLHF explanation
Constitutional AI	Anthropic	CAI explanation from the research team

Interview Q&A

Q1: Walk me through the full RLHF pipeline from a pretrained model to a deployed assistant.

Answer: Three stages:

Stage 1 - SFT: fine-tune the pretrained LLM on 10,000–50,000 human-written (prompt, response) pairs with standard cross-entropy loss. This teaches the model what an assistant response looks like and creates $\pi_\text{ref}$ - the frozen reference for the KL penalty.

Stage 2 - Reward Model: collect pairwise human comparisons - for each prompt, generate multiple responses, have labelers rank them. Train a reward model using the Bradley-Terry loss: $-\log\sigma(r_\phi(x,y_w) - r_\phi(x,y_l))$ . The RM learns to assign higher scores to preferred responses.

Stage 3 - PPO: initialize the policy from SFT. For each step: sample a prompt, generate a response with the policy, score with the frozen RM, compute the KL-penalized reward $R = r_\phi(x,y) - \beta \log(\pi_\theta/\pi_\text{ref})$ , run PPO updates. The KL penalty keeps the policy close to the coherent SFT model.

Q2: Explain the Bradley-Terry model and why it is used here.

Answer: The Bradley-Terry model is a probabilistic framework for pairwise comparisons. Given two items with scalar "strength" parameters, it models the probability that item $w$ beats item $l$ as $\sigma(r_w - r_l)$ . For RLHF, $r_w = r_\phi(x, y_w)$ (preferred response) and $r_l = r_\phi(x, y_l)$ (rejected response). The training loss is the negative log-likelihood of the correct ranking.

It is used because: (1) humans can reliably compare two responses even when scoring absolutely is difficult, (2) it handles noisy/inconsistent labels through probabilistic modeling, (3) it is statistically well-understood, and (4) ranking data is easier and more reliable to collect than absolute ratings.

Q3: What is reward hacking and how does the KL penalty prevent it?

Answer: Reward hacking occurs when the policy finds responses that score highly on the reward model but aren't actually good - exploiting weaknesses in the RM's generalization. Examples: very long responses, repetitive boilerplate text, confident-but-wrong claims, adversarial outputs that exploit RM training distribution gaps.

The KL penalty adds $-\beta \log(\pi_\theta / \pi_\text{ref})$ to the reward, penalizing the policy for deviating from the SFT reference model. Since $\pi_\text{ref}$ is a coherent language model, staying close to it ensures outputs remain reasonable text. The larger $\beta$ is, the harder it is to reward-hack (but also the harder to improve over SFT). Typical $\beta = 0.02$ – $0.5$ , chosen by monitoring KL divergence and running periodic human evaluations.

Q4: What is Constitutional AI and how does it differ from standard RLHF?

Answer: Standard RLHF requires human labelers for comparisons - expensive, slow, and introduces labeler biases. Constitutional AI (Anthropic, 2022) replaces human comparison data with AI-generated comparisons. The process: write a constitution (explicit principles like "avoid deception", "be helpful"), have the model critique and revise its own outputs against these principles (SL-CAF), then use an AI preference model prompted with the constitution to generate comparison rankings for RLHF training (RLAIF). CAI scales better, is more consistent, and the constitution is auditable. The limitation: the AI can only generalize its existing values; it cannot correct systematic model blindspots.

Q5: How would you collect high-quality comparison data for a domain-specific assistant?

Answer: Key considerations: (1) Labeler expertise - for a coding assistant, labelers must be software engineers who can evaluate code correctness; general-purpose labelers cannot. (2) Prompt distribution - prompts must match the actual production distribution. Collect real user queries if possible. (3) Evaluation rubric - define "better" explicitly per domain. For code: correctness (does it run?), efficiency, style, explanation quality. (4) Programmatic rewards - for code specifically, supplement human comparisons with execution-based rewards: run the code, check test passage. This dramatically reduces labeler burden. (5) Response diversity - generate responses at multiple temperatures to ensure pairs that differ meaningfully; comparing two near-identical responses provides no training signal.

Q6: What makes RLHF better than SFT alone?

Answer: SFT teaches the model to imitate demonstrations. The fundamental limit: labeler quality caps model quality. Labelers cannot write better explanations of quantum mechanics than GPT-4 if they lack the expertise.

RLHF decouples generation from evaluation. Even a non-expert can reliably tell which of two responses is better (comparative evaluation), even if they cannot write the better response themselves. This allows the model to exceed demonstration quality: the model generates high-quality candidates, humans select the best ones, and the model learns from those selections. Concretely: InstructGPT's 1.3B RLHF model was preferred by human evaluators over GPT-3 175B (SFT only). The RLHF model learned behaviors that weren't in the SFT data - appropriate length, accurate instruction-following, safe refusals - by being steered through comparison feedback.

Key Takeaways

RLHF consists of three stages: SFT on demonstrations, reward model training from human comparisons, PPO fine-tuning against the reward model with a KL penalty
The Bradley-Terry model trains the reward model on pairwise preferences using the loss $\mathcal{L}_{RM} = -\mathbb{E}\left[\log\sigma(r_\phi(x,y_w) - r_\phi(x,y_l))\right]$
The KL penalty $R(x,y) = r_\phi(x,y) - \beta\log(\pi_\theta/\pi_\text{ref})$ is essential - without it, PPO exploits the imperfect reward model within hundreds of steps
RLHF decouples generation from evaluation, allowing models to exceed the quality of demonstration data
InstructGPT (1.3B + RLHF) outperforms GPT-3 (175B, SFT only) in human preference - alignment matters as much as scale
Production RLHF requires 4 models in memory simultaneously and 3–5× the compute of SFT; DPO (next lesson) eliminates the RL loop for 3–5× cheaper training

:::tip 🎮 Interactive Playground

Visualize this concept: Try the RLHF Pipeline demo on the EngineersOfAI Playground - no code required.

:::

The Real Engineering Moment​

Why This Exists: The Gap Between Pretraining and Usefulness​

Historical Context​

The Three-Stage RLHF Pipeline​

Stage 1: Supervised Fine-Tuning (SFT)​

What It Is​

Why This Matters​

Data Collection​

Stage 2: Reward Model Training​

The Core Idea​

The Bradley-Terry Model​

Reward Model Architecture​

Collecting Comparison Data​

Stage 3: PPO Fine-Tuning​

The Objective​

Why the KL Penalty Is Essential: Reward Hacking​

Token-Level PPO Formulation​

Code: Toy RLHF Pipeline with trl​

Production Engineering Notes​

Compute Cost of RLHF​

Human Labeler Quality​

Reward Model Overoptimization​

Constitutional AI (Anthropic)​

Common Mistakes​

YouTube Resources​

Interview Q&A​

Q1: Walk me through the full RLHF pipeline from a pretrained model to a deployed assistant.​

Q2: Explain the Bradley-Terry model and why it is used here.​

Q3: What is reward hacking and how does the KL penalty prevent it?​

Q4: What is Constitutional AI and how does it differ from standard RLHF?​

Q5: How would you collect high-quality comparison data for a domain-specific assistant?​

Q6: What makes RLHF better than SFT alone?​

Key Takeaways​