RLHF and DPO for Open Models

The Production Scenario: When a Model That Knows Everything Still Says the Wrong Thing

Your team spent three months fine-tuning a 13B parameter open-source model on your company's internal knowledge base. The model scores impressively on retrieval benchmarks. It can recall product specifications, quote documentation verbatim, and answer technical questions that stumped your previous search system. QA passed it. The demo looked great. You shipped it.

Two weeks later, your support team is in your Slack. Users are complaining that the model's responses are technically accurate but weirdly unhelpful. When a frustrated customer asks "why does this keep breaking?", the model gives a four-paragraph explanation of the underlying system architecture - correct, thorough, and completely missing the emotional register of the question. When asked "what should I do first?", it lists every possible option alphabetically. It never says "I'm not sure" - it confidently hallucinates edge cases that don't exist. It has all the knowledge but none of the judgment.

This is the alignment problem in practice. You didn't train the model to be helpful in the human sense of the word. You trained it to predict the next token given previous tokens, and it's very good at that. But "helpful" means something more: it means reading what the user actually needs, ranking possible responses by how useful they are, choosing when to be concise versus detailed, and flagging uncertainty instead of papering over it. None of that comes from pre-training on text. It has to be taught separately.

The classic approach is Reinforcement Learning from Human Feedback - RLHF. You get humans to compare pairs of model responses, train a reward model on those comparisons, then use PPO to push the language model toward responses the reward model scores highly. This is how ChatGPT, Claude, and most modern aligned models were built. It works. But when you sit down to implement it for your open-source model, the engineering complexity hits you like a wall.

RLHF requires four separate models loaded in memory simultaneously during PPO training: the policy model you are updating, a frozen reference copy of that policy, the reward model, and a value head model. On a single A100 80GB GPU, running 7B parameter versions of all four is nearly impossible without extreme quantization. Training is unstable - reward hacking is a constant threat where the model learns to game the reward model rather than genuinely improve. You need careful KL-divergence penalties to prevent the policy from drifting too far. The feedback loop between reward signal and policy update creates subtle failure modes that take weeks to diagnose.

Then in 2023, Rafael Rafailov and colleagues at Stanford published a paper called "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." It changed how most practitioners think about alignment. DPO bypasses the reward model entirely, reformulates the RLHF objective as a supervised loss on preference pairs, and achieves comparable or better alignment quality with roughly half the GPU memory and none of the PPO instability. This lesson covers both approaches - because understanding why RLHF is hard is the fastest way to understand why DPO is brilliant.

Why This Exists: The Gap Between Knowledge and Judgment

Language models before alignment training are like a very well-read person who has never had a conversation. They have absorbed vast amounts of text, but text is not a dialogue. Text does not reward helpfulness or penalize arrogance. An encyclopedia entry does not become more prominent because a reader found it useful - it just exists. So a model trained purely on next-token prediction internalizes the statistical patterns of all human writing, including the frustrating, unhelpful, verbose, and wrong parts.

The early attempts to fix this were clumsy. You could add instructions to the system prompt: "be concise," "express uncertainty when unsure," "prioritize actionability." This helped but was brittle - the model had no deep reason to follow those instructions and would drift away from them when the context window filled up or when users pushed back. You could fine-tune on curated "good" examples, but curating "good" at scale is expensive, and more importantly, this approach gives the model examples of good behavior but never tells it what makes something good.

What you really want is for the model to internalize a preference ordering - a sense that response A is better than response B in a given context - and then to generalize that preference ordering to new contexts it has never seen. That is exactly what RLHF and DPO both do, using human (or AI-generated) comparison data as the supervision signal.

Historical Context: From Simple RLHF to the DPO Insight

The original RLHF paper for language models came from OpenAI in 2017 (Ziegler et al., "Fine-Tuning Language Models from Human Preferences"). The setup was applied to relatively small models on narrow tasks. The key insight was that humans are much better at comparing two outputs than at writing the "correct" output from scratch - so collect pairwise preferences, train a reward model on them, then use RL to optimize the generative model against the reward model.

InstructGPT (Ouyang et al., 2022) scaled this to GPT-3. The pipeline became: supervised fine-tuning on demonstrations (SFT), reward model training on human comparisons, then PPO to optimize against the reward model. InstructGPT showed that a 1.3B parameter RLHF-tuned model was preferred by humans over the 175B GPT-3 base model on helpfulness tasks. That result shocked the field - alignment mattered more than raw scale.

The "aha moment" for DPO came from a mathematical observation. Rafailov et al. (2023) noticed that the optimal policy under the RLHF objective has an analytical form. If you know the optimal policy, you can express the implicit reward in terms of the ratio of the optimal policy to the reference policy. That means you can write the reward model training objective directly in terms of the policy model - no separate reward model needed. The training signal flows directly from preference pairs to policy updates, collapsing a three-stage pipeline into a single supervised training run. The paper appeared in July 2023 and within months DPO had become the default alignment method for most open-source model releases.

Core Concepts: Understanding RLHF Before You Can Appreciate DPO

The RLHF Pipeline in Full

RLHF has three stages. Understanding each one matters because DPO's elegance only makes sense against the backdrop of what RLHF is trying to do.

Stage 1: Supervised Fine-Tuning (SFT)

Start with a pre-trained base model and fine-tune it on high-quality (prompt, response) pairs. This teaches the model the format and register of helpful responses. Without SFT, the base model does not know how to follow instructions - it will continue your prompt rather than respond to it. The SFT stage is also what most people call "instruction tuning."

Stage 2: Reward Model Training

Collect human preference data: for each prompt, generate two responses ( $y_w$ = "chosen/winner" and $y_l$ = "loser/rejected"), then have humans say which is better. Train a separate model (typically the SFT model with its language modeling head replaced by a scalar output head) to predict these preferences using the Bradley-Terry model:

$P(y_w \succ y_l | x) = \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))$

where $r_\phi(x, y)$ is the learned reward for prompt $x$ and response $y$ , and $\sigma$ is the sigmoid function. The training loss is:

$\mathcal{L}_{RM} = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l)) \right]$

Stage 3: PPO Fine-Tuning

Use Proximal Policy Optimization to update the language model (the "policy") to maximize expected reward while staying close to the SFT model (the "reference policy"). The objective is:

$\mathcal{L}_{PPO} = \mathbb{E}_{(x,y) \sim \pi_\theta} \left[ r_\phi(x, y) - \beta \cdot \text{KL}\left[\pi_\theta(y|x) \| \pi_{ref}(y|x)\right] \right]$

The KL term is critical. Without it, the policy quickly learns to generate responses that game the reward model - responses that look superficially good but are qualitatively terrible. The $\beta$ hyperparameter controls this trade-off: too low and you get reward hacking, too high and the model barely moves from SFT.

The PPO stage requires all four models in memory: the policy $\pi_\theta$ (being updated), the reference policy $\pi_{ref}$ (frozen SFT model), the reward model $r_\phi$ , and a value model $V_\psi$ used to estimate the baseline for variance reduction. For a 7B parameter model, that means roughly $4 \times 14\text{GB} = 56\text{GB}$ just for model weights in fp16, before activations or optimizer states.

DPO: The Reward Model Is Implicit

DPO starts from the same RLHF objective but derives it differently. The key insight is that if you solve the KL-constrained RL problem analytically, the optimal policy has this closed form:

$\pi^*(y|x) = \frac{1}{Z(x)} \pi_{ref}(y|x) \exp\left(\frac{1}{\beta} r(x,y)\right)$

where $Z(x)$ is a normalizing partition function. Rearranging this to solve for the reward:

$r(x,y) = \beta \log \frac{\pi^*(y|x)}{\pi_{ref}(y|x)} + \beta \log Z(x)$

Now substitute this into the reward model training objective (the Bradley-Terry loss). The $\beta \log Z(x)$ terms cancel because they are the same for both $y_w$ and $y_l$ . What remains is the DPO loss:

$\mathcal{L}_{DPO}(\pi_\theta; \pi_{ref}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right) \right]$

This is just a binary cross-entropy loss over (prompt, chosen response, rejected response) triplets. No reward model, no PPO loop, no value function, no four-model memory requirement. You need two models: the policy you are training and the frozen reference. Training is stable because it is supervised. The reference model's log probabilities are precomputed or computed in a forward pass, not through a complicated RL loop.

The intuition: DPO increases the log-probability of chosen responses relative to the reference while decreasing the log-probability of rejected responses. The $\beta$ parameter still controls how far the policy can deviate from the reference - low $\beta$ means aggressive optimization, high $\beta$ means conservative updates.

Beyond DPO: IPO, KTO, and ORPO

The DPO paper inspired a small explosion of variants that address its limitations.

IPO (Identity Preference Optimization, Azar et al. 2023) noticed that DPO can overfit if preference data is deterministic - i.e., if $y_w$ is always strictly better than $y_l$ rather than better with some probability. IPO adds a regularization term that prevents the log-ratio from becoming arbitrarily large.

KTO (Kahneman-Tversky Optimization, Ethayarajh et al. 2023) works with unpaired binary feedback - a thumbs up or thumbs down on individual responses without needing a rejected counterpart for every chosen example. This is useful when your feedback data does not come in matched pairs. KTO is inspired by prospect theory: humans are more sensitive to losses than gains, and the loss function reflects this asymmetry.

ORPO (Odds Ratio Preference Optimization, Hong et al. 2024) merges SFT and alignment into a single stage. It adds an odds-ratio penalty to the standard SFT loss, penalizing the model for assigning high probability to rejected responses during regular instruction tuning. This eliminates the need for a separate alignment phase entirely.

Building Preference Datasets

The quality of your preference data determines the quality of alignment. There are three main collection strategies.

Human Annotation

Hire annotators to compare response pairs. This is the gold standard but is expensive at scale. A key design choice is the comparison interface: do you give a binary choice (A or B) or a 5-point scale? Pairwise binary comparisons are more reliable but give less signal per annotation. For production alignment of a specialized model, you typically want 10,000 to 50,000 preference pairs for meaningful results.

Good human annotation requires clear rubrics. Define what "better" means for your use case: more accurate, more concise, more empathetic, less likely to hallucinate? Annotators who disagree about what they are measuring produce noisy data, and noisy preference data trains a reward model that does not reflect actual human preferences.

RLAIF: AI Feedback

RLAIF (Reinforcement Learning from AI Feedback, Bai et al. 2022) uses a stronger language model as the annotator. You prompt a capable model (e.g., GPT-4 or Claude) to compare pairs of responses and explain which is better. The annotation cost drops dramatically, and you can generate hundreds of thousands of preference pairs quickly.

The risk is annotation bias. The annotating model has its own preferences and blind spots. If you use GPT-4 to annotate preferences for training your open-source model, your model will learn to behave like GPT-4 prefers - which may not match your users' actual preferences. Also, if the model being annotated is close in capability to the annotator, the annotations become unreliable.

Synthetic Preference Generation

For many specialized domains, you can generate preference pairs synthetically. Give the model a prompt, generate multiple responses with different temperatures, then use heuristics or another model to rank them. For code generation, you can execute the code and rank by correctness. For factual Q&A, you can check answers against a knowledge base. This is cheap but limited to tasks with automatic evaluation criteria.

Code Examples

Setting Up a Preference Dataset

from datasets import Dataset
import json

# Format: list of dicts with "prompt", "chosen", "rejected"
raw_preferences = [
    {
        "prompt": "Explain what a neural network is to a 10-year-old.",
        "chosen": "A neural network is like a team of tiny decision-makers. Each one looks at a small piece of information, makes a guess, and passes it to the next one. Together they can recognize pictures, understand speech, and even play games - just by practicing over and over until they get good at it.",
        "rejected": "A neural network is a computational graph consisting of layers of parameterized linear transformations composed with non-linear activation functions, trained via gradient descent to minimize a loss function over a dataset."
    },
    {
        "prompt": "My code keeps throwing a KeyError. What should I do?",
        "chosen": "A KeyError means you're trying to access a dictionary key that doesn't exist. First, print the dictionary and the key you're looking for to confirm the mismatch. Then either use .get(key, default) to safely handle missing keys, or check with 'if key in my_dict' before accessing it.",
        "rejected": "KeyError is an exception. You should handle exceptions using try/except blocks in Python. Exception handling is a fundamental concept in programming."
    },
]

dataset = Dataset.from_list(raw_preferences)
dataset.save_to_disk("./preference_data")
print(f"Created dataset with {len(dataset)} preference pairs")
print(dataset[0].keys())
# dict_keys(['prompt', 'chosen', 'rejected'])

Computing Reference Log Probabilities (for understanding DPO internals)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

def compute_log_probs(model, tokenizer, prompt: str, response: str) -> float:
    """Compute the sum of log probabilities for a response given a prompt."""
    full_text = prompt + response

    inputs = tokenizer(full_text, return_tensors="pt").to(model.device)
    prompt_tokens = tokenizer(prompt, return_tensors="pt").input_ids
    prompt_len = prompt_tokens.shape[1]

    with torch.no_grad():
        outputs = model(**inputs, labels=inputs["input_ids"])
        # outputs.logits: (1, seq_len, vocab_size)
        logits = outputs.logits[0]  # (seq_len, vocab_size)

        # Shift: logits at position i predict token at position i+1
        shift_logits = logits[prompt_len - 1:-1]  # response token logits
        shift_labels = inputs["input_ids"][0, prompt_len:]  # response tokens

        log_probs = torch.nn.functional.log_softmax(shift_logits, dim=-1)
        token_log_probs = log_probs.gather(1, shift_labels.unsqueeze(1)).squeeze(1)

        return token_log_probs.sum().item()

# Example usage
model_name = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
model.eval()

prompt = "Explain what a neural network is to a 10-year-old.\n\n"
chosen = "A neural network is like a team of tiny decision-makers..."
rejected = "A neural network is a computational graph..."

lp_chosen = compute_log_probs(model, tokenizer, prompt, chosen)
lp_rejected = compute_log_probs(model, tokenizer, prompt, rejected)

print(f"Log prob chosen: {lp_chosen:.2f}")
print(f"Log prob rejected: {lp_rejected:.2f}")
print(f"Model prefers chosen: {lp_chosen > lp_rejected}")

DPO Training with TRL's DPOTrainer

from datasets import load_from_disk
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import DPOTrainer, DPOConfig
import torch

# -- Configuration --
model_name = "meta-llama/Llama-3.2-3B-Instruct"
output_dir = "./dpo-aligned-model"
beta = 0.1          # DPO temperature: controls deviation from reference
max_length = 1024
max_prompt_length = 512

# -- Load model with 4-bit quantization --
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)
model = prepare_model_for_kbit_training(model)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"  # Important for DPO - pad on left

# -- LoRA for parameter-efficient DPO --
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# trainable params: 13,631,488 || all params: 3,226,947,584 || trainable%: 0.42%

# -- Load preference dataset --
dataset = load_from_disk("./preference_data")
train_dataset = dataset.select(range(int(0.9 * len(dataset))))
eval_dataset = dataset.select(range(int(0.9 * len(dataset)), len(dataset)))

# -- DPO Training config --
training_args = DPOConfig(
    output_dir=output_dir,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=8,  # effective batch size = 16
    learning_rate=5e-5,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    beta=beta,
    max_length=max_length,
    max_prompt_length=max_prompt_length,
    remove_unused_columns=False,
    logging_steps=10,
    eval_steps=100,
    save_steps=200,
    evaluation_strategy="steps",
    bf16=True,
    report_to="wandb",  # or "none"
    # Key DPO settings
    loss_type="sigmoid",    # standard DPO loss
    # loss_type="ipo"       # use for IPO variant
    # loss_type="kto_pair"  # use for KTO-style pair loss
)

# -- Initialize DPOTrainer --
# DPOTrainer internally manages the reference model as a frozen copy
dpo_trainer = DPOTrainer(
    model=model,
    ref_model=None,         # None = use frozen copy of model before LoRA adapters
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    peft_config=peft_config,  # pass PEFT config so TRL manages ref correctly
)

# -- Train --
dpo_trainer.train()
dpo_trainer.save_model(output_dir)
print(f"DPO training complete. Model saved to {output_dir}")

Evaluating Alignment Quality

import json
from transformers import pipeline
from datasets import load_dataset

def evaluate_alignment(model_path: str, eval_prompts: list[str]) -> dict:
    """
    Simple win-rate evaluation: compare aligned model vs base model
    using GPT-4 as judge. Returns win rate and example responses.
    """
    # Load aligned model
    aligned_pipe = pipeline(
        "text-generation",
        model=model_path,
        device_map="auto",
        max_new_tokens=256,
        do_sample=False,  # greedy for eval consistency
    )

    results = []
    for prompt in eval_prompts:
        response = aligned_pipe(prompt)[0]["generated_text"]
        # Strip the prompt from the response
        response = response[len(prompt):].strip()
        results.append({"prompt": prompt, "response": response})

    return results

# Alignment metrics to track during training
# These appear in DPOTrainer logs automatically:
# - train/rewards/chosen: mean reward for chosen responses
# - train/rewards/rejected: mean reward for rejected responses
# - train/rewards/accuracies: fraction where chosen reward > rejected reward
# - train/rewards/margins: mean(chosen_reward - rejected_reward)
# - train/logps/chosen: mean log prob of chosen responses
# - train/logps/rejected: mean log prob of rejected responses

# A healthy DPO run shows:
# rewards/margins increasing over training
# rewards/accuracies converging to 0.8-0.95
# logps/chosen slightly increasing, logps/rejected decreasing
# The two curves diverging (not one collapsing to -inf)

def check_reward_hacking_symptoms(trainer_logs: list[dict]) -> list[str]:
    """Detect common DPO failure modes from training logs."""
    warnings = []

    last_log = trainer_logs[-1]

    # Symptom 1: Log probs collapsing to very negative values
    if last_log.get("train/logps/rejected", 0) < -100:
        warnings.append("WARN: rejected logps very negative - possible overfit on rejection signal")

    # Symptom 2: Chosen rewards not increasing
    first_log = trainer_logs[0]
    margin_delta = (
        last_log.get("train/rewards/margins", 0) -
        first_log.get("train/rewards/margins", 0)
    )
    if margin_delta < 0.1:
        warnings.append("WARN: reward margin barely improved - check learning rate and beta")

    # Symptom 3: Accuracy too high too fast (overfit)
    if last_log.get("train/rewards/accuracies", 0) > 0.99:
        warnings.append("WARN: accuracy near 1.0 - likely overfitting preference data")

    return warnings

Building a Simple Synthetic Preference Dataset

from openai import OpenAI
import random

client = OpenAI()  # requires OPENAI_API_KEY

def generate_preference_pairs(
    prompts: list[str],
    good_system_prompt: str,
    bad_system_prompt: str,
    model: str = "gpt-4o-mini"
) -> list[dict]:
    """
    Generate synthetic preference pairs by varying the system prompt.
    Uses a helpful system prompt for 'chosen' and a degraded one for 'rejected'.
    """
    pairs = []

    for prompt in prompts:
        # Generate 'chosen' response - helpful, concise, accurate
        chosen_response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": good_system_prompt},
                {"role": "user", "content": prompt}
            ],
            temperature=0.7,
            max_tokens=300,
        ).choices[0].message.content

        # Generate 'rejected' response - verbose, evasive, or unhelpful
        rejected_response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": bad_system_prompt},
                {"role": "user", "content": prompt}
            ],
            temperature=0.9,
            max_tokens=400,
        ).choices[0].message.content

        pairs.append({
            "prompt": prompt,
            "chosen": chosen_response,
            "rejected": rejected_response,
        })

    return pairs

good_system = "You are a concise, helpful assistant. Give direct, accurate answers. Express uncertainty when you are not sure. Prioritize what the user actually needs."

bad_system = "You are a verbose assistant. Always give lengthy responses. Never say you are uncertain - always sound confident. Include tangential information. Use lots of caveats and qualifications."

sample_prompts = [
    "What causes inflation?",
    "How do I reverse a string in Python?",
    "Should I use Redis or Memcached for session storage?",
    "What is the difference between precision and recall?",
    "How does gradient descent work?",
]

preference_pairs = generate_preference_pairs(sample_prompts, good_system, bad_system)
print(f"Generated {len(preference_pairs)} preference pairs")

# Save to disk in TRL-compatible format
import json
with open("synthetic_preferences.jsonl", "w") as f:
    for pair in preference_pairs:
        f.write(json.dumps(pair) + "\n")

Architecture and Flow Diagrams

The Full RLHF Pipeline

DPO vs RLHF: Simplified Comparison

DPO Loss Decomposition

Production Engineering Notes

Memory Budget for DPO with LoRA

When using LoRA adapters with DPOTrainer and ref_model=None, TRL disables the LoRA adapters to get reference model log probabilities. This means you only need one copy of the model weights in memory - the LoRA adapter weights are toggled on/off. For a 7B model in 4-bit quantization:

Base model: ~4GB (4-bit)
LoRA adapters: ~100MB (r=16 on attention projections)
Optimizer state: ~1GB (Adam on LoRA params only)
Activation memory: ~8GB (depends on sequence length)
Total: ~14GB - fits on a single 24GB GPU like an RTX 4090 or A10G

Without LoRA (full DPO), you need two full model copies in memory: the policy and the reference. For 7B this means 2x14GB = 28GB minimum in fp16, requiring an A100 40GB or larger.

Choosing the Beta Parameter

$\beta$ is the single most important hyperparameter in DPO. Too low: the model drifts far from the reference, often degrading general capabilities while maximizing reward on the training distribution. Too high: the model barely moves from SFT.

Typical values:

$\beta = 0.1$ : aggressive alignment, good when preference data is high quality and large
$\beta = 0.5$ : moderate alignment, good default starting point
$\beta = 1.0$ : conservative, use when preference data is small or noisy

Run a sweep over [0.05, 0.1, 0.5, 1.0] and evaluate on a held-out set using win rate against the reference model.

Preference Data Quality Over Quantity

1,000 high-quality, carefully annotated preference pairs almost always outperform 50,000 noisy AI-generated pairs. Noisy preferences where the "chosen" response is only marginally better than "rejected" train the model to make marginal improvements - not to clearly prefer helpful over unhelpful. If using AI feedback, use the strongest available model and provide detailed rubrics in the annotation prompt.

When to Use DPO vs SFT-Only

DPO adds value when:

Your SFT model is technically capable but behaviorally misaligned
You have at least 1,000 preference pairs (more is better)
You want to suppress specific failure modes (e.g., "stop hallucinating product names")

SFT-only is often enough when:

Your task is narrow and well-defined (code completion, classification)
You have high-quality demonstration data and do not need preference ranking
Your model is small (under 1B parameters) - the expressivity for DPO to show gains is limited

Evaluating Alignment Post-DPO

Alignment evaluation is difficult because there is no ground truth. Standard approaches:

MT-Bench / AlpacaEval: standardized multi-turn benchmarks with GPT-4 as judge. MT-Bench scores on a 1-10 scale across 8 categories (reasoning, coding, math, writing, etc.).
Win rate: sample 200-500 prompts from your test distribution, generate responses from the DPO model and the SFT reference, use a strong LLM judge (GPT-4) to pick the winner. Target win rate over reference: 55-65% suggests meaningful improvement.
Capability regression check: run MMLU, HumanEval, or other standard benchmarks to ensure alignment training has not degraded general capabilities. A DPO run with $\beta$ too low can drop MMLU by 2-5 points while appearing to "align" well by win rate.

Common Mistakes

:::danger Forgetting to pad on the left side

DPO training requires tokenizer.padding_side = "left" because TRL computes log probabilities over the response portion of the sequence. If you pad on the right (the default for most tokenizers), the batch computation misaligns and you get log probabilities computed over padding tokens rather than response tokens. This causes training to appear to work (loss decreases) while actually learning nothing - a silent failure.

Always set tokenizer.padding_side = "left" before DPOTrainer initialization.

:::

:::danger Using a model that was not SFT-first as the reference

DPO is designed to refine an already instruction-following model. If you try to run DPO starting from a raw base model (no SFT stage), the reference model does not know how to follow instructions. The policy will learn to produce responses that look relatively more "chosen-like" compared to the reference, but the reference is generating garbage, so the gradient signal is meaningless. Always start DPO from an SFT checkpoint.

:::

:::warning Setting beta too low without enough data

With small preference datasets (under 2,000 pairs) and $\beta < 0.1$ , the policy can overfit to the preference data rapidly. The model learns to make the specific chosen responses very likely and the specific rejected responses very unlikely, but does not generalize. You see train accuracy approach 1.0 quickly, but win rate on held-out prompts barely improves. Use $\beta \geq 0.1$ for small datasets and always monitor eval win rate separately from train accuracy.

:::

:::warning Using the same model for RLAIF annotation and alignment training

If you use GPT-4 to generate both the SFT demonstrations and the preference annotations, and then use both to train your open-source model, you are essentially distilling GPT-4 into your model twice. The preferences will reflect GPT-4's biases, not your users' needs. More importantly, if the SFT data and preference data come from the same distribution, DPO has very little room to improve - the SFT model already mimics the "chosen" responses. Use diverse sources: SFT from one distribution, preference annotations from actual user interactions or a separate process.

:::

:::warning Not checking for degenerate log probabilities

After DPO training, inspect the log probabilities of chosen and rejected responses on the eval set. If logps/rejected is extremely negative (below -200 on a 100-token response), the model has learned to assign near-zero probability to rejected responses. This is a sign of overfitting and usually means the model has partially collapsed - it will refuse to generate anything resembling a rejected response even in contexts where that content would be appropriate. Stop training earlier or increase $\beta$ .

:::

Interview Q&A

Q1: Explain the DPO loss and why it works without a reward model.

DPO derives from the observation that the optimal RLHF policy has an analytical form: $\pi^*(y|x) \propto \pi_{ref}(y|x) \exp(r(x,y)/\beta)$ . By rearranging this, you can express the reward as $r(x,y) = \beta \log(\pi^*(y|x)/\pi_{ref}(y|x)) + \text{const}$ . When you substitute this into the Bradley-Terry reward model objective (which compares rewards for chosen vs rejected responses), the constant terms cancel and you are left with a loss that depends only on the ratio of the policy to the reference. No separate reward model is needed because the policy implicitly represents a reward function through this ratio. Training minimizes the probability of preferring the rejected response over the chosen response under this implicit reward.

Q2: What does the beta parameter control in DPO and how do you choose it?

Beta ( $\beta$ ) controls the KL penalty between the trained policy and the reference model. Mathematically, it is the inverse temperature of the reward: high $\beta$ means the policy stays close to the reference even when the reward strongly favors deviation; low $\beta$ allows large deviations for even small reward improvements. Practically: low $\beta$ (0.05-0.1) enables aggressive alignment but risks overfitting and capability regression; high $\beta$ (0.5-1.0) is conservative and safe for small datasets. Start at 0.1 and sweep, monitoring both alignment win rate and capability benchmarks (MMLU, HumanEval) as $\beta$ decreases.

Q3: What is reward hacking and how does RLHF guard against it?

Reward hacking occurs when the policy finds response patterns that score highly under the reward model but are not actually preferred by humans. The reward model is an imperfect proxy - it is a classifier trained on human comparisons, not a perfect simulator of human preference. The policy optimized against it will eventually find its blind spots. RLHF guards against this with the KL-divergence term in the PPO objective: by penalizing divergence from the reference SFT model, it limits how far the policy can drift in search of reward model exploits. DPO handles this similarly through the $\beta$ parameter, though DPO is generally more resistant to reward hacking because it never trains a separate reward model that can be exploited - the reward is implicitly defined by the data.

Q4: How does KTO differ from DPO and when would you use it?

KTO (Kahneman-Tversky Optimization) works with unpaired binary feedback: individual responses labeled as good or bad, without requiring a rejected counterpart for every chosen example. DPO requires preference pairs - for every chosen response you need a corresponding rejected response to the same prompt. This pairing requirement can be a bottleneck: in production, you may have thumbs-up/thumbs-down feedback on individual model outputs, not side-by-side comparisons. KTO uses prospect theory-inspired asymmetric loss functions: the loss for a "good" response and the loss for a "bad" response have different shapes, reflecting the human tendency to be more sensitive to losses than gains. Use KTO when your feedback data is naturally binary and unpaired; use DPO when you can collect or generate matched pairs.

Q5: Walk me through what happens when DPO training goes wrong - what metrics indicate failure?

Several patterns indicate DPO training failure:

Reward margins not increasing: the mean difference between chosen and rejected reward estimates should grow over training. If it stays flat, the model is not learning the preference signal - check learning rate, data formatting, and padding direction.
Log probabilities collapsing: if logps/rejected drops to very large negative values (-200, -500) while logps/chosen barely increases, the model is overfitting. It is memorizing which specific responses to reject rather than learning a generalizable preference. Increase $\beta$ or reduce epochs.
Accuracy at 1.0 on training set but poor eval win rate: classic overfitting. The model has perfectly fit the training preference pairs but learned nothing about the underlying preference structure. Reduce training steps, add data augmentation, or collect more diverse preference pairs.
Capability regression on benchmarks: MMLU or HumanEval drops after DPO. Caused by $\beta$ too low - the model has drifted too far from the reference policy. Increase $\beta$ and retrain.

Q6: What is ORPO and how does it change the training workflow?

ORPO (Odds Ratio Preference Optimization) eliminates the need for a separate alignment stage after SFT by integrating preference optimization into the instruction tuning loss. The standard SFT cross-entropy loss is augmented with an odds-ratio penalty term that decreases the relative probability of generating rejected responses. The combined loss is:

$\mathcal{L}_{ORPO} = \mathcal{L}_{SFT} - \lambda \cdot \log \sigma\left(\log \frac{\pi_\theta(y_w|x)}{1 - \pi_\theta(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{1 - \pi_\theta(y_l|x)}\right)$

The advantage is that you go from base model to aligned model in a single training run with a single dataset that includes both instruction demonstrations and preference pairs. The disadvantage is that you cannot independently tune the SFT and alignment stages - if the SFT data and preference data pull the model in different directions, the combined loss may not learn either well. ORPO is a good choice for compute-constrained scenarios where running multiple fine-tuning stages is prohibitive.

Q7: How would you build a preference dataset for a specialized domain like medical question answering?

Start with the failure modes of your current model - run it on real medical questions and identify response categories where it fails: overconfident answers to unanswerable questions, failure to recommend professional consultation, responses that are technically accurate but clinically misleading due to omitted context. These failure modes define what "rejected" looks like.

Then generate preference pairs targeting these failures: for each test prompt, generate a response from your current model (often the "rejected" candidate), and write or prompt-engineer a better response (the "chosen" candidate) that recommends consultation when appropriate, expresses calibrated uncertainty, and does not omit clinically relevant context. Have domain experts (physicians, nurses) review a sample of pairs to validate the annotation criteria. For scale, use strong medical AI models (Med-PaLM 2, GPT-4 with medical system prompt) as secondary annotators, with human review of a random 10% sample to catch systematic annotation errors.

Aim for at least 3,000 preference pairs across diverse medical topics - not 3,000 pairs all in cardiology, which will overfit to cardiology-style communication while leaving general medicine unimproved.

Method Comparison: Choosing the Right Alignment Approach

The landscape of alignment methods can be confusing. Here is a practical decision framework based on your constraints.

When to Use Full RLHF with PPO

RLHF with PPO is justified when:

You are training a large frontier model (70B+) where alignment quality has high economic stakes
You have an established reward model infrastructure and annotation budget
You need fine-grained control over the reward signal - PPO lets you tune the reward function independently of the policy optimization
Your preference data is extremely large (100k+ pairs) - at this scale, training a dedicated reward model adds value over the implicit reward in DPO

Practically, most teams working with open-source models in the 1B-13B range should not use RLHF with PPO. The engineering complexity, memory requirements, and training instability are not justified unless you have a dedicated ML infrastructure team.

When to Use DPO

DPO is the right choice for:

Open-source model alignment in production teams without large ML infrastructure
Models where you have 1,000-100,000 preference pairs and want efficient use of that data
Compute-constrained environments where running four models simultaneously is impractical
Iterative alignment where you want to quickly experiment with different preference datasets

DPO should be your default alignment method when starting a new alignment project.

When to Use IPO

Use IPO over standard DPO when:

Your preference labels are deterministic (one response is always strictly better than the other, never "it depends")
You see the DPO log-ratio diverging to extreme values during training
Your preference data is small (under 2,000 pairs) - IPO's regularization helps prevent overfitting

IPO adds a squared deviation penalty on the log-ratio to prevent it from growing arbitrarily large:

$\mathcal{L}_{IPO} = \mathbb{E} \left[ \left( \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)} - \frac{1}{2\beta} \right)^2 \right]$

This targets the log-ratio to a finite value $1/(2\beta)$ rather than pushing it to infinity.

When to Use KTO

Use KTO when:

Your feedback data is binary and unpaired - users give thumbs up/down on individual responses without requiring matched pairs
You want to leverage large volumes of production feedback logs that are not structured as comparisons
Your preference collection process makes it easier to label individual responses than to compare pairs

KTO has been shown (Ethayarajh et al. 2023) to match or exceed DPO quality on many tasks when unpaired data is abundant, because the absolute volume of signal compensates for the weaker per-sample signal.

When to Use ORPO

Use ORPO when:

You want to skip the SFT-then-alignment two-stage pipeline
Your compute budget is tight and you cannot afford sequential training runs
Your instruction data and preference data are well-aligned (both targeting the same behavior)
You are fine-tuning a base model that has not been instruction-tuned yet

Scaling Alignment: From Single GPU to Multi-Node

Single GPU (24GB VRAM - RTX 4090, A10G)

Full DPO training on a 7B model is not feasible on a single 24GB GPU without quantization. With 4-bit quantization and LoRA:

4-bit base model weights: ~4GB
LoRA adapters (r=16): ~100MB
Activations + optimizer: ~16GB
Total: ~20GB - fits with careful gradient checkpointing

Use DPOConfig(beta=0.1, max_length=512) with small batch sizes and gradient accumulation.

Multi-GPU (4x A100 80GB)

With 4 A100s, you can run full-parameter DPO on 7B models without quantization:

Use DeepSpeed ZeRO Stage 2 or 3 to shard optimizer state and gradients
Set per_device_train_batch_size=4, gradient_accumulation_steps=4 for effective batch of 64
Enable bf16=True (not fp16 - bf16 is more stable for transformer training)
Use Flash Attention 2 to handle longer sequences efficiently

# DeepSpeed config for multi-GPU DPO
deepspeed_config = {
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": True,
        "reduce_scatter": True,
        "overlap_comm": True,
    },
    "gradient_clipping": 1.0,
    "bf16": {"enabled": True},
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
}

Monitoring Alignment Training in Production

Track these metrics in Weights & Biases or TensorBoard during every DPO run:

# Key metrics to watch - these are automatically logged by DPOTrainer
metrics_to_monitor = {
    "train/rewards/chosen": "Should increase over time",
    "train/rewards/rejected": "Should decrease over time",
    "train/rewards/margins": "Chosen - rejected; should grow",
    "train/rewards/accuracies": "Fraction where chosen > rejected; target 0.75-0.92",
    "train/logps/chosen": "Should be stable or slightly increase",
    "train/logps/rejected": "Should decrease but not collapse",
    "eval/rewards/margins": "Primary validation metric",
    "train/loss": "Should decrease steadily",
}

# Red flags during training:
# - rewards/accuracies > 0.98 after first 10% of training: overfitting
# - logps/rejected < -200 at any point: collapsed rejection signal
# - eval/rewards/margins not tracking train/rewards/margins: distribution shift
# - rewards/chosen decreasing: learning rate too high or data formatting bug

Practical Checklist: Before You Run DPO

A concise pre-flight checklist for running DPO in production:

Data preparation:

Dataset has exactly three fields: prompt, chosen, rejected
All chosen responses are actually better than rejected responses (spot-check 50 pairs)
Prompts are in the same format as the SFT model's training format (same chat template)
Dataset has at least 1,000 pairs; ideally 5,000+
Train/eval split is random and stratified if you have topic categories

Model setup:

Starting from an SFT checkpoint, NOT a raw base model
tokenizer.padding_side = "left" is set before DPOTrainer initialization
tokenizer.pad_token = tokenizer.eos_token if pad token is not set
If using LoRA, ref_model=None in DPOTrainer (TRL handles reference via adapter toggle)
If using full DPO (no LoRA), pass ref_model=SFT_model explicitly

Training config:

Learning rate is in range $[1 \times 10^{-6}, 1 \times 10^{-4}]$ ; start at $5 \times 10^{-5}$
Beta is in range $[0.05, 1.0]$ ; start at $0.1$
max_length set to fit sequences without excessive truncation (check p95 sequence length)
max_prompt_length set to roughly half max_length

Post-training validation:

Run eval on held-out preference pairs: rewards/accuracies should be 0.70-0.90
Run capability benchmarks (MMLU, HumanEval) to check for regression
Run 50-100 human-written test prompts and manually inspect responses
Compare win rate vs SFT reference using GPT-4 as judge on 200+ prompts

The Production Scenario: When a Model That Knows Everything Still Says the Wrong Thing​

Why This Exists: The Gap Between Knowledge and Judgment​

Historical Context: From Simple RLHF to the DPO Insight​

Core Concepts: Understanding RLHF Before You Can Appreciate DPO​

The RLHF Pipeline in Full​

DPO: The Reward Model Is Implicit​

Beyond DPO: IPO, KTO, and ORPO​

Building Preference Datasets​

Human Annotation​

RLAIF: AI Feedback​

Synthetic Preference Generation​

Code Examples​

Setting Up a Preference Dataset​

Computing Reference Log Probabilities (for understanding DPO internals)​

DPO Training with TRL's DPOTrainer​

Evaluating Alignment Quality​

Building a Simple Synthetic Preference Dataset​

Architecture and Flow Diagrams​

The Full RLHF Pipeline​

DPO vs RLHF: Simplified Comparison​

DPO Loss Decomposition​

Production Engineering Notes​

Memory Budget for DPO with LoRA​

Choosing the Beta Parameter​

Preference Data Quality Over Quantity​

When to Use DPO vs SFT-Only​

Evaluating Alignment Post-DPO​

Common Mistakes​

Interview Q&A​

Method Comparison: Choosing the Right Alignment Approach​

When to Use Full RLHF with PPO​

When to Use DPO​

When to Use IPO​

When to Use KTO​

When to Use ORPO​

Scaling Alignment: From Single GPU to Multi-Node​

Single GPU (24GB VRAM - RTX 4090, A10G)​

Multi-GPU (4x A100 80GB)​

Monitoring Alignment Training in Production​

Practical Checklist: Before You Run DPO​