Skip to main content

Direct Preference Optimisation - RLHF Without the RL

Reading time: ~35 minutes | Level: Reinforcement Learning | Role: MLE, AI Research Engineer


The Real Interview Moment

Stanford, 2023. Rafael Rafailov and colleagues are staring at the RLHF objective. They have been studying the InstructGPT paper, which set the standard for aligning language models with human preferences. The RLHF pipeline has three stages: supervised fine-tuning on demonstration data, reward model training on human preference comparisons, and policy optimization with PPO. It requires four models simultaneously in GPU memory - the policy, the reference policy, the reward model, and the value function. It is notoriously unstable. The clip ratio, the KL coefficient, the number of update epochs, the reward normalization - all must be tuned jointly. Major labs can afford it. Researchers at smaller institutions largely cannot.

Schulman et al. had already proved, in their 2017 KL-constrained optimization work, that the optimal policy under KL-regularized reward maximization has a known closed form. The policy that maximizes E[r(x,y)]βDKL(ππref)\mathbb{E}[r(x,y)] - \beta D_{\text{KL}}(\pi \| \pi_{\text{ref}}) can be written analytically, without running a single gradient step. Rafailov's group decides to take this closed form seriously and plug it back into the Bradley-Terry preference model used to train the reward model.

Something extraordinary happens in the algebra. The partition function Z(x)Z(x) - a normalizing constant that seemed to make the approach intractable - appears identically in both terms of the Bradley-Terry model and cancels out. The reward function cancels out with it. What remains is a loss function that depends only on the ratio of the policy's output probability to the reference model's output probability, evaluated on the chosen and rejected responses. No reward model required. No PPO loop. One training stage. Just supervised learning on (prompt, chosen, rejected) triplets.

The paper publishes in May 2023. Within six months, it becomes the dominant fine-tuning method for open-source language models. Zephyr, Mistral-Instruct, Llama-2-Chat variants, and dozens of others are trained with DPO or its variants. The compute reduction is real: 3 to 5 times cheaper than PPO-based RLHF, with comparable or better performance on most alignment benchmarks. An insight from mathematics eliminated an entire engineering system.

This is one of those rare papers where understanding the derivation is understanding the engineering. Follow the algebra carefully - the insight is in the cancellation.


Why RLHF Needed an Alternative

The previous lesson covered the full RLHF pipeline in detail. Here is a summary of why practitioners sought an alternative, to motivate the DPO design:

Training instability: PPO is sensitive to hyperparameters. The clip ratio ε\varepsilon, the KL coefficient β\beta, the number of update epochs KK, the reward normalization scheme, the value function learning rate - all must be tuned jointly. A badly-tuned run can collapse the policy in hours of expensive GPU time. Large organizations tune this empirically over many runs. Small teams often cannot afford the iteration cost.

Memory cost: PPO-based RLHF requires four models simultaneously in GPU memory - the policy being trained, the frozen reference policy (for KL penalty), the reward model, and the value function (baseline for variance reduction). For 7B parameter models in bfloat16, this is roughly 56GB before activations and optimizer states. This alone disqualifies RLHF for teams without access to multi-GPU clusters with large memory.

Reward model errors compound: the reward model is trained separately on preference data and then frozen. Any errors or biases in the reward model are amplified by PPO optimization. If the reward model assigns high scores to responses that are verbose but not actually better (a known phenomenon), PPO will produce a policy that generates increasingly verbose outputs. This is called reward hacking.

Sample inefficiency: PPO is an on-policy algorithm - it generates responses from the current policy, evaluates them with the reward model, and updates. After each policy update, previously collected data goes partially stale. The policy constantly generates new rollouts, most of which are only used for one or two gradient steps.

DPO eliminates all of these by reformulating alignment as a supervised learning problem on fixed preference data.


Historical Context

YearPaperAuthorsContribution
2017Proximal Policy OptimizationSchulman et al.PPO - the RL algorithm used in RLHF
2022InstructGPTOuyang et al. (OpenAI)Full RLHF pipeline - SFT + RM + PPO - the standard
2023DPORafailov et al. (Stanford)Closed-form RLHF: no reward model, no PPO
2023IPOAzar et al. (DeepMind)Fixes DPO overconfidence with squared loss
2023KTOEthayarajh et al. (Berkeley)Works without paired comparisons - single labels only
2023SLiCZhao et al.Sequence likelihood calibration - related derivation
2023ZephyrTunstall et al. (HuggingFace)First major open model trained with DPO
2024ORPOHong et al.Combines SFT and DPO in one pass - no reference model
2024SimPOMeng et al.Length-normalized DPO - improves long-response quality

The Full Mathematical Derivation

This is the central content of the DPO paper. The mathematical insight is the engineering insight. Follow each step carefully.

Step 1: The RLHF Objective

The standard RLHF objective finds a policy πθ\pi_\theta that maximizes the expected reward under the human preference model, subject to a KL divergence penalty that prevents the policy from drifting too far from the reference model πref\pi_{\text{ref}} (which is the SFT checkpoint):

maxπθExD,yπθ(yx)[rϕ(x,y)]βDKL ⁣(πθ(x)πref(x))\max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(y|x)}\left[r_\phi(x, y)\right] - \beta\, D_{\text{KL}}\!\left(\pi_\theta(\cdot|x) \,\|\, \pi_{\text{ref}}(\cdot|x)\right)

The KL penalty serves two purposes: it prevents reward hacking (the policy cannot drift to degenerate high-reward solutions far from the reference) and it maintains language quality (the reference model already speaks coherent language).

Expanding the KL term, we can write this as:

maxπθExD,yπθ[rϕ(x,y)βlogπθ(yx)πref(yx)]\max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta}\left[r_\phi(x, y) - \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}\right]

This is what PPO solves numerically: it estimates the expectation by generating rollouts from πθ\pi_\theta, evaluates the reward model on those rollouts, and computes policy gradients. DPO will solve it analytically.

Step 2: The Analytical Optimal Policy

For the class of KL-regularized objectives above, the optimal policy has a known closed form. To derive it, take the functional derivative of the objective with respect to πθ\pi_\theta and set it to zero. The result is:

π(yx)=1Z(x)πref(yx)exp ⁣(r(x,y)β)\pi^*(y|x) = \frac{1}{Z(x)}\,\pi_{\text{ref}}(y|x)\,\exp\!\left(\frac{r(x,y)}{\beta}\right)

where Z(x)Z(x) is the partition function (a normalizing constant that ensures the probabilities sum to 1):

Z(x)=yπref(yx)exp ⁣(r(x,y)β)Z(x) = \sum_y \pi_{\text{ref}}(y|x)\exp\!\left(\frac{r(x,y)}{\beta}\right)

Intuition: the optimal policy is the reference model re-weighted by how much each response's reward exceeds what the reference model already assigns, scaled by the temperature β\beta. Responses with high reward relative to the reference get multiplied up; responses with low reward get divided down. The reference model acts as an informed prior - you are not starting from scratch, you are refining an already good model.

Note: computing π\pi^* directly from this formula requires evaluating Z(x)Z(x), which involves summing over all possible responses yy - an intractable sum for an autoregressive LM. This is the step that previously made direct use of the closed form impractical. DPO's key insight is that Z(x)Z(x) can be made to cancel.

Step 3: Invert to Express Reward in Terms of the Optimal Policy

Rearrange the expression for π\pi^* to isolate r(x,y)r(x,y):

π(yx)Z(x)=πref(yx)exp ⁣(r(x,y)β)\pi^*(y|x) \cdot Z(x) = \pi_{\text{ref}}(y|x) \exp\!\left(\frac{r(x,y)}{\beta}\right)

exp ⁣(r(x,y)β)=π(yx)Z(x)πref(yx)\exp\!\left(\frac{r(x,y)}{\beta}\right) = \frac{\pi^*(y|x) \cdot Z(x)}{\pi_{\text{ref}}(y|x)}

Taking the logarithm of both sides:

r(x,y)β=logπ(yx)πref(yx)+logZ(x)\frac{r(x,y)}{\beta} = \log\frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \log Z(x)

r(x,y)=βlogπ(yx)πref(yx)+βlogZ(x)r^*(x, y) = \beta\log\frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta\log Z(x)

This is the key equation. The reward r(x,y)r^*(x,y) corresponding to the optimal policy can be expressed entirely in terms of: the optimal policy π\pi^*, the reference model πref\pi_{\text{ref}}, and the partition function Z(x)Z(x) - which crucially depends only on xx, not on yy.

Step 4: Substitute into the Bradley-Terry Preference Model

The reward model in RLHF is trained using the Bradley-Terry model for pairwise preferences. Given a prompt xx, a chosen response ywy_w (preferred by the human), and a rejected response yly_l:

P(ywylx)=σ ⁣(r(x,yw)r(x,yl))P(y_w \succ y_l \mid x) = \sigma\!\left(r^*(x, y_w) - r^*(x, y_l)\right)

Now substitute our expression for rr^*:

P(ywylx)=σ ⁣([βlogπ(ywx)πref(ywx)+βlogZ(x)][βlogπ(ylx)πref(ylx)+βlogZ(x)])P(y_w \succ y_l \mid x) = \sigma\!\left(\left[\beta\log\frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} + \beta\log Z(x)\right] - \left[\beta\log\frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)} + \beta\log Z(x)\right]\right)

The βlogZ(x)\beta\log Z(x) terms appear identically in both brackets and cancel:

P(ywylx)=σ ⁣(βlogπ(ywx)πref(ywx)βlogπ(ylx)πref(ylx))P(y_w \succ y_l \mid x) = \sigma\!\left(\beta\log\frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta\log\frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)

This is the key cancellation. The intractable partition function is gone. The preference probability depends only on ratios of policy probabilities to reference probabilities - quantities that are straightforward to compute for any autoregressive language model.

Step 5: The DPO Loss

Replace the optimal policy π\pi^* with the parameterized policy πθ\pi_\theta (the model being trained) and write the maximum likelihood objective over the preference dataset D={(x,yw,yl)}\mathcal{D} = \{(x, y_w, y_l)\}:

LDPO(πθ;πref)=E(x,yw,yl)D[logσ ⁣(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}_{\text{DPO}}(\pi_\theta;\pi_{\text{ref}}) = -\mathbb{E}_{(x,y_w,y_l) \sim \mathcal{D}}\left[\log\sigma\!\left(\beta\log\frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta\log\frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]

This is the DPO loss. Let us inventory what it requires:

  • πθ\pi_\theta: the policy being trained (the LM we want to align)
  • πref\pi_{\text{ref}}: the frozen SFT reference model
  • (x,yw,yl)(x, y_w, y_l): preference triplets - prompt, chosen response, rejected response

No reward model is trained. No PPO loop is needed. No rollouts are generated. The reward model is implicitly encoded in the ratio βlog(πθ/πref)\beta\log(\pi_\theta / \pi_{\text{ref}}) - the language model is its own reward model.

Step 6: Gradient Interpretation

What does the DPO gradient actually do? Define the implicit reward:

r^θ(x,y)=βlogπθ(yx)πref(yx)\hat{r}_\theta(x, y) = \beta\log\frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}

The DPO gradient for a single example (x,yw,yl)(x, y_w, y_l) is:

θLDPOσ ⁣(r^θ(x,yl)r^θ(x,yw))[θlogπθ(ywx)θlogπθ(ylx)]\nabla_\theta \mathcal{L}_{\text{DPO}} \propto -\sigma\!\left(\hat{r}_\theta(x, y_l) - \hat{r}_\theta(x, y_w)\right) \left[\nabla_\theta \log\pi_\theta(y_w|x) - \nabla_\theta \log\pi_\theta(y_l|x)\right]

This gradient: (1) increases the log-probability of the chosen response ywy_w, (2) decreases the log-probability of the rejected response yly_l, (3) weights both updates by σ(r^(yl)r^(yw))\sigma(\hat{r}(y_l) - \hat{r}(y_w)) - a term that is large when the model currently gives higher implicit reward to the rejected response (the wrong answer) and small when the model already correctly prefers the chosen response.

The weighting is elegant: DPO concentrates gradient on the comparisons where the model is most wrong, and applies smaller corrections where it is already right. This is similar in spirit to the adaboost focus on misclassified examples.


Architecture: DPO vs RLHF Training Flow

DPO collapses three training stages to two. The entire reward model training stage disappears. PPO is replaced by a standard supervised training loop. GPU memory drops from four models to two. The data format (prompt, chosen, rejected) is simpler to collect and store than PPO rollouts.


DPO Variants

The DPO paper launched a wave of follow-on work, each addressing a specific limitation.

IPO: Identity Preference Optimization (Azar et al., 2023)

The standard DPO loss uses the logistic (sigmoid) function, which saturates as the log-ratio difference grows large. When preference labels are noisy - annotators disagreeing, crowdsourced labels with high variance - the model can become overconfident on in-distribution comparisons. The sigmoid saturates and the gradient vanishes even when the model is wrong.

IPO fixes this by replacing the sigmoid with a squared loss:

LIPO=E(x,yw,yl) ⁣[(logπθ(ywx)πref(ywx)logπθ(ylx)πref(ylx)12β)2]\mathcal{L}_{\text{IPO}} = \mathbb{E}_{(x,y_w,y_l)}\!\left[\left(\log\frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log\frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} - \frac{1}{2\beta}\right)^2\right]

IPO directly regresses the log-ratio difference toward a target value of 1/(2β)1/(2\beta). The squared loss does not saturate - it maintains a constant gradient magnitude regardless of how confident the model is. This makes IPO significantly more robust to label noise.

When to use: crowdsourced preference labels with high inter-annotator disagreement, any setting where you suspect label noise. In trl: DPOConfig(loss_type="ipo").

KTO: Kahneman-Tversky Optimization (Ethayarajh et al., 2023)

DPO requires paired (chosen, rejected) comparisons for the same prompt. Collecting these is more expensive than collecting single-label feedback ("this response is good" or "this response is bad"). KTO eliminates the pairing requirement:

LKTO=E ⁣[λwσ ⁣(r^(yw)z0)+λlσ ⁣(z0r^(yl))]\mathcal{L}_{\text{KTO}} = \mathbb{E}\!\left[\lambda_w \sigma\!\left(\hat{r}(y_w) - z_0\right) + \lambda_l \sigma\!\left(z_0 - \hat{r}(y_l)\right)\right]

where z0=logZ(x)z_0 = \log Z(x) is approximated, and λw\lambda_w, λl\lambda_l are asymmetric weights. The asymmetry is motivated by Kahneman-Tversky prospect theory: humans weight losses more heavily than gains, so the rejected response penalty is weighted more heavily than the chosen response reward.

KTO works with thumbs-up/thumbs-down data - each response is independently labeled as accepted or rejected, without being paired against a specific alternative. This is the natural format for A/B testing data and real-user feedback logs.

ORPO: Odds Ratio Preference Optimization (Hong et al., 2024)

ORPO eliminates the need for a reference model by combining SFT and preference optimization in a single training pass:

LORPO=LSFTλE ⁣[logσ ⁣(logoddsθ(ywx)oddsθ(ylx))]\mathcal{L}_{\text{ORPO}} = \mathcal{L}_{\text{SFT}} - \lambda\, \mathbb{E}\!\left[\log\sigma\!\left(\log\frac{\text{odds}_\theta(y_w|x)}{\text{odds}_\theta(y_l|x)}\right)\right]

where oddsθ(yx)=πθ(yx)/(1πθ(yx))\text{odds}_\theta(y|x) = \pi_\theta(y|x) / (1 - \pi_\theta(y|x)).

ORPO uses the model's own confidence as the reference instead of a frozen SFT reference model. This means only one model in memory (vs two for DPO), and only one training stage (vs two for standard DPO pipeline). The trade-off is that ORPO is more sensitive to the training data quality since there is no reference distribution to regularize against.

DPO vs RLHF: Comparison Table

DimensionRLHF (PPO)DPOIPOKTO
Training stages3 (SFT + RM + PPO)2 (SFT + DPO)2 (SFT + IPO)2 (SFT + KTO)
Models in memory4222
Compute costBaseline3–5× cheaper3–5× cheaper3–5× cheaper
StabilityFinickyStableMore stableStable
Data formatRollouts + reward scores(prompt, chosen, rejected)(prompt, chosen, rejected)(prompt, response, accepted/rejected)
Noise robustnessDepends on RM qualityModerateHighHigh
Requires paired dataYes (for RM)YesYesNo
When to preferLarge scale, iterative, complex rewardDefault alignmentNoisy labelsSingle-label feedback

Complete DPO Implementation

"""
Full DPO training pipeline with trl.DPOTrainer.
pip install trl transformers datasets peft accelerate
"""

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import DPOTrainer, DPOConfig
from datasets import Dataset
from peft import LoraConfig, get_peft_model

# ─────────────────────────────────────────────────────────────────────────────
# 1. Preference Dataset - (prompt, chosen, rejected) triplets
# The ONLY data format DPO needs. No reward model scores required.
# ─────────────────────────────────────────────────────────────────────────────

PREFERENCE_DATA = [
{
"prompt": "Explain what a transformer is in neural networks.",
"chosen": (
"A transformer is a neural network architecture introduced in "
"'Attention Is All You Need' (Vaswani et al., 2017). It replaces "
"recurrent layers with self-attention mechanisms, allowing the model "
"to weigh the importance of different input tokens simultaneously. "
"The key components are: multi-head self-attention (which captures "
"relationships between all token pairs), position encodings (which "
"restore positional information lost without recurrence), and "
"position-wise feed-forward layers. Transformers scaled to become "
"the foundation of GPT, BERT, T5, and essentially all modern LLMs."
),
"rejected": (
"A transformer is a type of neural network that uses attention. "
"It was made by Google. It is used for NLP tasks like translation."
),
},
{
"prompt": "What is the vanishing gradient problem and how do residual connections fix it?",
"chosen": (
"The vanishing gradient problem occurs in deep networks during "
"backpropagation: gradients are multiplied by the Jacobian of each "
"layer as they propagate backwards. In deep networks (50+ layers), "
"if Jacobians have spectral radius less than 1, gradients decay "
"exponentially with depth - layers near the input receive near-zero "
"gradients and learn negligibly.\n\n"
"Residual connections (He et al., 2016) solve this with a simple "
"structural change: instead of learning H(x) = F(x), the block "
"learns F(x) = H(x) - x, so the output is F(x) + x. The gradient "
"of the loss with respect to the input now has an additive identity "
"term: dL/dx = dL/d(F(x)+x) = dL/dF · dF/dx + dL/dx. The +dL/dx "
"term passes the gradient directly from output to input, bypassing "
"the learned transformation entirely. Gradients cannot vanish "
"through skip connections - they always have a direct path."
),
"rejected": (
"Vanishing gradients happen when gradients get too small in deep "
"networks. Residual connections help by adding shortcuts that let "
"gradients flow better."
),
},
{
"prompt": "How does the Adam optimizer work?",
"chosen": (
"Adam (Adaptive Moment Estimation, Kingma & Ba, 2014) combines "
"momentum and adaptive learning rates in a single algorithm.\n\n"
"At each step t, Adam computes:\n"
"1. First moment (momentum): m_t = β₁·m_{t-1} + (1-β₁)·g_t\n"
"2. Second moment (adaptive scaling): v_t = β₂·v_{t-1} + (1-β₂)·g_t²\n"
"3. Bias correction (accounts for zero initialization):\n"
" m̂_t = m_t/(1-β₁^t), v̂_t = v_t/(1-β₂^t)\n"
"4. Parameter update: θ_t = θ_{t-1} - α·m̂_t/(√v̂_t + ε)\n\n"
"Intuition: m̂_t is the direction to move (smoothed gradient). "
"v̂_t is a per-parameter learning rate normalizer - parameters with "
"high historical gradient variance get smaller effective learning "
"rates, preventing overshooting. Parameters with small, consistent "
"gradients get larger effective updates.\n\n"
"Default hyperparameters: α=0.001, β₁=0.9, β₂=0.999, ε=1e-8."
),
"rejected": (
"Adam is an optimizer that combines momentum and RMSprop. "
"It uses adaptive learning rates for each parameter."
),
},
]


def build_dpo_dataset() -> Dataset:
"""Build DPO dataset from preference triplets."""
return Dataset.from_list(PREFERENCE_DATA)


# ─────────────────────────────────────────────────────────────────────────────
# 2. Manual DPO Loss Implementation (for pedagogical understanding)
# trl.DPOTrainer handles this automatically - this is for learning
# ─────────────────────────────────────────────────────────────────────────────

def compute_dpo_loss_manual(
policy_logps_chosen: torch.Tensor, # log π_θ(y_w | x) - shape (batch,)
policy_logps_rejected: torch.Tensor, # log π_θ(y_l | x) - shape (batch,)
ref_logps_chosen: torch.Tensor, # log π_ref(y_w | x) - shape (batch,)
ref_logps_rejected: torch.Tensor, # log π_ref(y_l | x) - shape (batch,)
beta: float = 0.1,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
"""
DPO loss - the exact computation from Rafailov et al. (2023).

The DPO loss is:
L = -E[log σ(β log(π_θ(y_w)/π_ref(y_w)) - β log(π_θ(y_l)/π_ref(y_l)))]

In log space:
- β (log π_θ(y_w) - log π_ref(y_w)) = β · (policy_logps_chosen - ref_logps_chosen)
- The difference of these is the logit for the sigmoid

The implicit reward for response y given prompt x is:
r̂(x, y) = β · (log π_θ(y|x) - log π_ref(y|x))

Returns:
loss: scalar DPO loss
chosen_rewards: implicit reward for chosen responses (for monitoring)
rejected_rewards: implicit reward for rejected responses (for monitoring)
"""
# Compute implicit rewards: β · log(π_θ / π_ref)
# High implicit reward = π_θ assigns relatively more probability than π_ref
# Low implicit reward = π_θ assigns relatively less probability than π_ref
chosen_rewards = beta * (policy_logps_chosen - ref_logps_chosen)
rejected_rewards = beta * (policy_logps_rejected - ref_logps_rejected)

# DPO logit: how much more does the model prefer chosen over rejected (vs reference)?
logits = chosen_rewards - rejected_rewards

# DPO loss: negative log-sigmoid of the logit
# Equivalent to binary cross-entropy where the label is "chosen > rejected"
# log σ(x) = -log(1 + e^{-x}) - numerically stable implementation:
loss = -torch.nn.functional.logsigmoid(logits).mean()

return loss, chosen_rewards.detach(), rejected_rewards.detach()


def compute_sequence_logprobs(
model: torch.nn.Module,
input_ids: torch.Tensor, # shape (batch, seq_len)
attention_mask: torch.Tensor, # shape (batch, seq_len)
response_start_idx: int, # index where the response begins (after the prompt)
) -> torch.Tensor:
"""
Compute the total log-probability of the response portion of a sequence.

For an autoregressive model: log π(y|x) = Σ_t log π(y_t | x, y_{<t})
We sum over response tokens only - prompt tokens are excluded.

This is computed without gradients for the reference model,
and with gradients for the policy model (to allow backpropagation).
"""
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
logits = outputs.logits # shape (batch, seq_len, vocab_size)

# Shift: logits[t] predicts token[t+1] (next-token prediction setup)
shift_logits = logits[:, :-1, :] # (batch, seq_len-1, vocab)
shift_labels = input_ids[:, 1:] # (batch, seq_len-1)

# Per-token log-probabilities: log softmax over vocab dimension
log_probs = torch.nn.functional.log_softmax(shift_logits, dim=-1)

# Gather the log-prob of the actual next token at each position
token_log_probs = log_probs.gather(
dim=2, index=shift_labels.unsqueeze(-1)
).squeeze(-1) # (batch, seq_len-1)

# Sum log-probs over response tokens only
# response_start_idx: the position where the response begins after the prompt
response_log_probs = token_log_probs[:, response_start_idx:].sum(dim=-1)

return response_log_probs # shape (batch,)


# ─────────────────────────────────────────────────────────────────────────────
# 3. Full DPO Training with trl.DPOTrainer
# ─────────────────────────────────────────────────────────────────────────────

def train_dpo(
model_name: str = "gpt2",
beta: float = 0.1,
n_epochs: int = 3,
lr: float = 1e-5,
use_lora: bool = True,
loss_type: str = "sigmoid", # "sigmoid" for DPO, "ipo" for IPO
) -> torch.nn.Module:
"""
Full DPO training using trl.DPOTrainer.

trl.DPOTrainer handles automatically:
- Loading the reference model (a frozen copy of the policy before LoRA adapters)
- Computing sequence log-probabilities for policy and reference
- Computing DPO loss with the chosen loss_type
- Standard training loop with logging

Critical hyperparameters:
beta: Controls strength of preference signal vs. reference regularization.
Lower β → stronger preference signal → larger policy changes.
Higher β → conservative updates → policy stays near reference.
Typical range: 0.05–0.3 for 7B models.
loss_type: "sigmoid" for standard DPO, "ipo" for IPO (more noise-robust)
lr: DPO is sensitive to learning rate. Use 1e-6 to 5e-5.
Lower than SFT typically (1e-3 to 1e-4 for SFT, 1e-5 for DPO).

Metrics to monitor during training:
rewards/chosen: should increase (model prefers chosen over reference)
rewards/rejected: should decrease (model disfavors rejected over reference)
rewards/margin: (chosen - rejected) should increase consistently
rewards/accuracy: fraction of comparisons where chosen > rejected; target >90%
logps/chosen: should stay near SFT values (not collapse or explode)
logps/rejected: should decrease as model learns to disfavor rejected
"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left" # Critical for causal LMs in DPO batch processing

# Load the SFT model (the starting point - must be a good SFT checkpoint)
# DPO does NOT work well on base pretrained models - always start from SFT
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16, # bfloat16 saves memory vs float32
)

# LoRA for memory-efficient fine-tuning
# With LoRA, the reference model is the base model (before LoRA adapters)
# This is efficient: reference = base weights, policy = base + LoRA adapters
if use_lora:
lora_config = LoraConfig(
r=16, # LoRA rank - higher = more parameters
lora_alpha=32, # LoRA scaling: effective_lr = lr * alpha/r
target_modules=["c_attn"], # GPT-2's attention projection layer
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# With LoRA on GPT-2: ~0.5M trainable / 124M total (~0.4%)
# For 7B models: ~20M trainable / 7B total (~0.3%)

dpo_config = DPOConfig(
output_dir="./dpo_output",
num_train_epochs=n_epochs,
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # effective batch size = 8
learning_rate=lr,
beta=beta, # The β in the DPO loss
loss_type=loss_type, # "sigmoid" = DPO, "ipo" = IPO, "hinge" = SLiC
max_prompt_length=256, # Max tokens for prompt portion
max_length=512, # Max total sequence length (prompt + response)
logging_steps=5,
warmup_ratio=0.1,
bf16=True, # bfloat16 training
# The reference model is automatically created as a frozen copy of `model`
# before LoRA adapters are merged - i.e., the SFT weights serve as π_ref
)

dataset = build_dpo_dataset()

trainer = DPOTrainer(
model=model,
args=dpo_config,
train_dataset=dataset,
processing_class=tokenizer,
# ref_model: if None, trl uses the base model (before PEFT adapters) as reference
# For LoRA training this is automatically correct
# For full fine-tuning, pass an explicit frozen reference model
)

trainer.train()
return trainer.model


# ─────────────────────────────────────────────────────────────────────────────
# 4. IPO Loss Variant (for reference and direct comparison)
# ─────────────────────────────────────────────────────────────────────────────

def ipo_loss(
policy_logps_chosen: torch.Tensor,
policy_logps_rejected: torch.Tensor,
ref_logps_chosen: torch.Tensor,
ref_logps_rejected: torch.Tensor,
beta: float = 0.1,
) -> torch.Tensor:
"""
Identity Preference Optimization loss (Azar et al., 2023).

IPO replaces DPO's sigmoid loss with an L2 loss, making it:
- More robust to noisy preference labels (sigmoid saturates; L2 does not)
- More stable when preference labels have high inter-annotator disagreement

The L2 loss directly regresses the log-ratio difference toward the target 1/(2β).
The target 1/(2β) comes from the IPO theoretical framework - it is the unique
target that satisfies the IPO regularized objective at optimality.
"""
log_ratio_chosen = policy_logps_chosen - ref_logps_chosen
log_ratio_rejected = policy_logps_rejected - ref_logps_rejected

# Target: the log-ratio difference should equal 1/(2β)
# This is what the optimal policy satisfies under IPO's framework
target = 1.0 / (2.0 * beta)

# L2 loss - no saturation, constant gradient magnitude
loss = ((log_ratio_chosen - log_ratio_rejected - target) ** 2).mean()
return loss


# ─────────────────────────────────────────────────────────────────────────────
# 5. Demonstration: DPO loss computation mechanics
# ─────────────────────────────────────────────────────────────────────────────

if __name__ == "__main__":
torch.manual_seed(42)

print("=== DPO Loss Mechanics Demo ===\n")

# Simulated log-probabilities for a batch of 3 preference pairs
# Scenario: policy already partially learns (chosen has higher log-prob)
policy_chosen = torch.tensor([-1.5, -2.0, -1.8]) # log π_θ(y_w | x)
policy_rejected = torch.tensor([-3.0, -4.5, -3.2]) # log π_θ(y_l | x)
ref_chosen = torch.tensor([-2.0, -2.5, -2.2]) # log π_ref(y_w | x)
ref_rejected = torch.tensor([-2.8, -3.0, -2.9]) # log π_ref(y_l | x)

for beta in [0.05, 0.1, 0.3, 1.0]:
loss, r_w, r_l = compute_dpo_loss_manual(
policy_chosen, policy_rejected, ref_chosen, ref_rejected, beta=beta
)
print(f"β={beta:.2f}: loss={loss.item():.4f} | "
f"implicit rewards chosen={r_w.numpy()} rejected={r_l.numpy()} | "
f"margin={( r_w - r_l).numpy()}")

print("\n=== Gradient Weight Analysis ===")
print("The DPO gradient is weighted by σ(r̂(y_l) - r̂(y_w))")
print("Large weight → model currently prefers rejected (wrong) → large correction")
print("Small weight → model already prefers chosen (right) → small correction")

loss, r_w, r_l = compute_dpo_loss_manual(
policy_chosen, policy_rejected, ref_chosen, ref_rejected, beta=0.1
)
grad_weights = torch.sigmoid(r_l - r_w)
print(f"Gradient weights (larger = more correction needed): {grad_weights.numpy()}")

Production Notes

SFT Checkpoint Quality

DPO is more sensitive than RLHF to the quality of the starting SFT checkpoint. The reason is architectural: DPO's reference model is the SFT model. The implicit reward r^θ(x,y)=βlog(πθ(yx)/πref(yx))\hat{r}_\theta(x, y) = \beta\log(\pi_\theta(y|x)/\pi_{\text{ref}}(y|x)) measures how much the trained policy deviates from the reference. If the SFT model is poor, the reference distribution is poor, and the signal becomes noisy.

In RLHF, the reward model provides an absolute signal that can grade responses independently of where the policy started. A well-trained reward model can be informative even when the SFT checkpoint is mediocre. In DPO, the signal is relative to the reference model - if the reference cannot generate the chosen responses with reasonable probability, the ratio πθ(ywx)/πref(ywx)\pi_\theta(y_w|x)/\pi_{\text{ref}}(y_w|x) diverges and the loss becomes numerically unstable.

Practical rule: before running DPO, verify that the SFT model achieves reasonable perplexity on a held-out set of chosen responses. If per-token perplexity on chosen responses exceeds ~10 (for a well-formatted domain), the SFT checkpoint is likely too weak.

Beta Tuning

The β\beta parameter controls the strength of the preference signal:

  • Low β\beta (0.01–0.05): strong preference signal. The model makes large moves away from the reference. Risk: distribution collapse - the model assigns near-zero probability to rejected responses, becoming brittle. Monitor logps/rejected: if it drops below -50 per token on average, reduce β\beta.
  • High β\beta (0.5–1.0): conservative updates. The model barely moves from the SFT baseline. Risk: insufficient alignment - the model performs nearly identically to the reference on benchmarks.
  • Typical values: 0.1–0.3 for 7B models, 0.05–0.1 for larger models. The Zephyr paper used β=0.1\beta = 0.1. Llama-2-Chat variants typically use β=0.1\beta = 0.10.20.2.

Monitor rewards/margin (chosen implicit reward minus rejected implicit reward) during training. It should increase steadily from near zero to a stable positive value. If it explodes, β\beta is too low. If it barely moves, β\beta is too high or the learning rate is too small.

Data Quality vs Quantity

DPO is more sensitive to data quality than PPO. With PPO, on-policy generation provides continuous data augmentation - even mediocre preference data can be overcome with enough rollouts that give the reward model many opportunities to provide signal. With DPO, the fixed offline dataset is all you have.

Data quality factors that matter most:

  • Label noise: mislabeled comparisons (chosen and rejected accidentally swapped) inject inverted gradients directly. With PPO, the reward model averages over many comparisons; with DPO, each mislabeled triplet directly poisons the gradient. For noisy labels, prefer IPO.
  • Difficulty of comparisons: easy pairs (obviously good chosen, obviously bad rejected) provide weak gradient signal because the sigmoid is near zero. Include challenging comparisons where chosen is only marginally better.
  • Prompt diversity: DPO benefits from diverse prompts spanning the full deployment distribution. Narrow datasets produce models that are strong on benchmarks but brittle in deployment.

Combining SFT and DPO Data

A common production pattern: mix a small fraction (10–20%) of SFT demonstration data into the DPO training to prevent catastrophic forgetting of language modeling skills while training on preference data. This is especially important for domains where the preference dataset is small.


Common Mistakes

:::danger Not starting from a good SFT checkpoint DPO requires a solid SFT model as its reference. Fine-tuning directly from a pretrained base model with DPO rarely works - the reference distribution is too broad (the model generates near-random text), the implicit reward signal is meaningless, and training is numerically unstable. The pipeline is always: pretrain → SFT → DPO. If SFT quality is suspect, fix it before attempting DPO. :::

:::danger Reference model distribution collapse on chosen responses If the SFT reference model assigns near-zero probability to chosen responses (i.e., the chosen responses are out-of-distribution for the reference), the ratio log(πθ(ywx)/πref(ywx))\log(\pi_\theta(y_w|x)/\pi_{\text{ref}}(y_w|x)) diverges. The DPO loss becomes dominated by numerical instability rather than meaningful preference learning. Check that the SFT model can generate the chosen responses with reasonable perplexity before running DPO on them. :::

:::warning Log-probability collapse on rejected sequences A common failure mode in DPO training: the model drives logπθ(ylx)\log\pi_\theta(y_l|x) \to -\infty for all rejected responses. The model completely stops being able to generate rejected-style responses. While this optimizes the DPO objective, it makes the model brittle: any response format similar to the rejected training examples may be suppressed, even when appropriate. Monitor logps/rejected - if it drops below -50 per response token on average, reduce β\beta or add an SFT regularization term. :::

:::tip IPO for noisy preference data If your preference labels come from crowdsourcing, have high inter-annotator disagreement, or you suspect mislabeling, use IPO instead of standard DPO. IPO's L2 loss does not saturate and regresses toward a calibrated target rather than using the sigmoid that can lock in incorrect confident predictions. Switch in trl with a single config change: DPOConfig(loss_type="ipo"). :::

:::tip KTO when you only have single-label feedback If your feedback data is in thumbs-up / thumbs-down format (each response is independently labeled, not paired against a specific alternative), use KTO. This matches the natural format of real-user feedback and A/B testing logs. KTO eliminates the need to construct paired datasets from unpaired judgments, which can introduce noise. Use trl.KTOTrainer with the same configuration structure as DPOTrainer. :::


YouTube Resources

VideoCreatorWhy Watch
DPO Paper ExplainedYannic KilcherFull DPO derivation step by step
RLHF vs DPOInterconnects AIPractical comparison - when to choose each
Zephyr: DPO in PracticeHuggingFaceTraining Zephyr-7B with DPO, engineering details
LLM Alignment Methods SurveySebastian RaschkaOverview of SFT, RLHF, DPO, IPO, KTO

Interview Questions and Answers

Q1: What is the key mathematical insight behind DPO? Walk through the derivation.

The key insight is that the RLHF objective - maximize reward subject to KL constraint from a reference model - has a known closed-form optimal policy. When you write down the KL-regularized objective and solve for the optimal policy analytically, you get:

π(yx)=1Z(x)πref(yx)exp ⁣(r(x,y)β)\pi^*(y|x) = \frac{1}{Z(x)}\,\pi_{\text{ref}}(y|x)\exp\!\left(\frac{r(x,y)}{\beta}\right)

Rearranging to express the reward in terms of the policy gives:

r(x,y)=βlogπ(yx)πref(yx)+βlogZ(x)r^*(x, y) = \beta\log\frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta\log Z(x)

When you substitute this into the Bradley-Terry preference model, the βlogZ(x)\beta\log Z(x) term appears identically in both r(yw)r(y_w) and r(yl)r(y_l) and cancels:

P(ywylx)=σ ⁣(βlogπ(ywx)πref(ywx)βlogπ(ylx)πref(ylx))P(y_w \succ y_l \mid x) = \sigma\!\left(\beta\log\frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta\log\frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)

The intractable partition function Z(x)Z(x) is gone. The preference probability depends only on computable ratios of policy probabilities. This gives the DPO loss directly: LDPO=E[logσ(βlog(πθ(yw)/πref(yw))βlog(πθ(yl)/πref(yl)))]\mathcal{L}_{\text{DPO}} = -\mathbb{E}\left[\log\sigma(\beta\log(\pi_\theta(y_w)/\pi_{\text{ref}}(y_w)) - \beta\log(\pi_\theta(y_l)/\pi_{\text{ref}}(y_l)))\right].


Q2: Why does Z(x) cancel? Why is that the crucial step?

Z(x)Z(x) is the partition function - a normalizing constant that ensures the probability distribution sums to 1 over all possible responses. It appears in the analytical optimal policy π(yx)=πref(yx)exp(r(x,y)/β)/Z(x)\pi^*(y|x) = \pi_{\text{ref}}(y|x)\exp(r(x,y)/\beta)/Z(x).

When you express the reward in terms of the policy, Z(x)Z(x) appears as an additive term βlogZ(x)\beta\log Z(x) in the reward formula. Because Z(x)Z(x) depends only on the prompt xx - not on the specific response yy - it is the same value for both the chosen response ywy_w and the rejected response yly_l.

In the Bradley-Terry model, the preference probability depends on the difference r(yw)r(yl)r(y_w) - r(y_l). The additive βlogZ(x)\beta\log Z(x) term is identical for both responses and cancels in the difference. This cancellation is why the intractable sum over all responses disappears.

Without this cancellation, computing Z(x)Z(x) would require summing the reference model's probability weighted by the reward exponential over all possible responses - an infinite or astronomically large sum for an autoregressive LM. The cancellation makes the DPO loss tractable to compute and minimize.


Q3: What are the pros and cons of DPO versus RLHF (PPO)?

DPO advantages: (1) 3–5× cheaper - no reward model training, no PPO loop, 2 models vs 4 in memory. (2) Simpler to implement and tune - standard supervised training, no PPO hyperparameter tuning, no rollout infrastructure. (3) More stable - no PPO clipping, no reward normalization, no value function training. (4) Works on fixed offline datasets - no need for real-time rollout generation. (5) The language model is its own reward model - implicit reward is useful for evaluation and ranking.

RLHF (PPO) advantages: (1) Online generation - PPO generates data from the current policy, correcting distribution shifts that DPO (operating on fixed offline data) cannot address. (2) Iterative improvement - can continuously collect preference comparisons on the updated model and retrain the reward model. (3) Complex reward signals - process reward models (PRMs) that evaluate reasoning steps require online rollouts. (4) Handles distribution shift - when the policy moves substantially from the SFT baseline, DPO's fixed offline data becomes less representative; PPO adapts.

In practice: use DPO for most alignment tasks, especially initial alignment and when compute is limited. Use PPO when you have the budget, need iterative improvement loops, or have evidence that DPO underfits (e.g., complex multi-step reasoning tasks where step-level rewards matter).


Q4: What is the role of β in DPO and how do you tune it?

β\beta is the KL regularization coefficient - it controls how strongly DPO pushes the policy away from the reference model. It appears in the RLHF objective as the coefficient on the KL penalty: maxπE[r]βDKL(ππref)\max_\pi \mathbb{E}[r] - \beta D_{\text{KL}}(\pi \| \pi_{\text{ref}}).

In the DPO loss, β\beta scales the implicit reward signal: r^(x,y)=βlog(πθ(yx)/πref(yx))\hat{r}(x,y) = \beta\log(\pi_\theta(y|x)/\pi_{\text{ref}}(y|x)). Smaller β\beta → larger implicit reward differences → stronger gradient signal → larger policy changes. Larger β\beta → smaller implicit reward differences → weaker gradient signal → conservative policy updates.

Tuning strategy: (1) Start with β=0.1\beta = 0.1 for 7B models (Zephyr, Mistral-Instruct default). (2) Monitor rewards/margin - it should increase steadily over training. If it barely moves, reduce β\beta or increase learning rate. If it explodes, increase β\beta. (3) Monitor logps/rejected - if it collapses to very negative values (distribution collapse), increase β\beta. (4) For noisy preference data, use IPO which is less sensitive to β\beta.


Q5: What are IPO and KTO, and when would you use each over standard DPO?

IPO (Identity Preference Optimization, Azar et al., 2023) replaces DPO's sigmoid loss with an L2 loss: LIPO=E[(log(πθ(yw)/πref(yw))log(πθ(yl)/πref(yl))1/(2β))2]\mathcal{L}_{\text{IPO}} = \mathbb{E}[(\log(\pi_\theta(y_w)/\pi_{\text{ref}}(y_w)) - \log(\pi_\theta(y_l)/\pi_{\text{ref}}(y_l)) - 1/(2\beta))^2]. The squared loss regresses toward a calibrated target rather than maximizing the sigmoid. This prevents the saturation problem where the model becomes overconfident on in-distribution comparisons. Use IPO when: preference labels are noisy (crowdsourced, high disagreement), or you observe the DPO model overfitting and becoming overconfident.

KTO (Kahneman-Tversky Optimization, Ethayarajh et al., 2023) works with single-label feedback rather than paired comparisons. Each response is independently labeled as accepted or rejected, without being paired against a specific alternative. KTO is motivated by prospect theory - it uses asymmetric weights for accepted vs rejected responses reflecting that losses feel larger than gains. Use KTO when: your feedback data is in thumbs-up/thumbs-down format (not paired), you want to use A/B testing data directly, or collecting paired comparisons is operationally difficult.

Both are available in trl: DPOConfig(loss_type="ipo") for IPO, KTOTrainer for KTO.


Q6: When would you still choose RLHF over DPO in a production system?

Four concrete scenarios where RLHF maintains an advantage:

(1) Iterative online learning: If you want to continuously collect new preference data on the updated model's outputs and retrain, RLHF is naturally iterative. Each PPO iteration generates new rollouts which are evaluated by the reward model. DPO trains on a fixed offline dataset - you must manually regenerate the dataset and restart training to incorporate new preference data.

(2) Process reward models for reasoning: For tasks requiring multi-step reasoning (mathematics, coding, logical inference), process reward models (PRMs) evaluate the quality of intermediate reasoning steps, not just the final output. PRMs require online rollout generation to get step-by-step scores - something DPO cannot do.

(3) Very large distribution shifts: If the alignment task requires the model to dramatically change its behavior from the SFT baseline (e.g., learning a completely new conversational style), DPO's fixed offline data may not cover the new distribution. PPO generates on-policy data from the current (changed) policy, naturally adapting to the new distribution.

(4) Multi-dimensional reward signals: RLHF can combine multiple reward models (helpfulness, safety, factuality) with different weights, tuning the mixture dynamically during training. DPO has one preference signal per training example. Decomposed reward learning requires the RLHF framework.


Key Takeaways

  • DPO derives from the same KL-regularized RLHF objective as PPO - but solves it analytically rather than numerically via policy gradient.

  • The key mathematical insight: the partition function Z(x)Z(x) cancels when the optimal policy expression is substituted into the Bradley-Terry model, eliminating the need for an explicit reward model.

  • The DPO loss directly increases the probability of chosen responses and decreases rejected responses, relative to the reference model:

LDPO=E ⁣[logσ ⁣(βlogπθ(yw)πref(yw)βlogπθ(yl)πref(yl))]\mathcal{L}_{\text{DPO}} = -\mathbb{E}\!\left[\log\sigma\!\left(\beta\log\frac{\pi_\theta(y_w)}{\pi_{\text{ref}}(y_w)} - \beta\log\frac{\pi_\theta(y_l)}{\pi_{\text{ref}}(y_l)}\right)\right]

  • DPO is 3–5× cheaper than RLHF and significantly simpler to implement, tune, and maintain.

  • DPO requires a high-quality SFT checkpoint - the reference model quality determines the quality of the implicit reward signal.

  • IPO is more robust to noisy preference labels; KTO eliminates the need for paired comparison data.

  • Open-source models (Zephyr, Mistral-Instruct, Llama variants) predominantly use DPO or its variants for alignment - RLHF is largely a frontier lab approach.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the DPO vs RLHF demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.