Proximal Policy Optimisation - The Algorithm That Runs ChatGPT's RLHF
Reading time: ~40 minutes | Level: Reinforcement Learning | Role: MLE, AI Research Engineer, MLOps
The Real Engineering Moment
The year is 2017 and John Schulman is frustrated. He has already published TRPO - Trust Region Policy Optimization - which provably prevents destructive policy updates and dramatically stabilizes RL training. But TRPO has a serious engineering problem: computing the natural gradient requires solving a constrained optimization problem at every step. The conjugate gradient procedure is numerically finicky. Implementing it correctly takes hundreds of lines of custom code. Nobody outside OpenAI uses it in practice.
Schulman goes back to the math. TRPO works by enforcing a hard constraint: the KL divergence between old and new policy must stay below some threshold . What if you could enforce that constraint approximately, without the second-order optimization machinery? What if a simple clipping operation on the probability ratio could achieve the same effect?
The PPO paper drops in July 2017 - two pages of math and a surprisingly clean implementation. The key insight: instead of constraining the KL divergence directly, clip the probability ratio to stay within . If the new policy wants to assign much higher or much lower probability to an action than the old policy, ignore that gradient signal. The result is an algorithm that is nearly as stable as TRPO, implemented in under 100 lines of PyTorch.
Three years pass. RLHF researchers at OpenAI are building InstructGPT - a version of GPT-3 that follows instructions instead of just completing prompts. They need an RL algorithm that can fine-tune a 175 billion parameter language model against a learned reward model. PPO is the only algorithm stable enough to do it without catastrophically collapsing the language model. InstructGPT ships in 2022. ChatGPT follows. Both use PPO at their core.
That clipping trick turns out to be one of the most consequential ideas in the history of AI. If you are working anywhere near LLM training - alignment, RLHF, constitutional AI, reward modeling - you need to understand PPO deeply.
Why This Exists: The Instability Problem in Policy Gradients
In the previous lesson we saw REINFORCE and actor-critic methods. They work, but they have a fundamental instability: there is no mechanism that prevents you from taking a step so large it destroys the policy.
Consider what happens when you update with a large gradient step:
- The policy changes significantly
- The trajectory distribution changes significantly
- The gradients you computed under the old policy are now stale
- But you keep using them - you're doing gradient ascent with an outdated estimator
This is the problem. Policy gradient methods use data from to estimate the gradient of , then update . But after the update, the data is no longer from - it's from the old policy. If the step is too large, the new policy is in a completely different region of policy space from where the gradient was estimated, and the update is not just useless but actively harmful.
TRPO solution: enforce a KL divergence constraint at every step. Computationally expensive - requires second-order optimization.
PPO solution: clip the probability ratio so updates that move the policy too far from get zero gradient. Computationally cheap - standard first-order gradient descent.
Historical Context
| Year | Work | Authors | Key Contribution |
|---|---|---|---|
| 1992 | REINFORCE | Williams | Policy gradient via log-probability trick |
| 2000 | Natural Policy Gradient | Kakade | Fisher information matrix for stable updates |
| 2015 | TRPO | Schulman et al. | Trust region constraint on KL divergence |
| 2016 | GAE | Schulman et al. | Generalized Advantage Estimation - bias-variance tradeoff |
| 2017 | PPO | Schulman et al. | Clipped surrogate objective - simple + stable |
| 2019 | RLHF for LMs | Ziegler et al. | First application of PPO to language model fine-tuning |
| 2022 | InstructGPT | Ouyang et al. | PPO + RLHF = instruction-following GPT-3 |
| 2022 | ChatGPT | OpenAI | PPO + RLHF at production scale |
The conceptual lineage is clean: REINFORCE → natural gradient → TRPO → PPO. Each step makes the algorithm more practical while preserving the stability guarantees.
Core Concepts
1. The Policy Gradient Objective and Its Instability
We want to maximize expected cumulative reward:
The policy gradient theorem gives us:
The problem: this gradient is estimated from trajectories sampled by . After we take a gradient step, changes, the policy changes, and the old trajectories are no longer valid samples from the new policy. Taking a large step means the gradient estimate is wildly off.
In practice: REINFORCE training looks like a drunk man walking. Loss crashes to randomly. Training is fragile.
2. The Probability Ratio: Reusing Old Data
Importance sampling lets us estimate expectations under one distribution using samples from another:
Apply this to policy gradients: we collected data under but want to estimate the gradient under . Define the probability ratio:
This ratio tells us: how much more (or less) likely is the new policy to take action in state compared to the policy that collected the data?
- : policy unchanged - using this action exactly as much as before
- : new policy twice as likely to take this action
- : new policy half as likely
3. The Conservative Policy Iteration (CPI) Objective
Using importance sampling, we can write an objective that uses old data to estimate the new policy's performance:
where is the estimated advantage at timestep .
Intuition: if an action had a positive advantage (it was better than expected) and the new policy assigns it higher probability (), this is good - we should do more of that action. If , we're doing less of it - penalty.
Problem: this objective has no constraint on how large can get. The optimizer can make arbitrarily large, changing the policy dramatically in one step. We're back to the instability problem.
4. The PPO Clipped Objective - The Key Innovation
PPO solves this by clipping to :
The is critical. Let's trace through the four cases:
Case 1: Positive advantage (), ratio too high ()
- The action was good. New policy wants to do it much more.
- Unclipped term: (large, positive - would encourage more of this)
- Clipped term: (smaller, positive)
- picks the clipped term → gradient is zeroed out for this action
- Effect: we stop increasing the probability of this action once we've increased it by
Case 2: Positive advantage (), ratio in range ()
- The action was good. New policy is doing more of it, but not too much.
- Both terms equal - gradient flows normally
Case 3: Negative advantage (), ratio too low ()
- The action was bad. New policy wants to avoid it much more.
- Clipping prevents over-penalizing - stops gradient once we've reduced probability by
Case 4: Negative advantage (), ratio too high ()
- The action was bad. New policy is doing it more - this is the danger zone.
- Unclipped term: (large negative - strong penalty)
- Clipped term: (less negative)
- picks the unclipped term → gradient still flows to correct this mistake
- Critical: PPO does NOT block gradients for bad actions that the policy is increasing. Only for good actions that the policy has already increased enough.
This asymmetry is the mathematical heart of PPO. It is a conservative lower bound on the policy improvement.
Probability Ratio r_t(θ)
Advantage > 0 (good action):
┌──────────────────────────────────────────────────┐
│ Gradient flows │ Gradient clipped │
│ (ratio in range)│ (ratio too high) │
├─────────────────┤──────────────────────────────────
0 1-ε 1 1+ε ∞
↑
r_old=1
Advantage < 0 (bad action):
┌────────────────────────────────────────────────────┐
│ Gradient clipped │ Gradient flows │
│ (ratio too low) │ (ratio in range or high) │
├────────────────────────┤────────────────────────────
0 1-ε 1 1+ε ∞
5. Generalized Advantage Estimation (GAE)
The advantage measures how much better action was compared to the average action from state :
We never know the true Q or V. We estimate them. The simplest estimator: the TD residual (1-step):
This has low variance but high bias (since is an approximation). The multi-step return has lower bias but higher variance. GAE (Schulman et al. 2016) interpolates:
where .
The parameter controls the tradeoff:
- : pure 1-step TD - high bias, low variance ()
- : full Monte Carlo - low bias, high variance ()
- (PPO default): good balance
The exponential decay means distant TD errors contribute less - we trust near-term estimates more.
6. The Value Function Loss
PPO uses an actor-critic architecture. The critic (value function ) is updated to minimize the squared error against the target returns:
where is typically the GAE-estimated return: .
:::note Value Function Clipping Some implementations also clip the value function loss to prevent large updates: . This is less universally agreed upon than the policy clipping. :::
7. The Entropy Bonus
Without encouragement, policies tend to become deterministic early in training - they commit to the first solution they find and stop exploring. The entropy bonus penalizes low-entropy (certain) policies:
Adding this to the objective encourages exploration throughout training.
8. The Full PPO Objective
Combining all three terms:
Standard hyperparameters from the original PPO paper:
| Hyperparameter | Symbol | Default | Notes |
|---|---|---|---|
| Clip ratio | 0.2 | How far ratio can deviate from 1 | |
| Discount | 0.99 | Future reward discount | |
| GAE lambda | 0.95 | Bias-variance tradeoff | |
| Value loss coeff | 0.5 | Weight of critic loss | |
| Entropy coeff | 0.01 | Weight of entropy bonus | |
| Epochs per batch | 4–10 | How many gradient steps on same data | |
| Minibatch size | - | 64–2048 | Larger is better for stability |
| Learning rate | 3e-4 | Adam default works well |
Architecture and Training Flow
Complete PPO Implementation
The following is a full, working PPO implementation for continuous/discrete control (tested on CartPole and LunarLander).
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import gymnasium as gym
from torch.distributions import Categorical, Normal
from collections import namedtuple
# ── Networks ──────────────────────────────────────────────────────────────────
class ActorNetwork(nn.Module):
"""Policy network: outputs action probabilities (discrete) or mean/std (continuous)."""
def __init__(self, obs_dim: int, act_dim: int, hidden: int = 64):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_dim, hidden),
nn.Tanh(),
nn.Linear(hidden, hidden),
nn.Tanh(),
)
self.head = nn.Linear(hidden, act_dim)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.head(self.net(x))
def get_distribution(self, obs: torch.Tensor) -> Categorical:
logits = self.forward(obs)
return Categorical(logits=logits)
def evaluate_actions(self, obs: torch.Tensor, actions: torch.Tensor):
dist = self.get_distribution(obs)
log_probs = dist.log_prob(actions)
entropy = dist.entropy()
return log_probs, entropy
class CriticNetwork(nn.Module):
"""Value network: outputs scalar state value V(s)."""
def __init__(self, obs_dim: int, hidden: int = 64):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_dim, hidden),
nn.Tanh(),
nn.Linear(hidden, hidden),
nn.Tanh(),
nn.Linear(hidden, 1),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.net(x).squeeze(-1)
# ── PPO Agent ─────────────────────────────────────────────────────────────────
class PPOAgent:
def __init__(
self,
obs_dim: int,
act_dim: int,
lr: float = 3e-4,
gamma: float = 0.99,
gae_lambda: float = 0.95,
clip_eps: float = 0.2,
value_coef: float = 0.5,
entropy_coef: float = 0.01,
n_epochs: int = 10,
minibatch_size: int = 64,
):
self.gamma = gamma
self.gae_lambda = gae_lambda
self.clip_eps = clip_eps
self.value_coef = value_coef
self.entropy_coef = entropy_coef
self.n_epochs = n_epochs
self.minibatch_size = minibatch_size
self.actor = ActorNetwork(obs_dim, act_dim)
self.critic = CriticNetwork(obs_dim)
# Shared optimizer - common in PPO implementations
self.optimizer = optim.Adam(
list(self.actor.parameters()) + list(self.critic.parameters()),
lr=lr,
)
@torch.no_grad()
def get_action(self, obs: np.ndarray):
obs_t = torch.FloatTensor(obs)
dist = self.actor.get_distribution(obs_t)
action = dist.sample()
log_prob = dist.log_prob(action)
value = self.critic(obs_t)
return action.item(), log_prob.item(), value.item()
def compute_gae(
self,
rewards: list,
values: list,
dones: list,
next_value: float,
) -> tuple[list, list]:
"""
Compute Generalized Advantage Estimates and returns.
GAE formula: Â_t = Σ_{l=0}^∞ (γλ)^l δ_{t+l}
where δ_t = r_t + γ V(s_{t+1}) - V(s_t)
"""
advantages = []
gae = 0.0
# Append next_value for bootstrapping
values = values + [next_value]
# Iterate backwards through trajectory
for t in reversed(range(len(rewards))):
# TD error
delta = rewards[t] + self.gamma * values[t + 1] * (1 - dones[t]) - values[t]
# Accumulate GAE (exponentially weighted sum of TD errors)
gae = delta + self.gamma * self.gae_lambda * (1 - dones[t]) * gae
advantages.insert(0, gae)
# Returns = advantages + values (for value function target)
returns = [adv + val for adv, val in zip(advantages, values[:-1])]
return advantages, returns
def update(self, rollout: dict) -> dict:
"""
Run K epochs of PPO updates on collected rollout data.
Returns dict of loss statistics.
"""
obs = torch.FloatTensor(np.array(rollout["obs"]))
actions = torch.LongTensor(rollout["actions"])
old_log_probs = torch.FloatTensor(rollout["log_probs"])
advantages = torch.FloatTensor(rollout["advantages"])
returns = torch.FloatTensor(rollout["returns"])
# ── CRITICAL: Normalize advantages ────────────────────────────────────
# High variance advantages cause training instability.
# Normalize to zero mean, unit variance per batch.
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
n = len(obs)
stats = {"policy_loss": [], "value_loss": [], "entropy": [], "approx_kl": []}
for epoch in range(self.n_epochs):
# Shuffle indices for minibatching
indices = np.random.permutation(n)
for start in range(0, n, self.minibatch_size):
idx = indices[start:start + self.minibatch_size]
mb_obs = obs[idx]
mb_actions = actions[idx]
mb_old_log_probs = old_log_probs[idx]
mb_advantages = advantages[idx]
mb_returns = returns[idx]
# ── Actor: compute new log probs and entropy ───────────────────
new_log_probs, entropy = self.actor.evaluate_actions(mb_obs, mb_actions)
# ── Probability ratio r_t(θ) = π_θ(a|s) / π_θ_old(a|s) ───────
# In log space: log(r) = log(π_new) - log(π_old)
log_ratio = new_log_probs - mb_old_log_probs
ratio = log_ratio.exp()
# ── Approximate KL (for monitoring) ────────────────────────────
# k ≈ (r - 1) - log(r) - numerically stable
approx_kl = ((ratio - 1) - log_ratio).mean().item()
# ── Clipped policy loss ─────────────────────────────────────────
unclipped = ratio * mb_advantages
clipped = torch.clamp(ratio, 1 - self.clip_eps, 1 + self.clip_eps) * mb_advantages
policy_loss = -torch.min(unclipped, clipped).mean()
# ── Value loss ─────────────────────────────────────────────────
values_pred = self.critic(mb_obs)
value_loss = nn.functional.mse_loss(values_pred, mb_returns)
# ── Entropy bonus (negative because we're minimizing) ──────────
entropy_loss = -entropy.mean()
# ── Combined PPO loss ──────────────────────────────────────────
loss = (
policy_loss
+ self.value_coef * value_loss
+ self.entropy_coef * entropy_loss
)
self.optimizer.zero_grad()
loss.backward()
# Gradient clipping prevents extremely large updates
nn.utils.clip_grad_norm_(
list(self.actor.parameters()) + list(self.critic.parameters()),
max_norm=0.5,
)
self.optimizer.step()
stats["policy_loss"].append(policy_loss.item())
stats["value_loss"].append(value_loss.item())
stats["entropy"].append(-entropy_loss.item())
stats["approx_kl"].append(approx_kl)
return {k: np.mean(v) for k, v in stats.items()}
# ── Training Loop ─────────────────────────────────────────────────────────────
def train_ppo(
env_id: str = "CartPole-v1",
total_timesteps: int = 500_000,
rollout_length: int = 2048,
n_envs: int = 4,
):
env = gym.make(env_id)
obs_dim = env.observation_space.shape[0]
act_dim = env.action_space.n
agent = PPOAgent(obs_dim=obs_dim, act_dim=act_dim)
total_steps = 0
episode_rewards = []
while total_steps < total_timesteps:
# ── Collect rollout ────────────────────────────────────────────────────
rollout = {
"obs": [], "actions": [], "rewards": [],
"log_probs": [], "values": [], "dones": []
}
obs, _ = env.reset()
episode_reward = 0.0
for step in range(rollout_length):
action, log_prob, value = agent.get_action(obs)
next_obs, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
rollout["obs"].append(obs)
rollout["actions"].append(action)
rollout["rewards"].append(reward)
rollout["log_probs"].append(log_prob)
rollout["values"].append(value)
rollout["dones"].append(float(done))
episode_reward += reward
obs = next_obs
total_steps += 1
if done:
episode_rewards.append(episode_reward)
episode_reward = 0.0
obs, _ = env.reset()
# Bootstrap value for last state
_, _, next_value = agent.get_action(obs)
# ── Compute GAE advantages and returns ─────────────────────────────────
advantages, returns = agent.compute_gae(
rollout["rewards"],
rollout["values"],
rollout["dones"],
next_value,
)
rollout["advantages"] = advantages
rollout["returns"] = returns
# ── PPO update ─────────────────────────────────────────────────────────
stats = agent.update(rollout)
if len(episode_rewards) > 0:
mean_reward = np.mean(episode_rewards[-20:])
print(
f"Steps: {total_steps:7d} | "
f"Mean reward: {mean_reward:7.1f} | "
f"Policy loss: {stats['policy_loss']:6.4f} | "
f"Value loss: {stats['value_loss']:6.4f} | "
f"Entropy: {stats['entropy']:5.3f} | "
f"Approx KL: {stats['approx_kl']:6.4f}"
)
env.close()
return agent
if __name__ == "__main__":
agent = train_ppo("CartPole-v1", total_timesteps=300_000)
# Typical result: ~500 (max) within 100K steps
PPO in Production: RLHF for Language Models
This is where PPO's real impact becomes clear. The 2022 InstructGPT paper (Ouyang et al.) uses PPO to fine-tune a 175B parameter GPT-3 to follow human instructions. The setup differs from the RL control setting in key ways:
The policy is the language model: - probability of generating response token sequence given prompt .
The environment is a reward model: - a neural network trained on human preference comparisons that assigns a scalar quality score to (prompt, response) pairs.
The state is the prompt + previous tokens: at each generation step, the agent "observes" the current partial response and "acts" by choosing the next token.
The reward signal is sparse: only the complete response receives a reward (not intermediate tokens).
The RLHF PPO Reward
The critical modification for RLHF is a KL penalty on top of the reward model score:
Where:
- : reward model score (learned from human preferences)
- : the SFT (supervised fine-tuned) model before PPO - frozen reference
- : KL penalty coefficient (typically 0.02–0.2)
- The KL term: penalizes the policy for deviating from the reference model
Why the KL penalty? The reward model is imperfect. Without the KL penalty, PPO finds adversarial responses that score highly on the reward model but are nonsensical or harmful - a phenomenon called reward hacking. The KL term keeps the policy close to the pretrained language model, which is a regularizer ensuring the output remains coherent text.
┌─────────────────────────────────────────────────────────┐
│ RLHF PPO Training Loop │
│ │
│ Prompt x → π_θ generates y (full response) │
│ → r_φ(x, y): reward model scores y │
│ → KL(π_θ || π_ref): penalize divergence │
│ → R(x,y) = r_φ(x,y) - β·KL │
│ → PPO update on token-level actions │
│ │
│ Frozen: π_ref (SFT model, for KL reference) │
│ Frozen: r_φ (reward model, evaluated but not updated) │
│ Trained: π_θ (policy = the LLM being aligned) │
└─────────────────────────────────────────────────────────┘
Key RLHF Engineering Details
Token-level PPO: each token generation is treated as an action. The advantage is backpropagated through the sequence. Only the final token's timestep receives the reward - all other timesteps have reward 0, plus the KL penalty per token.
Compute cost: RLHF training requires 4 models in memory simultaneously: (1) the policy , (2) the reference model , (3) the reward model , (4) the value/critic model. For 7B parameter models, this requires 4× the inference memory.
Reward model overoptimization: as PPO trains longer, the KL divergence grows. Eventually, the policy finds outputs that score highly on the reward model but don't generalize - Goodhart's Law applied to RL. The KL coefficient must be tuned carefully.
Common Mistakes
:::danger Setting clip ratio too high (ε > 0.3) With ε = 0.5 or 1.0, the clipping rarely activates. PPO degenerates into standard policy gradient - high variance, training instability, risk of catastrophic policy collapse. Keep ε between 0.1 and 0.3. The default 0.2 is a good starting point. :::
:::danger Not normalizing advantages Raw advantages have high variance and depend on the scale of rewards. If your reward is 1000x larger than CartPole's, your gradient will be 1000x larger. Always normalize advantages to zero mean and unit variance within each minibatch. Without this, learning rate becomes reward-scale-dependent and training is extremely fragile. :::
:::warning Too many update epochs per batch (K > 15) PPO reuses the same rollout data for K epochs. After too many epochs, the new policy is far from the policy that collected the data - the probability ratios become large, the clipping activates constantly, and no learning happens. The approximate KL divergence is a good diagnostic: if it exceeds 0.05 consistently, reduce K. :::
:::warning Forgetting gradient clipping
Large gradient norms can cause instability even with PPO's objective-level clipping. Always use clip_grad_norm_ with max_norm=0.5 (or 1.0 for larger models). The PPO paper does this; most tutorials forget it.
:::
:::tip Early stopping on KL divergence A common improvement: monitor the approximate KL between old and new policy after each epoch. If KL exceeds a threshold (e.g., 0.015), stop updating and start the next rollout. This prevents the data from going too stale within the K epochs. :::
Production Engineering Notes
Vectorized environments: don't train on a single environment. Use gymnasium.vector.make or stable-baselines3's vectorized envs. Training with 8–64 parallel environments dramatically reduces training time.
Separate actor and critic: the original PPO paper shares parameters between actor and critic. Many production implementations keep them separate - easier to tune learning rates independently and the critic can be updated more aggressively.
Learning rate scheduling: PPO benefits from a linearly decaying learning rate from to 0. Many implementations use LinearLR or manual decay.
Reward normalization: normalize rewards using a running mean and standard deviation. This prevents the reward scale from changing over training (important in non-stationary environments).
Observation normalization: normalize observations using running statistics. Critical for tasks where observations have different scales (e.g., position vs. velocity in MuJoCo).
Reference implementation: stable-baselines3 (SB3) has an excellent, well-tested PPO implementation. For RLHF specifically, trl (HuggingFace) wraps PPO for language model training.
YouTube Resources
| Video | Channel | Why Watch It |
|---|---|---|
| PPO Explained | Weights & Biases | Best visual PPO explanation with animated clipping diagrams |
| Proximal Policy Optimization | Arxiv Insights | Clean paper walkthrough with good mathematical depth |
| Deep RL with PPO | Andrej Karpathy | Building RL from scratch, PPO in context |
| RLHF and PPO for LLMs | Yannic Kilcher | InstructGPT paper analysis - PPO in the LLM context |
Interview Q&A
Q1: What problem does PPO's clipping solve, and why is the min() critical?
Answer: Without clipping, importance sampling allows the probability ratio to become arbitrarily large, corresponding to a policy that has moved far from the data-collection policy. The gradient estimate becomes invalid (high variance, wrong direction), and the policy can collapse in a single bad update.
The clipping limits to . The is critical because it makes the objective a pessimistic lower bound: when the ratio is too high and the advantage is positive (the optimizer wants to increase action probability), we clip and give zero gradient. When the ratio is too high and the advantage is negative (a bad action the optimizer is making more likely), we do NOT clip - we still penalize this. The min ensures we only ignore updates that would push the policy further in a direction we've already moved enough, while preserving corrections for policy degradation.
Q2: Derive PPO from TRPO - what is the connection?
Answer: TRPO optimizes:
The constraint is enforced via a Lagrangian with conjugate gradient to compute the natural gradient - expensive.
PPO approximates this constraint differently. First, note that is related to how far deviates from 1. If stays in , the KL divergence is approximately bounded. PPO replaces the hard KL constraint with a soft one via clipping: don't let the ratio go outside the interval. It's the same idea - limit policy change - with a simpler implementation. The paper shows empirically that PPO achieves similar performance to TRPO at a fraction of the computational cost.
Q3: Explain GAE intuition - why do we need ?
Answer: The advantage measures how good action was. To estimate it we need the true value function , which we don't have. Our approximation is imperfect - it has estimation bias.
The 1-step TD estimate has low variance (only one random step) but high bias (relies heavily on the value approximation). The Monte Carlo estimate (full trajectory return minus baseline) has low bias (no reliance on value approximation) but high variance (depends on many random steps).
GAE computes an exponentially weighted average of all n-step TD estimates: . The parameter tunes the tradeoff continuously. gives the 1-step estimate (high bias, low variance). gives the full Monte Carlo estimate (low bias, high variance). is empirically good for most tasks. GAE is used because it performs better than any fixed n-step return in practice.
Q4: What does the entropy bonus do and when would you remove it?
Answer: The entropy bonus adds a reward for taking diverse actions. Without it, the policy tends to become very confident (low entropy/deterministic) early in training, committing to a local optimum before adequately exploring. The entropy bonus prevents premature convergence.
You might reduce or remove it in late training - once the policy is near-optimal, entropy just adds noise. Some implementations anneal from 0.01 to 0 over training. You'd also reduce it if the policy is already too exploratory (entropy too high), which can happen in tasks with large action spaces.
In RLHF for LLMs, the KL penalty against the reference model serves a similar regularization function to the entropy bonus, so the entropy coefficient is often set to 0.
Q5: How is PPO used in RLHF? Walk through the full setup.
Answer:
-
SFT Phase: fine-tune a pretrained LLM on high-quality demonstration data. This gives - a competent instruction-follower.
-
Reward Model Training: collect pairs of responses to prompts. Human labelers rank them by quality. Train a reward model using a ranking loss (Bradley-Terry) on these comparisons.
-
PPO Phase: initialize the policy . At each PPO step:
- Sample prompts from a dataset
- Generate responses from (the policy)
- Score responses with (frozen)
- Compute KL-penalized reward:
- Run PPO update to maximize R
-
Token-level formulation: each token generation is an "action". The reward is only at the last token. Advantages are computed over the sequence using GAE.
-
The KL penalty prevents reward hacking - the model stays close to the coherent pretrained model while improving on the reward signal.
Q6: Compare PPO vs A3C vs SAC. When would you choose each?
Answer:
| Algorithm | On/Off Policy | Action Space | Key Advantage | When to Use |
|---|---|---|---|---|
| PPO | On-policy | Discrete or continuous | Stable, sample-efficient, simple | Default choice. RLHF. Most RL benchmarks. |
| A3C | On-policy | Discrete | Parallelizes across many workers | When you have many CPUs and want async training |
| SAC | Off-policy | Continuous only | Very sample-efficient, max entropy | Robotics, continuous control, data-limited settings |
PPO is the default for most tasks because it is simple, stable, and works across action space types. SAC is preferred for continuous control when sample efficiency matters (expensive simulation or real-world data). A3C is largely superseded by PPO with vectorized environments.
Key Takeaways
- PPO solves the policy gradient instability problem by clipping the probability ratio to stay within
- The in the clipped objective makes it a pessimistic lower bound - it prevents aggressive policy changes in directions that would make good actions more likely, but still corrects bad actions
- GAE with provides a good bias-variance tradeoff for advantage estimation
- The combined PPO objective includes policy loss, value loss, and entropy bonus
- PPO is the core algorithm in RLHF - with a KL penalty added to prevent reward hacking against an imperfect reward model
- The single most common mistake: not normalizing advantages within each minibatch
:::tip 🎮 Interactive Playground
Visualize this concept: Try the PPO Clipping Objective demo on the EngineersOfAI Playground - no code required.
:::
