PPO: the dominant policy gradient algorithm - how clipping the probability ratio prevents destructive policy updates while maintaining the efficiency of on-policy learning.

How does proximal policy optimization work in practice?

Proximal Policy Optimisation - The Algorithm That Runs ChatGPT's RLHF covers PPO, proximal policy optimization, policy gradient from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/reinforcement-learning/proximal-policy-optimisation

What is the difference between PPO and policy gradient?

See the full breakdown at https://engineersofai.com/docs/ml/reinforcement-learning/proximal-policy-optimisation

Proximal Policy Optimisation - The Algorithm That Runs ChatGPT's RLHF

Reading time: ~40 minutes | Level: Reinforcement Learning | Role: MLE, AI Research Engineer, MLOps

The Real Engineering Moment

The year is 2017 and John Schulman is frustrated. He has already published TRPO - Trust Region Policy Optimization - which provably prevents destructive policy updates and dramatically stabilizes RL training. But TRPO has a serious engineering problem: computing the natural gradient requires solving a constrained optimization problem at every step. The conjugate gradient procedure is numerically finicky. Implementing it correctly takes hundreds of lines of custom code. Nobody outside OpenAI uses it in practice.

Schulman goes back to the math. TRPO works by enforcing a hard constraint: the KL divergence between old and new policy must stay below some threshold $\delta$ . What if you could enforce that constraint approximately, without the second-order optimization machinery? What if a simple clipping operation on the probability ratio could achieve the same effect?

The PPO paper drops in July 2017 - two pages of math and a surprisingly clean implementation. The key insight: instead of constraining the KL divergence directly, clip the probability ratio $r_t(\theta)$ to stay within $[1-\varepsilon, 1+\varepsilon]$ . If the new policy wants to assign much higher or much lower probability to an action than the old policy, ignore that gradient signal. The result is an algorithm that is nearly as stable as TRPO, implemented in under 100 lines of PyTorch.

Three years pass. RLHF researchers at OpenAI are building InstructGPT - a version of GPT-3 that follows instructions instead of just completing prompts. They need an RL algorithm that can fine-tune a 175 billion parameter language model against a learned reward model. PPO is the only algorithm stable enough to do it without catastrophically collapsing the language model. InstructGPT ships in 2022. ChatGPT follows. Both use PPO at their core.

That clipping trick turns out to be one of the most consequential ideas in the history of AI. If you are working anywhere near LLM training - alignment, RLHF, constitutional AI, reward modeling - you need to understand PPO deeply.

Why This Exists: The Instability Problem in Policy Gradients

In the previous lesson we saw REINFORCE and actor-critic methods. They work, but they have a fundamental instability: there is no mechanism that prevents you from taking a step so large it destroys the policy.

Consider what happens when you update $\theta$ with a large gradient step:

The policy changes significantly
The trajectory distribution changes significantly
The gradients you computed under the old policy are now stale
But you keep using them - you're doing gradient ascent with an outdated estimator

This is the problem. Policy gradient methods use data from $\pi_\theta$ to estimate the gradient of $J(\theta)$ , then update $\theta$ . But after the update, the data is no longer from $\pi_{\theta_\text{new}}$ - it's from the old policy. If the step is too large, the new policy is in a completely different region of policy space from where the gradient was estimated, and the update is not just useless but actively harmful.

TRPO solution: enforce a KL divergence constraint $D_{KL}(\pi_\theta \| \pi_{\theta_\text{old}}) \leq \delta$ at every step. Computationally expensive - requires second-order optimization.

PPO solution: clip the probability ratio so updates that move the policy too far from $\pi_{\theta_\text{old}}$ get zero gradient. Computationally cheap - standard first-order gradient descent.

Historical Context

Year	Work	Authors	Key Contribution
1992	REINFORCE	Williams	Policy gradient via log-probability trick
2000	Natural Policy Gradient	Kakade	Fisher information matrix for stable updates
2015	TRPO	Schulman et al.	Trust region constraint on KL divergence
2016	GAE	Schulman et al.	Generalized Advantage Estimation - bias-variance tradeoff
2017	PPO	Schulman et al.	Clipped surrogate objective - simple + stable
2019	RLHF for LMs	Ziegler et al.	First application of PPO to language model fine-tuning
2022	InstructGPT	Ouyang et al.	PPO + RLHF = instruction-following GPT-3
2022	ChatGPT	OpenAI	PPO + RLHF at production scale

The conceptual lineage is clean: REINFORCE → natural gradient → TRPO → PPO. Each step makes the algorithm more practical while preserving the stability guarantees.

Core Concepts

1. The Policy Gradient Objective and Its Instability

We want to maximize expected cumulative reward:

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_t r_t\right]$

The policy gradient theorem gives us:

$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot \hat{A}_t\right]$

The problem: this gradient is estimated from trajectories sampled by $\pi_\theta$ . After we take a gradient step, $\theta$ changes, the policy changes, and the old trajectories are no longer valid samples from the new policy. Taking a large step means the gradient estimate is wildly off.

In practice: REINFORCE training looks like a drunk man walking. Loss crashes to $-\infty$ randomly. Training is fragile.

2. The Probability Ratio: Reusing Old Data

Importance sampling lets us estimate expectations under one distribution using samples from another:

$\mathbb{E}_{x \sim p}[f(x)] = \mathbb{E}_{x \sim q}\left[\frac{p(x)}{q(x)} f(x)\right]$

Apply this to policy gradients: we collected data under $\pi_{\theta_\text{old}}$ but want to estimate the gradient under $\pi_\theta$ . Define the probability ratio:

$r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)}$

This ratio tells us: how much more (or less) likely is the new policy to take action $a_t$ in state $s_t$ compared to the policy that collected the data?

$r_t(\theta) = 1.0$ : policy unchanged - using this action exactly as much as before
$r_t(\theta) = 2.0$ : new policy twice as likely to take this action
$r_t(\theta) = 0.5$ : new policy half as likely

3. The Conservative Policy Iteration (CPI) Objective

Using importance sampling, we can write an objective that uses old data to estimate the new policy's performance:

$L^{CPI}(\theta) = \mathbb{E}_t\left[r_t(\theta) \hat{A}_t\right]$

where $\hat{A}_t$ is the estimated advantage at timestep $t$ .

Intuition: if an action had a positive advantage (it was better than expected) and the new policy assigns it higher probability ( $r_t > 1$ ), this is good - we should do more of that action. If $r_t < 1$ , we're doing less of it - penalty.

Problem: this objective has no constraint on how large $r_t(\theta)$ can get. The optimizer can make $r_t$ arbitrarily large, changing the policy dramatically in one step. We're back to the instability problem.

4. The PPO Clipped Objective - The Key Innovation

PPO solves this by clipping $r_t(\theta)$ to $[1-\varepsilon, 1+\varepsilon]$ :

$L^{CLIP}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta)\hat{A}_t,\ \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon)\hat{A}_t\right)\right]$

The $\min$ is critical. Let's trace through the four cases:

Case 1: Positive advantage ( $\hat{A}_t > 0$ ), ratio too high ( $r_t > 1+\varepsilon$ )

The action was good. New policy wants to do it much more.
Unclipped term: $r_t \hat{A}_t$ (large, positive - would encourage more of this)
Clipped term: $(1+\varepsilon)\hat{A}_t$ (smaller, positive)
$\min$ picks the clipped term → gradient is zeroed out for this action
Effect: we stop increasing the probability of this action once we've increased it by $\varepsilon$

Case 2: Positive advantage ( $\hat{A}_t > 0$ ), ratio in range ( $1-\varepsilon \leq r_t \leq 1+\varepsilon$ )

The action was good. New policy is doing more of it, but not too much.
Both terms equal - gradient flows normally

Case 3: Negative advantage ( $\hat{A}_t < 0$ ), ratio too low ( $r_t < 1-\varepsilon$ )

The action was bad. New policy wants to avoid it much more.
Clipping prevents over-penalizing - stops gradient once we've reduced probability by $\varepsilon$

Case 4: Negative advantage ( $\hat{A}_t < 0$ ), ratio too high ( $r_t > 1+\varepsilon$ )

The action was bad. New policy is doing it more - this is the danger zone.
Unclipped term: $r_t \hat{A}_t$ (large negative - strong penalty)
Clipped term: $(1+\varepsilon)\hat{A}_t$ (less negative)
$\min$ picks the unclipped term → gradient still flows to correct this mistake
Critical: PPO does NOT block gradients for bad actions that the policy is increasing. Only for good actions that the policy has already increased enough.

This asymmetry is the mathematical heart of PPO. It is a conservative lower bound on the policy improvement.

Probability Ratio r_t(θ)

Advantage > 0 (good action):
  ┌──────────────────────────────────────────────────┐
  │ Gradient flows  │      Gradient clipped           │
  │ (ratio in range)│      (ratio too high)           │
  ├─────────────────┤──────────────────────────────────
  0             1-ε  1   1+ε                       ∞
                    ↑
                r_old=1

Advantage < 0 (bad action):
  ┌────────────────────────────────────────────────────┐
  │     Gradient clipped   │  Gradient flows           │
  │     (ratio too low)    │  (ratio in range or high) │
  ├────────────────────────┤────────────────────────────
  0                    1-ε  1  1+ε                   ∞

5. Generalized Advantage Estimation (GAE)

The advantage $\hat{A}_t$ measures how much better action $a_t$ was compared to the average action from state $s_t$ :

$\hat{A}_t = Q(s_t, a_t) - V(s_t)$

We never know the true Q or V. We estimate them. The simplest estimator: the TD residual (1-step):

$\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$

This has low variance but high bias (since $V$ is an approximation). The multi-step return has lower bias but higher variance. GAE (Schulman et al. 2016) interpolates:

$\hat{A}_t^{GAE(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l \delta_{t+l}$

where $\delta_{t+l} = r_{t+l} + \gamma V(s_{t+l+1}) - V(s_{t+l})$ .

The $\lambda$ parameter controls the tradeoff:

$\lambda = 0$ : pure 1-step TD - high bias, low variance ( $\hat{A}_t = \delta_t$ )
$\lambda = 1$ : full Monte Carlo - low bias, high variance ( $\hat{A}_t = \sum_l \gamma^l r_{t+l} - V(s_t)$ )
$\lambda = 0.95$ (PPO default): good balance

The exponential decay $(\gamma\lambda)^l$ means distant TD errors contribute less - we trust near-term estimates more.

6. The Value Function Loss

PPO uses an actor-critic architecture. The critic (value function $V_\theta$ ) is updated to minimize the squared error against the target returns:

$L^{VF}(\theta) = \mathbb{E}_t\left[(V_\theta(s_t) - V_t^{\text{targ}})^2\right]$

where $V_t^{\text{targ}}$ is typically the GAE-estimated return: $\hat{A}_t + V(s_t)$ .

:::note Value Function Clipping Some implementations also clip the value function loss to prevent large updates: $V_\text{clipped} = V_{\theta_\text{old}} + \text{clip}(V_\theta - V_{\theta_\text{old}}, -\varepsilon, \varepsilon)$ . This is less universally agreed upon than the policy clipping. :::

7. The Entropy Bonus

Without encouragement, policies tend to become deterministic early in training - they commit to the first solution they find and stop exploring. The entropy bonus penalizes low-entropy (certain) policies:

$H[\pi_\theta](s) = -\sum_a \pi_\theta(a|s) \log \pi_\theta(a|s)$

Adding this to the objective encourages exploration throughout training.

8. The Full PPO Objective

Combining all three terms:

$L^{PPO}(\theta) = \mathbb{E}_t\left[L^{CLIP}(\theta) - c_1 L^{VF}(\theta) + c_2 H[\pi_\theta](s_t)\right]$

Standard hyperparameters from the original PPO paper:

Hyperparameter	Symbol	Default	Notes
Clip ratio	$\varepsilon$	0.2	How far ratio can deviate from 1
Discount	$\gamma$	0.99	Future reward discount
GAE lambda	$\lambda$	0.95	Bias-variance tradeoff
Value loss coeff	$c_1$	0.5	Weight of critic loss
Entropy coeff	$c_2$	0.01	Weight of entropy bonus
Epochs per batch	$K$	4–10	How many gradient steps on same data
Minibatch size	-	64–2048	Larger is better for stability
Learning rate	$\alpha$	3e-4	Adam default works well

Architecture and Training Flow

Complete PPO Implementation

The following is a full, working PPO implementation for continuous/discrete control (tested on CartPole and LunarLander).

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import gymnasium as gym
from torch.distributions import Categorical, Normal
from collections import namedtuple

# ── Networks ──────────────────────────────────────────────────────────────────

class ActorNetwork(nn.Module):
    """Policy network: outputs action probabilities (discrete) or mean/std (continuous)."""

    def __init__(self, obs_dim: int, act_dim: int, hidden: int = 64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden),
            nn.Tanh(),
            nn.Linear(hidden, hidden),
            nn.Tanh(),
        )
        self.head = nn.Linear(hidden, act_dim)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.head(self.net(x))

    def get_distribution(self, obs: torch.Tensor) -> Categorical:
        logits = self.forward(obs)
        return Categorical(logits=logits)

    def evaluate_actions(self, obs: torch.Tensor, actions: torch.Tensor):
        dist = self.get_distribution(obs)
        log_probs = dist.log_prob(actions)
        entropy = dist.entropy()
        return log_probs, entropy


class CriticNetwork(nn.Module):
    """Value network: outputs scalar state value V(s)."""

    def __init__(self, obs_dim: int, hidden: int = 64):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, hidden),
            nn.Tanh(),
            nn.Linear(hidden, hidden),
            nn.Tanh(),
            nn.Linear(hidden, 1),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x).squeeze(-1)


# ── PPO Agent ─────────────────────────────────────────────────────────────────

class PPOAgent:
    def __init__(
        self,
        obs_dim: int,
        act_dim: int,
        lr: float = 3e-4,
        gamma: float = 0.99,
        gae_lambda: float = 0.95,
        clip_eps: float = 0.2,
        value_coef: float = 0.5,
        entropy_coef: float = 0.01,
        n_epochs: int = 10,
        minibatch_size: int = 64,
    ):
        self.gamma = gamma
        self.gae_lambda = gae_lambda
        self.clip_eps = clip_eps
        self.value_coef = value_coef
        self.entropy_coef = entropy_coef
        self.n_epochs = n_epochs
        self.minibatch_size = minibatch_size

        self.actor = ActorNetwork(obs_dim, act_dim)
        self.critic = CriticNetwork(obs_dim)

        # Shared optimizer - common in PPO implementations
        self.optimizer = optim.Adam(
            list(self.actor.parameters()) + list(self.critic.parameters()),
            lr=lr,
        )

    @torch.no_grad()
    def get_action(self, obs: np.ndarray):
        obs_t = torch.FloatTensor(obs)
        dist = self.actor.get_distribution(obs_t)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        value = self.critic(obs_t)
        return action.item(), log_prob.item(), value.item()

    def compute_gae(
        self,
        rewards: list,
        values: list,
        dones: list,
        next_value: float,
    ) -> tuple[list, list]:
        """
        Compute Generalized Advantage Estimates and returns.

        GAE formula: Â_t = Σ_{l=0}^∞ (γλ)^l δ_{t+l}
        where δ_t = r_t + γ V(s_{t+1}) - V(s_t)
        """
        advantages = []
        gae = 0.0

        # Append next_value for bootstrapping
        values = values + [next_value]

        # Iterate backwards through trajectory
        for t in reversed(range(len(rewards))):
            # TD error
            delta = rewards[t] + self.gamma * values[t + 1] * (1 - dones[t]) - values[t]
            # Accumulate GAE (exponentially weighted sum of TD errors)
            gae = delta + self.gamma * self.gae_lambda * (1 - dones[t]) * gae
            advantages.insert(0, gae)

        # Returns = advantages + values (for value function target)
        returns = [adv + val for adv, val in zip(advantages, values[:-1])]
        return advantages, returns

    def update(self, rollout: dict) -> dict:
        """
        Run K epochs of PPO updates on collected rollout data.
        Returns dict of loss statistics.
        """
        obs = torch.FloatTensor(np.array(rollout["obs"]))
        actions = torch.LongTensor(rollout["actions"])
        old_log_probs = torch.FloatTensor(rollout["log_probs"])
        advantages = torch.FloatTensor(rollout["advantages"])
        returns = torch.FloatTensor(rollout["returns"])

        # ── CRITICAL: Normalize advantages ────────────────────────────────────
        # High variance advantages cause training instability.
        # Normalize to zero mean, unit variance per batch.
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

        n = len(obs)
        stats = {"policy_loss": [], "value_loss": [], "entropy": [], "approx_kl": []}

        for epoch in range(self.n_epochs):
            # Shuffle indices for minibatching
            indices = np.random.permutation(n)

            for start in range(0, n, self.minibatch_size):
                idx = indices[start:start + self.minibatch_size]

                mb_obs = obs[idx]
                mb_actions = actions[idx]
                mb_old_log_probs = old_log_probs[idx]
                mb_advantages = advantages[idx]
                mb_returns = returns[idx]

                # ── Actor: compute new log probs and entropy ───────────────────
                new_log_probs, entropy = self.actor.evaluate_actions(mb_obs, mb_actions)

                # ── Probability ratio r_t(θ) = π_θ(a|s) / π_θ_old(a|s) ───────
                # In log space: log(r) = log(π_new) - log(π_old)
                log_ratio = new_log_probs - mb_old_log_probs
                ratio = log_ratio.exp()

                # ── Approximate KL (for monitoring) ────────────────────────────
                # k ≈ (r - 1) - log(r) - numerically stable
                approx_kl = ((ratio - 1) - log_ratio).mean().item()

                # ── Clipped policy loss ─────────────────────────────────────────
                unclipped = ratio * mb_advantages
                clipped = torch.clamp(ratio, 1 - self.clip_eps, 1 + self.clip_eps) * mb_advantages
                policy_loss = -torch.min(unclipped, clipped).mean()

                # ── Value loss ─────────────────────────────────────────────────
                values_pred = self.critic(mb_obs)
                value_loss = nn.functional.mse_loss(values_pred, mb_returns)

                # ── Entropy bonus (negative because we're minimizing) ──────────
                entropy_loss = -entropy.mean()

                # ── Combined PPO loss ──────────────────────────────────────────
                loss = (
                    policy_loss
                    + self.value_coef * value_loss
                    + self.entropy_coef * entropy_loss
                )

                self.optimizer.zero_grad()
                loss.backward()
                # Gradient clipping prevents extremely large updates
                nn.utils.clip_grad_norm_(
                    list(self.actor.parameters()) + list(self.critic.parameters()),
                    max_norm=0.5,
                )
                self.optimizer.step()

                stats["policy_loss"].append(policy_loss.item())
                stats["value_loss"].append(value_loss.item())
                stats["entropy"].append(-entropy_loss.item())
                stats["approx_kl"].append(approx_kl)

        return {k: np.mean(v) for k, v in stats.items()}


# ── Training Loop ─────────────────────────────────────────────────────────────

def train_ppo(
    env_id: str = "CartPole-v1",
    total_timesteps: int = 500_000,
    rollout_length: int = 2048,
    n_envs: int = 4,
):
    env = gym.make(env_id)
    obs_dim = env.observation_space.shape[0]
    act_dim = env.action_space.n

    agent = PPOAgent(obs_dim=obs_dim, act_dim=act_dim)

    total_steps = 0
    episode_rewards = []

    while total_steps < total_timesteps:
        # ── Collect rollout ────────────────────────────────────────────────────
        rollout = {
            "obs": [], "actions": [], "rewards": [],
            "log_probs": [], "values": [], "dones": []
        }

        obs, _ = env.reset()
        episode_reward = 0.0

        for step in range(rollout_length):
            action, log_prob, value = agent.get_action(obs)
            next_obs, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            rollout["obs"].append(obs)
            rollout["actions"].append(action)
            rollout["rewards"].append(reward)
            rollout["log_probs"].append(log_prob)
            rollout["values"].append(value)
            rollout["dones"].append(float(done))

            episode_reward += reward
            obs = next_obs
            total_steps += 1

            if done:
                episode_rewards.append(episode_reward)
                episode_reward = 0.0
                obs, _ = env.reset()

        # Bootstrap value for last state
        _, _, next_value = agent.get_action(obs)

        # ── Compute GAE advantages and returns ─────────────────────────────────
        advantages, returns = agent.compute_gae(
            rollout["rewards"],
            rollout["values"],
            rollout["dones"],
            next_value,
        )
        rollout["advantages"] = advantages
        rollout["returns"] = returns

        # ── PPO update ─────────────────────────────────────────────────────────
        stats = agent.update(rollout)

        if len(episode_rewards) > 0:
            mean_reward = np.mean(episode_rewards[-20:])
            print(
                f"Steps: {total_steps:7d} | "
                f"Mean reward: {mean_reward:7.1f} | "
                f"Policy loss: {stats['policy_loss']:6.4f} | "
                f"Value loss: {stats['value_loss']:6.4f} | "
                f"Entropy: {stats['entropy']:5.3f} | "
                f"Approx KL: {stats['approx_kl']:6.4f}"
            )

    env.close()
    return agent


if __name__ == "__main__":
    agent = train_ppo("CartPole-v1", total_timesteps=300_000)
    # Typical result: ~500 (max) within 100K steps

PPO in Production: RLHF for Language Models

This is where PPO's real impact becomes clear. The 2022 InstructGPT paper (Ouyang et al.) uses PPO to fine-tune a 175B parameter GPT-3 to follow human instructions. The setup differs from the RL control setting in key ways:

The policy is the language model: $\pi_\theta(y|x)$ - probability of generating response token sequence $y$ given prompt $x$ .

The environment is a reward model: $r_\phi(x, y)$ - a neural network trained on human preference comparisons that assigns a scalar quality score to (prompt, response) pairs.

The state is the prompt + previous tokens: at each generation step, the agent "observes" the current partial response and "acts" by choosing the next token.

The reward signal is sparse: only the complete response receives a reward (not intermediate tokens).

The RLHF PPO Reward

The critical modification for RLHF is a KL penalty on top of the reward model score:

$R(x, y) = r_\phi(x, y) - \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$

Where:

$r_\phi(x, y)$ : reward model score (learned from human preferences)
$\pi_\text{ref}$ : the SFT (supervised fine-tuned) model before PPO - frozen reference
$\beta$ : KL penalty coefficient (typically 0.02–0.2)
The KL term: penalizes the policy for deviating from the reference model

Why the KL penalty? The reward model is imperfect. Without the KL penalty, PPO finds adversarial responses that score highly on the reward model but are nonsensical or harmful - a phenomenon called reward hacking. The KL term keeps the policy close to the pretrained language model, which is a regularizer ensuring the output remains coherent text.

┌─────────────────────────────────────────────────────────┐
│                  RLHF PPO Training Loop                 │
│                                                         │
│  Prompt x → π_θ generates y (full response)            │
│           → r_φ(x, y): reward model scores y           │
│           → KL(π_θ || π_ref): penalize divergence      │
│           → R(x,y) = r_φ(x,y) - β·KL                  │
│           → PPO update on token-level actions           │
│                                                         │
│  Frozen: π_ref (SFT model, for KL reference)           │
│  Frozen: r_φ (reward model, evaluated but not updated) │
│  Trained: π_θ (policy = the LLM being aligned)         │
└─────────────────────────────────────────────────────────┘

Key RLHF Engineering Details

Token-level PPO: each token generation is treated as an action. The advantage is backpropagated through the sequence. Only the final token's timestep receives the reward - all other timesteps have reward 0, plus the KL penalty per token.

Compute cost: RLHF training requires 4 models in memory simultaneously: (1) the policy $\pi_\theta$ , (2) the reference model $\pi_\text{ref}$ , (3) the reward model $r_\phi$ , (4) the value/critic model. For 7B parameter models, this requires 4× the inference memory.

Reward model overoptimization: as PPO trains longer, the KL divergence grows. Eventually, the policy finds outputs that score highly on the reward model but don't generalize - Goodhart's Law applied to RL. The KL coefficient $\beta$ must be tuned carefully.

Common Mistakes

:::danger Setting clip ratio too high (ε > 0.3) With ε = 0.5 or 1.0, the clipping rarely activates. PPO degenerates into standard policy gradient - high variance, training instability, risk of catastrophic policy collapse. Keep ε between 0.1 and 0.3. The default 0.2 is a good starting point. :::

:::danger Not normalizing advantages Raw advantages have high variance and depend on the scale of rewards. If your reward is 1000x larger than CartPole's, your gradient will be 1000x larger. Always normalize advantages to zero mean and unit variance within each minibatch. Without this, learning rate becomes reward-scale-dependent and training is extremely fragile. :::

:::warning Too many update epochs per batch (K > 15) PPO reuses the same rollout data for K epochs. After too many epochs, the new policy is far from the policy that collected the data - the probability ratios become large, the clipping activates constantly, and no learning happens. The approximate KL divergence is a good diagnostic: if it exceeds 0.05 consistently, reduce K. :::

:::warning Forgetting gradient clipping Large gradient norms can cause instability even with PPO's objective-level clipping. Always use clip_grad_norm_ with max_norm=0.5 (or 1.0 for larger models). The PPO paper does this; most tutorials forget it. :::

:::tip Early stopping on KL divergence A common improvement: monitor the approximate KL between old and new policy after each epoch. If KL exceeds a threshold (e.g., 0.015), stop updating and start the next rollout. This prevents the data from going too stale within the K epochs. :::

Production Engineering Notes

Vectorized environments: don't train on a single environment. Use gymnasium.vector.make or stable-baselines3's vectorized envs. Training with 8–64 parallel environments dramatically reduces training time.

Separate actor and critic: the original PPO paper shares parameters between actor and critic. Many production implementations keep them separate - easier to tune learning rates independently and the critic can be updated more aggressively.

Learning rate scheduling: PPO benefits from a linearly decaying learning rate from $3 \times 10^{-4}$ to 0. Many implementations use LinearLR or manual decay.

Reward normalization: normalize rewards using a running mean and standard deviation. This prevents the reward scale from changing over training (important in non-stationary environments).

Observation normalization: normalize observations using running statistics. Critical for tasks where observations have different scales (e.g., position vs. velocity in MuJoCo).

Reference implementation: stable-baselines3 (SB3) has an excellent, well-tested PPO implementation. For RLHF specifically, trl (HuggingFace) wraps PPO for language model training.

YouTube Resources

Video	Channel	Why Watch It
PPO Explained	Weights & Biases	Best visual PPO explanation with animated clipping diagrams
Proximal Policy Optimization	Arxiv Insights	Clean paper walkthrough with good mathematical depth
Deep RL with PPO	Andrej Karpathy	Building RL from scratch, PPO in context
RLHF and PPO for LLMs	Yannic Kilcher	InstructGPT paper analysis - PPO in the LLM context

Interview Q&A

Q1: What problem does PPO's clipping solve, and why is the min() critical?

Answer: Without clipping, importance sampling allows the probability ratio $r_t(\theta)$ to become arbitrarily large, corresponding to a policy that has moved far from the data-collection policy. The gradient estimate becomes invalid (high variance, wrong direction), and the policy can collapse in a single bad update.

The clipping limits $r_t(\theta)$ to $[1-\varepsilon, 1+\varepsilon]$ . The $\min$ is critical because it makes the objective a pessimistic lower bound: when the ratio is too high and the advantage is positive (the optimizer wants to increase action probability), we clip and give zero gradient. When the ratio is too high and the advantage is negative (a bad action the optimizer is making more likely), we do NOT clip - we still penalize this. The min ensures we only ignore updates that would push the policy further in a direction we've already moved enough, while preserving corrections for policy degradation.

Q2: Derive PPO from TRPO - what is the connection?

Answer: TRPO optimizes:

$\max_\theta L^{CPI}(\theta) \quad \text{subject to} \quad D_{KL}(\pi_{\theta_\text{old}} \| \pi_\theta) \leq \delta$

The constraint is enforced via a Lagrangian with conjugate gradient to compute the natural gradient - expensive.

PPO approximates this constraint differently. First, note that $D_{KL}(\pi_{\theta_\text{old}} \| \pi_\theta)$ is related to how far $r_t(\theta) = \pi_\theta / \pi_{\theta_\text{old}}$ deviates from 1. If $r_t$ stays in $[1-\varepsilon, 1+\varepsilon]$ , the KL divergence is approximately bounded. PPO replaces the hard KL constraint with a soft one via clipping: don't let the ratio go outside the interval. It's the same idea - limit policy change - with a simpler implementation. The paper shows empirically that PPO achieves similar performance to TRPO at a fraction of the computational cost.

Q3: Explain GAE intuition - why do we need $\lambda$ ?

Answer: The advantage $\hat{A}_t$ measures how good action $a_t$ was. To estimate it we need the true value function $V^*$ , which we don't have. Our approximation $V_\theta$ is imperfect - it has estimation bias.

The 1-step TD estimate $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ has low variance (only one random step) but high bias (relies heavily on the value approximation). The Monte Carlo estimate (full trajectory return minus baseline) has low bias (no reliance on value approximation) but high variance (depends on many random steps).

GAE computes an exponentially weighted average of all n-step TD estimates: $\hat{A}_t = \sum_l (\gamma\lambda)^l \delta_{t+l}$ . The $\lambda$ parameter tunes the tradeoff continuously. $\lambda=0$ gives the 1-step estimate (high bias, low variance). $\lambda=1$ gives the full Monte Carlo estimate (low bias, high variance). $\lambda=0.95$ is empirically good for most tasks. GAE is used because it performs better than any fixed n-step return in practice.

Q4: What does the entropy bonus do and when would you remove it?

Answer: The entropy bonus $c_2 H[\pi_\theta](s)$ adds a reward for taking diverse actions. Without it, the policy tends to become very confident (low entropy/deterministic) early in training, committing to a local optimum before adequately exploring. The entropy bonus prevents premature convergence.

You might reduce or remove it in late training - once the policy is near-optimal, entropy just adds noise. Some implementations anneal $c_2$ from 0.01 to 0 over training. You'd also reduce it if the policy is already too exploratory (entropy too high), which can happen in tasks with large action spaces.

In RLHF for LLMs, the KL penalty against the reference model serves a similar regularization function to the entropy bonus, so the entropy coefficient is often set to 0.

Q5: How is PPO used in RLHF? Walk through the full setup.

Answer:

SFT Phase: fine-tune a pretrained LLM on high-quality demonstration data. This gives $\pi_\text{ref}$ - a competent instruction-follower.
Reward Model Training: collect pairs of responses to prompts. Human labelers rank them by quality. Train a reward model $r_\phi$ using a ranking loss (Bradley-Terry) on these comparisons.
PPO Phase: initialize the policy $\pi_\theta = \pi_\text{ref}$ . At each PPO step:
- Sample prompts from a dataset
- Generate responses from $\pi_\theta$ (the policy)
- Score responses with $r_\phi$ (frozen)
- Compute KL-penalized reward: $R = r_\phi(x,y) - \beta \log(\pi_\theta / \pi_\text{ref})$
- Run PPO update to maximize R
Token-level formulation: each token generation is an "action". The reward is only at the last token. Advantages are computed over the sequence using GAE.
The KL penalty prevents reward hacking - the model stays close to the coherent pretrained model while improving on the reward signal.

Q6: Compare PPO vs A3C vs SAC. When would you choose each?

Answer:

Algorithm	On/Off Policy	Action Space	Key Advantage	When to Use
PPO	On-policy	Discrete or continuous	Stable, sample-efficient, simple	Default choice. RLHF. Most RL benchmarks.
A3C	On-policy	Discrete	Parallelizes across many workers	When you have many CPUs and want async training
SAC	Off-policy	Continuous only	Very sample-efficient, max entropy	Robotics, continuous control, data-limited settings

PPO is the default for most tasks because it is simple, stable, and works across action space types. SAC is preferred for continuous control when sample efficiency matters (expensive simulation or real-world data). A3C is largely superseded by PPO with vectorized environments.

Key Takeaways

PPO solves the policy gradient instability problem by clipping the probability ratio $r_t(\theta)$ to stay within $[1-\varepsilon, 1+\varepsilon]$
The $\min$ in the clipped objective makes it a pessimistic lower bound - it prevents aggressive policy changes in directions that would make good actions more likely, but still corrects bad actions
GAE with $\lambda=0.95$ provides a good bias-variance tradeoff for advantage estimation
The combined PPO objective includes policy loss, value loss, and entropy bonus
PPO is the core algorithm in RLHF - with a KL penalty added to prevent reward hacking against an imperfect reward model
The single most common mistake: not normalizing advantages within each minibatch

:::tip 🎮 Interactive Playground

Visualize this concept: Try the PPO Clipping Objective demo on the EngineersOfAI Playground - no code required.

:::

The Real Engineering Moment​

Why This Exists: The Instability Problem in Policy Gradients​

Historical Context​

Core Concepts​

1. The Policy Gradient Objective and Its Instability​

2. The Probability Ratio: Reusing Old Data​

3. The Conservative Policy Iteration (CPI) Objective​

4. The PPO Clipped Objective - The Key Innovation​

5. Generalized Advantage Estimation (GAE)​

6. The Value Function Loss​

7. The Entropy Bonus​

8. The Full PPO Objective​

Architecture and Training Flow​

Complete PPO Implementation​

PPO in Production: RLHF for Language Models​

The RLHF PPO Reward​

Key RLHF Engineering Details​

Common Mistakes​

Production Engineering Notes​

YouTube Resources​

Interview Q&A​

Q1: What problem does PPO's clipping solve, and why is the min() critical?​

Q2: Derive PPO from TRPO - what is the connection?​

Q3: Explain GAE intuition - why do we need λ\lambdaλ?​

Q4: What does the entropy bonus do and when would you remove it?​

Q5: How is PPO used in RLHF? Walk through the full setup.​

Q6: Compare PPO vs A3C vs SAC. When would you choose each?​

Key Takeaways​