Policy Gradient Methods
Reading time: ~50 min | Interview relevance: Very High | Target roles: ML Engineer, Research Engineer, AI Engineer
The Real Engineering Moment
The year is 2017. OpenAI is training a robot hand - Dactyl - to manipulate a Rubik's Cube in simulation. The action space is 24 continuous joint torques, each in . DQN cannot work here. You cannot have a Q-value output node for every possible continuous action - there are uncountably many. The argmax over a continuous set requires solving a separate optimization problem at every step, which is computationally intractable.
The approach that works: parameterize the policy directly as a neural network , compute the gradient of expected return with respect to , and ascend the gradient. The policy outputs a Gaussian distribution over joint torques. The network learns to squeeze and rotate, adapting force based on tactile feedback. No Q-function, no argmax over actions. Just gradient ascent on the policy parameters.
This is the policy gradient approach. It works for continuous and discrete actions, for stochastic and deterministic policies. And it is the direct ancestor of PPO, which trains every major LLM today - ChatGPT, Claude, Gemini. The REINFORCE algorithm from 1992 and the RLHF objective in 2022 share the same mathematical skeleton. Understanding one is understanding the other.
In this lesson we derive everything from first principles: where the policy gradient theorem comes from, why naive REINFORCE has high variance and how baselines fix it, how actor-critic methods improve sample efficiency, and how A2C/A3C scale to parallel environments.
Why Value-Based Methods Fail for Continuous Actions
Value-based methods (Q-learning, DQN) have a fundamental limitation: to act, you must solve .
For discrete actions: This is trivial - evaluate for each of the actions, take the max. DQN handles this with one forward pass outputting values.
For continuous actions: . You cannot enumerate all actions. To find , you must run a separate optimization for every state encountered during training and inference. This is prohibitively expensive.
For stochastic optimal policies: In partially observable or adversarial settings, the optimal policy may be genuinely stochastic - a mixed strategy. Q-learning always produces a deterministic greedy policy. Rock-Paper-Scissors: the optimal strategy is uniform random. A deterministic policy is exploited by any opponent.
When policy structure is known: If you know the policy should be a Gaussian over joint torques, why learn the full Q-function (mapping every state-action pair to a value) as an intermediate step? Directly parameterize what you care about.
Policy gradient methods address all three limitations by optimizing the policy directly.
Historical Context
1992: Ronald Williams publishes "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning" - the REINFORCE algorithm. The key insight: you can compute the gradient of expected reward without knowing the environment dynamics, using the log-derivative trick. This was a theoretical breakthrough but practically unusable due to high variance.
1999–2001: Richard Sutton et al. publish the policy gradient theorem, proving that the gradient can be computed using the Q-function as a weighting factor. This connects policy gradients to value-based methods and enables actor-critic hybrids.
2013–2014: Deep actor-critic methods emerge, combining neural network function approximation with policy gradients. The A3C paper (Mnih et al., 2016) shows asynchronous parallel actors can match DQN performance on Atari with less wall-clock time.
2015: Schulman et al. introduce TRPO (Trust Region Policy Optimization) - constrained policy gradient updates that avoid catastrophic performance collapse. This leads to PPO (2017), which simplifies TRPO's constraint to a clip objective.
2022: RLHF papers (InstructGPT, Constitutional AI) use PPO on top of large language models, directly applying the policy gradient framework to language generation. The REINFORCE update from 1992 drives the training of trillion-parameter models.
Policy Parameterization
Before deriving gradients, we need differentiable parameterized policies.
Softmax Policy (Discrete Actions)
For discrete action space :
where are action preferences (logits) produced by a neural network.
Properties:
- Always produces a valid probability distribution
- Differentiable with respect to
- As logits , approaches deterministic; as logits , approaches uniform
- Exploration is natural: actions with non-zero probability are always tried
Gaussian Policy (Continuous Actions)
For continuous action space :
The network outputs a mean vector and (optionally) a standard deviation .
Reparameterization trick: To sample while keeping gradients flowing:
The randomness is in , which doesn't depend on . Gradients flow through and .
import torch
import torch.nn as nn
class ContinuousPolicyNetwork(nn.Module):
"""Gaussian policy for continuous action spaces."""
def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 256):
super().__init__()
self.backbone = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Tanh(), # Tanh preferred over ReLU for policy networks
nn.Linear(hidden_dim, hidden_dim),
nn.Tanh(),
)
self.mean_head = nn.Linear(hidden_dim, action_dim)
# log_std as a learnable parameter (not state-dependent)
# More stable than state-dependent std in early training
self.log_std = nn.Parameter(torch.zeros(action_dim))
def forward(self, state: torch.Tensor):
features = self.backbone(state)
mean = self.mean_head(features)
std = self.log_std.exp().expand_as(mean)
return torch.distributions.Normal(mean, std)
def act(self, state: torch.Tensor, deterministic: bool = False):
dist = self.forward(state)
if deterministic:
action = dist.mean # exploitation
else:
action = dist.rsample() # reparameterized sample
log_prob = dist.log_prob(action).sum(-1) # sum over action dims
return action, log_prob
The Policy Gradient Theorem: Full Derivation
We want to maximize expected cumulative reward:
where is a trajectory sampled by running policy .
Step 1: Write as an integral
where the trajectory probability is:
Step 2: Take the gradient
Step 3: Apply the log-derivative trick
The log-derivative identity: for any differentiable :
Applying to :
Step 4: Substitute and simplify
Step 5: Expand
Taking the gradient w.r.t. : the first term doesn't depend on (initial state distribution is fixed). The last terms don't depend on (environment dynamics are independent of policy parameters). Only the policy log-probabilities remain:
Final Result: Policy Gradient Theorem
where is the return from time (causality: only future rewards matter for step 's action).
Why this is remarkable:
- The gradient doesn't require knowing - the environment dynamics cancel out
- It can be estimated by sampling trajectories and computing log-probabilities - both accessible
- It works for any differentiable parameterization of , including neural networks
REINFORCE: Monte Carlo Policy Gradient
REINFORCE (Williams, 1992) directly implements the policy gradient theorem using Monte Carlo trajectory sampling:
For each episode:
1. Sample trajectory τ = (s₀,a₀,r₀,...,s_T) by running π_θ
2. Compute returns G_t for each timestep t
3. Gradient estimate: ĝ = Σ_t ∇_θ log π_θ(a_t|s_t) · G_t
4. Update: θ ← θ + α · ĝ
Full PyTorch Implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import gymnasium as gym
from torch.optim import Adam
class PolicyNetwork(nn.Module):
"""Discrete-action policy network (softmax output)."""
def __init__(self, state_dim: int, n_actions: int, hidden_dim: int = 128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, n_actions),
)
def forward(self, state: torch.Tensor) -> torch.distributions.Distribution:
logits = self.net(state)
return torch.distributions.Categorical(logits=logits)
def act(self, state: np.ndarray) -> tuple[int, torch.Tensor]:
state_t = torch.FloatTensor(state).unsqueeze(0)
dist = self.forward(state_t)
action = dist.sample()
log_prob = dist.log_prob(action)
return action.item(), log_prob
class REINFORCEAgent:
"""
REINFORCE: Monte Carlo Policy Gradient.
- Collects full episodes
- Computes exact Monte Carlo returns G_t
- High variance, unbiased gradient estimates
"""
def __init__(
self,
state_dim: int,
n_actions: int,
lr: float = 1e-3,
gamma: float = 0.99,
normalize_returns: bool = True,
):
self.gamma = gamma
self.normalize_returns = normalize_returns
self.policy = PolicyNetwork(state_dim, n_actions)
self.optimizer = Adam(self.policy.parameters(), lr=lr)
def compute_returns(self, rewards: list[float]) -> torch.Tensor:
"""
Compute discounted returns G_t for each timestep.
G_t = r_t + γ·r_{t+1} + γ²·r_{t+2} + ...
Computed backwards for numerical efficiency: O(T) vs O(T²).
"""
returns = []
G = 0.0
for r in reversed(rewards):
G = r + self.gamma * G
returns.insert(0, G)
returns = torch.FloatTensor(returns)
if self.normalize_returns:
# Baseline: subtract mean, divide by std
# This is NOT a formal baseline (depends on the whole episode)
# but reduces variance in practice
returns = (returns - returns.mean()) / (returns.std() + 1e-8)
return returns
def update(
self,
log_probs: list[torch.Tensor],
rewards: list[float],
) -> float:
"""
REINFORCE gradient update.
Loss = -E[Σ_t log π_θ(a_t|s_t) · G_t]
(Negative because PyTorch minimizes, but we want to maximize J)
The gradient of this loss w.r.t. θ equals -∇_θ J(θ).
"""
returns = self.compute_returns(rewards)
# Stack log probs and compute weighted loss
log_probs_t = torch.stack(log_probs) # (T,)
policy_loss = -(log_probs_t * returns).sum()
self.optimizer.zero_grad()
policy_loss.backward()
# Gradient clipping prevents occasional large gradient steps
torch.nn.utils.clip_grad_norm_(self.policy.parameters(), max_norm=1.0)
self.optimizer.step()
return policy_loss.item()
def train_reinforce(
env_name: str = "CartPole-v1",
n_episodes: int = 2000,
lr: float = 1e-3,
gamma: float = 0.99,
) -> tuple[REINFORCEAgent, list[float]]:
"""Train REINFORCE on a Gym environment."""
env = gym.make(env_name)
state_dim = env.observation_space.shape[0]
n_actions = env.action_space.n
agent = REINFORCEAgent(state_dim, n_actions, lr=lr, gamma=gamma)
episode_rewards = []
for episode in range(n_episodes):
state, _ = env.reset()
log_probs, rewards = [], []
done = False
# Collect full episode
while not done:
action, log_prob = agent.policy.act(state)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
log_probs.append(log_prob)
rewards.append(reward)
state = next_state
# Update after each episode
loss = agent.update(log_probs, rewards)
episode_rewards.append(sum(rewards))
if (episode + 1) % 200 == 0:
avg = np.mean(episode_rewards[-200:])
print(f"Episode {episode+1:4d} | Avg reward: {avg:6.1f} | Loss: {loss:.4f}")
env.close()
return agent, episode_rewards
The Variance Problem: Why REINFORCE Is Slow
REINFORCE is theoretically correct but has extremely high variance in practice. Understanding why is essential for understanding why all subsequent methods (baseline, actor-critic, PPO) exist.
Source of variance: The return depends on all future actions and environment transitions - every random decision and stochastic environment response after time . A single unlucky trajectory can make a good action look bad.
Concrete example:
- Episode 1: Agent takes the optimal action at , but falls into a trap at (bad luck). . Gradient pushes action probability down - wrong direction.
- Episode 2: Agent takes a suboptimal action at , but gets lucky later. . Gradient pushes action probability up - wrong direction.
Quantitative: The variance of the REINFORCE gradient estimator scales as:
This can be orders of magnitude larger than the signal. In practice, REINFORCE often requires thousands of episodes to converge on tasks where actor-critic converges in hundreds.
Baseline Subtraction: Variance Reduction Without Bias
Key theorem: You can subtract any function (a baseline) from without changing the expected gradient:
Proof That Baselines Are Unbiased
We need to show: .
The baseline does not bias the gradient estimate - it is a free variance reduction tool.
Optimal Baseline
The variance of the gradient estimator with baseline is:
Minimizing over gives the optimal baseline:
In practice, the optimal baseline is hard to compute. The value function is an excellent approximation - it removes the part of the return that doesn't depend on the specific action taken.
The Advantage Function
With :
The advantage function measures how much better action is than the policy's average. The update becomes:
- : action was better than average → increase its probability
- : action was worse than average → decrease its probability
- Zero-mean by construction: for all
Actor-Critic: Online Advantage Estimation
REINFORCE waits until the end of the episode to compute exact Monte Carlo returns. This is:
- Slow: can't learn until the episode ends
- High variance: returns are noisy due to all future randomness
- Only valid for episodic tasks
Actor-Critic learns a value function (the critic) online to estimate advantages at each step - no need to wait for the episode to end.
Actor (policy): π_θ(a|s) - decides what action to take
Critic (value): V_w(s) - estimates expected return from s
TD error as advantage estimate:
This is a biased estimator of - biased because is an approximation, but lower variance than Monte Carlo returns because it uses only one step of actual reward.
Bias-variance tradeoff in policy gradient estimation:
| Estimator | Bias | Variance | Updates needed |
|---|---|---|---|
| (Monte Carlo) | None | Very high | Episode-level |
| (1-step TD) | High (poor critic) | Low | Step-level |
| -step return | Medium | Medium | -step |
| GAE (λ) | Adjustable | Adjustable | Step-level |
Full PyTorch Actor-Critic Implementation
class ActorCriticNetwork(nn.Module):
"""
Shared backbone with separate actor and critic heads.
Sharing backbone: fewer parameters, shared representations.
Separate heads: actor and critic can specialize independently.
Alternative: completely separate networks (more stable but slower).
"""
def __init__(self, state_dim: int, n_actions: int, hidden_dim: int = 256):
super().__init__()
# Shared feature extractor
self.backbone = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, hidden_dim),
nn.Tanh(),
)
# Actor head: produces action distribution
self.actor_head = nn.Linear(hidden_dim, n_actions)
# Critic head: produces scalar state value estimate
self.critic_head = nn.Linear(hidden_dim, 1)
# Initialize critic head to near-zero (avoids early value overestimates)
nn.init.zeros_(self.critic_head.weight)
nn.init.zeros_(self.critic_head.bias)
def forward(self, state: torch.Tensor) -> tuple:
features = self.backbone(state)
logits = self.actor_head(features)
value = self.critic_head(features).squeeze(-1)
return torch.distributions.Categorical(logits=logits), value
def get_action_and_value(
self, state: torch.Tensor, action: torch.Tensor = None
) -> tuple:
dist, value = self.forward(state)
if action is None:
action = dist.sample()
log_prob = dist.log_prob(action)
entropy = dist.entropy()
return action, log_prob, entropy, value
class A2CAgent:
"""
Advantage Actor-Critic (A2C) - synchronous version.
Collects n_steps of experience, then updates both actor and critic.
Loss = actor_loss + value_coeff * critic_loss - entropy_coeff * entropy
A2C vs A3C:
- A2C: synchronous - wait for all workers, average gradients, stable
- A3C: asynchronous - workers update global network independently, fast
- In practice with GPUs: A2C is usually preferred (GPU batching)
"""
def __init__(
self,
state_dim: int,
n_actions: int,
lr: float = 7e-4,
gamma: float = 0.99,
value_coeff: float = 0.5, # weight for critic loss
entropy_coeff: float = 0.01, # entropy bonus weight
n_steps: int = 5, # steps before update
max_grad_norm: float = 0.5, # gradient clipping
):
self.gamma = gamma
self.value_coeff = value_coeff
self.entropy_coeff = entropy_coeff
self.n_steps = n_steps
self.max_grad_norm = max_grad_norm
self.network = ActorCriticNetwork(state_dim, n_actions)
self.optimizer = Adam(self.network.parameters(), lr=lr, eps=1e-5)
def compute_returns_and_advantages(
self,
rewards: list[float],
values: torch.Tensor, # V(s_t) for t=0..n-1
next_value: float, # V(s_n) - bootstrap value
dones: list[bool],
) -> tuple[torch.Tensor, torch.Tensor]:
"""
Compute n-step returns and advantages.
Return at step t: G_t = r_t + γ·r_{t+1} + ... + γ^{n-t-1}·r_{n-1} + γ^{n-t}·V(s_n)
Advantage at step t: A_t = G_t - V(s_t)
The done flag zeros out the bootstrap when an episode ends.
"""
returns = []
G = next_value
for i in reversed(range(len(rewards))):
# If done, don't bootstrap from next state
G = rewards[i] + self.gamma * G * (1.0 - float(dones[i]))
returns.insert(0, G)
returns = torch.FloatTensor(returns)
advantages = returns - values.detach()
# Normalize advantages for stable training
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
return returns, advantages
def update(
self,
states: torch.Tensor, # (n_steps, state_dim)
actions: torch.Tensor, # (n_steps,)
rewards: list[float], # (n_steps,)
next_state: torch.Tensor, # (state_dim,)
dones: list[bool], # (n_steps,)
) -> dict[str, float]:
"""Full A2C update step."""
# Get action distributions and values for collected states
dists, values = self.network(states)
# Bootstrap next-state value
with torch.no_grad():
_, next_value = self.network(next_state.unsqueeze(0))
next_value = next_value.item()
returns, advantages = self.compute_returns_and_advantages(
rewards, values, next_value, dones
)
# ── Actor Loss ────────────────────────────────────────────────────────
# -E[log π(a|s) · A(s,a)]
log_probs = dists.log_prob(actions)
actor_loss = -(log_probs * advantages).mean()
# ── Critic Loss ───────────────────────────────────────────────────────
# MSE between value predictions and actual returns
# Clip at 0.5 to match the value_coeff scaling
critic_loss = F.mse_loss(values, returns)
# ── Entropy Bonus ─────────────────────────────────────────────────────
# Maximize policy entropy to encourage exploration
# H(π) = -Σ_a π(a|s) log π(a|s)
entropy = dists.entropy().mean()
# ── Total Loss ────────────────────────────────────────────────────────
total_loss = (
actor_loss
+ self.value_coeff * critic_loss
- self.entropy_coeff * entropy
)
self.optimizer.zero_grad()
total_loss.backward()
# Gradient clipping: prevents catastrophic updates
torch.nn.utils.clip_grad_norm_(
self.network.parameters(), max_norm=self.max_grad_norm
)
self.optimizer.step()
return {
"actor_loss": actor_loss.item(),
"critic_loss": critic_loss.item(),
"entropy": entropy.item(),
"total_loss": total_loss.item(),
}
def train_a2c(
env_name: str = "CartPole-v1",
n_envs: int = 4, # parallel environments
n_steps: int = 5, # steps per update
total_steps: int = 500_000,
) -> A2CAgent:
"""
A2C training loop with parallel environments.
Using gym.vector for vectorized environment interaction.
"""
# Vectorized environments (parallel rollouts)
envs = gym.vector.make(env_name, num_envs=n_envs)
state_dim = envs.single_observation_space.shape[0]
n_actions = envs.single_action_space.n
agent = A2CAgent(state_dim, n_actions)
obs, _ = envs.reset()
episode_rewards = np.zeros(n_envs)
all_rewards = []
for step in range(0, total_steps, n_steps * n_envs):
# Collect n_steps of experience from all envs
states_batch = []
actions_batch = []
rewards_batch = [[] for _ in range(n_envs)]
dones_batch = [[] for _ in range(n_envs)]
for t in range(n_steps):
obs_t = torch.FloatTensor(obs)
states_batch.append(obs_t)
with torch.no_grad():
dists, _ = agent.network(obs_t)
actions = dists.sample()
actions_batch.append(actions)
obs_next, rewards, terminated, truncated, infos = envs.step(
actions.numpy()
)
dones = terminated | truncated
for i in range(n_envs):
rewards_batch[i].append(float(rewards[i]))
dones_batch[i].append(bool(dones[i]))
episode_rewards[i] += rewards[i]
if dones[i]:
all_rewards.append(episode_rewards[i])
episode_rewards[i] = 0.0
obs = obs_next
# Update for each environment (or batch them together)
for env_idx in range(n_envs):
env_states = torch.stack([states_batch[t][env_idx]
for t in range(n_steps)])
env_actions = torch.stack([actions_batch[t][env_idx]
for t in range(n_steps)])
env_next_state = torch.FloatTensor(obs[env_idx])
agent.update(
states=env_states,
actions=env_actions,
rewards=rewards_batch[env_idx],
next_state=env_next_state,
dones=dones_batch[env_idx],
)
if len(all_rewards) > 0 and step % 10_000 == 0:
avg = np.mean(all_rewards[-50:]) if len(all_rewards) >= 50 else np.mean(all_rewards)
print(f"Step {step:6d} | Avg reward (last 50 eps): {avg:.1f}")
envs.close()
return agent
The Three Loss Terms: Deep Dive
Actor Loss:
Gradient of this loss w.r.t. equals - minimizing the loss is gradient ascent on expected return. The sign of determines the update direction:
- : Loss becomes more negative as increases → the optimizer increases . Good.
- : Loss becomes more positive as increases → the optimizer decreases . Good.
Critic Loss:
Standard mean-squared error regression. The target is the actual observed return (or -step return), which is an unbiased estimate of . The critic is being trained to predict returns accurately. A well-trained critic makes the advantage estimates more accurate, which reduces policy gradient variance.
Note: The target should be detached from the computation graph - we don't want gradients flowing through the target into the network weights in the same way.
Entropy Bonus:
The entropy of a categorical distribution over actions:
Properties:
- Maximum entropy: (uniform distribution)
- Minimum entropy: 0 (deterministic distribution)
- Maximizing entropy keeps the policy from collapsing to a single action prematurely
Why entropy matters: Without it, the policy converges to a near-deterministic distribution early in training, before it has explored sufficiently. This is called entropy collapse - the policy becomes too confident. The entropy coefficient (typically 0.01) is a hyperparameter: too high and the policy stays random; too low and entropy collapse occurs.
A2C vs A3C: Synchronous vs Asynchronous
A3C (Asynchronous Advantage Actor-Critic, Mnih et al. 2016):
Global Network (θ_global, w_global)
↑ ↑ ↑
Worker 1 Worker 2 Worker 3
(env copy) (env copy) (env copy)
computes ∇θ computes ∇θ computes ∇θ
updates global asynchronously
Workers operate in parallel, independently computing gradients and applying them to the shared global network without waiting for each other. Benefits:
- Multiple actors generate diverse experience (different random seeds, exploration trajectories)
- Asynchronous updates decorrelate training data (like a distributed replay buffer)
- Efficient CPU utilization (each worker runs on one CPU core)
A2C (Synchronous version):
Global Network (θ, w)
↓
┌────────────────────┐
│ Worker 1 ... N │ ← all workers run synchronously
└────────────────────┘
↓
Average gradients from all workers
↓
Single synchronized update
Workers run in sync, gradients are averaged, one update per step. Benefits:
- More deterministic training (reproducible results)
- Better GPU utilization (larger effective batch size)
- Slightly more stable due to gradient averaging
In practice (2024): A2C with vectorized environments (gym.vector) is standard. A3C's asynchronous advantage is less important with modern GPU training since the GPU naturally batches computation.
Entropy Regularization: Maximum Entropy RL
Entropy maximization can be elevated from a regularization trick to a core objective:
where is the policy entropy and is the temperature parameter.
Why Maximum Entropy RL?
- Exploration: the agent explores all high-reward strategies, not just one
- Robustness: a stochastic policy is harder to exploit in adversarial settings
- Multi-modality: the agent learns all ways to solve the task, not just one path
- Transferability: maximum entropy policies transfer better to new tasks
Soft Actor-Critic (SAC) takes MaxEnt RL to its logical conclusion for continuous control. It maintains an automatically tuned temperature and produces SOTA results on MuJoCo benchmarks. The policy is explicitly regularized to have high entropy, resulting in natural exploration without -greedy.
Gaussian Policy for Continuous Actions: Full Implementation
import torch
import torch.nn as nn
import numpy as np
import gymnasium as gym
from torch.optim import Adam
class GaussianActorCritic(nn.Module):
"""Actor-Critic for continuous action spaces with Gaussian policy."""
LOG_STD_MIN = -20
LOG_STD_MAX = 2
def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 256):
super().__init__()
self.backbone = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, hidden_dim),
nn.Tanh(),
)
# Actor: outputs mean and log_std of Gaussian
self.mean_head = nn.Linear(hidden_dim, action_dim)
self.log_std_head = nn.Linear(hidden_dim, action_dim)
# Critic: outputs state value
self.value_head = nn.Linear(hidden_dim, 1)
def forward(self, state: torch.Tensor):
features = self.backbone(state)
# Actor output
mean = self.mean_head(features)
log_std = self.log_std_head(features).clamp(
self.LOG_STD_MIN, self.LOG_STD_MAX
)
std = log_std.exp()
dist = torch.distributions.Normal(mean, std)
# Critic output
value = self.value_head(features).squeeze(-1)
return dist, value
def act(self, state: np.ndarray, deterministic: bool = False):
state_t = torch.FloatTensor(state).unsqueeze(0)
dist, value = self.forward(state_t)
if deterministic:
action = dist.mean
else:
action = dist.rsample() # reparameterized sample
# Sum log_prob over action dimensions
log_prob = dist.log_prob(action).sum(-1)
# Clip action to valid range
action = action.clamp(-1.0, 1.0)
return (action.squeeze(0).detach().numpy(),
log_prob.item(),
value.item())
Algorithm Comparison
| REINFORCE | Actor-Critic (A2C) | PPO | |
|---|---|---|---|
| Bias | None | Some (critic approx.) | Some (clipping) |
| Variance | Very high | Low-medium | Low |
| Sample efficiency | Low | Medium | High |
| Data reuse | None (on-policy) | None (on-policy) | Multiple epochs |
| Stability | Low | Medium | High |
| Hyperparameters | Few (α, γ) | More (coeff's) | Many (clip ε, coeff's) |
| Best for | Theory demos | Sequential tasks | Production, RLHF |
Connection to RLHF
The policy gradient framework maps directly to RLHF for language models:
| RL Concept | RLHF Analog |
|---|---|
| Policy | Language model |
| State | Prompt + tokens generated so far |
| Action | Next token (from vocabulary of ~50K) |
| Episode | Full response generation (until EOS) |
| Reward | Reward model score at episode end |
| Critic | Value head on top of LLM backbone |
| KL penalty | $-\beta \log(p_\theta(y_t |
The REINFORCE-style update for RLHF would be:
Responses with high reward scores get reinforced; low-reward responses are suppressed. PPO-RLHF adds:
- A clipping constraint (prevents large policy updates that destroy alignment)
- A KL-from-reference penalty (prevents reward hacking by staying close to the base model)
- A learned value baseline (reduces variance without bias)
Common Mistakes
:::danger Forgetting to Detach the Critic Target
When computing the critic loss F.mse_loss(values, returns), the returns should not have gradients flowing through them. If returns are computed using the critic's own outputs (bootstrapping), you must use .detach(): target = (r + gamma * next_value).detach(). Failing to detach causes the critic to chase its own tail - it optimizes against a moving target that it itself is changing, leading to instability or divergence.
:::
:::danger Wrong Sign on the Actor Loss The policy gradient theorem gives . To maximize , you ascend the gradient. In PyTorch, which minimizes, the actor loss must be . A common bug: forgetting the negative sign. The gradient will then descend , making the agent worse over time. Symptom: rewards decrease monotonically during training. :::
:::warning Entropy Collapse in Early Training
If the entropy coefficient is too small or the learning rate too large, the policy can collapse to a near-deterministic distribution within the first few thousand steps. Once collapsed, the policy stops exploring, and gradient estimates become extremely noisy (few actions ever tried). Always monitor entropy in your training logs. If entropy drops below ~0.5 nats in early training, increase entropy_coeff or reduce the learning rate.
:::
:::warning Advantage Normalization Can Hurt in Small Batches Normalizing advantages to zero mean and unit variance per batch is a common trick. But if the batch is very small (n_steps=5, single environment), the normalization constants are noisy estimates of batch statistics, introducing additional variance. For small batches, skip advantage normalization or use running statistics. For large batches (32+ environments × 128+ steps), normalization is beneficial. :::
YouTube Resources
| Video | Creator | Why Watch |
|---|---|---|
| Policy Gradient Algorithms - Deep RL Bootcamp Lecture | Pieter Abbeel & John Schulman | The canonical lecture on policy gradients - covers REINFORCE, baselines, actor-critic, TRPO |
| Lecture 8: Policy Gradient - CS285 Deep RL | Sergey Levine (UC Berkeley) | Rigorous derivation with a focus on variance reduction and practical implementation |
| A Friendly Introduction to REINFORCE | Andrej Karpathy | Intuitive walkthrough of the log-derivative trick with minimal formalism |
| Deep RL Lecture 3 - Policy Gradients | David Silver (DeepMind) | Connects DP-based policy gradient theorem to sampled Monte Carlo estimation |
| Proximal Policy Optimization - Paper Explained | Yannic Kilcher | Excellent explanation of how PPO builds on actor-critic - essential RLHF prerequisite |
Interview Q&A
Q1: Derive the policy gradient theorem. Why doesn't it require environment dynamics?
The goal is . Writing as an integral: . The trajectory probability is . Using the log-derivative trick - - we get . Taking the log: . Differentiating w.r.t. : the term is zero (initial state distribution is fixed). The terms are zero (environment dynamics don't depend on ). Only remains. You only need to be able to evaluate - never . This is why policy gradient methods are model-free.
Q2: What is the variance problem in REINFORCE and how do baselines address it?
REINFORCE estimates the gradient as where is the Monte Carlo return. has high variance because it is a sum of all future random rewards and environment responses. An unlucky trajectory makes a good action look bad; a lucky one makes a bad action look good. A baseline can be subtracted: the update uses instead. This is valid because - the proof uses the fact that . So the baseline doesn't affect the expected gradient, only its variance. The best baseline is , giving advantage . Variance reduction can be 10x or more in practice.
Q3: Explain the bias-variance tradeoff between REINFORCE and actor-critic.
REINFORCE uses Monte Carlo returns - unbiased (the expected value is the true ) but high variance (depends on all future random events). Actor-critic uses the TD error - biased (because is an approximation, not the true ) but lower variance (only one step of actual randomness). This is the fundamental bias-variance tradeoff. Generalized Advantage Estimation (GAE, Schulman et al. 2016) interpolates with parameter : gives 1-step TD (low variance, high bias); gives Monte Carlo (high variance, no bias). In practice, often works well - most of the variance reduction with minimal bias increase.
Q4: What does the entropy bonus do in actor-critic training?
The entropy bonus is added to the objective: . Maximizing entropy keeps the policy stochastic - it prevents premature convergence to a deterministic policy before the agent has sufficiently explored. Without entropy regularization, the policy often collapses to one action early in training (entropy collapse), then the gradient estimates become noisy (limited action diversity), and learning stagnates. With entropy regularization: the policy maintains meaningful exploration throughout training, gradient estimates have lower variance (more diverse trajectories), and the policy discovers multiple high-reward strategies. The coefficient (typically 0.01–0.05) controls the exploration-exploitation tradeoff. Maximum Entropy RL (SAC) makes entropy a first-class objective with automatic temperature tuning, achieving SOTA on continuous control benchmarks.
Q5: How does actor-critic differ from REINFORCE, and why is it preferred in practice?
Three key differences. First, update timing: REINFORCE collects a full episode then updates; actor-critic updates after steps (or every step). Online updates are more sample efficient - you don't discard data by waiting for episode end; in long episodes (e.g., robot with 10K-step horizon), REINFORCE is impractical. Second, advantage estimation: REINFORCE uses exact Monte Carlo returns (unbiased but high variance); actor-critic bootstraps using , giving lower-variance but biased advantage estimates. Third, architecture: actor-critic maintains two functions (the actor and critic ) that interact; REINFORCE has only the policy. The bias introduced by the critic is generally a good tradeoff - empirically, actor-critic methods converge in far fewer environment interactions. The main disadvantage: critic training adds complexity (an additional learning problem), and if the critic is poorly trained, the biased advantage estimates can hurt the actor. This is why critic architecture and loss weighting are important hyperparameters.
Q6: How does the policy gradient connect to RLHF for language models? What are the LLM-specific challenges?
In RLHF, the language model is the policy, the prompt is the state, and each token is an action. The policy gradient update is where is the reward model score. This reinforces responses receiving high reward. LLM-specific challenges: (1) Sparse reward - reward is only available for the full response (end of episode), not per token. This makes credit assignment hard: which tokens contributed to the high/low reward? (2) Enormous action space - 50K possible tokens per step, making exploration harder than discrete RL with a handful of actions. (3) Reward hacking - the model quickly finds responses that score high on the reward model without being genuinely helpful (e.g., verbose responses that confuse the reward model, or responses that flatter the user). PPO-RLHF addresses this with a KL penalty term: the actual reward used is , where is the reference (pre-RLHF) model. This constrains how far the policy can drift. (4) Scale - the policy has billions of parameters; naive REINFORCE gradient estimates have far too much variance. PPO's ability to reuse data across multiple gradient steps makes it much more practical than REINFORCE at this scale.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Policy Gradient (REINFORCE) demo on the EngineersOfAI Playground - no code required.
:::
