What is policy gradient?

Directly optimize policies with gradient ascent - REINFORCE derivation, the log-derivative trick, variance reduction with baselines, actor-critic, A2C/A3C, and entropy regularization. The foundation for PPO and RLHF.

How does REINFORCE work in practice?

Policy Gradient Methods covers policy gradient, REINFORCE, actor-critic from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/reinforcement-learning/policy-gradient-methods

What is the difference between policy gradient and actor-critic?

See the full breakdown at https://engineersofai.com/docs/ml/reinforcement-learning/policy-gradient-methods

Policy Gradient Methods

Reading time: ~50 min | Interview relevance: Very High | Target roles: ML Engineer, Research Engineer, AI Engineer

The Real Engineering Moment

The year is 2017. OpenAI is training a robot hand - Dactyl - to manipulate a Rubik's Cube in simulation. The action space is 24 continuous joint torques, each in $[-1, 1]$ . DQN cannot work here. You cannot have a Q-value output node for every possible continuous action - there are uncountably many. The argmax over a continuous set requires solving a separate optimization problem at every step, which is computationally intractable.

The approach that works: parameterize the policy directly as a neural network $\pi_\theta(a|s)$ , compute the gradient of expected return with respect to $\theta$ , and ascend the gradient. The policy outputs a Gaussian distribution over joint torques. The network learns to squeeze and rotate, adapting force based on tactile feedback. No Q-function, no argmax over actions. Just gradient ascent on the policy parameters.

This is the policy gradient approach. It works for continuous and discrete actions, for stochastic and deterministic policies. And it is the direct ancestor of PPO, which trains every major LLM today - ChatGPT, Claude, Gemini. The REINFORCE algorithm from 1992 and the RLHF objective in 2022 share the same mathematical skeleton. Understanding one is understanding the other.

In this lesson we derive everything from first principles: where the policy gradient theorem comes from, why naive REINFORCE has high variance and how baselines fix it, how actor-critic methods improve sample efficiency, and how A2C/A3C scale to parallel environments.

Why Value-Based Methods Fail for Continuous Actions

Value-based methods (Q-learning, DQN) have a fundamental limitation: to act, you must solve $\arg\max_a Q(s, a)$ .

For discrete actions: This is trivial - evaluate $Q(s, a)$ for each of the $|A|$ actions, take the max. DQN handles this with one forward pass outputting $|A|$ values.

For continuous actions: $A = \mathbb{R}^d$ . You cannot enumerate all actions. To find $\arg\max_a Q_\theta(s, a)$ , you must run a separate optimization for every state encountered during training and inference. This is prohibitively expensive.

For stochastic optimal policies: In partially observable or adversarial settings, the optimal policy may be genuinely stochastic - a mixed strategy. Q-learning always produces a deterministic greedy policy. Rock-Paper-Scissors: the optimal strategy is uniform random. A deterministic policy is exploited by any opponent.

When policy structure is known: If you know the policy should be a Gaussian over joint torques, why learn the full Q-function (mapping every state-action pair to a value) as an intermediate step? Directly parameterize what you care about.

Policy gradient methods address all three limitations by optimizing the policy directly.

Historical Context

1992: Ronald Williams publishes "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning" - the REINFORCE algorithm. The key insight: you can compute the gradient of expected reward without knowing the environment dynamics, using the log-derivative trick. This was a theoretical breakthrough but practically unusable due to high variance.

1999–2001: Richard Sutton et al. publish the policy gradient theorem, proving that the gradient can be computed using the Q-function as a weighting factor. This connects policy gradients to value-based methods and enables actor-critic hybrids.

2013–2014: Deep actor-critic methods emerge, combining neural network function approximation with policy gradients. The A3C paper (Mnih et al., 2016) shows asynchronous parallel actors can match DQN performance on Atari with less wall-clock time.

2015: Schulman et al. introduce TRPO (Trust Region Policy Optimization) - constrained policy gradient updates that avoid catastrophic performance collapse. This leads to PPO (2017), which simplifies TRPO's constraint to a clip objective.

2022: RLHF papers (InstructGPT, Constitutional AI) use PPO on top of large language models, directly applying the policy gradient framework to language generation. The REINFORCE update from 1992 drives the training of trillion-parameter models.

Policy Parameterization

Before deriving gradients, we need differentiable parameterized policies.

Softmax Policy (Discrete Actions)

For discrete action space $A = \{0, 1, \ldots, K-1\}$ :

$\pi_\theta(a | s) = \frac{\exp(h_\theta(s, a))}{\sum_{a'} \exp(h_\theta(s, a'))}$

where $h_\theta(s, a)$ are action preferences (logits) produced by a neural network.

Properties:

Always produces a valid probability distribution
Differentiable with respect to $\theta$
As logits $\to \infty$ , approaches deterministic; as logits $\to 0$ , approaches uniform
Exploration is natural: actions with non-zero probability are always tried

Gaussian Policy (Continuous Actions)

For continuous action space $A = \mathbb{R}^d$ :

$\pi_\theta(a | s) = \mathcal{N}(a \mid \mu_\theta(s), \sigma_\theta(s)^2 I)$

The network outputs a mean vector $\mu_\theta(s) \in \mathbb{R}^d$ and (optionally) a standard deviation $\sigma_\theta(s) \in \mathbb{R}^d_{>0}$ .

Reparameterization trick: To sample while keeping gradients flowing:

$a = \mu_\theta(s) + \sigma_\theta(s) \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$

The randomness is in $\epsilon$ , which doesn't depend on $\theta$ . Gradients flow through $\mu_\theta$ and $\sigma_\theta$ .

import torch
import torch.nn as nn

class ContinuousPolicyNetwork(nn.Module):
    """Gaussian policy for continuous action spaces."""
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 256):
        super().__init__()
        self.backbone = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.Tanh(),         # Tanh preferred over ReLU for policy networks
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh(),
        )
        self.mean_head = nn.Linear(hidden_dim, action_dim)
        # log_std as a learnable parameter (not state-dependent)
        # More stable than state-dependent std in early training
        self.log_std = nn.Parameter(torch.zeros(action_dim))

    def forward(self, state: torch.Tensor):
        features = self.backbone(state)
        mean = self.mean_head(features)
        std = self.log_std.exp().expand_as(mean)
        return torch.distributions.Normal(mean, std)

    def act(self, state: torch.Tensor, deterministic: bool = False):
        dist = self.forward(state)
        if deterministic:
            action = dist.mean          # exploitation
        else:
            action = dist.rsample()     # reparameterized sample
        log_prob = dist.log_prob(action).sum(-1)   # sum over action dims
        return action, log_prob

The Policy Gradient Theorem: Full Derivation

We want to maximize expected cumulative reward:

$J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[G(\tau)] = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \gamma^t r_t\right]$

where $\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots, s_T)$ is a trajectory sampled by running policy $\pi_\theta$ .

Step 1: Write $J$ as an integral

$J(\theta) = \int_\tau p_\theta(\tau) G(\tau) \, d\tau$

where the trajectory probability is:

$p_\theta(\tau) = p(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t | s_t) \cdot P(s_{t+1} | s_t, a_t)$

Step 2: Take the gradient

$\nabla_\theta J(\theta) = \int_\tau \nabla_\theta p_\theta(\tau) G(\tau) \, d\tau$

Step 3: Apply the log-derivative trick

The log-derivative identity: for any differentiable $f$ :

$\nabla_\theta f(\theta) = f(\theta) \nabla_\theta \log f(\theta)$

Applying to $p_\theta(\tau)$ :

$\nabla_\theta p_\theta(\tau) = p_\theta(\tau) \nabla_\theta \log p_\theta(\tau)$

Step 4: Substitute and simplify

$\nabla_\theta J(\theta) = \int_\tau p_\theta(\tau) \nabla_\theta \log p_\theta(\tau) G(\tau) \, d\tau = \mathbb{E}_{\tau \sim \pi_\theta}\left[\nabla_\theta \log p_\theta(\tau) \cdot G(\tau)\right]$

Step 5: Expand $\log p_\theta(\tau)$

$\log p_\theta(\tau) = \log p(s_0) + \sum_{t=0}^{T-1} \log \pi_\theta(a_t | s_t) + \sum_{t=0}^{T-1} \log P(s_{t+1} | s_t, a_t)$

Taking the gradient w.r.t. $\theta$ : the first term $\log p(s_0)$ doesn't depend on $\theta$ (initial state distribution is fixed). The last terms $\log P(\cdot)$ don't depend on $\theta$ (environment dynamics are independent of policy parameters). Only the policy log-probabilities remain:

$\nabla_\theta \log p_\theta(\tau) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t | s_t)$

Final Result: Policy Gradient Theorem

$\boxed{\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right]}$

where $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ is the return from time $t$ (causality: only future rewards matter for step $t$ 's action).

Why this is remarkable:

The gradient doesn't require knowing $P(s'|s,a)$ - the environment dynamics cancel out
It can be estimated by sampling trajectories and computing log-probabilities - both accessible
It works for any differentiable parameterization of $\pi_\theta$ , including neural networks

REINFORCE: Monte Carlo Policy Gradient

REINFORCE (Williams, 1992) directly implements the policy gradient theorem using Monte Carlo trajectory sampling:

For each episode:
Sample trajectory τ = (s₀,a₀,r₀,...,s_T) by running π_θ
Compute returns G_t for each timestep t
Gradient estimate: ĝ = Σ_t ∇_θ log π_θ(a_t|s_t) · G_t
Update: θ ← θ + α · ĝ

Full PyTorch Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import gymnasium as gym
from torch.optim import Adam


class PolicyNetwork(nn.Module):
    """Discrete-action policy network (softmax output)."""
    def __init__(self, state_dim: int, n_actions: int, hidden_dim: int = 128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, n_actions),
        )

    def forward(self, state: torch.Tensor) -> torch.distributions.Distribution:
        logits = self.net(state)
        return torch.distributions.Categorical(logits=logits)

    def act(self, state: np.ndarray) -> tuple[int, torch.Tensor]:
        state_t = torch.FloatTensor(state).unsqueeze(0)
        dist = self.forward(state_t)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return action.item(), log_prob


class REINFORCEAgent:
    """
    REINFORCE: Monte Carlo Policy Gradient.
    - Collects full episodes
    - Computes exact Monte Carlo returns G_t
    - High variance, unbiased gradient estimates
    """
    def __init__(
        self,
        state_dim: int,
        n_actions: int,
        lr: float = 1e-3,
        gamma: float = 0.99,
        normalize_returns: bool = True,
    ):
        self.gamma = gamma
        self.normalize_returns = normalize_returns
        self.policy = PolicyNetwork(state_dim, n_actions)
        self.optimizer = Adam(self.policy.parameters(), lr=lr)

    def compute_returns(self, rewards: list[float]) -> torch.Tensor:
        """
        Compute discounted returns G_t for each timestep.
        G_t = r_t + γ·r_{t+1} + γ²·r_{t+2} + ...

        Computed backwards for numerical efficiency: O(T) vs O(T²).
        """
        returns = []
        G = 0.0
        for r in reversed(rewards):
            G = r + self.gamma * G
            returns.insert(0, G)
        returns = torch.FloatTensor(returns)

        if self.normalize_returns:
            # Baseline: subtract mean, divide by std
            # This is NOT a formal baseline (depends on the whole episode)
            # but reduces variance in practice
            returns = (returns - returns.mean()) / (returns.std() + 1e-8)

        return returns

    def update(
        self,
        log_probs: list[torch.Tensor],
        rewards: list[float],
    ) -> float:
        """
        REINFORCE gradient update.

        Loss = -E[Σ_t log π_θ(a_t|s_t) · G_t]
        (Negative because PyTorch minimizes, but we want to maximize J)

        The gradient of this loss w.r.t. θ equals -∇_θ J(θ).
        """
        returns = self.compute_returns(rewards)

        # Stack log probs and compute weighted loss
        log_probs_t = torch.stack(log_probs)     # (T,)
        policy_loss = -(log_probs_t * returns).sum()

        self.optimizer.zero_grad()
        policy_loss.backward()
        # Gradient clipping prevents occasional large gradient steps
        torch.nn.utils.clip_grad_norm_(self.policy.parameters(), max_norm=1.0)
        self.optimizer.step()

        return policy_loss.item()


def train_reinforce(
    env_name: str = "CartPole-v1",
    n_episodes: int = 2000,
    lr: float = 1e-3,
    gamma: float = 0.99,
) -> tuple[REINFORCEAgent, list[float]]:
    """Train REINFORCE on a Gym environment."""
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    n_actions = env.action_space.n

    agent = REINFORCEAgent(state_dim, n_actions, lr=lr, gamma=gamma)
    episode_rewards = []

    for episode in range(n_episodes):
        state, _ = env.reset()
        log_probs, rewards = [], []
        done = False

        # Collect full episode
        while not done:
            action, log_prob = agent.policy.act(state)
            next_state, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated

            log_probs.append(log_prob)
            rewards.append(reward)
            state = next_state

        # Update after each episode
        loss = agent.update(log_probs, rewards)
        episode_rewards.append(sum(rewards))

        if (episode + 1) % 200 == 0:
            avg = np.mean(episode_rewards[-200:])
            print(f"Episode {episode+1:4d} | Avg reward: {avg:6.1f} | Loss: {loss:.4f}")

    env.close()
    return agent, episode_rewards

The Variance Problem: Why REINFORCE Is Slow

REINFORCE is theoretically correct but has extremely high variance in practice. Understanding why is essential for understanding why all subsequent methods (baseline, actor-critic, PPO) exist.

Source of variance: The return $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ depends on all future actions and environment transitions - every random decision and stochastic environment response after time $t$ . A single unlucky trajectory can make a good action look bad.

Concrete example:

Episode 1: Agent takes the optimal action at $t=5$ , but falls into a trap at $t=20$ (bad luck). $G_5 < 0$ . Gradient pushes action probability down - wrong direction.
Episode 2: Agent takes a suboptimal action at $t=5$ , but gets lucky later. $G_5 > 0$ . Gradient pushes action probability up - wrong direction.

Quantitative: The variance of the REINFORCE gradient estimator scales as:

$\text{Var}[\nabla_\theta \hat{J}] \propto \mathbb{E}\left[\left(\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right)^2\right]$

This can be orders of magnitude larger than the signal. In practice, REINFORCE often requires thousands of episodes to converge on tasks where actor-critic converges in hundreds.

Baseline Subtraction: Variance Reduction Without Bias

Key theorem: You can subtract any function $b(s_t)$ (a baseline) from $G_t$ without changing the expected gradient:

$\nabla_\theta J(\theta) = \mathbb{E}\left[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (G_t - b(s_t))\right]$

Proof That Baselines Are Unbiased

We need to show: $\mathbb{E}\left[\nabla_\theta \log \pi_\theta(a_t|s_t) \cdot b(s_t)\right] = 0$ .

$\mathbb{E}_{a_t \sim \pi_\theta(\cdot|s_t)}\left[\nabla_\theta \log \pi_\theta(a_t|s_t) \cdot b(s_t)\right]$

$= b(s_t) \sum_{a_t} \pi_\theta(a_t|s_t) \nabla_\theta \log \pi_\theta(a_t|s_t)$

$= b(s_t) \sum_{a_t} \pi_\theta(a_t|s_t) \frac{\nabla_\theta \pi_\theta(a_t|s_t)}{\pi_\theta(a_t|s_t)}$

$= b(s_t) \sum_{a_t} \nabla_\theta \pi_\theta(a_t|s_t)$

$= b(s_t) \cdot \nabla_\theta \sum_{a_t} \pi_\theta(a_t|s_t) = b(s_t) \cdot \nabla_\theta 1 = 0$

The baseline does not bias the gradient estimate - it is a free variance reduction tool.

Optimal Baseline

The variance of the gradient estimator with baseline $b$ is:

$\text{Var}[\hat{g}] = \mathbb{E}\left[(G_t - b)^2 (\nabla_\theta \log \pi)^2\right] - \mathbb{E}[\hat{g}]^2$

Minimizing over $b$ gives the optimal baseline:

$b^*(s) = \frac{\mathbb{E}[G_t^2 (\nabla_\theta \log \pi)^2]}{\mathbb{E}[(\nabla_\theta \log \pi)^2]}$

In practice, the optimal baseline is hard to compute. The value function $V^\pi(s)$ is an excellent approximation - it removes the part of the return that doesn't depend on the specific action taken.

The Advantage Function

With $b(s) = V^\pi(s)$ :

$G_t - V^\pi(s_t) \approx A^\pi(s_t, a_t)$

The advantage function $A(s,a) = Q(s,a) - V(s)$ measures how much better action $a$ is than the policy's average. The update becomes:

$\theta \leftarrow \theta + \alpha \sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A_t$

$A > 0$ : action was better than average → increase its probability
$A < 0$ : action was worse than average → decrease its probability
Zero-mean by construction: $\mathbb{E}_{a \sim \pi}[A(s,a)] = 0$ for all $s$

Actor-Critic: Online Advantage Estimation

REINFORCE waits until the end of the episode to compute exact Monte Carlo returns. This is:

Slow: can't learn until the episode ends
High variance: returns are noisy due to all future randomness
Only valid for episodic tasks

Actor-Critic learns a value function (the critic) online to estimate advantages at each step - no need to wait for the episode to end.

Actor (policy):  π_θ(a|s) - decides what action to take
Critic (value):  V_w(s)   - estimates expected return from s

TD error as advantage estimate:

$\delta_t = r_t + \gamma V_w(s_{t+1}) - V_w(s_t)$

This is a biased estimator of $A^\pi(s_t, a_t)$ - biased because $V_w$ is an approximation, but lower variance than Monte Carlo returns because it uses only one step of actual reward.

Bias-variance tradeoff in policy gradient estimation:

Estimator	Bias	Variance	Updates needed
$G_t$ (Monte Carlo)	None	Very high	Episode-level
$\delta_t$ (1-step TD)	High (poor critic)	Low	Step-level
$n$ -step return	Medium	Medium	$n$ -step
GAE (λ)	Adjustable	Adjustable	Step-level

Full PyTorch Actor-Critic Implementation

class ActorCriticNetwork(nn.Module):
    """
    Shared backbone with separate actor and critic heads.

    Sharing backbone: fewer parameters, shared representations.
    Separate heads: actor and critic can specialize independently.
    Alternative: completely separate networks (more stable but slower).
    """
    def __init__(self, state_dim: int, n_actions: int, hidden_dim: int = 256):
        super().__init__()
        # Shared feature extractor
        self.backbone = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh(),
        )
        # Actor head: produces action distribution
        self.actor_head = nn.Linear(hidden_dim, n_actions)
        # Critic head: produces scalar state value estimate
        self.critic_head = nn.Linear(hidden_dim, 1)

        # Initialize critic head to near-zero (avoids early value overestimates)
        nn.init.zeros_(self.critic_head.weight)
        nn.init.zeros_(self.critic_head.bias)

    def forward(self, state: torch.Tensor) -> tuple:
        features = self.backbone(state)
        logits = self.actor_head(features)
        value = self.critic_head(features).squeeze(-1)
        return torch.distributions.Categorical(logits=logits), value

    def get_action_and_value(
        self, state: torch.Tensor, action: torch.Tensor = None
    ) -> tuple:
        dist, value = self.forward(state)
        if action is None:
            action = dist.sample()
        log_prob = dist.log_prob(action)
        entropy = dist.entropy()
        return action, log_prob, entropy, value


class A2CAgent:
    """
    Advantage Actor-Critic (A2C) - synchronous version.

    Collects n_steps of experience, then updates both actor and critic.
    Loss = actor_loss + value_coeff * critic_loss - entropy_coeff * entropy

    A2C vs A3C:
    - A2C: synchronous - wait for all workers, average gradients, stable
    - A3C: asynchronous - workers update global network independently, fast
    - In practice with GPUs: A2C is usually preferred (GPU batching)
    """
    def __init__(
        self,
        state_dim: int,
        n_actions: int,
        lr: float = 7e-4,
        gamma: float = 0.99,
        value_coeff: float = 0.5,      # weight for critic loss
        entropy_coeff: float = 0.01,   # entropy bonus weight
        n_steps: int = 5,              # steps before update
        max_grad_norm: float = 0.5,    # gradient clipping
    ):
        self.gamma = gamma
        self.value_coeff = value_coeff
        self.entropy_coeff = entropy_coeff
        self.n_steps = n_steps
        self.max_grad_norm = max_grad_norm

        self.network = ActorCriticNetwork(state_dim, n_actions)
        self.optimizer = Adam(self.network.parameters(), lr=lr, eps=1e-5)

    def compute_returns_and_advantages(
        self,
        rewards: list[float],
        values: torch.Tensor,       # V(s_t) for t=0..n-1
        next_value: float,          # V(s_n) - bootstrap value
        dones: list[bool],
    ) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Compute n-step returns and advantages.

        Return at step t: G_t = r_t + γ·r_{t+1} + ... + γ^{n-t-1}·r_{n-1} + γ^{n-t}·V(s_n)
        Advantage at step t: A_t = G_t - V(s_t)

        The done flag zeros out the bootstrap when an episode ends.
        """
        returns = []
        G = next_value

        for i in reversed(range(len(rewards))):
            # If done, don't bootstrap from next state
            G = rewards[i] + self.gamma * G * (1.0 - float(dones[i]))
            returns.insert(0, G)

        returns = torch.FloatTensor(returns)
        advantages = returns - values.detach()

        # Normalize advantages for stable training
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

        return returns, advantages

    def update(
        self,
        states: torch.Tensor,           # (n_steps, state_dim)
        actions: torch.Tensor,          # (n_steps,)
        rewards: list[float],           # (n_steps,)
        next_state: torch.Tensor,       # (state_dim,)
        dones: list[bool],              # (n_steps,)
    ) -> dict[str, float]:
        """Full A2C update step."""
        # Get action distributions and values for collected states
        dists, values = self.network(states)

        # Bootstrap next-state value
        with torch.no_grad():
            _, next_value = self.network(next_state.unsqueeze(0))
            next_value = next_value.item()

        returns, advantages = self.compute_returns_and_advantages(
            rewards, values, next_value, dones
        )

        # ── Actor Loss ────────────────────────────────────────────────────────
        # -E[log π(a|s) · A(s,a)]
        log_probs = dists.log_prob(actions)
        actor_loss = -(log_probs * advantages).mean()

        # ── Critic Loss ───────────────────────────────────────────────────────
        # MSE between value predictions and actual returns
        # Clip at 0.5 to match the value_coeff scaling
        critic_loss = F.mse_loss(values, returns)

        # ── Entropy Bonus ─────────────────────────────────────────────────────
        # Maximize policy entropy to encourage exploration
        # H(π) = -Σ_a π(a|s) log π(a|s)
        entropy = dists.entropy().mean()

        # ── Total Loss ────────────────────────────────────────────────────────
        total_loss = (
            actor_loss
            + self.value_coeff * critic_loss
            - self.entropy_coeff * entropy
        )

        self.optimizer.zero_grad()
        total_loss.backward()
        # Gradient clipping: prevents catastrophic updates
        torch.nn.utils.clip_grad_norm_(
            self.network.parameters(), max_norm=self.max_grad_norm
        )
        self.optimizer.step()

        return {
            "actor_loss": actor_loss.item(),
            "critic_loss": critic_loss.item(),
            "entropy": entropy.item(),
            "total_loss": total_loss.item(),
        }


def train_a2c(
    env_name: str = "CartPole-v1",
    n_envs: int = 4,             # parallel environments
    n_steps: int = 5,            # steps per update
    total_steps: int = 500_000,
) -> A2CAgent:
    """
    A2C training loop with parallel environments.
    Using gym.vector for vectorized environment interaction.
    """
    # Vectorized environments (parallel rollouts)
    envs = gym.vector.make(env_name, num_envs=n_envs)
    state_dim = envs.single_observation_space.shape[0]
    n_actions = envs.single_action_space.n

    agent = A2CAgent(state_dim, n_actions)

    obs, _ = envs.reset()
    episode_rewards = np.zeros(n_envs)
    all_rewards = []

    for step in range(0, total_steps, n_steps * n_envs):
        # Collect n_steps of experience from all envs
        states_batch = []
        actions_batch = []
        rewards_batch = [[] for _ in range(n_envs)]
        dones_batch = [[] for _ in range(n_envs)]

        for t in range(n_steps):
            obs_t = torch.FloatTensor(obs)
            states_batch.append(obs_t)

            with torch.no_grad():
                dists, _ = agent.network(obs_t)
                actions = dists.sample()

            actions_batch.append(actions)

            obs_next, rewards, terminated, truncated, infos = envs.step(
                actions.numpy()
            )
            dones = terminated | truncated

            for i in range(n_envs):
                rewards_batch[i].append(float(rewards[i]))
                dones_batch[i].append(bool(dones[i]))
                episode_rewards[i] += rewards[i]
                if dones[i]:
                    all_rewards.append(episode_rewards[i])
                    episode_rewards[i] = 0.0

            obs = obs_next

        # Update for each environment (or batch them together)
        for env_idx in range(n_envs):
            env_states = torch.stack([states_batch[t][env_idx]
                                      for t in range(n_steps)])
            env_actions = torch.stack([actions_batch[t][env_idx]
                                       for t in range(n_steps)])
            env_next_state = torch.FloatTensor(obs[env_idx])

            agent.update(
                states=env_states,
                actions=env_actions,
                rewards=rewards_batch[env_idx],
                next_state=env_next_state,
                dones=dones_batch[env_idx],
            )

        if len(all_rewards) > 0 and step % 10_000 == 0:
            avg = np.mean(all_rewards[-50:]) if len(all_rewards) >= 50 else np.mean(all_rewards)
            print(f"Step {step:6d} | Avg reward (last 50 eps): {avg:.1f}")

    envs.close()
    return agent

The Three Loss Terms: Deep Dive

Actor Loss: $-\mathbb{E}[\log\pi(a|s) \cdot A(s,a)]$

Gradient of this loss w.r.t. $\theta$ equals $-\nabla_\theta J(\theta)$ - minimizing the loss is gradient ascent on expected return. The sign of $A(s,a)$ determines the update direction:

$A > 0$ : Loss becomes more negative as $\log\pi(a|s)$ increases → the optimizer increases $\pi(a|s)$ . Good.
$A < 0$ : Loss becomes more positive as $\log\pi(a|s)$ increases → the optimizer decreases $\pi(a|s)$ . Good.

Critic Loss: $\mathbb{E}[(V_w(s) - G_t)^2]$

Standard mean-squared error regression. The target $G_t$ is the actual observed return (or $n$ -step return), which is an unbiased estimate of $V^*(s_t)$ . The critic is being trained to predict returns accurately. A well-trained critic makes the advantage estimates more accurate, which reduces policy gradient variance.

Note: The target $G_t$ should be detached from the computation graph - we don't want gradients flowing through the target into the network weights in the same way.

Entropy Bonus: $\mathbb{E}[H(\pi(\cdot|s))]$

The entropy of a categorical distribution over $K$ actions:

$H(\pi(\cdot|s)) = -\sum_{a=1}^{K} \pi(a|s) \log \pi(a|s)$

Properties:

Maximum entropy: $\log K$ (uniform distribution)
Minimum entropy: 0 (deterministic distribution)
Maximizing entropy keeps the policy from collapsing to a single action prematurely

Why entropy matters: Without it, the policy converges to a near-deterministic distribution early in training, before it has explored sufficiently. This is called entropy collapse - the policy becomes too confident. The entropy coefficient $\beta$ (typically 0.01) is a hyperparameter: too high and the policy stays random; too low and entropy collapse occurs.

A2C vs A3C: Synchronous vs Asynchronous

A3C (Asynchronous Advantage Actor-Critic, Mnih et al. 2016):

    Global Network (θ_global, w_global)
         ↑             ↑             ↑
    Worker 1       Worker 2       Worker 3
    (env copy)     (env copy)     (env copy)
    computes ∇θ    computes ∇θ    computes ∇θ
    updates global asynchronously

Workers operate in parallel, independently computing gradients and applying them to the shared global network without waiting for each other. Benefits:

Multiple actors generate diverse experience (different random seeds, exploration trajectories)
Asynchronous updates decorrelate training data (like a distributed replay buffer)
Efficient CPU utilization (each worker runs on one CPU core)

A2C (Synchronous version):

    Global Network (θ, w)
         ↓
    ┌────────────────────┐
    │ Worker 1 ... N     │  ← all workers run synchronously
    └────────────────────┘
         ↓
    Average gradients from all workers
         ↓
    Single synchronized update

Workers run in sync, gradients are averaged, one update per step. Benefits:

More deterministic training (reproducible results)
Better GPU utilization (larger effective batch size)
Slightly more stable due to gradient averaging

In practice (2024): A2C with vectorized environments (gym.vector) is standard. A3C's asynchronous advantage is less important with modern GPU training since the GPU naturally batches computation.

Entropy Regularization: Maximum Entropy RL

Entropy maximization can be elevated from a regularization trick to a core objective:

$J_{MaxEnt}(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_t \left(r_t + \alpha H(\pi_\theta(\cdot|s_t))\right)\right]$

where $H(\pi) = -\sum_a \pi(a) \log \pi(a)$ is the policy entropy and $\alpha > 0$ is the temperature parameter.

Why Maximum Entropy RL?

Exploration: the agent explores all high-reward strategies, not just one
Robustness: a stochastic policy is harder to exploit in adversarial settings
Multi-modality: the agent learns all ways to solve the task, not just one path
Transferability: maximum entropy policies transfer better to new tasks

Soft Actor-Critic (SAC) takes MaxEnt RL to its logical conclusion for continuous control. It maintains an automatically tuned temperature $\alpha$ and produces SOTA results on MuJoCo benchmarks. The policy is explicitly regularized to have high entropy, resulting in natural exploration without $\epsilon$ -greedy.

Gaussian Policy for Continuous Actions: Full Implementation

import torch
import torch.nn as nn
import numpy as np
import gymnasium as gym
from torch.optim import Adam


class GaussianActorCritic(nn.Module):
    """Actor-Critic for continuous action spaces with Gaussian policy."""

    LOG_STD_MIN = -20
    LOG_STD_MAX = 2

    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 256):
        super().__init__()
        self.backbone = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh(),
        )
        # Actor: outputs mean and log_std of Gaussian
        self.mean_head = nn.Linear(hidden_dim, action_dim)
        self.log_std_head = nn.Linear(hidden_dim, action_dim)

        # Critic: outputs state value
        self.value_head = nn.Linear(hidden_dim, 1)

    def forward(self, state: torch.Tensor):
        features = self.backbone(state)

        # Actor output
        mean = self.mean_head(features)
        log_std = self.log_std_head(features).clamp(
            self.LOG_STD_MIN, self.LOG_STD_MAX
        )
        std = log_std.exp()
        dist = torch.distributions.Normal(mean, std)

        # Critic output
        value = self.value_head(features).squeeze(-1)

        return dist, value

    def act(self, state: np.ndarray, deterministic: bool = False):
        state_t = torch.FloatTensor(state).unsqueeze(0)
        dist, value = self.forward(state_t)

        if deterministic:
            action = dist.mean
        else:
            action = dist.rsample()   # reparameterized sample

        # Sum log_prob over action dimensions
        log_prob = dist.log_prob(action).sum(-1)

        # Clip action to valid range
        action = action.clamp(-1.0, 1.0)

        return (action.squeeze(0).detach().numpy(),
                log_prob.item(),
                value.item())

Algorithm Comparison

	REINFORCE	Actor-Critic (A2C)	PPO
Bias	None	Some (critic approx.)	Some (clipping)
Variance	Very high	Low-medium	Low
Sample efficiency	Low	Medium	High
Data reuse	None (on-policy)	None (on-policy)	Multiple epochs
Stability	Low	Medium	High
Hyperparameters	Few (α, γ)	More (coeff's)	Many (clip ε, coeff's)
Best for	Theory demos	Sequential tasks	Production, RLHF

Connection to RLHF

The policy gradient framework maps directly to RLHF for language models:

RL Concept	RLHF Analog
Policy $\pi_\theta(a \\| s)$	Language model $p_\theta(y_t \\| y_{<t}, x)$
State $s_t$	Prompt $x$ + tokens generated so far $y_{<t}$
Action $a_t$	Next token $y_t$ (from vocabulary of ~50K)
Episode	Full response generation (until EOS)
Reward $R(\tau)$	Reward model score $r_\phi(x, y)$ at episode end
Critic $V_w(s)$	Value head on top of LLM backbone
KL penalty	$-\beta \log(p_\theta(y_t

The REINFORCE-style update for RLHF would be:

$\theta \leftarrow \theta + \alpha \nabla_\theta \log p_\theta(y|x) \cdot (r_\phi(x, y) - b)$

Responses with high reward scores get reinforced; low-reward responses are suppressed. PPO-RLHF adds:

A clipping constraint (prevents large policy updates that destroy alignment)
A KL-from-reference penalty (prevents reward hacking by staying close to the base model)
A learned value baseline (reduces variance without bias)

Common Mistakes

:::danger Forgetting to Detach the Critic Target When computing the critic loss F.mse_loss(values, returns), the returns should not have gradients flowing through them. If returns are computed using the critic's own outputs (bootstrapping), you must use .detach(): target = (r + gamma * next_value).detach(). Failing to detach causes the critic to chase its own tail - it optimizes against a moving target that it itself is changing, leading to instability or divergence. :::

:::danger Wrong Sign on the Actor Loss The policy gradient theorem gives $\nabla_\theta J = \mathbb{E}[\nabla_\theta \log\pi \cdot A]$ . To maximize $J$ , you ascend the gradient. In PyTorch, which minimizes, the actor loss must be $-\mathbb{E}[\log\pi(a|s) \cdot A(s,a)]$ . A common bug: forgetting the negative sign. The gradient will then descend $J$ , making the agent worse over time. Symptom: rewards decrease monotonically during training. :::

:::warning Entropy Collapse in Early Training If the entropy coefficient is too small or the learning rate too large, the policy can collapse to a near-deterministic distribution within the first few thousand steps. Once collapsed, the policy stops exploring, and gradient estimates become extremely noisy (few actions ever tried). Always monitor entropy in your training logs. If entropy drops below ~0.5 nats in early training, increase entropy_coeff or reduce the learning rate. :::

:::warning Advantage Normalization Can Hurt in Small Batches Normalizing advantages to zero mean and unit variance per batch is a common trick. But if the batch is very small (n_steps=5, single environment), the normalization constants are noisy estimates of batch statistics, introducing additional variance. For small batches, skip advantage normalization or use running statistics. For large batches (32+ environments × 128+ steps), normalization is beneficial. :::

YouTube Resources

Video	Creator	Why Watch
Policy Gradient Algorithms - Deep RL Bootcamp Lecture	Pieter Abbeel & John Schulman	The canonical lecture on policy gradients - covers REINFORCE, baselines, actor-critic, TRPO
Lecture 8: Policy Gradient - CS285 Deep RL	Sergey Levine (UC Berkeley)	Rigorous derivation with a focus on variance reduction and practical implementation
A Friendly Introduction to REINFORCE	Andrej Karpathy	Intuitive walkthrough of the log-derivative trick with minimal formalism
Deep RL Lecture 3 - Policy Gradients	David Silver (DeepMind)	Connects DP-based policy gradient theorem to sampled Monte Carlo estimation
Proximal Policy Optimization - Paper Explained	Yannic Kilcher	Excellent explanation of how PPO builds on actor-critic - essential RLHF prerequisite

Interview Q&A

Q1: Derive the policy gradient theorem. Why doesn't it require environment dynamics?

The goal is $\nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}_\tau[G(\tau)]$ . Writing as an integral: $\nabla_\theta \int p_\theta(\tau) G(\tau) d\tau$ . The trajectory probability is $p_\theta(\tau) = p(s_0) \prod_t [\pi_\theta(a_t|s_t) \cdot P(s_{t+1}|s_t,a_t)]$ . Using the log-derivative trick - $\nabla_\theta p_\theta(\tau) = p_\theta(\tau) \nabla_\theta \log p_\theta(\tau)$ - we get $\mathbb{E}_\tau[\nabla_\theta \log p_\theta(\tau) \cdot G(\tau)]$ . Taking the log: $\log p_\theta(\tau) = \log p(s_0) + \sum_t \log \pi_\theta(a_t|s_t) + \sum_t \log P(s_{t+1}|s_t,a_t)$ . Differentiating w.r.t. $\theta$ : the $\log p(s_0)$ term is zero (initial state distribution is fixed). The $\log P(\cdot)$ terms are zero (environment dynamics don't depend on $\theta$ ). Only $\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t)$ remains. You only need to be able to evaluate $\log \pi_\theta(a|s)$ - never $P(s'|s,a)$ . This is why policy gradient methods are model-free.

Q2: What is the variance problem in REINFORCE and how do baselines address it?

REINFORCE estimates the gradient as $\sum_t \nabla_\theta \log\pi_\theta(a_t|s_t) \cdot G_t$ where $G_t$ is the Monte Carlo return. $G_t$ has high variance because it is a sum of all future random rewards and environment responses. An unlucky trajectory makes a good action look bad; a lucky one makes a bad action look good. A baseline $b(s_t)$ can be subtracted: the update uses $(G_t - b(s_t))$ instead. This is valid because $\mathbb{E}[\nabla_\theta \log\pi_\theta(a|s) \cdot b(s)] = 0$ - the proof uses the fact that $\sum_a \nabla_\theta \pi_\theta(a|s) = \nabla_\theta \sum_a \pi_\theta(a|s) = \nabla_\theta 1 = 0$ . So the baseline doesn't affect the expected gradient, only its variance. The best baseline is $V^\pi(s_t)$ , giving advantage $A(s,a) = G_t - V^\pi(s_t)$ . Variance reduction can be 10x or more in practice.

Q3: Explain the bias-variance tradeoff between REINFORCE and actor-critic.

REINFORCE uses Monte Carlo returns $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ - unbiased (the expected value is the true $Q^\pi(s_t,a_t)$ ) but high variance (depends on all future random events). Actor-critic uses the TD error $\delta_t = r_t + \gamma V_w(s_{t+1}) - V_w(s_t)$ - biased (because $V_w$ is an approximation, not the true $V^\pi$ ) but lower variance (only one step of actual randomness). This is the fundamental bias-variance tradeoff. Generalized Advantage Estimation (GAE, Schulman et al. 2016) interpolates with parameter $\lambda \in [0,1]$ : $\lambda = 0$ gives 1-step TD (low variance, high bias); $\lambda = 1$ gives Monte Carlo (high variance, no bias). In practice, $\lambda = 0.95$ often works well - most of the variance reduction with minimal bias increase.

Q4: What does the entropy bonus do in actor-critic training?

The entropy bonus $H(\pi(\cdot|s)) = -\sum_a \pi(a|s)\log\pi(a|s)$ is added to the objective: $J_{ent} = J + \beta \cdot \mathbb{E}[H(\pi)]$ . Maximizing entropy keeps the policy stochastic - it prevents premature convergence to a deterministic policy before the agent has sufficiently explored. Without entropy regularization, the policy often collapses to one action early in training (entropy collapse), then the gradient estimates become noisy (limited action diversity), and learning stagnates. With entropy regularization: the policy maintains meaningful exploration throughout training, gradient estimates have lower variance (more diverse trajectories), and the policy discovers multiple high-reward strategies. The coefficient $\beta$ (typically 0.01–0.05) controls the exploration-exploitation tradeoff. Maximum Entropy RL (SAC) makes entropy a first-class objective with automatic temperature tuning, achieving SOTA on continuous control benchmarks.

Q5: How does actor-critic differ from REINFORCE, and why is it preferred in practice?

Three key differences. First, update timing: REINFORCE collects a full episode then updates; actor-critic updates after $n$ steps (or every step). Online updates are more sample efficient - you don't discard data by waiting for episode end; in long episodes (e.g., robot with 10K-step horizon), REINFORCE is impractical. Second, advantage estimation: REINFORCE uses exact Monte Carlo returns (unbiased but high variance); actor-critic bootstraps using $V_w$ , giving lower-variance but biased advantage estimates. Third, architecture: actor-critic maintains two functions (the actor $\pi_\theta$ and critic $V_w$ ) that interact; REINFORCE has only the policy. The bias introduced by the critic is generally a good tradeoff - empirically, actor-critic methods converge in far fewer environment interactions. The main disadvantage: critic training adds complexity (an additional learning problem), and if the critic is poorly trained, the biased advantage estimates can hurt the actor. This is why critic architecture and loss weighting are important hyperparameters.

Q6: How does the policy gradient connect to RLHF for language models? What are the LLM-specific challenges?

In RLHF, the language model $p_\theta(y|x)$ is the policy, the prompt $x$ is the state, and each token $y_t$ is an action. The policy gradient update is $\nabla_\theta \log p_\theta(y|x) \cdot (r(x,y) - V_w(x))$ where $r$ is the reward model score. This reinforces responses receiving high reward. LLM-specific challenges: (1) Sparse reward - reward is only available for the full response (end of episode), not per token. This makes credit assignment hard: which tokens contributed to the high/low reward? (2) Enormous action space - 50K possible tokens per step, making exploration harder than discrete RL with a handful of actions. (3) Reward hacking - the model quickly finds responses that score high on the reward model without being genuinely helpful (e.g., verbose responses that confuse the reward model, or responses that flatter the user). PPO-RLHF addresses this with a KL penalty term: the actual reward used is $r(x,y) - \beta \cdot \text{KL}(p_\theta || p_{ref})$ , where $p_{ref}$ is the reference (pre-RLHF) model. This constrains how far the policy can drift. (4) Scale - the policy has billions of parameters; naive REINFORCE gradient estimates have far too much variance. PPO's ability to reuse data across multiple gradient steps makes it much more practical than REINFORCE at this scale.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Policy Gradient (REINFORCE) demo on the EngineersOfAI Playground - no code required.

:::

The Real Engineering Moment​

Why Value-Based Methods Fail for Continuous Actions​

Historical Context​

Policy Parameterization​

Softmax Policy (Discrete Actions)​

Gaussian Policy (Continuous Actions)​

The Policy Gradient Theorem: Full Derivation​

Step 1: Write JJJ as an integral​

Step 2: Take the gradient​

Step 3: Apply the log-derivative trick​

Step 4: Substitute and simplify​

Step 5: Expand log⁡pθ(τ)\log p_\theta(\tau)logpθ​(τ)​

Final Result: Policy Gradient Theorem​

REINFORCE: Monte Carlo Policy Gradient​

Full PyTorch Implementation​

The Variance Problem: Why REINFORCE Is Slow​

Baseline Subtraction: Variance Reduction Without Bias​

Proof That Baselines Are Unbiased​

Optimal Baseline​

The Advantage Function​

Actor-Critic: Online Advantage Estimation​

Full PyTorch Actor-Critic Implementation​

The Three Loss Terms: Deep Dive​

Actor Loss: −E[log⁡π(a∣s)⋅A(s,a)]-\mathbb{E}[\log\pi(a|s) \cdot A(s,a)]−E[logπ(a∣s)⋅A(s,a)]​

Critic Loss: E[(Vw(s)−Gt)2]\mathbb{E}[(V_w(s) - G_t)^2]E[(Vw​(s)−Gt​)2]​

Entropy Bonus: E[H(π(⋅∣s))]\mathbb{E}[H(\pi(\cdot|s))]E[H(π(⋅∣s))]​

A2C vs A3C: Synchronous vs Asynchronous​

Entropy Regularization: Maximum Entropy RL​

Gaussian Policy for Continuous Actions: Full Implementation​

Algorithm Comparison​

Connection to RLHF​

Common Mistakes​

YouTube Resources​

Interview Q&A​