Skip to main content

RL in Production - Where Theory Meets Reality

Reading time: ~40 minutes | Level: Reinforcement Learning | Role: ML Engineer, AI Engineer, MLOps Engineer


The Real Engineering Moment

The year is 2016 and a team at DeepMind has a bold proposal: hand control of Google's data center cooling systems to a reinforcement learning agent. The potential upside is enormous - cooling consumes around 40% of a data center's total energy budget, and even modest gains translate to millions of dollars and tons of CO2. The engineering team has trained a deep RL agent on five years of historical sensor data: inlet temperatures, pump speeds, coolant flow rates, power consumption readings from thousands of sensors distributed across dozens of server halls.

The problem is immediately obvious to anyone who has deployed a learning system in a physical environment. The exploration strategy that makes RL work in simulation - trying random actions to discover their consequences - would be catastrophic here. A cooling pump running at the wrong speed, a chiller set to the wrong temperature set point, a cooling tower turned off at the wrong moment: any of these could overheat production hardware worth hundreds of millions of dollars. You do not get to try a bad action in a live data center and observe what happens. "Sample efficiency" is not an abstract research concern. It is a physical safety constraint.

The team's solution defines what production RL actually looks like. They trained entirely offline - never touching the live system during training, learning only from historical logs. They layered explicit safety constraints into the policy: actions outside defined operational envelopes were blocked at the system level before reaching any physical actuator. They ran in shadow mode for weeks - the agent computed recommended actions, humans compared them to what operators would have done, but the agent's commands went nowhere. When the gap between agent recommendations and human decisions became small and consistently sensible, they introduced a narrow deployment: the agent controlled one cooling loop, with a human monitoring dashboard and a one-button manual override within arm's reach.

The first 24 hours of live control, engineers stood by with their hands on the override. Nothing broke. The agent made small, conservative adjustments - nudging set points by fractions of degrees, smoothing pump ramp rates, anticipating thermal load changes before they propagated. After three months, the system was fully autonomous. The measured result: a 40% reduction in cooling energy, which translated to roughly 15% total data center energy savings. The finding became a landmark paper (Lazic et al., 2018) and a blueprint for industrial RL deployment.

This is what production RL looks like. Not the textbook version where your agent explores a grid world, but an engineering discipline where you assume your agent will try something dangerous if given the chance, design every component to prevent that, and deploy incrementally through shadow mode, limited deployment, and full autonomy - in that order. Every design decision in this lesson derives from that discipline.


Historical Context

The milestones below trace how production RL evolved from simulation-only research to live industrial systems.

YearSystemOrgContribution
2013DQN on AtariDeepMindFirst deep RL success - but pure simulation, unlimited exploration
2016Data center coolingDeepMind / GoogleFirst large-scale offline RL in critical infrastructure
2017AlphaGo ZeroDeepMindSelf-play RL - closed environment, no external cost to exploration
2019YouTube recommendation RLGoogleREINFORCE for long-term user satisfaction; delayed reward in production
2020CQL (Conservative Q-Learning)UC BerkeleyPrincipled offline RL with extrapolation error bounds
2021IQL (Implicit Q-Learning)UC BerkeleyOffline RL without querying OOD actions - more stable
2021CPO / SafeRLUC BerkeleyConstrained policy optimization for safety-critical deployment
2022Waymo self-driving RLWaymoModel-based RL trained in simulation, deployed in the real world
2023AlphaFold 2 refinementDeepMindRL-style iterative self-evaluation in protein structure prediction
2024Robot manipulation (RT-2)GoogleVision-language-action models trained with RL from demos

Why Production RL Is Hard

The gap between research RL and production RL is wider than the gap between research and production for almost any other ML approach. Here is a systematic account of the challenges.

Reward delay. In CartPole, reward arrives every timestep. In a real recommendation system, the reward signal is user satisfaction - measured by subscription renewal, not skip rate. That signal arrives days or months after the decision. Credit assignment across a delayed horizon is genuinely hard. The agent cannot know which recommendation it made three weeks ago caused the user to cancel.

Non-stationarity. Your RL policy changes the environment it operates in. A recommendation algorithm that learns to push certain content changes user tastes. A pricing agent that learns to raise prices in low-supply conditions changes competitor behavior. The environment shifts underneath the policy, making old value estimates incorrect. This is unlike supervised learning, where the data distribution is fixed. In RL, the policy is part of the data distribution.

Exploration risk. Every RL algorithm requires exploration - trying actions it is uncertain about to improve its value estimates. In CartPole, exploration means the pole occasionally falls. In a production system, exploration means showing a user a worse recommendation, bidding too high in an auction, or setting a price that drives customers to a competitor. The cost of a bad exploratory action is real and immediate.

Distributional shift. As your policy improves, the data distribution it generates diverges from the data distribution the value function was trained on. Q-values estimated from early behavioral policy data become inaccurate under the improved policy. This leads to overestimation of Q-values for actions that look good in the value function but were rarely tried in practice.

Sample efficiency. Academic RL algorithms routinely require tens of millions of environment interactions to converge. Your production environment may allow thousands of interactions per day at most. Model-free RL at academic scale is simply not possible in many real systems.

These five constraints define the engineering problem space of production RL. Every technique in this lesson is a direct response to one or more of them.


The Production RL Decision Tree

Before choosing an RL approach, work through these questions in order:


Offline RL (Batch RL) - Learning Without Touching the Environment

Offline RL is the most important innovation for production deployment. The core idea: learn a good policy from a fixed historical dataset D\mathcal{D} without any online interaction.

D={(st,at,rt,st+1)}t=1N\mathcal{D} = \{(s_t, a_t, r_t, s_{t+1})\}_{t=1}^N

The dataset was collected by some behavioral policy πb\pi_b - the previous system, human operators, a heuristic rule engine. You train a new policy π\pi to be better than πb\pi_b without ever trying π\pi in the live system.

Why this is hard. Standard Q-learning would overfit catastrophically. The agent needs to estimate Q(s,a)Q(s, a) for actions it has never seen in D\mathcal{D}. Because it cannot try those actions and observe their outcomes, it fabricates optimistic Q-value estimates. Then it exploits those fabricated estimates, selecting the out-of-distribution actions most aggressively. The resulting policy fails on deployment because it behaves in ways the dataset never covered. This is the extrapolation error problem, and it is the defining challenge of offline RL.

Conservative Q-Learning (CQL)

Kumar et al. (2020) address extrapolation error by adding a conservative penalty to the Q-learning objective. The penalty actively pushes down Q-values for actions not in the dataset, while pushing up Q-values for actions that appear in D\mathcal{D}.

LCQL(Q)=EsD ⁣[logaexp(Q(s,a))]E(s,a)D[Q(s,a)]conservative penalty+12LTD(Q)\mathcal{L}_{CQL}(Q) = \underbrace{\mathbb{E}_{s \sim \mathcal{D}}\!\left[\log\sum_a \exp(Q(s,a))\right] - \mathbb{E}_{(s,a)\sim\mathcal{D}}[Q(s,a)]}_{\text{conservative penalty}} + \frac{1}{2}\mathcal{L}_{TD}(Q)

The first term - the log-sum-exp minus the in-distribution expectation - is a soft maximum over all actions minus the value at the actual data actions. Minimizing this pushes down Q-values for high-Q out-of-distribution actions and pushes up Q-values for actions in D\mathcal{D}. The second term 12LTD(Q)\frac{1}{2}\mathcal{L}_{TD}(Q) is the standard temporal difference loss that keeps the Q-function Bellman-consistent.

The result: CQL learns Q-values that are conservative lower bounds on the true Q-values for out-of-distribution actions, while being accurate for in-distribution actions. The greedy policy over these conservative Q-values is safe to deploy - it will not chase phantom rewards.

Implicit Q-Learning (IQL)

IQL (Kostrikov et al., 2021) takes a different approach: avoid querying out-of-distribution actions altogether. Instead of computing maxaQ(s,a)\max_a Q(s', a) during the Bellman backup - which requires evaluating Q at potentially unseen actions - IQL uses expectile regression to fit a value function that implicitly represents the value of the best in-distribution action.

LV(V)=E(s,a)D ⁣[Lτexpectile ⁣(Q(s,a)V(s))]\mathcal{L}_V(V) = \mathbb{E}_{(s,a)\sim\mathcal{D}}\!\left[\mathcal{L}_\tau^\text{expectile}\!\left(Q(s,a) - V(s)\right)\right]

where Lτexpectile(u)=τ1[u<0]u2\mathcal{L}_\tau^\text{expectile}(u) = |\tau - \mathbf{1}[u < 0]|\, u^2. With τ\tau close to 1, this fits the value function to the upper quantile of Q-values in the data - approximating the value of the best action in D\mathcal{D} without evaluating Q at unseen actions. IQL is often preferred in practice because it is stable, avoids CQL's penalty coefficient tuning, and achieves state-of-the-art performance on D4RL benchmarks.

Offline to Online: Fine-Tuning After Offline Pre-Training

A powerful two-stage approach: train offline on historical data to get a reasonable policy, then fine-tune online with limited live interactions. This combines the sample efficiency of offline RL (no wasted exploratory interactions) with the ability to improve beyond the behavioral policy.

The challenge: the offline-trained policy may be overly conservative. Its Q-values were deliberately suppressed for OOD actions (by CQL). When you switch to online interaction, you need to gradually relax this conservatism to allow productive exploration without reverting to unsafe behavior.

Practical implementation:

  1. Train CQL offline until convergence - extract a policy that is at least as good as πb\pi_b
  2. In online phase, initialize from this policy and add a small exploration bonus (ε\varepsilon-greedy with ε=0.01\varepsilon = 0.01 or Thompson Sampling)
  3. Gradually decrease the CQL penalty coefficient α\alpha over online training - loosening conservatism as you gather more real-world data
  4. Monitor constraint satisfaction at every step - if any hard constraint is violated, pause online training and investigate before continuing

When to Use Offline RL

Use offline RL when you have a rich historical dataset from a prior system or human operators, and online exploration is expensive, dangerous, or ethically problematic (medical treatment, industrial control, financial trading). Do not expect offline RL to extrapolate far beyond the behavioral policy - it finds the best policy within the support of your data, not beyond it.

import torch
import torch.nn as nn
import torch.nn.functional as F
import copy


class CQLAgent:
"""
Conservative Q-Learning for offline RL.
Kumar et al. (2020) - NeurIPS 2020.
"""

def __init__(self, state_dim, action_dim, hidden_dim=256,
alpha=1.0, gamma=0.99, lr=3e-4):
self.alpha = alpha # CQL conservative penalty weight
self.gamma = gamma

# Double Q-networks to reduce overestimation
self.q1 = self._build_q(state_dim, action_dim, hidden_dim)
self.q2 = self._build_q(state_dim, action_dim, hidden_dim)
self.q1_target = copy.deepcopy(self.q1)
self.q2_target = copy.deepcopy(self.q2)

self.q_optim = torch.optim.Adam(
list(self.q1.parameters()) + list(self.q2.parameters()), lr=lr
)

def _build_q(self, state_dim, action_dim, hidden_dim):
return nn.Sequential(
nn.Linear(state_dim + action_dim, hidden_dim), nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
nn.Linear(hidden_dim, 1)
)

def cql_loss(self, states, actions, rewards, next_states, dones, num_random=10):
"""
Full CQL loss = TD loss + alpha * conservative penalty.
The conservative penalty = logsumexp(Q over random actions) - Q(data actions).
"""
sa = torch.cat([states, actions], dim=-1)
batch_size = states.shape[0]

# --- Bellman TD loss ---
with torch.no_grad():
# Target: use behavioral action as next action (simplified)
# In full CQL: use policy-sampled next actions
next_sa = torch.cat([next_states, actions], dim=-1)
target_q = rewards + self.gamma * (1 - dones) * torch.min(
self.q1_target(next_sa), self.q2_target(next_sa)
)

q1_pred = self.q1(sa)
q2_pred = self.q2(sa)
td_loss = F.mse_loss(q1_pred, target_q) + F.mse_loss(q2_pred, target_q)

# --- CQL conservative penalty ---
# Sample random actions uniformly (these are out-of-distribution)
random_actions = torch.FloatTensor(
batch_size * num_random, actions.shape[-1]
).uniform_(-1, 1)

# Expand states to match random actions
states_exp = states.unsqueeze(1).expand(-1, num_random, -1).reshape(
batch_size * num_random, -1
)
random_sa = torch.cat([states_exp, random_actions], dim=-1)

# log-sum-exp over random actions minus Q at data actions
q1_rand = self.q1(random_sa).view(batch_size, num_random)
q2_rand = self.q2(random_sa).view(batch_size, num_random)

# Conservative penalty: push random Q down, push data Q up
conservative = (
torch.logsumexp(q1_rand, dim=1).mean() - q1_pred.mean() +
torch.logsumexp(q2_rand, dim=1).mean() - q2_pred.mean()
)

total_loss = td_loss + self.alpha * conservative
return total_loss, {"td_loss": td_loss.item(), "cql_penalty": conservative.item()}

def update(self, batch):
states, actions, rewards, next_states, dones = batch
loss, info = self.cql_loss(states, actions, rewards, next_states, dones)
self.q_optim.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(
list(self.q1.parameters()) + list(self.q2.parameters()), 1.0
)
self.q_optim.step()
return info

def soft_update_targets(self, tau=0.005):
for p, tp in zip(self.q1.parameters(), self.q1_target.parameters()):
tp.data.copy_(tau * p.data + (1 - tau) * tp.data)
for p, tp in zip(self.q2.parameters(), self.q2_target.parameters()):
tp.data.copy_(tau * p.data + (1 - tau) * tp.data)

Contextual Bandits - The Pragmatic Alternative

Many real production problems do not require full RL. If the reward is immediate and there is no multi-step temporal credit assignment problem, you are dealing with a contextual bandit - not a full MDP.

The bandit formulation: At each timestep, observe context xx (user features, item features, session data), select action aa from action space A\mathcal{A}, observe immediate reward r(x,a)r(x, a). No state transition. No Markov assumption. No future states to worry about.

This simplification makes bandits far more tractable for production: no value function, no temporal credit assignment, no replay buffer with multi-step returns.

Upper Confidence Bound (UCB)

UCB exploits uncertainty in your reward estimates. Arm aa has been tried nan_a times, yielding estimated mean reward μ^a\hat{\mu}_a. The UCB score is:

UCB(a)=μ^a+clntna\text{UCB}(a) = \hat{\mu}_a + c\sqrt{\frac{\ln t}{n_a}}

The second term is the uncertainty bonus - large when arm aa has been tried rarely (nan_a small) and shrinks as more data arrives. UCB achieves O(TlogT)O(\sqrt{T \log T}) cumulative regret - near-optimal among all bandit algorithms.

Thompson Sampling

Thompson Sampling takes a Bayesian approach: maintain a posterior distribution over the reward parameter θ\theta, sample from it, then take the best action under the sample.

θP(θD),at=argmaxar(a;θ)\theta \sim P(\theta \mid \mathcal{D}), \quad a_t = \arg\max_a r(a; \theta)

For Bernoulli rewards (click/no-click), the conjugate prior is Beta: θaBeta(αa,βa)\theta_a \sim \text{Beta}(\alpha_a, \beta_a). After a click: αaαa+1\alpha_a \leftarrow \alpha_a + 1. After no click: βaβa+1\beta_a \leftarrow \beta_a + 1. Thompson Sampling empirically outperforms UCB, handles non-stationary rewards more gracefully, and scales naturally to large action spaces.

LinUCB - Contextual Linear Bandits

For problems where you have feature vectors, LinUCB fits a linear reward model r(x,a)θaxr(x, a) \approx \theta_a^\top x per arm and uses the regression uncertainty as the exploration bonus:

r^(a)=θ^ax+αxAa1x\hat{r}(a) = \hat{\theta}_a^\top x + \alpha\sqrt{x^\top A_a^{-1} x}

where Aa=ixixi+IA_a = \sum_i x_i x_i^\top + I is the regularized gram matrix for arm aa's data. The second term - the standard error of the linear prediction - is large when xx is in a direction poorly covered by the data, driving exploration.

import numpy as np


class LinUCBAgent:
"""
LinUCB contextual bandit.
Li et al. (2010) - used at Yahoo! for news article recommendation.
Achieves O(sqrt(T log T)) regret with linear reward structure.
"""

def __init__(self, n_actions: int, n_features: int, alpha: float = 1.0):
self.n_actions = n_actions
self.n_features = n_features
self.alpha = alpha # exploration coefficient

# Per-arm sufficient statistics
self.A = [np.eye(n_features) for _ in range(n_actions)] # feature covariance
self.b = [np.zeros(n_features) for _ in range(n_actions)] # reward accumulator

def select_action(self, context: np.ndarray) -> int:
"""Select arm by UCB on linear reward estimate."""
x = context.astype(float)
ucb_scores = []
for a in range(self.n_actions):
A_inv = np.linalg.inv(self.A[a])
theta_hat = A_inv @ self.b[a]
mu = theta_hat @ x # estimated reward
sigma = np.sqrt(x @ A_inv @ x) # estimation uncertainty
ucb_scores.append(mu + self.alpha * sigma)
return int(np.argmax(ucb_scores))

def update(self, context: np.ndarray, action: int, reward: float) -> None:
"""Update arm model after observing reward."""
x = context.astype(float)
self.A[action] += np.outer(x, x)
self.b[action] += reward * x

def get_theta(self, action: int) -> np.ndarray:
"""Return current linear model weights for arm."""
return np.linalg.inv(self.A[action]) @ self.b[action]


# ---- Simulation: news article recommendation ----
np.random.seed(42)
n_articles = 5
n_features = 10

agent = LinUCBAgent(n_actions=n_articles, n_features=n_features, alpha=0.5)
true_thetas = np.random.randn(n_articles, n_features) # true (unknown) reward params

total_reward = 0
for t in range(1000):
context = np.random.randn(n_features)
action = agent.select_action(context)

# True reward: linear + noise; observed as binary click
true_reward = true_thetas[action] @ context + 0.1 * np.random.randn()
observed_reward = float(true_reward > 0) # click model

agent.update(context, action, observed_reward)
total_reward += observed_reward

print(f"Total clicks (LinUCB): {total_reward:.0f}")
print(f"Expected random policy: {1000 * 0.5:.0f}")
# LinUCB should substantially outperform random within a few hundred steps

Thompson Sampling in Practice - Beta-Bernoulli Bandit

For binary reward settings (click/no-click, conversion/no-conversion), Thompson Sampling with a Beta prior is analytically tractable and highly effective:

import numpy as np
from scipy.stats import beta


class ThompsonSamplingBandit:
"""
Thompson Sampling with Beta-Bernoulli conjugate model.
For binary rewards (click/no-click, conversion/no-conversion).
Achieves O(sqrt(T log T)) regret - same asymptotic as UCB, often faster in practice.
"""

def __init__(self, n_arms: int, alpha_prior: float = 1.0, beta_prior: float = 1.0):
self.n_arms = n_arms
# Uniform prior Beta(1, 1) - no prior knowledge about any arm
self.alphas = np.full(n_arms, alpha_prior, dtype=float)
self.betas = np.full(n_arms, beta_prior, dtype=float)

def select_arm(self) -> int:
"""Sample theta from each arm's posterior, select highest."""
samples = np.array([
np.random.beta(self.alphas[i], self.betas[i])
for i in range(self.n_arms)
])
return int(np.argmax(samples))

def update(self, arm: int, reward: float) -> None:
"""Conjugate update: Beta + Bernoulli = Beta."""
self.alphas[arm] += reward # success count
self.betas[arm] += (1 - reward) # failure count

def get_posterior_mean(self, arm: int) -> float:
"""Expected value of arm reward: alpha / (alpha + beta)."""
return self.alphas[arm] / (self.alphas[arm] + self.betas[arm])

def get_credible_interval(self, arm: int, ci: float = 0.95) -> tuple:
"""95% credible interval for arm's true reward probability."""
return beta.interval(ci, self.alphas[arm], self.betas[arm])

Production bandit deployments:

  • Netflix - thumbnail selection: which image to show for a show drives click-through rate significantly; immediate reward (user clicks or not)
  • Google Ads - bid strategy selection per auction context; reward is conversion within session
  • Spotify - playlist generation given listening context; reward is skip rate within session
  • Clinical trials - adaptive dosing: select treatment arm based on patient features; reward is immediate biomarker response

:::tip When to use bandits vs full RL If the reward is observed within the same session (click, purchase, conversion within minutes), use bandits. If the reward depends on a sequence of decisions across time (monthly churn, long-term engagement, multi-step manipulation task), use full RL. The Markov assumption distinguishes them: bandits have no state that persists meaningfully across steps. :::


Reward Shaping - Making Sparse Rewards Learnable

Many real tasks have sparse rewards: a robot gets +1 for picking up an object after 10,000 timesteps of failure, or a recommendation agent gets +1 only when a user subscribes after 30 days of interaction. Sparse rewards make learning impossibly slow.

Reward shaping adds an auxiliary signal F(s,a,s)F(s, a, s') to the environment reward to guide the agent:

r(s,a,s)=r(s,a,s)+F(s,a,s)r'(s, a, s') = r(s, a, s') + F(s, a, s')

The critical design constraint: shaping must not change the optimal policy. A poorly designed shaping signal causes the agent to optimize the shaped reward at the expense of the true objective.

Potential-Based Shaping

Ng et al. (1999) proved that any shaping function of the form:

F(s,a,s)=γΦ(s)Φ(s)F(s, a, s') = \gamma\Phi(s') - \Phi(s)

is policy-invariant - it cannot change the optimal policy for any Φ:SR\Phi : \mathcal{S} \to \mathbb{R}. The intuition: FF is the TD-like difference in a "potential" function Φ\Phi. Over a complete trajectory, these differences telescope, adding only a constant to cumulative reward. The agent that maximizes J=J+constJ' = J + \text{const} is the same as the agent that maximizes JJ.

Design recipe: choose Φ\Phi to reflect proximity to goal (distance to target, number of subtasks completed, value of a hand-crafted heuristic), compute FF as above, add to reward.

Intrinsic Motivation

Reward the agent for exploring novel states - curiosity as a training signal. This addresses the core challenge of sparse rewards in real environments: without any intermediate signal, random exploration rarely stumbles onto the terminal reward at all.

ICM (Intrinsic Curiosity Module, Pathak et al. 2017): Train a forward model to predict the next state embedding ϕ^(st+1)\hat{\phi}(s_{t+1}) given the current state encoding ϕ(st)\phi(s_t) and action ata_t. The prediction error becomes the intrinsic reward - large in states the model has not learned to predict (novel, unexplored) and small in well-understood states.

rtICM=η2ϕ^(st+1)ϕ(st+1)2r_t^{\text{ICM}} = \frac{\eta}{2}\|\hat{\phi}(s_{t+1}) - \phi(s_{t+1})\|^2

ICM also includes an inverse model (predicting ata_t from ϕ(st)\phi(s_t) and ϕ(st+1)\phi(s_{t+1})) to learn features that are invariant to irrelevant environmental noise - preventing the agent from being curious about TV static or random pixel flickering.

RND (Random Network Distillation, Burda et al. 2018): A fixed randomly initialized network f:SRdf: \mathcal{S} \to \mathbb{R}^d defines a target embedding. A predictor network f^\hat{f} is trained to match the target on visited states. Prediction error f^(s)f(s)2\|\hat{f}(s) - f(s)\|^2 is high for novel states (predictor has not seen them), low for frequently visited ones (predictor has converged). Simpler to train than ICM, highly effective on sparse-reward environments like Montezuma's Revenge.

Count-based exploration: For discrete state spaces, keep a visitation counter N(s)N(s) and add an intrinsic reward rt+=β/N(st)r_t^+ = \beta / \sqrt{N(s_t)} - the exploration bonus decays as the state is visited more. Impractical for continuous high-dimensional state spaces; use pseudo-counts (hash-based or density-based) as an approximation.

Reward Hacking and Goodhart's Law

:::danger Goodhart's Law - the most dangerous failure mode in production RL "When a measure becomes a target, it ceases to be a good measure." - Charles Goodhart

If you shape rewards based on a proxy metric, the agent will optimize that proxy in ways that diverge from your actual goal. Examples from production systems:

  • Watch time reward - agent recommends increasingly extreme content that is "sticky" but not what users would say they wanted
  • Engagement reward - agent learns to generate outrage-inducing content (more clicks, longer sessions, worse user outcomes)
  • Shaped distance reward - robot vibrates in place to maximize progress signal without actually moving toward the goal (classic locomotion failure)
  • Code quality score - agent writes tests that always pass rather than correct code

The fix: red-team your reward function. Before training, explicitly ask: "If my agent scored 100% on this reward, would I be happy with its behavior?" If not, you have Goodhart baked in. Run adversarial simulations; look for the most absurd way to score well. :::


Safe RL - Hard Constraints in Continuous Control

Some actions are not just suboptimal - they are catastrophically harmful. Safety constraints must be enforced at the architecture level, not hoped for through reward design.

Constrained MDPs

The formal framework is the Constrained Markov Decision Process (CMDP). Add KK cost functions ck(s,a)c_k(s, a) alongside the reward r(s,a)r(s, a). The optimization problem is:

maxπJ(π)subject toCk(π)dkk\max_\pi J(\pi) \quad \text{subject to} \quad C_k(\pi) \leq d_k \quad \forall k

where Ck(π)=Eπ ⁣[tγtck(st,at)]C_k(\pi) = \mathbb{E}_\pi\!\left[\sum_t \gamma^t c_k(s_t, a_t)\right] is the expected cumulative cost under policy π\pi, and dkd_k is the safety threshold. You want the highest-reward policy that keeps all safety costs below their limits simultaneously.

Lagrangian Relaxation

Convert the constrained problem to an unconstrained min-max using Lagrange multipliers λk0\lambda_k \geq 0:

L(π,λ)=J(π)kλk ⁣(Ck(π)dk)\mathcal{L}(\pi, \lambda) = J(\pi) - \sum_k \lambda_k\!\left(C_k(\pi) - d_k\right)

The saddle-point maxπminλL(π,λ)\max_\pi \min_\lambda \mathcal{L}(\pi, \lambda) gives the optimal constrained policy. In practice, alternate between two updates:

  1. Policy update: policy gradient on J(π)kλkCk(π)J(\pi) - \sum_k \lambda_k C_k(\pi) - larger λk\lambda_k means stronger penalty for incurring cost kk
  2. Dual update: λkmax(0,λk+ηλ(Ck(π)dk))\lambda_k \leftarrow \max(0,\, \lambda_k + \eta_\lambda(C_k(\pi) - d_k)) - raise λk\lambda_k when constraint violated, lower when satisfied

The Lagrangian is simple and compatible with any policy gradient algorithm, but can oscillate and may violate constraints transiently during training.

Constrained Policy Optimization (CPO)

CPO (Achiam et al., 2017) extends the PPO trust region to handle constraints directly. At each update step it solves:

πk+1=argmaxπL(πk,π)s.t.DˉKL(ππk)δ,JCi(π)di\pi_{k+1} = \arg\max_\pi \mathcal{L}(\pi_k, \pi) \quad \text{s.t.} \quad \bar{D}_{KL}(\pi \| \pi_k) \leq \delta,\quad J_{C_i}(\pi) \leq d_i

Both the KL constraint and cost constraints are enforced simultaneously. CPO provides monotonic improvement guarantees and near-constraint satisfaction at every update, at the cost of a more complex optimization step involving a second-order expansion and a feasibility check.

Safety Layer

The most pragmatic production approach: train any standard RL policy, post-process its action through a safety filter before execution.

import numpy as np


class LagrangianSafeRL:
"""
Lagrangian safe RL training loop.
Adds constraint satisfaction to any policy gradient algorithm.
"""

def __init__(self, policy, cost_limit: float = 0.1,
lambda_lr: float = 0.01, gamma: float = 0.99):
self.policy = policy
self.cost_limit = cost_limit # d: maximum allowed average cost
self.lambda_lr = lambda_lr
self.lam = 0.0 # Lagrange multiplier (starts unconstrained)
self.gamma = gamma

def compute_augmented_reward(self, rewards, costs):
"""
Augmented reward = original reward - lambda * cost.
Agent penalized for incurring cost; penalty scales with constraint violation.
"""
return [r - self.lam * c for r, c in zip(rewards, costs)]

def update_lambda(self, episode_costs: list) -> float:
"""
Dual ascent: gradient step on Lagrangian w.r.t. lambda.
lambda_{k+1} = max(0, lambda_k + lr * (avg_cost - limit))
"""
avg_cost = np.mean(episode_costs)
self.lam = max(0.0, self.lam + self.lambda_lr * (avg_cost - self.cost_limit))
return self.lam

def safety_project(self, action: np.ndarray, state: np.ndarray) -> np.ndarray:
"""
Project proposed action to safe feasible set.
Implementation is domain-specific. Examples:
- Actuator limits: clip to physical range
- Financial: enforce position / leverage limits
- Robotics: solve QP for nearest kinematically feasible action
"""
# Example: clip to actuator limits [-1, 1]
return np.clip(action, -1.0, 1.0)

def _compute_returns(self, rewards: list) -> list:
G, running = [], 0.0
for r in reversed(rewards):
running = r + self.gamma * running
G.insert(0, running)
return G

def training_step(self, env, n_steps: int = 1000) -> dict:
"""One episode of safe RL with Lagrangian penalty."""
state = env.reset()
rewards, costs, log_probs = [], [], []

for _ in range(n_steps):
action, log_prob = self.policy.act(state)
safe_action = self.safety_project(action, state)

next_state, reward, done, info = env.step(safe_action)
cost = info.get("cost", 0.0) # domain-specific safety cost

rewards.append(reward)
costs.append(cost)
log_probs.append(log_prob)

if done:
break
state = next_state

# Compute augmented returns for policy update
augmented = self.compute_augmented_reward(rewards, costs)
G = self._compute_returns(augmented)

# Policy gradient update (REINFORCE; replace with PPO in practice)
policy_loss = sum(-lp * g for lp, g in zip(log_probs, G))
# policy.optimizer.zero_grad(); policy_loss.backward(); optimizer.step()

# Dual: adjust lambda based on constraint violation this episode
new_lambda = self.update_lambda(costs)

return {
"total_reward": sum(rewards),
"total_cost": sum(costs),
"lambda": new_lambda,
"constraint_satisfied": np.mean(costs) <= self.cost_limit,
}

Real-World RL Systems

DeepMind Data Center Cooling (2016–2018)

Architecture: Deep neural network policy trained on 5 years of sensor logs. State: ~20 sensor readings (temperatures, flow rates, power). Action: set points for pumps, chillers, cooling tower fans. Reward: negative power usage effectiveness (PUE).

Key engineering decisions:

  • Fully offline training - no online exploration during training
  • 21 hard constraints on operating parameters enforced at the actuator layer
  • Months of shadow mode comparison before any live control
  • Human override always available and monitored 24/7
  • Gradual deployment: one cooling loop first, full autonomy only after months

Result: 40% reduction in cooling energy, 15% total data center energy savings.

YouTube Recommendation (2019–present)

Recurrent RL agent (REINFORCE with baseline) for long-term user satisfaction. The key insight: optimizing immediate engagement (clicks, raw watch time) diverges from optimizing long-term satisfaction. The RL agent trades short-term engagement for long-term retention. Challenge: reward (user returns to platform, satisfaction surveys) arrives days after the recommendation decision. Multi-step credit assignment across delayed reward is the core technical problem.

Waymo Self-Driving (Model-Based RL)

Waymo uses model-based RL: learn a world model from sensor data, then plan within the model. Online exploration happens in simulation (Waymo Sim), not on real roads. Real-world data refines the world model, not the policy directly. This is the highest-stakes application of offline-to-online transfer: simulation-trained policy validated offline, then deployed in controlled scenarios before full autonomy.

AlphaFold 2 (RL-Style Iterative Refinement)

AlphaFold 2 is not RL in the traditional sense, but it uses RL-style optimization: structure prediction is refined through an iterative process where the model evaluates its own confidence and refines the candidate structure accordingly. The model generates a protein structure, evaluates its plausibility (a learned confidence score), then uses the confidence as a signal to refine the structure further. This is structurally identical to the actor-critic RL loop: generate (actor) → evaluate (critic) → update.

The CASP14 result (2020) - AlphaFold 2 solving protein structure prediction to near-experimental accuracy - is partly attributable to this iterative RL-style refinement. The lesson: RL concepts generalize far beyond the sequential decision-making framing they are usually presented in.

Model-Based RL for Sample Efficiency

When online interactions are limited but not entirely prohibited, model-based RL (MBRL) offers a middle path. The agent learns a dynamics model T^(ss,a)\hat{T}(s' | s, a) from limited real experience, then uses this model to simulate additional training data without real interactions.

Dyna-Q (Sutton, 1991): After each real interaction (s,a,r,s)(s, a, r, s'), update the Q-function normally. Then perform kk additional "imagined" updates: sample a previously seen (s,a)(s, a) pair, query the learned model for s^T^(s,a)\hat{s}' \sim \hat{T}(\cdot | s, a), update Q from this imagined transition. With k=50k = 50, Dyna-Q achieves 50× the effective sample efficiency of model-free Q-learning on small environments.

MBPO (Model-Based Policy Optimization, Janner et al. 2019): Train an ensemble of probabilistic dynamics models. Sample from the ensemble to detect uncertainty (disagreement between ensemble members = out-of-distribution state). Use short model-generated rollouts (horizon 1–5) combined with the real replay buffer to train a SAC agent. Achieves 5–40× better sample efficiency than SAC alone on MuJoCo continuous control tasks.

Production consideration: Learned dynamics models introduce a new failure mode - model bias. If the model is wrong about the consequences of an action, the agent will exploit that error (like reward hacking, but for the dynamics model). Always bound the rollout horizon and monitor the gap between predicted and observed transitions in deployment.


Offline RL Benchmarks and Algorithms Compared

The D4RL benchmark (Fu et al., 2020) standardized offline RL evaluation with four dataset quality levels per environment:

Dataset typeDescriptionCQLIQLTD3+BCBC baseline
randomRandomly collected transitions~5%~5%~8%~4%
mediumSAC policy trained to 1/3 performance, then stopped~44%~47%~48%~36%
medium-replayAll data from training SAC to medium level~45%~73%~44%~26%
medium-expertMix of medium and expert demonstrations~91%~87%~90%~52%
expertExpert demonstrations only~98%~92%~98%~107%

Scores are normalized: 0 = random policy, 100 = expert policy (HalfCheetah environment). Key takeaways:

  • Behavioral Cloning (BC) - supervised learning on the data - sets the baseline. It performs well on expert data but poorly on medium data where the behavioral policy itself was suboptimal.
  • CQL and IQL substantially outperform BC on medium and medium-replay datasets by extracting more information from the Q-value structure than pure imitation.
  • Medium-replay data (all training checkpoints, not just the final policy) is often the richest dataset - it contains diverse exploratory behavior that offline RL can exploit.
  • No offline RL algorithm reliably outperforms its behavioral policy by a wide margin; the ceiling is set by what is in the data.

Choosing the Right Offline RL Algorithm

ScenarioRecommended AlgorithmReason
Continuous action space, moderate data coverageIQLStable training, avoids OOD action queries
Discrete action space, narrow data coverageCQLConservative penalty prevents extrapolation errors
Large dataset, near-expert demonstrationsTD3+BC (behavioral cloning regularized)Leverages demonstrations directly
Unknown data quality, first experimentBC baseline firstEstablishes a floor; validates data quality
Very limited data (< 10K transitions)Weighted behavior cloningFull offline RL may overfit

Production Deployment Checklist

# Production RL deployment protocol

deployment_checklist = {
"offline_validation": [
"Evaluate policy on held-out historical trajectories (not in training set)",
"Compare against behavioral policy baseline - is it measurably better?",
"Check for reward hacking: adversarial scenarios where high reward != good behavior",
"Stress-test on worst-case scenarios from dataset tail",
],
"shadow_mode": [
"Log all policy recommendations alongside actual system decisions",
"Monitor divergence: how often does policy disagree with current system?",
"Alert on large disagreements - may indicate out-of-distribution state",
"Run minimum 2 weeks across varied operating conditions before live control",
],
"limited_deployment": [
"Control the smallest possible subsystem first",
"One-button manual override always available and within arm's reach",
"Real-time constraint monitoring dashboard for all safety metrics",
"Automatic rollback if any hard constraint is violated",
"A/B test against current system with identical traffic",
],
"full_deployment": [
"Continuous monitoring: reward, cost metrics, constraint satisfaction rates",
"Distribution shift detection on state features (Jensen-Shannon divergence)",
"Periodic offline re-evaluation on new historical data",
"Automated rollback triggers (3-sigma deviation from baseline)",
"On-call escalation path for unexpected agent behavior",
"Quarterly reward function red-team sessions",
],
}

Common Mistakes

:::danger Mistake 1 - Exploring freely in production Never run standard epsilon-greedy with ε>0.01\varepsilon > 0.01 in a high-stakes production environment. Every exploration step has a real cost. Use offline RL and shadow mode to validate before any online exploration. When you do introduce online exploration, make it targeted (Thompson Sampling uncertainty) not random. :::

:::danger Mistake 2 - Trusting your reward function without red-teaming The most common production failure: an agent that achieves a high reward on your metric while doing something you would never want. Spend three times longer designing and stress-testing your reward function than you think necessary. Explicitly construct adversarial scenarios before training begins. :::

:::warning Mistake 3 - Trusting offline evaluation metrics uncritically Offline evaluation measures policy quality on historical data. A policy can score well offline while performing poorly online due to distributional shift. Offline evaluation is necessary but insufficient. Always validate with shadow-mode comparison before live deployment. :::

:::warning Mistake 4 - Treating the training environment as static As your policy changes user behavior, the data distribution shifts. What was optimal six months ago may be suboptimal today. RL systems need continuous monitoring and periodic retraining. Plan for this in your infrastructure before your first deployment. :::

:::tip Start with contextual bandits For most business ML problems - recommendation, content ranking, ad bidding - start with contextual bandits before full RL. Bandits are simpler, more interpretable, easier to debug, and often achieve 80% of the performance of full RL with 20% of the complexity. Escalate to full RL only when you need multi-step credit assignment. :::


YouTube Resources

VideoChannelWhat You Will Learn
Offline RL OverviewSergey LevineWhen and why offline RL, CQL intuition, D4RL benchmark
DeepMind Data Center RLDeepMindData center cooling case study walkthrough
Safe Reinforcement LearningICML TutorialConstrained MDPs, Lagrangian relaxation, CPO
Contextual BanditsMicrosoft ResearchLinUCB and Thompson Sampling explained

Interview Q&A

Q1: What are the key differences between offline RL and online RL, and when would you choose each?

Online RL interacts with the environment during training - the agent tries actions, observes outcomes, and updates its policy in real time. It can explore freely and recover from mistakes. Suitable when the environment is safe to interact with (simulation, game), resets are cheap, and many interactions are possible.

Offline RL (batch RL) learns from a fixed historical dataset without any environment interaction. The agent cannot try new actions - it must learn entirely from logged data of a behavioral policy. Suitable when online exploration is dangerous (medical, industrial, financial), you have rich historical data from an existing system, or you want to pre-train before limited online fine-tuning.

The defining challenge of offline RL is extrapolation error: the agent overestimates Q-values for out-of-distribution actions it has never seen. CQL addresses this with conservative penalties; IQL avoids unseen action queries entirely.

Q2: Explain CQL's key idea. Why is the log-sum-exp term in the loss necessary?

CQL adds a conservative penalty to the standard TD loss to address extrapolation error. The log-sum-exp term Es[logaexp(Q(s,a))]\mathbb{E}_s[\log \sum_a \exp(Q(s,a))] is the soft maximum of Q-values over all actions at a given state - including actions not in the dataset. Minimizing this term pushes down Q-values for all actions, especially those with high but unsupported Q-estimates (out-of-distribution actions that look spuriously good). The second term E(s,a)D[Q(s,a)]-\mathbb{E}_{(s,a)\sim\mathcal{D}}[Q(s,a)] pushes up Q-values specifically for actions present in the data, counterbalancing the global suppression.

The result: Q-values are conservatively lower bounded for unseen actions, accurate for seen actions. The policy that maximizes conservative Q-values will not chase phantom rewards from unexplored regions. The α\alpha coefficient controls how conservative to be - a hyperparameter that must be tuned per dataset.

Q3: What are contextual bandits and when are they preferable to full RL?

Contextual bandits are a simplified RL setting: observe context xx, take action aa, observe immediate reward r(x,a)r(x, a), with no state transitions and no temporal credit assignment. Each decision is independent.

They are preferable to full RL when the reward is observed immediately within the same interaction (click, conversion, immediate rating), there is no meaningful long-term state that evolves based on your actions, and you want simpler training, debugging, and monitoring. Examples: news article recommendation (immediate click), ad bidding (immediate conversion), clinical dosing (immediate biomarker response).

Use full RL when decisions have long-term consequences unfolding over time: recommendation affecting 30-day retention, financial portfolio allocation, robot manipulation requiring multi-step planning.

Q4: What is reward shaping and how can it go wrong? What is Goodhart's Law?

Reward shaping adds an auxiliary term to the environment reward to guide agents through sparse reward settings. Potential-based shaping F=γΦ(s)Φ(s)F = \gamma\Phi(s') - \Phi(s) is guaranteed not to change the optimal policy - the differences telescope over a trajectory, adding only a constant to cumulative reward.

The failure mode is Goodhart's Law: "when a measure becomes a target, it ceases to be a good measure." A shaped reward is a proxy for the true objective. If the proxy is imperfectly aligned, the agent exploits the gap. Examples: watch time optimization leading to extreme content, engagement optimization leading to outrage-bait, shaped distance rewards leading to vibrating-in-place locomotion.

Mitigation: red-team your reward function before training - explicitly ask what happens if the agent scores 100% on this metric. Use human evaluators to audit policy behavior. Restrict the degree of optimization with KL penalties from a safe reference policy.

Q5: What is a constrained MDP and how does Lagrangian relaxation solve it?

A CMDP adds KK cost functions to a standard MDP. The optimization problem is: maximize expected cumulative reward subject to the constraint that expected cumulative cost stays below thresholds dkd_k for all kk simultaneously.

Lagrangian relaxation converts this constrained problem to an unconstrained min-max: maxπminλL(π,λ)=J(π)kλk(Ck(π)dk)\max_\pi \min_\lambda \mathcal{L}(\pi, \lambda) = J(\pi) - \sum_k \lambda_k(C_k(\pi) - d_k).

In practice: alternate between policy gradient updates treating kλkCk(π)-\sum_k \lambda_k C_k(\pi) as an additional penalty term, and dual updates λkmax(0,λk+η(Ck(π)dk))\lambda_k \leftarrow \max(0, \lambda_k + \eta(C_k(\pi) - d_k)). Lagrange multipliers adapt automatically: if the agent keeps violating a constraint, its penalty grows until the constraint is satisfied. The main limitation is transient constraint violations during training - CPO provides stronger per-update guarantees.

Q6: How did DeepMind's data center RL system work in practice?

The system had four key engineering components. First, fully offline training: the policy was trained on 5 years of sensor logs - tens of millions of (state, action, reward) tuples from human operator decisions. No online exploration during training. Second, safety constraints: 21 hard operational constraints (temperature limits, pressure ranges, flow rate limits) enforced at the actuator command level - the policy's recommendations were checked and clipped before any physical command was sent. Third, shadow mode: the system ran for months generating recommendations that were logged and compared to human decisions, but not executed. Engineers validated recommendations were sensible across a wide range of operating conditions. Fourth, gradual deployment: live control was introduced one cooling loop at a time, with human monitoring and one-button override always available. Full autonomy came only after months of successful partial deployment. The 40% cooling energy reduction demonstrates that the engineering around the algorithm mattered as much as the algorithm itself.


Monitoring RL Systems in Production

Once deployed, RL systems require a monitoring strategy fundamentally different from supervised ML. A supervised model's performance is measurable with a held-out test set. An RL system's performance depends on the policy's interaction with a non-stationary environment - the environment that the policy itself is changing.

Key Metrics to Track

Reward metrics: Track the per-episode reward distribution, not just the mean. The distribution's shape reveals whether the policy is consistently good or bimodally distributed (excellent sometimes, terrible other times). A bimodal distribution suggests the policy is sensitive to specific environment conditions that need investigation.

Constraint metrics: For safe RL deployments, track the empirical constraint violation rate at every decision step. Plot it over time. Any upward trend - even a small one - is a signal that the environment is shifting in ways the policy was not trained to handle.

Distribution shift metrics: Compare the current state feature distribution pπcurrent(s)p_{\pi_{\text{current}}}(s) against the training distribution pπb(s)p_{\pi_b}(s). The Jensen-Shannon divergence between these distributions should be small. When it grows, your value estimates are becoming unreliable.

DJS(pπbpπcurrent)=12[DKL(pπbm)+DKL(pπcurrentm)]D_{JS}(p_{\pi_b} \| p_{\pi_\text{current}}) = \frac{1}{2}\left[D_{KL}(p_{\pi_b} \| m) + D_{KL}(p_{\pi_\text{current}} \| m)\right]

where m=12(pπb+pπcurrent)m = \frac{1}{2}(p_{\pi_b} + p_{\pi_\text{current}}).

Action diversity metrics: If the policy is exploiting a small subset of available actions, it may have converged to a local optimum or be reward hacking. Track action entropy over time: H(π(s))=aπ(as)logπ(as)H(\pi(\cdot|s)) = -\sum_a \pi(a|s)\log\pi(a|s).

Automated Rollback Triggers

Define explicit rollback criteria before deployment:

  • Reward drops more than 2 standard deviations below the rolling mean
  • Constraint violation rate exceeds threshold dk+δd_k + \delta for any kk
  • State feature distribution shift exceeds DJS>0.1D_{JS} > 0.1
  • Any single action causes an irreversible real-world consequence (fire the exception immediately, alert on-call)

When a trigger fires, rollback to the previous policy automatically, alert the on-call team, and log the trigger state for post-hoc debugging.

Hyperparameter Reference for Production RL Algorithms

AlgorithmKey HyperparametersTypical ValuesSensitivity
CQLα\alpha (penalty weight)0.1–5.0High - tune on validation
CQLnum_random samples10Medium
IQLτ\tau (expectile)0.7–0.9Medium - higher = more optimistic
IQLβ\beta (policy extraction temp.)3.0–10.0Medium
LinUCBα\alpha (exploration coeff.)0.1–2.0High - tune per domain
Thompson SamplingPrior (α0,β0)(\alpha_0, \beta_0)(1, 1) uniformLow
Lagrangian safe RLλlr\lambda_\text{lr}0.001–0.01High - too fast → oscillation
Lagrangian safe RLInitial λ\lambda0.0Fixed (start unconstrained)
PPO + agentClip ε\varepsilon0.1–0.2Medium
PPO + agentKL target0.01–0.05High - too large → instability

A/B Testing RL Policies

Standard A/B testing applies to RL with important caveats: (1) the treatment and control groups must be fully isolated - if the RL policy changes prices in a market, it changes the environment for the heuristic control too; (2) the measurement period must be long enough to capture delayed rewards - for 30-day churn metrics, run for 90+ days before concluding; (3) guard against novelty effects - users may respond differently to a new policy simply because it is new, not because it is better.

Interplay Between RL and the Data Flywheel

RL systems that improve over time create a self-reinforcing data flywheel: a better policy generates better data (richer, more diverse states) which enables better training which produces a better policy. This is the same flywheel that drives AlphaGo Zero's self-play, YouTube's recommendation RL, and Waymo's simulation training loop.

The challenge: early in deployment, the behavioral policy generates poor data. The distribution is narrow, the reward signals are sparse. This is the cold start problem for RL in production. Mitigation strategies: (1) bootstrap with human demonstration data (SFT before RL), (2) use a rule-based prior policy as the behavioral policy to generate warm-start data, (3) use contextual bandits to collect initial diverse data before training a full RL agent.


Summary

Production RL is an engineering discipline, not just an algorithmic one. The five core challenges - reward delay, non-stationarity, exploration risk, distributional shift, sample efficiency - each demand specific engineering responses.

Offline RL (CQL, IQL) is the right tool when online exploration is unsafe or expensive. Contextual bandits (LinUCB, Thompson Sampling) are the right tool when the problem lacks temporal credit assignment. Reward shaping accelerates learning in sparse reward settings but must be designed with Goodhart's Law in mind. Safe RL (Lagrangian, CPO, safety layers) is required whenever actions have hard physical or ethical constraints.

The DeepMind data center story is the template: start with historical data, train offline, run in shadow mode, deploy incrementally with overrides, monitor continuously. This is how RL earns the right to operate autonomously in the real world.


Next: RL for AI Agents - Teaching Models to Act in the World

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Training Dynamics demo on the EngineersOfAI Playground - no code required.

:::

© 2026 EngineersOfAI. All rights reserved.