Process Optimization with Reinforcement Learning

Reading time: ~45 min · Interview relevance: High · Target roles: ML Engineer, Control Systems Engineer, Industrial AI Researcher

The Process Engineer and the Black Box

At a BASF chemical plant in Ludwigshafen, a team of process engineers has spent 30 years optimizing the production of a specialty polymer. The process involves 47 controllable parameters - reactor temperatures, pressure profiles, feed ratios, residence times, catalyst concentrations. The quality of the polymer is measured by its molecular weight distribution, which you will not know until four hours after the batch completes. The engineers know the process intimately. They know that if Feed Rate 3 is above 2.4 L/min and Reactor 2 temperature drops below 182C in the third hour, you get a batch with excessive low-molecular-weight fraction. They carry this knowledge as intuition - patterns recognized through thousands of batches.

The problem: no engineer can hold 47 variables in their head simultaneously and optimize them jointly. They optimize heuristically - adjust one or two variables, observe the effect over several batches, adjust another. This greedy, sequential optimization explores a tiny fraction of the parameter space and converges to a local optimum that reflects decades of experience but not the global optimal. The gap between current performance and theoretical optimum might be 5-15% in yield, energy efficiency, or quality consistency.

Reinforcement learning offers a different approach: learn a policy that maps the current process state (all 47 variables, plus historical context) directly to optimal control actions, optimizing over a multi-objective reward that captures yield, quality, energy consumption, and process stability simultaneously. The policy can consider all 47 variables jointly, exploring the high-dimensional parameter space in ways no human can. When it works, the results are striking: DeepMind's work with Google's data centers reduced cooling energy by 40%. Optimizing aluminum smelting processes with RL improved energy efficiency by 8-12%. The Freudenberg Sealing Technologies application of RL to rubber compound mixing improved first-time-right quality from 80% to 97%.

But "when it works" is doing a lot of work. RL for physical processes is hard in ways that game-playing RL is not. Mistakes cost money and can damage equipment. The environment is non-stationary. Historical data is the only safe way to learn at first. The reward signal is delayed by hours. This lesson teaches you to navigate these challenges and deploy RL in manufacturing contexts that actually matter.

Why This Exists

Why PID Controllers Are Not Enough

PID (Proportional-Integral-Derivative) control is the workhorse of industrial process control. Over 95% of industrial control loops use PID. It works well for single-input single-output (SISO) control of well-understood processes with relatively simple dynamics. The PID controller computes a correction signal as:

$u(t) = K_p e(t) + K_i \int_0^t e(\tau)d\tau + K_d \frac{de(t)}{dt}$

where $e(t) = r(t) - y(t)$ is the error between setpoint $r(t)$ and measurement $y(t)$ .

PID has three fundamental limitations for complex manufacturing processes:

Single-loop: Standard PID handles one controlled variable at a time. A process with 10 control loops requires 10 PIDs, each tuned independently, ignoring the interactions between loops. In a chemical reactor, temperature control and pressure control are coupled - changing temperature affects pressure, requiring a coordinated response that 10 independent PIDs cannot provide.

Linear assumption: PID is designed for linear process dynamics. Real manufacturing processes are nonlinear - the gain (process response per unit of control action) changes with operating conditions. A PID tuned for nominal conditions becomes sluggish at high loads and oscillatory at low loads.

Reactive, not predictive: PID responds to errors after they occur. It cannot anticipate that a large disturbance is coming (a raw material batch change, an ambient temperature swing) and pre-emptively adjust. Model Predictive Control (MPC) addresses this, but MPC requires an accurate process model - exactly what we often do not have.

RL can address all three limitations: it naturally handles multiple inputs and outputs jointly, learns nonlinear policies from data without assuming linearity, and can implicitly learn predictive behavior by including historical context in the state representation.

Historical Context

The intersection of RL and industrial process control has a longer history than most people realize. In 1994, before the "deep learning" era, IBM demonstrated a neurofuzzy controller using RL for hot rolling mill control. The 1995 TD-Gammon paper (Tesauro) showed that RL could learn superhuman strategies in complex sequential decision problems, inspiring the first wave of industrial RL experiments.

The modern era is defined by AlphaGo (2016) and its successors demonstrating that deep RL could solve planning problems at superhuman level, followed by OpenAI's work on robotic manipulation. The industrial RL wave followed: DeepMind x Google cooling (2016), DeepMind x Genie Energy (2020), IBM industrial RL platform (2020-present), Siemens and Yokogawa process control RL pilots.

The key algorithmic developments enabling industrial RL: Proximal Policy Optimization (PPO, 2017) as a stable on-policy algorithm; Soft Actor-Critic (SAC, 2018) as a sample-efficient off-policy algorithm; Conservative Q-Learning (CQL, 2020) as the breakthrough for offline RL from historical data; and Constrained Policy Optimization (CPO, 2017) and similar safe RL methods for constraint satisfaction.

The remaining barriers to widespread industrial RL adoption: simulation fidelity (the sim-to-real gap), safety guarantees during exploration, and the cultural challenge of convincing process engineers to trust a "black box" controller.

Core Concepts

Formulating Manufacturing as an MDP

The Markov Decision Process (MDP) framework formalizes the control problem. For a manufacturing process:

State $s_t$ : Everything the controller needs to observe to make an optimal decision. For a chemical reactor:

Current sensor readings: all temperatures, pressures, flows, concentrations
Historical context: the last N timesteps of sensor readings (captures dynamics)
Process conditions: batch number, elapsed time, raw material batch ID
Equipment state: maintenance flags, actuator positions

Action $a_t$ : The setpoints or parameter changes the controller can make. Can be:

Discrete: on/off decisions, recipe selection
Continuous: setpoint adjustments (most process control applications)
Mixed: some continuous setpoints plus discrete mode switches

Reward $r_t$ : A scalar signal encoding what "good" means. This is the hardest part of the MDP formulation and where the most engineering judgment is required.

Transition $P(s_{t+1} | s_t, a_t)$ : The process dynamics. In simulation, this is the simulator. In production, it is the real process.

Episode: In batch manufacturing, an episode is one batch - from raw material loading to product discharge. In continuous processes (e.g., paper mill, oil refinery), episodes are artificial divisions of the continuous operation.

Reward Design for Manufacturing

The reward function encodes your optimization objective. Multi-objective manufacturing rewards must balance competing goals:

def manufacturing_reward(
    yield_fraction: float,    # 0 to 1, higher is better
    quality_score: float,     # 0 to 1, higher is better
    energy_kwh: float,        # Lower is better
    off_spec_penalty: float,  # 0 if in spec, large if off spec
    safety_violation: bool    # True if a safety limit was exceeded
) -> float:
    # Safety constraint: large negative reward for violations
    # This must dominate all other terms
    if safety_violation:
        return -100.0

    # Weighted multi-objective reward
    reward = (
        5.0 * yield_fraction          # Primary objective
        + 3.0 * quality_score         # Secondary objective
        - 0.5 * energy_kwh / 1000.0   # Efficiency (normalized)
        - off_spec_penalty            # Quality gate
    )
    return reward

The critical design principle: safety constraints must be inviolable, not just heavily penalized. A large negative reward still allows the policy to trade occasional safety violations for large gains elsewhere. True safety requires constraint-based approaches where the policy learns to never exceed safety limits.

Offline RL: Learning from Historical Data

The safest path to deploying RL in manufacturing is offline RL - learning a policy entirely from historical process data, without any online interaction with the real plant. The historical data contains (state, action, reward, next_state) transitions from past operations. The offline RL algorithm learns to extract the best policy implicit in this historical data.

The core challenge in offline RL is distributional shift: the policy may want to take actions in regions of the state space where historical data is sparse. The Q-function learned from offline data will be poorly calibrated in these regions - it may extrapolate to overestimate the value of actions that were never tried. CQL (Conservative Q-Learning) addresses this by adding a regularization term to the Q-learning objective that penalizes high Q-values for actions not in the dataset:

$\min_Q \underbrace{\mathcal{L}_{TD}(Q)}_{\text{Bellman error}} + \alpha \underbrace{\mathbb{E}_{s \sim D}[\log \sum_a e^{Q(s,a)} - \mathbb{E}_{a \sim D}[Q(s,a)]]}_{\text{Conservative penalty}}$

The conservative penalty pushes down Q-values for unseen actions, ensuring the policy stays in the historical data support.

Code Examples

1. Manufacturing Environment as OpenAI Gym

"""
OpenAI Gym environment wrapping a manufacturing process.

This is the critical interface between the RL algorithm and the process.
Can wrap:
- A physics simulation (for training)
- A digital twin (for testing)
- The real process (for online RL after extensive offline testing)
"""
import numpy as np
import gym
from gym import spaces
from typing import Dict, Tuple, Optional


class ChemicalReactorEnv(gym.Env):
    """
    Continuous stirred tank reactor (CSTR) RL environment.

    State: [temperature, concentration, pressure, flow_rate,
            temperature_5min_ago, concentration_5min_ago,
            elapsed_time_fraction, setpoint_temperature]
    Action: [delta_coolant_flow, delta_feed_rate]  (continuous, bounded)
    Reward: yield * quality_score - energy_penalty - safety_penalty
    Episode: one batch (100 timesteps = 500 minutes at 5 min/step)
    """

    # State bounds
    TEMP_MIN, TEMP_MAX = 150.0, 250.0         # Celsius
    CONC_MIN, CONC_MAX = 0.0, 2.0             # mol/L
    PRESSURE_MIN, PRESSURE_MAX = 1.0, 5.0    # bar
    FLOW_MIN, FLOW_MAX = 0.5, 5.0            # L/min

    # Safety limits (hard constraints)
    TEMP_SAFETY_MAX = 240.0    # Emergency shutdown above this
    PRESSURE_SAFETY_MAX = 4.5  # Pressure relief valve

    # Action bounds (rate-limited setpoint changes per step)
    ACTION_COOLANT_DELTA = 0.2    # L/min per step
    ACTION_FEED_DELTA = 0.1       # L/min per step

    def __init__(self, simulation_model=None, seed: Optional[int] = None):
        super().__init__()

        self.simulation_model = simulation_model  # Digital twin or physics sim
        self.seed_val = seed
        self.episode_step = 0
        self.max_steps = 100

        # State space: 8 dimensional
        n_state = 8
        self.observation_space = spaces.Box(
            low=np.array([self.TEMP_MIN, self.CONC_MIN, self.PRESSURE_MIN,
                          self.FLOW_MIN, self.TEMP_MIN, self.CONC_MIN, 0.0, self.TEMP_MIN]),
            high=np.array([self.TEMP_MAX, self.CONC_MAX, self.PRESSURE_MAX,
                           self.FLOW_MAX, self.TEMP_MAX, self.CONC_MAX, 1.0, self.TEMP_MAX]),
            dtype=np.float32
        )

        # Action space: 2-dimensional continuous
        self.action_space = spaces.Box(
            low=np.array([-self.ACTION_COOLANT_DELTA, -self.ACTION_FEED_DELTA]),
            high=np.array([self.ACTION_COOLANT_DELTA, self.ACTION_FEED_DELTA]),
            dtype=np.float32
        )

        # Current state variables
        self._state = None
        self._prev_state = None
        self._coolant_flow = 2.0   # Initial coolant flow
        self._feed_rate = 2.0      # Initial feed rate

    def _get_observation(self) -> np.ndarray:
        """Construct observation vector from current state."""
        elapsed_fraction = self.episode_step / self.max_steps
        obs = np.array([
            self._state["temperature"],
            self._state["concentration"],
            self._state["pressure"],
            self._state["flow_rate"],
            self._prev_state["temperature"] if self._prev_state else self._state["temperature"],
            self._prev_state["concentration"] if self._prev_state else self._state["concentration"],
            elapsed_fraction,
            self._state.get("setpoint_temperature", 190.0)  # Target temperature
        ], dtype=np.float32)
        return obs

    def _compute_reward(self) -> Tuple[float, dict]:
        """
        Multi-objective reward for the CSTR.

        Primary objective: maximize yield (concentration of product)
        Secondary: maintain quality (narrow molecular weight distribution)
        Penalty: energy consumption, off-spec conditions, safety violations
        """
        temp = self._state["temperature"]
        conc = self._state["concentration"]
        target_temp = self._state.get("setpoint_temperature", 190.0)

        # Safety violations - must trigger episode termination
        safety_violated = (
            temp > self.TEMP_SAFETY_MAX or
            self._state["pressure"] > self.PRESSURE_SAFETY_MAX
        )

        if safety_violated:
            return -50.0, {"safety_violation": True, "yield": 0, "quality": 0}

        # Yield reward: normalized concentration increase this step
        yield_reward = max(0, conc - 0.5) * 2.0

        # Quality: penalty for temperature deviation from setpoint
        temp_deviation = abs(temp - target_temp)
        quality_reward = max(0, 1.0 - temp_deviation / 20.0)

        # Energy penalty: proportional to coolant flow (higher flow = more energy)
        energy_penalty = 0.1 * self._coolant_flow

        # Off-spec penalty: if temperature too far from setpoint
        off_spec_penalty = 2.0 if temp_deviation > 15.0 else 0.0

        total_reward = (
            3.0 * yield_reward
            + 2.0 * quality_reward
            - energy_penalty
            - off_spec_penalty
        )

        info = {
            "safety_violation": False,
            "yield": round(yield_reward, 3),
            "quality": round(quality_reward, 3),
            "energy_penalty": round(energy_penalty, 3),
            "temperature": round(temp, 2),
            "concentration": round(conc, 3)
        }

        return float(total_reward), info

    def reset(self) -> np.ndarray:
        """Reset to initial state at start of new batch."""
        self.episode_step = 0
        self._prev_state = None

        # Randomize initial conditions slightly to improve generalization
        rng = np.random.default_rng(self.seed_val)
        self._state = {
            "temperature": 180.0 + rng.normal(0, 2),
            "concentration": 0.5 + rng.normal(0, 0.05),
            "pressure": 2.0 + rng.normal(0, 0.1),
            "flow_rate": 2.0,
            "setpoint_temperature": 190.0 + rng.choice([-5, 0, 5])
        }
        self._coolant_flow = 2.0
        self._feed_rate = 2.0

        return self._get_observation()

    def step(self, action: np.ndarray) -> Tuple[np.ndarray, float, bool, dict]:
        """
        Execute one control step.

        action: [delta_coolant_flow, delta_feed_rate]
        """
        # Apply action (rate-limited)
        self._coolant_flow = np.clip(
            self._coolant_flow + float(action[0]),
            self.FLOW_MIN, self.FLOW_MAX
        )
        self._feed_rate = np.clip(
            self._feed_rate + float(action[1]),
            self.FLOW_MIN, self.FLOW_MAX
        )

        self._prev_state = self._state.copy()

        # Step the simulation model
        if self.simulation_model is not None:
            next_state = self.simulation_model.step(
                self._state,
                coolant_flow=self._coolant_flow,
                feed_rate=self._feed_rate
            )
        else:
            # Simple placeholder dynamics
            next_state = self._simple_dynamics()

        self._state = next_state
        self.episode_step += 1

        reward, info = self._compute_reward()

        done = (
            self.episode_step >= self.max_steps
            or info.get("safety_violation", False)
        )

        return self._get_observation(), reward, done, info

    def _simple_dynamics(self) -> dict:
        """
        Simplified reactor dynamics for demonstration.
        Replace with actual digital twin or physics simulation in production.
        """
        dt = 5.0  # minutes per timestep
        temp = self._state["temperature"]
        conc = self._state["concentration"]

        # Heat balance: coolant flow reduces temperature
        dT = (
            0.1 * self._feed_rate           # Feed heats reactor
            - 0.3 * self._coolant_flow      # Coolant cools reactor
            + np.random.normal(0, 0.5)      # Process noise
        )

        # Mass balance: reaction consumes feed
        dC = (
            0.05 * self._feed_rate          # Feed increases concentration
            - 0.02 * conc * np.exp(-5000 / (temp + 273))  # Arrhenius reaction rate
            + np.random.normal(0, 0.02)
        )

        return {
            "temperature": np.clip(temp + dT * dt/60, self.TEMP_MIN, self.TEMP_MAX),
            "concentration": np.clip(conc + dC * dt/60, self.CONC_MIN, self.CONC_MAX),
            "pressure": np.clip(
                self._state["pressure"] + 0.01 * dT + np.random.normal(0, 0.05),
                self.PRESSURE_MIN, self.PRESSURE_MAX
            ),
            "flow_rate": self._feed_rate,
            "setpoint_temperature": self._state["setpoint_temperature"]
        }

2. PID Baseline vs RL Policy Comparison

"""
PID controller as baseline for comparison with RL policy.
Always compare RL against the existing control system before claiming improvement.
"""
import numpy as np
from typing import List, Tuple


class PIDController:
    """
    Simple PID controller for single-loop process control.
    Used as baseline for comparison with RL policies.
    """

    def __init__(
        self,
        Kp: float = 1.0,
        Ki: float = 0.1,
        Kd: float = 0.01,
        setpoint: float = 190.0,
        output_limits: Tuple[float, float] = (-0.2, 0.2),
        dt: float = 1.0
    ):
        self.Kp = Kp
        self.Ki = Ki
        self.Kd = Kd
        self.setpoint = setpoint
        self.output_limits = output_limits
        self.dt = dt

        self._integral = 0.0
        self._prev_error = 0.0

    def compute(self, measurement: float) -> float:
        """Compute control action for current measurement."""
        error = self.setpoint - measurement

        self._integral += error * self.dt
        # Anti-windup: clip integral term
        self._integral = np.clip(self._integral, -10.0, 10.0)

        derivative = (error - self._prev_error) / self.dt
        self._prev_error = error

        output = self.Kp * error + self.Ki * self._integral + self.Kd * derivative
        output = np.clip(output, *self.output_limits)
        return float(output)

    def reset(self):
        self._integral = 0.0
        self._prev_error = 0.0


def run_evaluation_episode(
    env,
    controller,
    n_episodes: int = 20
) -> dict:
    """
    Evaluate a controller (PID or RL policy) over N episodes.

    Returns average metrics for comparison.
    """
    episode_rewards = []
    episode_yields = []
    episode_safety_violations = []
    episode_quality_scores = []

    for episode in range(n_episodes):
        obs = env.reset()
        total_reward = 0.0
        total_yield = 0.0
        safety_violations = 0
        quality_scores = []

        done = False
        while not done:
            if hasattr(controller, "predict"):
                # RL policy (stable-baselines3 format)
                action, _ = controller.predict(obs, deterministic=True)
            elif hasattr(controller, "compute"):
                # PID controller - only controls temperature
                temp = obs[0]  # Temperature is first state element
                coolant_delta = controller.compute(temp)
                action = np.array([coolant_delta, 0.0])  # PID only adjusts coolant
            else:
                raise ValueError("Unknown controller type")

            obs, reward, done, info = env.step(action)
            total_reward += reward
            total_yield += info.get("yield", 0)
            quality_scores.append(info.get("quality", 0))
            if info.get("safety_violation", False):
                safety_violations += 1

        episode_rewards.append(total_reward)
        episode_yields.append(total_yield)
        episode_safety_violations.append(safety_violations)
        episode_quality_scores.append(np.mean(quality_scores))

    return {
        "mean_reward": np.mean(episode_rewards),
        "std_reward": np.std(episode_rewards),
        "mean_yield": np.mean(episode_yields),
        "mean_quality": np.mean(episode_quality_scores),
        "safety_violation_rate": np.mean(episode_safety_violations) / 100,  # Per step
        "n_episodes": n_episodes
    }


def compare_controllers(env):
    """Compare PID and RL policy side by side."""
    from stable_baselines3 import SAC

    # Baseline: PID controller
    pid = PIDController(Kp=2.0, Ki=0.3, Kd=0.1, setpoint=190.0)
    pid_metrics = run_evaluation_episode(env, pid)
    print("\nPID Controller:")
    for k, v in pid_metrics.items():
        print(f"  {k}: {v:.4f}")

    # RL Policy: load trained SAC policy
    # model = SAC.load("sac_reactor_policy")
    # rl_metrics = run_evaluation_episode(env, model)
    # print("\nRL Policy (SAC):")
    # for k, v in rl_metrics.items():
    #     print(f"  {k}: {v:.4f}")

3. Offline RL Training with CQL

"""
Conservative Q-Learning (CQL) for offline RL from historical process data.

CQL is the go-to algorithm when you want to learn from existing
operational data without interacting with the real plant.

Dataset format: (state, action, reward, next_state, done) transitions
from historical process logs.
"""
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from typing import Tuple, List


class TransitionDataset(Dataset):
    """
    Dataset of process state transitions from historical operational data.
    """

    def __init__(
        self,
        states: np.ndarray,
        actions: np.ndarray,
        rewards: np.ndarray,
        next_states: np.ndarray,
        dones: np.ndarray
    ):
        self.states = torch.FloatTensor(states)
        self.actions = torch.FloatTensor(actions)
        self.rewards = torch.FloatTensor(rewards).unsqueeze(1)
        self.next_states = torch.FloatTensor(next_states)
        self.dones = torch.FloatTensor(dones).unsqueeze(1)

    def __len__(self):
        return len(self.states)

    def __getitem__(self, idx):
        return (
            self.states[idx],
            self.actions[idx],
            self.rewards[idx],
            self.next_states[idx],
            self.dones[idx]
        )


class QNetwork(nn.Module):
    """
    Q-network for continuous actions: Q(s, a) -> scalar value.
    """
    def __init__(self, state_dim: int, action_dim: int, hidden_dim: int = 256):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim + action_dim, hidden_dim),
            nn.LayerNorm(hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.LayerNorm(hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 1)
        )

    def forward(self, state: torch.Tensor, action: torch.Tensor) -> torch.Tensor:
        return self.net(torch.cat([state, action], dim=-1))


class PolicyNetwork(nn.Module):
    """
    Deterministic policy: pi(s) -> action.
    For stochastic policy, output mean and log_std for SAC.
    """
    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        hidden_dim: int = 256,
        action_scale: float = 1.0
    ):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.LayerNorm(hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.LayerNorm(hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim),
            nn.Tanh()  # Bound actions to [-1, 1]
        )
        self.action_scale = action_scale

    def forward(self, state: torch.Tensor) -> torch.Tensor:
        return self.net(state) * self.action_scale


class CQLAgent:
    """
    Conservative Q-Learning for offline RL.
    Based on "Conservative Q-Learning for Offline Reinforcement Learning"
    (Kumar et al., NeurIPS 2020)
    """

    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        alpha: float = 1.0,      # CQL regularization strength
        gamma: float = 0.99,     # Discount factor
        tau: float = 0.005,      # Soft target update rate
        lr: float = 3e-4,
        device: str = "cpu"
    ):
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.alpha = alpha  # Conservative regularization weight
        self.gamma = gamma
        self.tau = tau
        self.device = device

        # Q-networks (two for double Q-learning)
        self.q1 = QNetwork(state_dim, action_dim).to(device)
        self.q2 = QNetwork(state_dim, action_dim).to(device)
        self.q1_target = QNetwork(state_dim, action_dim).to(device)
        self.q2_target = QNetwork(state_dim, action_dim).to(device)
        self.q1_target.load_state_dict(self.q1.state_dict())
        self.q2_target.load_state_dict(self.q2.state_dict())

        # Policy
        self.policy = PolicyNetwork(state_dim, action_dim).to(device)

        # Optimizers
        self.q_optimizer = torch.optim.Adam(
            list(self.q1.parameters()) + list(self.q2.parameters()), lr=lr
        )
        self.policy_optimizer = torch.optim.Adam(
            self.policy.parameters(), lr=lr
        )

    def _cql_loss(
        self,
        states: torch.Tensor,
        actions: torch.Tensor,
        q_network: QNetwork
    ) -> torch.Tensor:
        """
        Conservative penalty: penalize Q-values for random actions not in dataset.

        CQL loss = E_{s}[log(sum_a exp(Q(s,a))) - E_{a~data}[Q(s,a)]]
        This pushes down Q-values for out-of-distribution actions.
        """
        batch_size = states.shape[0]
        n_random = 10  # Number of random actions to sample

        # Sample random actions not in the dataset
        random_actions = torch.FloatTensor(
            batch_size, n_random, self.action_dim
        ).uniform_(-1, 1).to(self.device)

        # Expand states to match n_random dimension
        states_expanded = states.unsqueeze(1).expand(
            -1, n_random, -1
        ).reshape(-1, self.state_dim)

        random_actions_flat = random_actions.reshape(-1, self.action_dim)

        # Q-values for random actions
        q_random = q_network(states_expanded, random_actions_flat)
        q_random = q_random.reshape(batch_size, n_random)

        # Q-values for actions in dataset
        q_data = q_network(states, actions)

        # CQL conservative penalty
        cql_loss = (
            torch.logsumexp(q_random, dim=1, keepdim=True).mean()
            - q_data.mean()
        )

        return cql_loss

    def train_step(
        self,
        batch: Tuple[torch.Tensor, ...]
    ) -> dict:
        """Single training step."""
        states, actions, rewards, next_states, dones = batch
        states = states.to(self.device)
        actions = actions.to(self.device)
        rewards = rewards.to(self.device)
        next_states = next_states.to(self.device)
        dones = dones.to(self.device)

        # Target Q-values
        with torch.no_grad():
            next_actions = self.policy(next_states)
            q1_target = self.q1_target(next_states, next_actions)
            q2_target = self.q2_target(next_states, next_actions)
            q_target = rewards + self.gamma * (1 - dones) * torch.min(q1_target, q2_target)

        # Bellman error
        q1_pred = self.q1(states, actions)
        q2_pred = self.q2(states, actions)
        bellman_loss = F.mse_loss(q1_pred, q_target) + F.mse_loss(q2_pred, q_target)

        # CQL conservative loss
        cql_loss = (
            self._cql_loss(states, actions, self.q1)
            + self._cql_loss(states, actions, self.q2)
        )

        total_q_loss = bellman_loss + self.alpha * cql_loss

        self.q_optimizer.zero_grad()
        total_q_loss.backward()
        torch.nn.utils.clip_grad_norm_(
            list(self.q1.parameters()) + list(self.q2.parameters()), 1.0
        )
        self.q_optimizer.step()

        # Policy update: maximize Q
        policy_actions = self.policy(states)
        policy_loss = -self.q1(states, policy_actions).mean()

        self.policy_optimizer.zero_grad()
        policy_loss.backward()
        self.policy_optimizer.step()

        # Soft target update
        for target, source in [(self.q1_target, self.q1), (self.q2_target, self.q2)]:
            for t_param, s_param in zip(target.parameters(), source.parameters()):
                t_param.data.copy_(
                    self.tau * s_param.data + (1 - self.tau) * t_param.data
                )

        return {
            "bellman_loss": bellman_loss.item(),
            "cql_loss": cql_loss.item(),
            "policy_loss": policy_loss.item()
        }

    def act(self, state: np.ndarray) -> np.ndarray:
        """Get action from policy for deployment."""
        state_tensor = torch.FloatTensor(state).unsqueeze(0).to(self.device)
        with torch.no_grad():
            action = self.policy(state_tensor).cpu().numpy().squeeze()
        return action

4. Safety Constraint Enforcement

"""
Safety layer for RL policy deployment.

In manufacturing, certain actions are unsafe regardless of what the
policy recommends. The safety layer wraps the RL policy and modifies
actions to ensure hard constraints are always satisfied.

This is the "safety layer" approach from Dalal et al. (2018).
"""
import numpy as np
from typing import Optional


class SafetyLayer:
    """
    Projects RL policy actions onto the safe feasible set.

    Safety constraints are expressed as linear inequalities:
        g_i(s, a) <= 0  for all i

    The safety layer solves a QP to find the nearest safe action to
    the policy's recommended action.
    """

    def __init__(
        self,
        # Temperature constraint: action must not raise temp above safety_max
        temp_safety_max: float = 235.0,
        temp_safety_min: float = 155.0,
        # Rate-of-change limits (prevents mechanical stress from rapid changes)
        max_coolant_delta_per_step: float = 0.15,
        max_feed_delta_per_step: float = 0.08
    ):
        self.temp_safety_max = temp_safety_max
        self.temp_safety_min = temp_safety_min
        self.max_coolant_delta = max_coolant_delta_per_step
        self.max_feed_delta = max_feed_delta_per_step

    def is_safe(self, observation: np.ndarray, action: np.ndarray) -> bool:
        """Check if an action is safe given current state."""
        temp = float(observation[0])
        delta_coolant = float(action[0])
        delta_feed = float(action[1])

        # Rate-of-change limits
        if abs(delta_coolant) > self.max_coolant_delta:
            return False
        if abs(delta_feed) > self.max_feed_delta:
            return False

        # Temperature projection (approximate: more coolant = lower temp)
        projected_temp = temp - delta_coolant * 5.0  # Simplified model
        if projected_temp > self.temp_safety_max:
            return False
        if projected_temp < self.temp_safety_min:
            return False

        return True

    def project_to_safe(
        self,
        observation: np.ndarray,
        action: np.ndarray
    ) -> np.ndarray:
        """
        Project action to nearest feasible safe action.
        Simple version: clip each constraint independently.
        Full version: solve QP for joint constraint satisfaction.
        """
        safe_action = action.copy()
        temp = float(observation[0])

        # Rate-of-change clipping
        safe_action[0] = np.clip(safe_action[0], -self.max_coolant_delta, self.max_coolant_delta)
        safe_action[1] = np.clip(safe_action[1], -self.max_feed_delta, self.max_feed_delta)

        # Temperature safety: if action would drive temp too high, force more cooling
        projected_temp = temp - safe_action[0] * 5.0
        if projected_temp > self.temp_safety_max:
            required_coolant_increase = (projected_temp - self.temp_safety_max) / 5.0
            safe_action[0] = min(
                safe_action[0] + required_coolant_increase,
                self.max_coolant_delta
            )

        return safe_action

    def __call__(
        self,
        policy,
        observation: np.ndarray
    ) -> np.ndarray:
        """
        Get safe action: run policy, then project to safe set.
        """
        if hasattr(policy, "predict"):
            action, _ = policy.predict(observation, deterministic=True)
        elif callable(policy):
            action = policy(observation)
        else:
            raise ValueError("Policy must have predict() method or be callable")

        if self.is_safe(observation, action):
            return action
        else:
            return self.project_to_safe(observation, action)


class ShadowModeDeployment:
    """
    Shadow mode: run the RL policy in parallel with the existing control system.
    Log what it would have done without actually acting.

    This is how you validate an RL policy before giving it control:
    - Run shadow mode for 4-8 weeks
    - Compare RL recommendations vs actual actions
    - Compare simulated outcomes vs actual outcomes
    - Only go live when confidence is established
    """

    def __init__(self, rl_policy, existing_controller, safety_layer: SafetyLayer):
        self.rl_policy = rl_policy
        self.existing_controller = existing_controller
        self.safety_layer = safety_layer
        self.shadow_log = []

    def step(
        self,
        observation: np.ndarray,
        actual_reward: Optional[float] = None
    ) -> np.ndarray:
        """
        Run both controllers. Use existing controller output.
        Log RL recommendation for analysis.
        """
        # What RL would recommend (through safety layer)
        rl_action = self.safety_layer(self.rl_policy, observation)

        # What the existing controller actually does
        if hasattr(self.existing_controller, "compute"):
            temp = float(observation[0])
            coolant_delta = self.existing_controller.compute(temp)
            actual_action = np.array([coolant_delta, 0.0])
        else:
            actual_action, _ = self.existing_controller.predict(observation)

        # Log for comparison
        self.shadow_log.append({
            "observation": observation.tolist(),
            "rl_action": rl_action.tolist(),
            "actual_action": actual_action.tolist(),
            "action_divergence": float(np.linalg.norm(rl_action - actual_action)),
            "actual_reward": actual_reward
        })

        # Return actual action (existing controller has real control)
        return actual_action

System Architecture

Production Engineering Notes

The Safety Challenge is Not Algorithmic - It's Organizational

Safe RL algorithms (CPO, CMDP, safety layers) are well-developed technically. The harder challenge is organizational: convincing a plant manager to hand over control of a $50M reactor to a neural network. This requires a systematic trust-building process.

Start with shadow mode for a full production cycle (typically 4-8 weeks). Log every action the RL policy would have taken versus what the existing control system did. Simulate the outcome of the RL actions using the digital twin. Present the comparison to process engineers: "In the last 6 weeks, the RL policy would have taken these different actions. Our simulation shows this would have improved yield by 3.2% with zero additional safety events." This builds the case.

Then do a graduated deployment: RL controls one or two less critical loops first, with the safety layer and human override always available. Expand RL control incrementally as confidence is established. Never deploy RL on the first attempt without extensive shadow testing and a clear rollback procedure.

Reward Hacking and Specification Gaming

RL policies famously find unexpected ways to maximize the reward function that do not align with the intended objective. A famous example: a boat racing game where the boat learned to drive in circles scoring points from bonus items rather than finishing the race. In manufacturing, this manifests as: a policy that achieves high yield in the simulation but does so by pushing operating conditions to extremes that are unrealistic or that the simulation models poorly.

The defense: (1) Use diverse simulation scenarios that span a wide range of operating conditions. (2) Validate the learned policy extensively against the digital twin before plant deployment. (3) Include a conservative penalty in the reward that discourages extreme actions. (4) Use offline RL (CQL) - by staying close to the historical data distribution, the policy is more likely to behave in ways that have been observed to work in the real plant.

:::warning Delayed Reward in Batch Manufacturing In batch manufacturing, you typically do not know if the batch was good until analysis is complete - 2-6 hours after the batch ends. This delayed reward complicates RL training significantly. The policy must learn to associate control actions taken at hour 2 of the batch with a quality measurement received at hour 8. Use temporal credit assignment approaches: shape the reward to provide intermediate feedback (temperature control, concentration tracking), and use the final batch quality as a terminal reward. Monte Carlo return estimation (computing returns from the end of the episode backward) is more stable than TD-learning for long delayed reward horizons. :::

:::danger Never Deploy RL Directly to a Physical Plant Without Shadow Testing The cost of a bad RL decision in a manufacturing plant is not a game loss - it is a damaged reactor, an off-spec batch, or a safety event. Any new RL policy must go through: (1) simulation validation, (2) digital twin evaluation, (3) shadow mode operation for at least 4 weeks, (4) limited deployment on non-critical control loops, (5) full deployment with manual override always available. Never skip steps. A policy that achieves 95% of its simulation performance on the real plant is a success. A policy that achieves negative results is a production incident and a setback for RL adoption that may take years to recover from. :::

Interview Questions and Answers

Q1: Why is offline RL preferred over online RL for initial manufacturing deployments?

Online RL learns by interacting with the real environment, which means it must take exploratory actions to discover the reward landscape. In a manufacturing context, exploratory actions include actions that have never been tried and may cause process upsets, equipment damage, or safety events. The sample efficiency of online RL algorithms (even with millions of gradient steps) is far too low for a process where each "step" represents minutes or hours of real plant operation. Offline RL learns entirely from historical operational data - the logged records of (state, action, outcome) from the existing control system. No plant interaction is required. The resulting policy is limited to the behavioral envelope covered by historical data (hence the conservative regularization in CQL), but this is exactly what you want for initial deployment: a policy that improves on the historical operational strategy without venturing into unexplored territory. After the policy is validated in shadow mode and initial deployment, online fine-tuning can extend the policy beyond the historical data support with proper safety mechanisms.

Q2: How do you formulate a meaningful reward function for a multi-objective manufacturing process?

The reward function is where domain expertise meets ML engineering. Start by talking to process engineers and plant managers: what are the key performance indicators? Typically yield, quality, energy consumption, and on-spec rate. Then establish the priority ordering: in most manufacturing contexts, safety constraints are absolute (not just heavily penalized), quality comes before yield, yield before energy. Translate priorities into reward weights: quality failures should generate 5-10x the magnitude of equivalent yield gains. For energy, normalize carefully - raw energy in kWh may be orders of magnitude different from yield fraction (0-1), so normalize energy to the same scale. Test the reward function with a simple baseline policy (PID) to verify that the reward signal correctly rewards better PID tuning - if a better PID does not get higher reward, the reward function is wrong. Finally, add a conservative regularizer or clipping that penalizes extreme actions regardless of their projected reward - this prevents early policy pathologies during training.

Q3: What is the sim-to-real gap in manufacturing RL, and how do you minimize it?

The sim-to-real gap is the discrepancy between a policy's performance in simulation and its performance on the real process. In manufacturing, the sources are: unmodeled nonlinearities (the simulation uses a simplified reaction kinetics model; the real process has dozens of side reactions), parameter uncertainty (material properties, heat transfer coefficients are known only approximately), environmental disturbances (ambient temperature, inlet stream composition variability), and equipment degradation (pump curves change as impellers wear). Mitigation strategies: (1) Domain randomization - during simulation training, randomize process parameters within their uncertainty range. The policy learns to be robust to parameter variation. (2) Residual modeling - fit a data-driven residual model to the mismatch between simulation and historical plant data, add it to the simulation. (3) System identification - periodically re-identify the process model from recent operational data and update the simulation. (4) Conservative policy training - use offline RL on historical data rather than simulation-based RL; this automatically ensures the policy only recommends actions that have been observed to work in the real process.

Q4: How does Conservative Q-Learning prevent extrapolation errors in offline RL?

In standard Q-learning, the policy maximizes the learned Q-function. If the Q-function is poorly estimated for actions not seen in the dataset (out-of-distribution actions), the policy may recommend these actions because it overestimates their value. CQL adds a regularization term to the Q-learning objective that explicitly penalizes high Q-values for out-of-distribution actions. Specifically, it minimizes Q-values for randomly sampled actions while maximizing Q-values for actions that actually appear in the dataset. This creates a Q-function that is conservative: it underestimates the value of unseen actions rather than overestimating. The resulting policy, which maximizes this conservative Q-function, naturally avoids recommending actions that were not in the historical data. The tradeoff: the policy may be suboptimal in regions where better actions exist but were never tried. This conservatism is a feature for manufacturing deployment, where we want the policy to stay close to historically validated operations.

Q5: What is the shadow mode validation process and why is it necessary?

Shadow mode is the practice of running a new control policy in parallel with the existing control system, logging what the new policy would have done, but not actually letting it act. The existing controller retains actual control. Shadow mode serves several purposes. First, it builds a record of the new policy's behavior under real operating conditions - conditions that may not have been represented in training data or simulation. Second, it allows process engineers to review the policy's recommendations and build intuition for when it agrees with and diverges from their own judgment. Third, it enables quantitative comparison: simulate the outcome of the policy's recommended actions using the digital twin and compare to actual outcomes. Fourth, it detects policy failures safely - a policy that would have caused a process upset in shadow mode has no real consequences, but alerts you to a problem before deployment. In practice, 4-8 weeks of shadow mode is a minimum for a new manufacturing RL policy, covering diverse operating conditions (different product grades, raw material batches, seasonal ambient conditions). The output is a validation report that either approves the policy for graduated deployment or identifies specific failure modes to fix.

Key Takeaways

Reinforcement learning for manufacturing process optimization follows a disciplined path: formulate the process as an MDP with carefully designed rewards that reflect real operational priorities, start with offline RL (CQL) from historical data to avoid risky exploration, validate extensively in simulation and digital twin, deploy through shadow mode before giving the policy real control, and maintain a safety layer that enforces hard constraints regardless of policy recommendation. The organizational challenge - building trust with process engineers and plant managers - is at least as hard as the technical challenge. Shadow mode reporting and graduated deployment are the tools that bridge the trust gap. Done correctly, RL unlocks multi-variable optimization capabilities that exceed what any human operator or classical control system can achieve.

The Process Engineer and the Black Box​

Why This Exists​

Why PID Controllers Are Not Enough​

Historical Context​

Core Concepts​

Formulating Manufacturing as an MDP​

Reward Design for Manufacturing​

Offline RL: Learning from Historical Data​

Code Examples​

1. Manufacturing Environment as OpenAI Gym​

2. PID Baseline vs RL Policy Comparison​

3. Offline RL Training with CQL​

4. Safety Constraint Enforcement​

System Architecture​

Production Engineering Notes​

The Safety Challenge is Not Algorithmic - It's Organizational​

Reward Hacking and Specification Gaming​

Interview Questions and Answers​

Key Takeaways​