How RL enables autonomous AI agents: ReAct, tool use, MCTS planning, AlphaCode, SWE-bench, and the emerging agent-RL paradigm powering Claude, GPT-4o, and Gemini.

How does RL for agents work in practice?

RL for AI Agents - Teaching Models to Act in the World covers AI agents, RL for agents, ReAct from first principles with code examples. Free lesson at https://engineersofai.com/docs/ml/reinforcement-learning/rl-for-ai-agents

What is the difference between AI agents and ReAct?

See the full breakdown at https://engineersofai.com/docs/ml/reinforcement-learning/rl-for-ai-agents

RL for AI Agents - Teaching Models to Act in the World

Reading time: ~40 minutes | Level: Reinforcement Learning | Role: MLE, AI Research Engineer, MLOps

The Real Engineering Moment

The year is 2024 and a benchmark called SWE-bench is becoming the standard measure of AI capability for software engineering. SWE-bench contains 2,294 real GitHub issues from popular Python repositories - Django, Flask, scikit-learn, sympy. Each issue requires an agent to: read the repository, understand the codebase, identify the bug, write a fix, verify the fix passes all tests. The agent has access to file read/write, bash execution, and Python interpreter - the same tools a human developer uses.

Claude 3.5 Sonnet scores 49% on SWE-bench Verified in mid-2024. This is remarkable not because of the number - it is remarkable because of what the number means. A language model is reading Python source code, running test suites, observing failure messages, editing source files, re-running tests, observing the new output, and iterating. It is not generating text. It is acting in an environment, receiving feedback from that environment, and adapting its behavior based on that feedback. This is reinforcement learning, even if the agent was not trained with explicit RL on coding tasks.

The architecture is what matters: the LLM serves as a policy. Its observations are file contents, terminal outputs, error messages. Its actions are bash commands, file edits, tool calls. Its reward signal (during training) is test pass/fail. The environment is a computer. This is the RL-for-agents paradigm - and it is the direction the entire field is moving.

Understanding how RL applies to AI agents requires connecting the abstract MDP framework from the first lesson in this module to the concrete engineering of modern AI systems. That is what this lesson does.

Why This Exists: From Language Models to Agents

A standard language model is a function: $f: \text{tokens} \to \text{tokens}$ . Given a prompt, it generates a response. This is powerful but limited - it cannot observe the results of its outputs, adapt to feedback, or take multi-step actions in a dynamic environment.

An agent is a different paradigm: it perceives observations from an environment, selects actions, observes consequences, and updates its behavior. The key difference from a standard LLM is the feedback loop - the agent receives information from the environment that changes what it does next.

What LLMs can do without RL:

Generate a single response to a query
Follow a fixed instruction format
Complete a pattern based on training data

What agent + RL enables:

Take multi-step actions in a computer environment
Observe tool outputs and adapt (e.g., read the error message, fix the code)
Plan over long horizons - 10, 50, 200 steps
Self-correct based on feedback
Decompose complex tasks into sub-tasks and execute them

The transition from "language model" to "agent" requires three things: (1) a loop (generate → act → observe → generate), (2) an action space beyond text (tool calls, code execution), and (3) a training signal that rewards task completion, not just text quality.

Historical Context

Year	Paper / System	Key Contribution
2021	WebGPT (OpenAI)	GPT-3 with web browsing - first agentic LLM
2022	SayCan (Google)	LLM planning + robot actions grounded by RL feasibility scores
2022	RLAM (DeepMind)	RL for language model tool use (calculator, search)
2022	ReAct (Yao et al.)	Interleave reasoning and action in LLM prompts
2023	Toolformer (Meta)	Self-supervised training for tool use
2023	AlphaCode 2 (DeepMind)	MCTS + LLM for competitive programming - 85th percentile
2023	SWE-bench (Princeton/Chicago)	Real GitHub issue benchmark for coding agents
2024	Claude 3.5 + Computer Use	Full computer interaction - screenshots, mouse/keyboard
2024	OpenAI o1	MCTS + RL for long chain-of-thought reasoning
2024	Devin (Cognition)	Full software engineering agent
2025	SWE-bench 50%+	Claude, GPT-4o, Gemini crossing 50% on Verified

The Agent Formulation as an MDP

Let's ground AI agents in the formal RL framework we established in Lesson 1.

State $s_t$ : the full context the agent has access to at time $t$ . For a coding agent this includes: the original task description, the repository structure, all previous tool outputs, all previous code edits, the current file contents.

$s_t = (\text{task}, \text{history}_{0:t-1}, \text{current\_observations}_t)$

Action $a_t$ : what the agent produces at time $t$ . For tool-using agents, actions are:

Text generation: write reasoning, plan, explanation
Tool calls: read_file(path), write_file(path, content), bash(command), search(query)
Termination: finish(answer) - signal task completion

Transition: the environment executes the action and returns an observation $o_{t+1}$ :

bash("python test.py") → returns stdout/stderr from test execution
read_file("model.py") → returns file contents
search("how to sort a list in Python") → returns search results

Reward $r_t$ : for most agent tasks, the reward is sparse and binary:

$r_T = +1$ if the final state satisfies the success criterion (tests pass, task complete)
$r_T = 0$ otherwise
$r_t = 0$ for all $t < T$ (no intermediate reward)

Episode: one complete task - from initial task description to final answer or timeout.

This is a standard MDP with very large state space (arbitrary text context), large action space (natural language + tool calls), and sparse reward.

ReAct: Reason and Act

ReAct (Yao et al., 2023) is the foundational prompting framework for agents. The key insight: language models are better at acting when they reason first.

The ReAct Trajectory Format

Instead of just generating actions, the agent interleaves reasoning thoughts with actions:

$\text{Trajectory} = (o_1, \text{thought}_1, a_1, o_2, \text{thought}_2, a_2, \ldots, o_T, \text{thought}_T, a_T)$

where:

$o_t$ : observation from the environment (tool output, error message, file content)
$\text{thought}_t$ : the agent's reasoning about the current state and what to do next
$a_t$ : the action taken (tool call, code write, answer)

The ReAct format turns chain-of-thought reasoning into the "thinking" component of an agent's policy. The thought is not submitted to the environment - it is internal reasoning. The action is what gets executed.

Example ReAct trajectory for a coding task:

Observation: Task: Fix the bug in sort_by_date() that returns wrong order for leap years.

Thought: I need to first understand the current implementation of sort_by_date().
Action: read_file("utils/date_utils.py")

Observation: [file content showing the sort function]

Thought: I see the issue. The leap year check uses year % 4 == 0 but doesn't exclude
century years (divisible by 100 but not 400). I need to fix this logic.
Action: bash("grep -n 'leap' utils/date_utils.py")

Observation: 45:     if year % 4 == 0:  # BUG: incomplete leap year check

Thought: Found the bug on line 45. The correct check is:
(year % 4 == 0 and year % 100 != 0) or (year % 400 == 0)
Action: write_file("utils/date_utils.py", [corrected content])

Thought: Now run the tests to verify the fix.
Action: bash("python -m pytest tests/test_date_utils.py -v")

Observation: ===== 8 passed in 0.34s =====

Thought: All tests pass. The fix is correct.
Action: finish("Fixed leap year check in sort_by_date(): line 45 in utils/date_utils.py")

Why Interleaving Reasoning and Action Helps

Research on ReAct (Yao et al., 2023) showed that:

Reasoning without action (CoT) fails when knowledge is insufficient - the model hallucinates
Action without reasoning (Act) fails when multi-step planning is needed - the model takes random actions
ReAct combines both: reasoning anchors the planning and reduces hallucination; actions ground the reasoning in real observations

The observation after each action is critical - it closes the loop. The model updates its belief about the world based on real environment feedback, not just its internal knowledge.

Monte Carlo Tree Search for Agent Planning

The Problem with Greedy Action Selection

ReAct agents with greedy (single-path) exploration can get stuck in local optima:

Choose a wrong approach early (e.g., wrong file to fix)
Execute many actions along that path
Realize too late that the approach was wrong
Have no mechanism to backtrack and try alternatives

For complex tasks (software engineering, scientific reasoning, math proof), this is a critical failure mode. AlphaCode 2 (DeepMind, 2023) showed that search over solution candidates dramatically improves performance - from 31% to 85th percentile on competitive programming.

MCTS for Language Model Agents

Monte Carlo Tree Search treats the agent's trajectory as a tree of possible sequences. Each node in the tree is a state $s$ (partial trajectory). Each edge is an action $a$ . The tree is built by:

Selection: traverse the tree using UCT to balance exploration and exploitation:

$\text{UCT}(s, a) = Q(s,a) + c\sqrt{\frac{\ln N(s)}{N(s,a)}}$

where $Q(s,a)$ is the average return from taking action $a$ in state $s$ , $N(s)$ is the visit count of state $s$ , $N(s,a)$ is the visit count of the $(s,a)$ edge, and $c$ is an exploration constant.

Expansion: at a leaf node, expand by generating candidate next actions from the LLM
Simulation (rollout): from the expanded node, run a full rollout (complete the episode) using the LLM as a rollout policy
Backpropagation: propagate the rollout reward back up the tree, updating $Q$ and $N$ for all traversed edges

The Value Function: Evaluating Partial Trajectories

A key component of MCTS for coding agents is a value function $V(s)$ that estimates the probability of task completion from the current state $s$ (partial trajectory). This is trained similarly to the critic in actor-critic RL:

Collect trajectories with known outcomes (task completed or failed)
Train a classifier: $V(s)$ = probability this partial trajectory leads to success
The value function guides the MCTS selection step - prune low-value branches early

AlphaCode 2 uses exactly this approach: LLM generates code candidates, MCTS searches over the candidate space using a trained value function, the final solution is the highest-scoring complete solution found by MCTS.

MCTS Tree for Coding Agent:

               Task Description
                      │
        ┌─────────────┼─────────────┐
     Read file     Search API    bash test
     [Q=0.8]       [Q=0.3]      [Q=0.6]
        │
    ┌───┴───┐
  Fix bug   Rewrite
  [Q=0.9]  [Q=0.2]
    │
  Test fix
  [Q=0.95] ← Best path found by UCT

Training Agents with RL: The Full Pipeline

Modern agent training follows a pattern:

Step 1 - SFT on demonstrations: collect expert human trajectories solving agent tasks. Fine-tune the LLM on these trajectories using standard cross-entropy. This teaches the model the task format, tool use syntax, and basic problem-solving patterns.

Step 2 - Rollout collection: let the agent attempt real tasks in a sandboxed environment. Collect (state, action, reward) sequences. The sandbox must be realistic: real code execution, real web access, real file systems.

Step 3 - Reward signal: binary reward (task complete / failed) is the most reliable signal. Process reward models (PRMs) can provide intermediate step rewards - a separate model trained to evaluate whether each reasoning/action step makes sense.

Step 4 - Policy update: use PPO or DPO to update the agent policy. Challenges:

Very long context (entire multi-step trajectory)
Sparse binary reward (credit assignment across 50+ steps)
Large action space (natural language tokens)

Step 5 - Iterate: the improved agent can now attempt harder tasks, which provides better training data for the next iteration.

Code: ReAct Agent with Tool Use

"""
ReAct agent implementation with tool use.
Demonstrates: reason → act → observe loop with real tools.
"""

import json
import re
from typing import Callable

# ─────────────────────────────────────────────────────────────────────────────
# Tool Definitions
# ─────────────────────────────────────────────────────────────────────────────

def calculator(expression: str) -> str:
    """Safe calculator using eval with restricted namespace."""
    import math
    safe_namespace = {
        "abs": abs, "round": round, "min": min, "max": max,
        "sum": sum, "sqrt": math.sqrt, "log": math.log,
        "pi": math.pi, "e": math.e,
    }
    try:
        result = eval(expression, {"__builtins__": {}}, safe_namespace)
        return str(result)
    except Exception as exc:
        return f"Error: {exc}"


def web_search(query: str) -> str:
    """Simulated web search - returns mock results."""
    # In production: use a real search API (SerpAPI, Brave, etc.)
    mock_results = {
        "Python sort list": (
            "Use list.sort() for in-place sorting or sorted(list) for a new sorted list. "
            "Both accept a key= parameter. Default is ascending order."
        ),
        "Python list comprehension": (
            "List comprehension syntax: [expr for item in iterable if condition]. "
            "More efficient than equivalent for-loop."
        ),
        "bubble sort": (
            "Bubble sort: repeatedly compare adjacent elements and swap if out of order. "
            "O(n²) time complexity, O(1) space. Not suitable for large arrays."
        ),
    }
    for key, result in mock_results.items():
        if key.lower() in query.lower():
            return result
    return f"Search results for '{query}': No relevant results found in mock database."


def read_file(path: str) -> str:
    """Read file contents."""
    import os
    try:
        with open(path, "r") as f:
            return f.read()
    except FileNotFoundError:
        return f"Error: File '{path}' not found."
    except PermissionError:
        return f"Error: Permission denied for '{path}'."


def write_file(path: str, content: str) -> str:
    """Write content to file."""
    try:
        with open(path, "w") as f:
            f.write(content)
        return f"Successfully wrote {len(content)} characters to '{path}'."
    except Exception as exc:
        return f"Error writing file: {exc}"


# Tool registry
TOOLS = {
    "calculator": {
        "fn": calculator,
        "description": "Evaluate a mathematical expression. Input: expression string.",
        "example": 'calculator("sqrt(144) + 2**10")',
    },
    "web_search": {
        "fn": web_search,
        "description": "Search the web for information. Input: search query string.",
        "example": 'web_search("Python list comprehension syntax")',
    },
    "read_file": {
        "fn": read_file,
        "description": "Read a file from disk. Input: file path string.",
        "example": 'read_file("main.py")',
    },
    "write_file": {
        "fn": write_file,
        "description": "Write content to a file. Input: path and content.",
        "example": 'write_file("output.txt", "Hello, world!")',
    },
    "finish": {
        "fn": lambda answer: f"FINAL_ANSWER: {answer}",
        "description": "Finish the task with a final answer. Input: answer string.",
        "example": 'finish("The answer is 42")',
    },
}


# ─────────────────────────────────────────────────────────────────────────────
# ReAct Agent
# ─────────────────────────────────────────────────────────────────────────────

class ReActAgent:
    """
    ReAct agent: interleaves reasoning (Thought) and actions (Action).

    Loop:
      1. Generate Thought: reason about current state
      2. Generate Action: select tool and arguments
      3. Execute Action: call the tool, get Observation
      4. Append (Thought, Action, Observation) to context
      5. Repeat until finish() is called or max_steps reached
    """

    def __init__(
        self,
        llm_fn: Callable[[str], str],
        tools: dict = TOOLS,
        max_steps: int = 10,
        verbose: bool = True,
    ):
        self.llm = llm_fn
        self.tools = tools
        self.max_steps = max_steps
        self.verbose = verbose

    def build_system_prompt(self) -> str:
        """Build system prompt describing available tools."""
        tool_descriptions = "\n".join([
            f"- {name}: {info['description']}\n  Example: {info['example']}"
            for name, info in self.tools.items()
        ])
        return f"""You are a helpful agent that solves tasks step by step.
You have access to the following tools:
{tool_descriptions}

Format your response EXACTLY as:
Thought: [your reasoning about what to do next]
Action: [tool_name]("[arguments]")

Wait for the Observation before taking the next action.
When you have a complete answer, use: Action: finish("[your answer]")"""

    def parse_action(self, response: str) -> tuple[str, str] | None:
        """
        Parse 'Action: tool_name("args")' from LLM response.
        Returns (tool_name, args_string) or None if not parseable.
        """
        # Match: Action: tool_name("arguments")
        pattern = r'Action:\s*(\w+)\("([^"]*)"\)'
        match = re.search(pattern, response)
        if match:
            return match.group(1), match.group(2)

        # Fallback: match Action: tool_name('arguments')
        pattern2 = r"Action:\s*(\w+)\('([^']*)'\)"
        match2 = re.search(pattern2, response)
        if match2:
            return match2.group(1), match2.group(2)

        return None

    def execute_action(self, tool_name: str, args: str) -> str:
        """Execute a tool and return its observation."""
        if tool_name not in self.tools:
            return f"Error: Unknown tool '{tool_name}'. Available tools: {list(self.tools.keys())}"
        return self.tools[tool_name]["fn"](args)

    def run(self, task: str) -> str:
        """
        Run the ReAct loop until finish() or max_steps.
        Returns the final answer.
        """
        # Build initial context
        messages = [
            {"role": "system", "content": self.build_system_prompt()},
            {"role": "user", "content": f"Task: {task}"},
        ]

        trajectory = []

        for step in range(self.max_steps):
            # ── Generate Thought + Action ──────────────────────────────────────
            # In production: call actual LLM API here
            # For demo: use the simulated LLM
            context = self._build_context(messages, trajectory)
            response = self.llm(context)

            if self.verbose:
                print(f"\n[Step {step+1}]")
                print(f"LLM Response: {response}")

            # ── Parse Action ───────────────────────────────────────────────────
            parsed = self.parse_action(response)
            if parsed is None:
                observation = "Error: Could not parse action. Use format: Action: tool_name(\"args\")"
                if self.verbose:
                    print(f"Observation: {observation}")
                trajectory.append(("", "parse_error", observation))
                continue

            tool_name, args = parsed

            # ── Execute Action ─────────────────────────────────────────────────
            observation = self.execute_action(tool_name, args)
            if self.verbose:
                print(f"Action: {tool_name}(\"{args}\")")
                print(f"Observation: {observation}")

            # Extract thought from response
            thought_match = re.search(r"Thought:\s*(.+?)(?=Action:|$)", response, re.DOTALL)
            thought = thought_match.group(1).strip() if thought_match else ""

            trajectory.append((thought, f'{tool_name}("{args}")', observation))

            # ── Check for finish ───────────────────────────────────────────────
            if tool_name == "finish":
                return observation.replace("FINAL_ANSWER: ", "")

        return "Max steps reached without completing task."

    def _build_context(self, messages: list, trajectory: list) -> str:
        """Build the full conversation context for the LLM."""
        context = messages[0]["content"] + "\n\n"
        context += f"User: {messages[1]['content']}\n\n"
        for thought, action, observation in trajectory:
            if thought:
                context += f"Thought: {thought}\n"
            context += f"Action: {action}\n"
            context += f"Observation: {observation}\n\n"
        return context


# ─────────────────────────────────────────────────────────────────────────────
# MCTS Planning Sketch for Code Generation
# ─────────────────────────────────────────────────────────────────────────────

import math
from dataclasses import dataclass, field

@dataclass
class MCTSNode:
    """A node in the MCTS tree representing a partial agent trajectory."""
    state: str                  # Partial trajectory / context so far
    parent: "MCTSNode | None" = None
    action: str = ""            # Action that led to this node
    children: list = field(default_factory=list)
    visit_count: int = 0
    total_value: float = 0.0

    @property
    def value(self) -> float:
        """Average value from simulations through this node."""
        return self.total_value / self.visit_count if self.visit_count > 0 else 0.0

    def uct_score(self, parent_visits: int, c: float = 1.41) -> float:
        """Upper Confidence Bound for Trees score."""
        if self.visit_count == 0:
            return float("inf")  # Unvisited nodes get maximum priority
        exploitation = self.value
        exploration = c * math.sqrt(math.log(parent_visits) / self.visit_count)
        return exploitation + exploration


class AgentMCTS:
    """
    MCTS for agent planning over multiple action steps.
    Adapts the standard MCTS algorithm to language model action generation.
    """

    def __init__(
        self,
        llm_fn: Callable[[str], str],
        value_fn: Callable[[str], float],
        n_simulations: int = 50,
        max_depth: int = 10,
        n_children: int = 3,
        c: float = 1.41,
    ):
        self.llm = llm_fn
        self.value = value_fn        # Trained value model: state → P(success)
        self.n_simulations = n_simulations
        self.max_depth = max_depth
        self.n_children = n_children
        self.c = c

    def search(self, task: str) -> list[str]:
        """
        Run MCTS to find the best action sequence for the task.
        Returns the best trajectory found.
        """
        root = MCTSNode(state=task)

        for _ in range(self.n_simulations):
            # 1. Selection: traverse tree using UCT
            node = self._select(root)

            # 2. Expansion: generate candidate actions at leaf node
            if node.visit_count > 0 and len(node.children) == 0:
                self._expand(node)

            # 3. Simulation: rollout from this node to get a reward estimate
            simulation_value = self._simulate(node)

            # 4. Backpropagation: update node statistics up to root
            self._backpropagate(node, simulation_value)

        # Return the best path: always select child with highest average value
        return self._best_path(root)

    def _select(self, node: MCTSNode) -> MCTSNode:
        """Select leaf node using UCT."""
        while node.children and node.visit_count > 0:
            # Select child with highest UCT score
            node = max(node.children, key=lambda n: n.uct_score(node.visit_count, self.c))
        return node

    def _expand(self, node: MCTSNode) -> None:
        """Generate n_children candidate next actions from the LLM."""
        for _ in range(self.n_children):
            # In production: call LLM with current state to generate a candidate action
            # LLM returns next action + updated state
            action = self.llm(f"Given state: {node.state}\nNext action:")
            child_state = node.state + f"\nAction: {action}"
            child = MCTSNode(state=child_state, parent=node, action=action)
            node.children.append(child)

    def _simulate(self, node: MCTSNode) -> float:
        """
        Estimate value of this node.
        Option 1: use trained value function V(s) → P(success)
        Option 2: run a full rollout to completion
        """
        # Use value function (faster than full rollout)
        return self.value(node.state)

    def _backpropagate(self, node: MCTSNode, value: float) -> None:
        """Propagate simulation result up to root."""
        while node is not None:
            node.visit_count += 1
            node.total_value += value
            node = node.parent

    def _best_path(self, root: MCTSNode) -> list[str]:
        """Extract the best path from root to best leaf (greedy selection)."""
        path = []
        node = root
        while node.children:
            best_child = max(node.children, key=lambda n: n.value)
            path.append(best_child.action)
            node = best_child
        return path


# ─────────────────────────────────────────────────────────────────────────────
# Simulated LLM for demo (replace with real LLM API in production)
# ─────────────────────────────────────────────────────────────────────────────

def simulated_llm(context: str) -> str:
    """
    Simulated LLM responses for the ReAct demo.
    In production: call OpenAI API, Anthropic API, etc.
    """
    if "What is 15% of 847?" in context and "calculator" not in context:
        return (
            "Thought: I need to calculate 15% of 847. I'll use the calculator tool.\n"
            'Action: calculator("0.15 * 847")'
        )
    if "127.05" in context and "finish" not in context:
        return (
            "Thought: The calculator returned 127.05. That is 15% of 847.\n"
            'Action: finish("15% of 847 is 127.05")'
        )
    if "bubble sort" in context.lower() and "web_search" not in context:
        return (
            "Thought: I need to look up bubble sort to explain it.\n"
            'Action: web_search("bubble sort")'
        )
    if "O(n²)" in context:
        return (
            "Thought: I found information about bubble sort. I can now answer.\n"
            'Action: finish("Bubble sort repeatedly compares adjacent elements and swaps them if out of order. Time complexity: O(n²). Space complexity: O(1). Not suitable for large arrays.")'
        )
    return (
        "Thought: Let me search for more information.\n"
        'Action: web_search("general information")'
    )


if __name__ == "__main__":
    print("=" * 60)
    print("ReAct Agent Demo")
    print("=" * 60)

    agent = ReActAgent(llm_fn=simulated_llm, max_steps=5, verbose=True)

    print("\nTask 1: Mathematical calculation")
    result1 = agent.run("What is 15% of 847?")
    print(f"\nFinal answer: {result1}")

    print("\n" + "=" * 60)
    print("Task 2: Knowledge retrieval")
    result2 = agent.run("Explain bubble sort and its time complexity.")
    print(f"\nFinal answer: {result2}")

Reward Design for Agents

Binary Task Completion Reward

The simplest and most reliable reward: did the agent complete the task successfully?

Code agents: does the test suite pass? Binary, unambiguous.
Web agents: does the final page match the target state? Compare DOM or screenshot.
Math agents: is the final answer numerically correct? Check with a verifier.

Binary rewards have the major advantage of being objective and unambiguous. They avoid reward hacking through qualitative assessments. The downside: very sparse - the agent only gets a signal at the very end of a potentially 50+ step trajectory.

Process Reward Models (PRMs)

For long-horizon tasks, a process reward model evaluates each step of the agent's trajectory:

$r_t^{PRM} = f_\phi(s_t, a_t)$

where $f_\phi$ is trained to predict whether step $t$ of the trajectory is a good step toward task completion. This provides dense rewards across the trajectory, solving the credit assignment problem.

PRMs for math reasoning (Lightman et al., 2023): train a classifier to evaluate whether each step in a mathematical proof is correct and necessary. OpenAI used this to improve o1's math reasoning dramatically.

Challenges: PRMs require human annotation of process quality (expensive) or a strong automated evaluator.

Self-Evaluation and Reflection

Some agent systems use the LLM itself as a reward signal via self-evaluation:

Agent attempts the task and produces an output
A "critic" LLM (or the same LLM with a different prompt) evaluates the output quality
The evaluation score becomes the reward signal

This is computationally expensive (two forward passes per step) but can produce rich reward signals for tasks where objective verification is hard. Risk: the critic and the policy may develop correlated failure modes - both hallucinate the same thing.

Production Engineering Notes

Context Window Management

Agent trajectories are long. A 50-step trajectory with tool outputs can easily consume 50,000–100,000 tokens. Production agents must manage context carefully:

Sliding window: drop oldest observations once context limit is reached. Risk: lose important early information.
Summarization: periodically compress the trajectory history with an LLM. Risk: information loss.
Key-value compression: store tool outputs externally, keep only summaries in context. Risk: retrieval failures.
Memory architectures: explicit episodic memory (MemGPT, etc.) with structured retrieval.

Sandboxing Code Execution

Coding agents execute arbitrary code. Production systems must sandbox execution to prevent:

Infinite loops consuming resources
File system access outside the task directory
Network access to unauthorized endpoints
Malicious code from adversarial tasks

Common approaches: Docker containers with resource limits, gVisor (Google's sandbox kernel), Firecracker microVMs, or cloud function sandboxes (AWS Lambda with timeout).

Latency vs Quality Tradeoff

Agent tasks involve multiple sequential LLM calls, each taking 1–10 seconds. A 20-step agent task takes 20–200 seconds to complete. This is acceptable for background tasks (software agents running overnight) but not for interactive applications.

Strategies for latency reduction:

Parallel tool calls: when multiple tools can be called simultaneously, batch them
Smaller models for simple steps: use GPT-3.5 or smaller models for observation parsing, reserve large models for planning
Caching: cache common tool outputs (web search results, file reads)
MCTS with time budget: run MCTS for a fixed time budget rather than fixed number of simulations

Benchmarks for Agent Evaluation

Benchmark	Domain	Task Type	Success Metric
SWE-bench	Software Engineering	Fix GitHub issues	Test suite pass
WebArena	Web Navigation	Complete web tasks	Task completion
OSWorld	Computer Use	OS-level tasks	Goal state match
HumanEval	Code Generation	Implement functions	Test pass rate
MATH	Math Reasoning	Solve math problems	Exact answer match
ALFWorld	Household Tasks	Complete text-world tasks	Task completion

Common Mistakes

:::danger No sandboxing for code execution Coding agents that execute arbitrary code without sandboxing will eventually encounter (or be given) malicious or destructive inputs. Always run code in isolated containers with CPU, memory, disk, and network limits. Never run agent-generated code on the host system. :::

:::danger Infinite loop in agent trajectory Without a maximum step limit and a timeout per step, agents can enter infinite loops: the LLM keeps calling the same tool, getting the same error, calling the tool again. Always implement: (1) max_steps hard limit, (2) per-step timeout, (3) repeated-action detection that breaks the loop. :::

:::warning Binary reward only - no intermediate signal A pure binary reward (task complete / failed) over 50+ steps creates a severe credit assignment problem. The agent cannot distinguish between "wrong approach from step 1" and "correct approach, minor bug at step 49." Add a process reward model or at least intermediate checkpoints (test partial completion at key steps) to provide denser reward signals. :::

:::warning Context length as training bottleneck Long agent trajectories (50+ steps) exceed the context windows of most models during training. Either truncate trajectories (losing information), use gradient checkpointing with long context models (expensive), or decompose into shorter sub-tasks. Design your agent task structure to fit within practical context limits. :::

:::tip Self-reflection improves agent quality significantly Add a "reflection" step after each failed attempt: have the agent review its trajectory, identify where it went wrong, and state what it would do differently. Then restart. This simple intervention improves success rates by 10–20% on coding tasks without any additional training, by exploiting the LLM's in-context learning capability. :::

YouTube Resources

Video	Channel	Why Watch It
AI Agents and RL	Andrej Karpathy	Agents, language models, and the RL connection
ReAct: Reasoning and Acting	Yannic Kilcher	ReAct paper walkthrough with examples
AlphaCode Explained	DeepMind	AlphaCode architecture - MCTS + LLM for programming
Building AI Agents	Harrison Chase	LangChain and agent frameworks in practice

Interview Q&A

Q1: What is the difference between ReAct and standard chain-of-thought (CoT) prompting?

Answer: Chain-of-thought prompting has the model reason through a problem step by step before producing a final answer - but the reasoning is entirely internal, using only the model's parametric knowledge. The model cannot observe external information during the reasoning process.

ReAct extends CoT by interleaving reasoning with actions that interact with the environment. The key difference: ReAct thoughts and actions alternate with observations from the real world (tool outputs, search results, code execution results). Each observation grounds the next reasoning step in reality rather than in the model's potentially hallucinated internal state.

Concretely: a CoT model solving a math problem can only use what it "knows." A ReAct agent can call a calculator, look up a formula, execute code to verify a numerical result. The observation loop prevents hallucination by providing external ground truth at each step.

Q2: How does MCTS improve agent planning? Walk through the UCT formula.

Answer: MCTS builds a tree of possible action sequences, using simulations to estimate the value of each path. The key advantage over greedy action selection: it considers multiple alternative approaches and backtracks to explore promising branches that weren't followed initially.

The UCT formula $Q(s,a) + c\sqrt{\ln N(s) / N(s,a)}$ has two terms: (1) Exploitation: $Q(s,a)$ is the average reward from actions taken at node $(s,a)$ - we prefer paths that have been rewarding. (2) Exploration: $c\sqrt{\ln N(s) / N(s,a)}$ increases as $N(s,a)$ decreases (rarely visited actions get a bonus) and as $N(s)$ increases (more total visits from the parent mean we can afford to explore more). The constant $c$ balances exploration vs exploitation.

For coding agents: MCTS generates multiple candidate code implementations, simulates each using the value function (P(tests pass)), and selects the most promising path to explore further. AlphaCode 2 uses this to generate thousands of candidate solutions and select the best - dramatically outperforming single-shot generation.

Q3: How do you design rewards for a coding agent?

Answer: Three layers: (1) Binary terminal reward: the strongest signal. If the test suite passes at the end, reward = 1. This is objective, unambiguous, and unambiguous - no reward hacking. The downside is extreme sparsity over long trajectories. (2) Process reward model (PRM): train a model to evaluate each reasoning/action step. Does this bash command make sense? Is this code change logically connected to the bug being fixed? PRMs are expensive to train (need annotated trajectories) but provide dense signals. (3) Intermediate checkpoints: for tasks with a natural structure (test files, function stubs), reward partial completion. If 3 of 5 test cases pass, give reward 0.6. This densifies the signal without a full PRM.

The key design principle: make rewards objective and verifiable. Subjective rewards (code quality, style) lead to reward hacking. Use execution results, not human aesthetic judgment.

Q4: How would you benchmark an AI coding agent? What are the limitations of SWE-bench?

Answer: SWE-bench is currently the gold standard: real GitHub issues, real test suites, objective pass/fail criterion, difficult enough to differentiate models meaningfully. To evaluate an agent, run it on the 300 Verified issues (curated subset with reliable tests), measure pass rate, break down by repository, issue type, and difficulty.

Limitations: (1) Training data contamination: the issues are from popular open-source repos. Large LLMs have likely seen the issue text and even the solution in training data. The benchmark measures a combination of generalization and memorization. (2) Test quality: some tests in SWE-bench are incomplete or brittle - a test passing doesn't always mean the bug is correctly fixed. (3) Task coverage: SWE-bench is biased toward Python package bugs. It underrepresents system programming, concurrent code, performance debugging. (4) No robustness evaluation: SWE-bench doesn't test whether the fix breaks other functionality (the agent might fix one test by breaking three others that weren't in the test suite).

Q5: What are the main challenges of long-horizon planning for AI agents?

Answer: Five core challenges: (1) Credit assignment: with 50+ steps and a single terminal reward, it is very hard to determine which early decisions led to success or failure. A wrong choice at step 3 may only manifest as a failure at step 47. PRMs and dense reward signals partially address this. (2) Context window saturation: trajectories grow long - each observation, tool output, and reasoning step adds tokens. At 100K tokens (common for complex tasks), attention is diluted and the model loses track of early context. (3) Error accumulation: errors at early steps propagate to later steps. A wrong file read leads to a wrong hypothesis leads to a wrong code fix. Unlike a search problem where backtracking is free, agent errors in production have costs (time, side effects). (4) Exploration vs exploitation: should the agent commit to a promising-looking approach or explore alternatives? Without MCTS, agents greedily commit and cannot easily backtrack. (5) Environment non-stationarity: in multi-agent or collaborative settings, other agents or users change the environment concurrently. The agent's world model may become stale.

Key Takeaways

AI agents are language models embedded in an action-observation loop: state = (task + history + observations), action = tool calls or text, reward = task completion signal
ReAct interleaves reasoning (Thought) and environment interaction (Action + Observation), grounding chain-of-thought in real-world feedback and reducing hallucination
MCTS extends greedy agent planning to tree search: UCT balances exploitation of promising paths with exploration of alternatives, enabling multi-hypothesis reasoning at test time
Agent training follows the SFT → rollout → reward → policy update cycle, analogous to RLHF but with sparse binary rewards and long-horizon trajectories
The key reward design principle: make rewards objective and verifiable - binary test pass/fail is more reliable than qualitative scoring
Production agent systems require: sandboxed code execution, context window management, step limits and timeouts, and self-reflection loops for quality improvement
SWE-bench Verified (real GitHub issue resolution) is the current benchmark - 49%+ (2024) for frontier models - and is the primary measure of agent progress

:::tip 🎮 Interactive Playground

Visualize this concept: Try the ReAct Agent demo on the EngineersOfAI Playground - no code required.

:::

The Real Engineering Moment​

Why This Exists: From Language Models to Agents​

Historical Context​

The Agent Formulation as an MDP​

ReAct: Reason and Act​

The ReAct Trajectory Format​

Why Interleaving Reasoning and Action Helps​

Monte Carlo Tree Search for Agent Planning​

The Problem with Greedy Action Selection​

MCTS for Language Model Agents​

The Value Function: Evaluating Partial Trajectories​

Training Agents with RL: The Full Pipeline​

Code: ReAct Agent with Tool Use​

Reward Design for Agents​

Binary Task Completion Reward​

Process Reward Models (PRMs)​

Self-Evaluation and Reflection​

Production Engineering Notes​

Context Window Management​

Sandboxing Code Execution​

Latency vs Quality Tradeoff​

Benchmarks for Agent Evaluation​

Common Mistakes​

YouTube Resources​

Interview Q&A​

Q1: What is the difference between ReAct and standard chain-of-thought (CoT) prompting?​

Q2: How does MCTS improve agent planning? Walk through the UCT formula.​

Q3: How do you design rewards for a coding agent?​

Q4: How would you benchmark an AI coding agent? What are the limitations of SWE-bench?​

Q5: What are the main challenges of long-horizon planning for AI agents?​

Key Takeaways​