The ReAct Pattern
Before ReAct, Language Models Hallucinated Facts
Ask GPT-3 in 2022 who won the 2020 US presidential election. It will tell you Joe Biden. Ask it who the current CEO of Twitter is. It will tell you Jack Dorsey (wrong - he had stepped down). Ask it what the current price of Bitcoin is. It will make up a number. The model knows facts from its training data, but its training data has a cutoff, and it has no mechanism to distinguish what it knows from what it is confabulating.
The standard solution at the time was retrieval-augmented generation: query a database, inject the result into the prompt, ask the model to answer based on the retrieved context. This worked for single lookups. It failed for multi-step tasks that required acting on information - finding a fact, using it to look up another fact, combining multiple sources.
What researchers needed was a model that could reason about when to look something up, look it up, see the result, reason about the result, decide what to look up next - an iterative reasoning-and-acting loop grounded in real observations.
In October 2022, Yao et al. from Princeton and Google Brain published "ReAct: Synergizing Reasoning and Acting in Language Models." The paper demonstrated that interleaving verbal reasoning traces with actions - specifically, reasoning before each action and after each observation - dramatically outperformed both pure reasoning (Chain-of-Thought) and pure acting (without explicit reasoning). The key insight was simple and profound: ground your reasoning in reality by acting on the world and observing the results.
That paper changed how we build agents.
:::tip 🎮 Interactive Playground Visualize this concept: Try the ReAct Agent demo on the EngineersOfAI Playground - no code required. :::
Why Reasoning Alone Is Not Enough
Chain-of-Thought (CoT) prompting, introduced by Wei et al. (2022), showed that asking models to reason step-by-step before answering significantly improved accuracy. Instead of "What is the capital of Germany?" you prompt "Let me think step by step. Germany is a country in central Europe. Its major cities are Berlin, Hamburg, Munich. The capital of Germany is Berlin."
This works remarkably well for math problems and logical reasoning where all relevant information is contained in the question. It fails for factual questions where the model needs information it might not have, or might have incorrectly memorized.
PURE CHAIN-OF-THOUGHT FAILURE EXAMPLE:
Question: What is the current population of the city where the 2024 Olympics were held?
Chain-of-Thought:
Thought 1: The 2024 Olympics were held in Paris, France.
Thought 2: Paris is the capital of France with a large population.
Thought 3: I believe the population of Paris is approximately 2.1 million people.
Answer: 2.1 million.
PROBLEM: The actual city population figure may be outdated or wrong.
The model had no way to verify - it just generated a plausible-sounding number.
Without a mechanism to look things up, CoT is just confident reasoning on potentially wrong premises. The reasoning chain is internally consistent but may be grounded in false beliefs.
Why Acting Alone Is Not Enough
What if you just give the model tools and let it use them without explicit reasoning? The model receives a question, generates tool calls, gets results, generates more tool calls - no explicit thought steps.
This fails in a different way. Without reasoning, the model makes poor tool selection decisions, fails to interpret ambiguous results correctly, and cannot maintain a coherent plan across multiple steps. Each tool call is made somewhat blindly, without the context of the overall strategy.
PURE ACTING FAILURE EXAMPLE:
Question: Who wrote the paper that introduced the attention mechanism?
Action 1: search("attention mechanism paper")
Result: [Many results about attention in ML, NLP, psychology...]
Action 2: search("attention is all you need")
Result: ["Attention Is All You Need" - Vaswani et al., 2017, Google Brain]
Action 3: search("Vaswani et al 2017")
Result: [More papers by Vaswani on transformers]
(Agent continues searching without converging on the answer)
PROBLEM: Without a reasoning trace connecting observations to conclusions,
the agent doesn't recognize that it already has the answer.
ReAct: Interleaving Thought and Action
The ReAct framework structures agent behavior as alternating Thought and Action phases, where each Thought is explicitly connected to the preceding Observation.
REACT STRUCTURE:
Question: {user question}
Thought 1: {reasoning about what to do first, and why}
Action 1: {tool call}
Observation 1: {tool result}
Thought 2: {reasoning about what the observation means, and what to do next}
Action 2: {tool call}
Observation 2: {tool result}
...
Thought N: {reasoning that the task is complete and what the answer is}
Final Answer: {answer to the original question}
The crucial feature: every action is preceded by a thought that explains the reasoning behind it, and every thought refers to the preceding observation. Reasoning is grounded in real observations, not in potentially wrong memorized beliefs.
The Original Paper's Results
The 2022 ReAct paper (Yao et al.) evaluated on three benchmarks:
HotpotQA (multi-hop question answering): ReAct outperformed CoT by 3.9 percentage points in exact match score, and significantly outperformed Act-only by 6.7 points.
FEVER (fact verification): ReAct achieved 60.9% accuracy versus CoT's 56.3% and Act-only's 58.9%.
ALFWorld (interactive household tasks): ReAct succeeded on 71% of tasks versus Act-only at 45%. The reasoning traces were essential for navigating rooms and handling unexpected situations.
The key finding: on tasks requiring factual grounding, reasoning alone hallucinates; acting alone misses the forest for the trees; ReAct does both and wins.
ReAct vs Chain-of-Thought: The Critical Difference
| Dimension | Chain-of-Thought | ReAct |
|---|---|---|
| Reasoning | Yes - step by step | Yes - step by step |
| Tool use | No | Yes |
| Grounded in reality | No - relies on memory | Yes - verified by tools |
| Hallucination risk | High for factual tasks | Low - facts are looked up |
| Suitable for | Math, logic, analysis | Research, data retrieval, multi-step |
| Token cost | Low | Higher (tool calls add tokens) |
| Latency | Low | Higher (tool calls take time) |
The decision between CoT and ReAct: if the task requires facts from the external world, use ReAct. If the task requires reasoning about information already in context, CoT suffices.
Reflexion: ReAct With Memory
Shinn et al. (2023) extended ReAct with a concept called Reflexion. After completing a task (successfully or not), the agent generates a verbal reflection on what worked and what failed. These reflections are stored and injected into future runs as additional context.
Reflexion substantially improves performance on complex tasks by allowing the agent to learn from its mistakes within a session. The Shinn et al. paper showed improvements of 10-20% on programming tasks and decision-making benchmarks.
ReAct With Modern Function Calling
The original ReAct paper was implemented with text-based prompting: the model generated "Action: search[query]" as text, which was then parsed and executed. Modern APIs with native function calling make this cleaner and more reliable.
With function calling, the ReAct pattern works like this:
- Thought: generated as a
textcontent block before the tool call (or as content within the request) - Action: generated as a
tool_usecontent block - Observation: returned as a
tool_resultcontent block
The explicit "Thought:" prefix is no longer necessary - the structure is encoded in the content block types. However, prompting the model to reason before acting (via the system prompt) still significantly improves performance.
Full ReAct Implementation From Scratch
"""
Complete ReAct agent implementation from scratch.
Uses Anthropic API with explicit reasoning traces.
Demonstrates: thought generation, action execution, observation integration,
ReAct failure mode detection, and Reflexion-style retry.
Install: pip install anthropic
"""
import anthropic
import json
import subprocess
import os
import re
from typing import Any
from dataclasses import dataclass, field
client = anthropic.Anthropic()
# ── Data types for trajectory tracking ───────────────────────────────────────
@dataclass
class ReActStep:
"""A single Thought-Action-Observation triple."""
thought: str
action_name: str | None
action_input: dict | None
observation: str | None
is_final: bool = False
final_answer: str | None = None
@dataclass
class ReActTrajectory:
"""Full trajectory of a ReAct run."""
task: str
steps: list[ReActStep] = field(default_factory=list)
succeeded: bool = False
reflection: str | None = None
def to_text(self) -> str:
"""Format trajectory as human-readable text."""
lines = [f"Task: {self.task}", ""]
for i, step in enumerate(self.steps, 1):
lines.append(f"Thought {i}: {step.thought}")
if step.action_name:
args_str = json.dumps(step.action_input, indent=2) if step.action_input else "{}"
lines.append(f"Action {i}: {step.action_name}({args_str})")
lines.append(f"Observation {i}: {step.observation}")
if step.is_final:
lines.append(f"Final Answer: {step.final_answer}")
lines.append("")
return "\n".join(lines)
# ── Tools ─────────────────────────────────────────────────────────────────────
TOOLS = [
{
"name": "search_web",
"description": (
"Search for information on the web. Returns relevant text excerpts. "
"Use for current facts, recent events, statistics, and information "
"that may not be in training data. Be specific in your query."
),
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query string."},
"num_results": {
"type": "integer",
"description": "Number of results to return (1-5). Default: 3.",
"default": 3
}
},
"required": ["query"]
}
},
{
"name": "calculator",
"description": (
"Evaluate a mathematical expression and return the result. "
"Supports +, -, *, /, **, (), and basic math functions (sqrt, log, etc.). "
"Use this instead of computing in your head to ensure accuracy."
),
"input_schema": {
"type": "object",
"properties": {
"expression": {
"type": "string",
"description": "Math expression to evaluate. Example: '(145 * 12) / 365'"
}
},
"required": ["expression"]
}
},
{
"name": "lookup_fact",
"description": (
"Look up a specific fact in the knowledge base. "
"Returns structured data for: country populations, country capitals, "
"element properties, historical dates, mathematical constants. "
"Use when you need a specific factual value rather than explanatory text."
),
"input_schema": {
"type": "object",
"properties": {
"category": {
"type": "string",
"description": "Category of fact to look up.",
"enum": ["country_population", "country_capital", "element", "constant"]
},
"key": {
"type": "string",
"description": "The specific item to look up. Example: 'France' for country facts."
}
},
"required": ["category", "key"]
}
},
{
"name": "run_python",
"description": "Execute Python code and return the output. Use for data processing and computation.",
"input_schema": {
"type": "object",
"properties": {
"code": {"type": "string", "description": "Python code to execute."}
},
"required": ["code"]
}
}
]
def execute_tool(name: str, args: dict[str, Any]) -> str:
"""Execute a tool and return the result."""
if name == "search_web":
query = args.get("query", "")
# In production, this would call a real search API (Tavily, SerpAPI, etc.)
# For this demo, we simulate results
mock_results = {
"2024 olympics": (
"The 2024 Summer Olympics (officially the Games of the XXXIII Olympiad) "
"were held in Paris, France, from July 26 to August 11, 2024. "
"Paris is the capital of France with a city population of approximately "
"2.1 million (metropolitan area: 12 million)."
),
"paris population": (
"Paris proper population: approximately 2,165,423 (2023 census). "
"Greater Paris metropolitan area: approximately 12.2 million (2023)."
),
"gpt-4 release date": (
"GPT-4 was released by OpenAI on March 14, 2023. "
"It was made available via the API and ChatGPT Plus subscription."
)
}
# Find the most relevant mock result
for key, result in mock_results.items():
if any(word in query.lower() for word in key.split()):
return f"Search results for '{query}':\n{result}"
return (
f"Search results for '{query}':\n"
"No specific results found for this query. "
"Try rephrasing or using a more specific search term."
)
elif name == "calculator":
expression = args.get("expression", "")
try:
# Safe evaluation - only allow mathematical operations
# In production, use a proper math parser
allowed = set("0123456789+-*/().,_ abcdefghijklmnopqrstuvwxyz")
if not all(c in allowed for c in expression.lower()):
return f"Error: Expression contains disallowed characters: {expression}"
import math
safe_globals = {
"sqrt": math.sqrt,
"log": math.log,
"log10": math.log10,
"sin": math.sin,
"cos": math.cos,
"pi": math.pi,
"e": math.e,
"__builtins__": {}
}
result = eval(expression, safe_globals)
return f"Result: {expression} = {result}"
except Exception as ex:
return f"Error evaluating expression '{expression}': {ex}"
elif name == "lookup_fact":
category = args.get("category", "")
key = args.get("key", "").lower()
databases = {
"country_population": {
"france": "67,750,000 (2023 estimate)",
"germany": "83,200,000 (2023 estimate)",
"japan": "124,900,000 (2023 estimate)",
"usa": "335,000,000 (2023 estimate)",
"china": "1,409,000,000 (2023 estimate)",
},
"country_capital": {
"france": "Paris",
"germany": "Berlin",
"japan": "Tokyo",
"usa": "Washington, D.C.",
"china": "Beijing",
},
"constant": {
"pi": "3.14159265358979",
"e": "2.71828182845905",
"golden_ratio": "1.61803398874989",
"speed_of_light": "299,792,458 m/s",
}
}
db = databases.get(category, {})
if key in db:
return f"{category}[{key}] = {db[key]}"
else:
available = list(db.keys())[:5]
return (
f"'{key}' not found in {category}. "
f"Available entries include: {', '.join(available)}"
)
elif name == "run_python":
code = args.get("code", "")
try:
result = subprocess.run(
["python3", "-c", code],
capture_output=True, text=True, timeout=10
)
out = result.stdout or "(no stdout)"
err = f"\nstderr: {result.stderr}" if result.stderr else ""
return f"stdout: {out}{err}"
except subprocess.TimeoutExpired:
return "Error: Timeout after 10 seconds."
except Exception as ex:
return f"Error: {ex}"
else:
return f"Error: Unknown tool '{name}'"
# ── ReAct loop ────────────────────────────────────────────────────────────────
REACT_SYSTEM_PROMPT = """You are a careful, methodical AI agent using the ReAct framework.
For every step:
1. THINK FIRST: Before taking any action, explain your reasoning.
- What do you know so far?
- What do you need to find out?
- Why is this specific action the right next step?
2. ACT: Call a tool based on your reasoning.
3. After seeing the observation, think again:
- What does this observation tell me?
- Is this enough to answer the question, or do I need more?
- What is my next action?
Key principles:
- Never state a fact without having looked it up if it matters to the answer
- If a lookup returns unexpected results, reason about why and try differently
- When you have enough information, state your final answer clearly
- Prefer specific lookups over general searches when you know what you need
- Do not repeat tool calls that already gave you the information you need"""
def run_react(
task: str,
max_steps: int = 15,
reflection: str | None = None
) -> ReActTrajectory:
"""
Run a ReAct agent on a task.
Args:
task: The question or task to complete
max_steps: Maximum number of Thought-Action-Observation triples
reflection: Optional reflection from a previous failed attempt (Reflexion)
Returns:
ReActTrajectory with the full record of the run
"""
trajectory = ReActTrajectory(task=task)
# Build the initial message, incorporating any reflection from a prior attempt
initial_content = task
if reflection:
initial_content = (
f"{task}\n\n"
f"[Note: You attempted this before and it did not succeed. "
f"Here is your reflection on what went wrong:\n{reflection}\n\n"
f"Please try a different approach this time.]"
)
messages = [{"role": "user", "content": initial_content}]
print(f"\n{'='*65}")
print(f"ReAct starting: {task}")
if reflection:
print(f"[With reflection from previous attempt]")
print(f"{'='*65}\n")
for step_num in range(max_steps):
print(f"[Step {step_num + 1}]")
# Get the next thought/action from the model
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=2048,
system=REACT_SYSTEM_PROMPT,
tools=TOOLS,
messages=messages
)
messages.append({"role": "assistant", "content": response.content})
# Extract the thought (text content)
thought = next(
(block.text for block in response.content if hasattr(block, "text")),
"(no explicit thought)"
)
print(f" Thought: {thought[:150]}...")
# Check for final answer (no more tool calls)
if response.stop_reason == "end_turn":
step = ReActStep(
thought=thought,
action_name=None,
action_input=None,
observation=None,
is_final=True,
final_answer=thought
)
trajectory.steps.append(step)
trajectory.succeeded = True
print(f" [FINAL ANSWER]")
print(f"{'='*65}")
return trajectory
# Process tool calls
if response.stop_reason == "tool_use":
tool_results = []
step_actions = []
step_observations = []
for block in response.content:
if block.type == "tool_use":
action_name = block.name
action_input = block.input
print(f" Action: {action_name}({json.dumps(action_input)[:60]})")
# Execute the tool
observation = execute_tool(action_name, action_input)
print(f" Observation: {observation[:100]}...")
step_actions.append((action_name, action_input))
step_observations.append(observation)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": observation
})
messages.append({"role": "user", "content": tool_results})
# Record this Thought-Action-Observation triple
# (For simplicity, record the last action if multiple were made)
step = ReActStep(
thought=thought,
action_name=step_actions[-1][0] if step_actions else None,
action_input=step_actions[-1][1] if step_actions else None,
observation=step_observations[-1] if step_observations else None
)
trajectory.steps.append(step)
# Hit max steps without completion
trajectory.succeeded = False
return trajectory
# ── Reflexion wrapper ─────────────────────────────────────────────────────────
def generate_reflection(trajectory: ReActTrajectory) -> str:
"""
Generate a reflection on a failed trajectory.
This reflection will be used to improve the next attempt.
"""
traj_text = trajectory.to_text()
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=512,
messages=[{
"role": "user",
"content": (
f"The following agent trajectory failed to complete the task. "
f"Write a brief, specific reflection on:\n"
f"1. What went wrong\n"
f"2. What information was missing\n"
f"3. What should be done differently\n\n"
f"Trajectory:\n{traj_text}"
)
}]
)
return response.content[0].text
def run_react_with_reflexion(task: str, max_attempts: int = 3) -> str:
"""
Run ReAct with Reflexion: retry failed attempts with generated reflections.
"""
reflection = None
for attempt in range(max_attempts):
print(f"\n{'*'*65}")
print(f"Attempt {attempt + 1}/{max_attempts}")
print(f"{'*'*65}")
trajectory = run_react(task, reflection=reflection)
if trajectory.succeeded:
final_step = next(
(s for s in reversed(trajectory.steps) if s.is_final),
None
)
if final_step and final_step.final_answer:
return final_step.final_answer
return "Task completed."
if attempt < max_attempts - 1:
print(f"\n[Attempt {attempt + 1} failed. Generating reflection...]")
reflection = generate_reflection(trajectory)
print(f"Reflection: {reflection[:200]}...")
return f"Task not completed after {max_attempts} attempts."
# ── ReAct failure modes detection ─────────────────────────────────────────────
def detect_react_failure_modes(trajectory: ReActTrajectory) -> list[str]:
"""
Analyze a trajectory for common ReAct failure modes.
Returns a list of detected failure modes.
"""
failures = []
steps = trajectory.steps
# Check for reasoning loops (same action repeated)
action_history = []
for step in steps:
if step.action_name and step.action_input:
key = f"{step.action_name}:{json.dumps(step.action_input, sort_keys=True)}"
if key in action_history:
failures.append(
f"Reasoning loop detected: '{step.action_name}' called with same args more than once"
)
action_history.append(key)
# Check for ignoring observations
for i, step in enumerate(steps[1:], 1):
prev_obs = steps[i-1].observation or ""
if prev_obs and step.thought and len(prev_obs) > 50:
# Rough check: does the thought reference any content from the observation?
obs_words = set(prev_obs.lower().split()[:20]) # first 20 words
thought_words = set(step.thought.lower().split())
overlap = obs_words & thought_words
if len(overlap) < 2: # Very little overlap suggests observation was ignored
failures.append(
f"Step {i+1}: Thought may not have incorporated the previous observation"
)
# Check for too many steps on simple task
if len(steps) > 8 and not trajectory.succeeded:
failures.append(f"Excessive steps ({len(steps)}) without completion - strategy issue")
return failures
# ── Run examples ──────────────────────────────────────────────────────────────
if __name__ == "__main__":
# Example 1: Multi-hop factual question (ReAct shines here)
result = run_react_with_reflexion(
"What is the population of the country where the 2024 Summer Olympics were held?"
)
print(f"\n\nFINAL ANSWER: {result}\n")
# Example 2: Calculation requiring facts + math
result2 = run_react(
"If the speed of light is divided by pi, what is the result? Show your work."
)
print(f"\n\nFINAL ANSWER 2: "
f"{result2.steps[-1].final_answer if result2.steps else 'No answer'}\n")
ReAct Failure Modes
ReAct is powerful but has known failure modes you need to plan for in production.
Production Engineering Notes
:::tip Explicit thought generation improves quality significantly Even with modern function calling (where thoughts are implicit), prompting the model to "think step by step before each action" substantially improves tool selection accuracy. Add this instruction to your system prompt and you will see fewer irrelevant tool calls and better observation interpretation. :::
:::warning ReAct is not appropriate for all tasks If the task requires no external information (pure math, pure logic, analysis of provided text), ReAct adds latency and cost without benefit. Use Chain-of-Thought for self-contained tasks. Use ReAct when the task requires facts from the external world or multiple sequential lookups. :::
:::danger Residual hallucination in thoughts Even with ReAct, the model can introduce false facts in the Thought phase that do not come from any observation. Watch for thoughts like "as I know from my training, X is true" - these are hallucinations masquerading as knowledge. Prompt the model: "Only state facts you have confirmed through tool calls. Never state factual claims based on memory alone." :::
Interview Questions
Q: What problem does ReAct solve that Chain-of-Thought does not?
Chain-of-Thought prompting improves reasoning by making it explicit and step-by-step, but it relies entirely on the model's memorized knowledge. For tasks requiring current facts, multi-hop lookups, or information not in the training data, CoT hallucinates - it generates plausible-sounding but potentially false intermediate steps, and these errors compound. ReAct solves this by grounding each reasoning step in real observations. After every thought, the agent takes an action (a tool call) and observes the actual result. The next thought is based on that real observation, not on potentially wrong memorized beliefs. This breaks the hallucination cycle because facts must be verified through tool calls before being used in further reasoning.
Q: Explain the Reflexion extension to ReAct. Why does it work?
Reflexion (Shinn et al., 2023) adds an episodic memory layer to ReAct. After a failed task attempt, the model generates a verbal reflection analyzing what went wrong and what should be done differently. This reflection is stored and prepended to the context of the next attempt. It works because LLMs are better at critiquing their own mistakes after the fact than at avoiding them in the first place - just as humans often understand an error clearly in hindsight. The reflection compresses the lessons from a failed trajectory into a compact piece of advice that fits in the context window. Results: 10-20% improvement on programming tasks and decision-making benchmarks. The main limitation is that each retry still starts from scratch execution-wise, only benefiting from the linguistic reflection.
Q: How has the implementation of ReAct evolved from the original paper to modern function calling APIs?
The original 2022 paper implemented ReAct with text-based prompting: the model generated text like "Action: Search[query]" and "Thought: Based on the search result..." as raw text, which was parsed with string matching and executed. This was fragile - the model sometimes generated malformed action text, and parsing was error-prone. Modern function calling APIs formalize this structure: thoughts are generated as text content blocks, actions are generated as structured tool_use content blocks with machine-readable JSON arguments, and observations are returned as typed tool_result blocks. The structure that ReAct encoded in text conventions is now encoded in the API protocol. The fundamental pattern - thought before action, observation before next thought - remains the same. What changed is reliability: modern function calling essentially never generates malformed tool calls, while text-based ReAct had significant parsing failures.
Q: What are the most common ReAct failure modes in production and how do you address each?
Reasoning loops: the model calls the same tool with the same arguments repeatedly. Fix: track action history in scaffolding code, and if a repeat is detected, inject an interrupt message: "You have already tried this approach. Please try something fundamentally different." Observation misinterpretation: the model's next thought does not incorporate the key information from the observation. Fix: in the system prompt, require the model to explicitly quote from the observation in its next thought, ensuring it actually read it. Irrelevant tool calls: the model calls a tool that does not help with the current task. Fix: improve tool descriptions to explicitly say when NOT to use each tool. Premature termination: the model declares the task done before it actually is. Fix: add a verification step where the model checks its answer against the original requirements. Residual hallucination: the model introduces facts in thoughts that did not come from any tool. Fix: instruct explicitly that all factual claims must be grounded in tool results.
Q: ReAct was published in 2022. How does the current state of agentic AI relate to its findings?
The core insight of ReAct - that interleaving reasoning and acting in a loop outperforms either alone - has been validated repeatedly and is now the foundation of virtually every serious agent architecture. What has changed: (1) the action space has expanded dramatically from simple web searches to rich tool ecosystems with function calling; (2) reasoning quality has improved substantially as models got larger and better; (3) the original text-based parsing has been replaced by structured function calling APIs; (4) the pattern has been extended with memory (Reflexion), planning (ReAct + planning agents), and multi-agent orchestration (ReAct agents coordinating with each other). The benchmark numbers have also improved dramatically - 2022 ReAct achieved 60-70% on HotpotQA; 2025 agents on comparable benchmarks exceed 80%. The original paper remains foundational reading because it identified the right abstraction. The implementations have evolved substantially.
