Raw API Agent Patterns
The Incident That Converted a Framework Believer
It is 3:14 AM when the page fires. The production research agent - the one that processes fifty requests per hour for enterprise customers - has stopped responding. The on-call engineer opens her laptop, checks the monitoring dashboard, and finds that the agent has been in a loop for forty minutes: calling the same tool repeatedly, getting the same result, calling it again. The API bill for those forty minutes is $340.
She opens the codebase and stares at the stack trace. The error originated six frames deep in the LangChain internals, passed through a callback handler, was caught and re-raised by the AgentExecutor, and then caught again by a retry decorator. By the time the exception reaches her application code, it has been transformed twice and the original error message - "tool returned None, retrying" - is buried.
She spent two hours debugging a twenty-minute incident. She wrote the postmortem. Then she spent the following week rewriting the agent in raw Anthropic API calls. The rewrite was 180 lines. Every line was code she wrote. Every behavior was intentional.
The next time something went wrong - three weeks later - she fixed it in twelve minutes. The error message pointed directly to her code. The fix was four lines.
This is the core argument for raw API agents: not that frameworks are bad, but that transparency and debuggability have compounding value in production. When you control every line of the agentic loop, you understand every behavior. When you understand every behavior, you can fix problems fast. Fast fixes at 3 AM matter.
:::tip 🎮 Interactive Playground Visualize this concept: Try the ReAct Agent demo on the EngineersOfAI Playground - no code required. :::
Why This Exists
What Frameworks Trade Away
Frameworks trade transparency for convenience. LangChain's AgentExecutor handles the agentic loop for you - but inside that loop are dozens of decisions: how to format tool results, whether to retry on parse errors, how many iterations to allow, what to do when no tool is called. These decisions are reasonable defaults, but they are not your decisions.
Every framework behavior you did not write is a behavior you might not understand when it surprises you. In development, surprises are inconvenient. In production at 3 AM, they are expensive.
The raw API approach inverts this trade-off: you write every behavior explicitly, which means you understand every behavior completely. The agentic loop is not magic - it is a while loop with three operations: call the model, execute tools, append results. That loop fits in forty lines of Python.
When Raw API Is Strictly Better
Simple agents (less than five tools, linear loop): the framework adds abstraction overhead without providing capabilities you need. A raw API agent in 60 lines is more readable and more debuggable than the equivalent LangChain agent.
Latency-critical agents: every abstraction layer adds function call overhead. For agents called thousands of times per day where p99 latency matters, the raw API is faster.
High-control production systems: enterprises with strict behavior requirements, custom retry policies, specific logging formats, or compliance requirements often find that frameworks impose patterns that conflict with their requirements. Raw API gives them complete control.
Learning and debugging environments: if you want to deeply understand agentic systems, build one from scratch. The understanding transfers to all framework work.
Historical Context
The Anthropic Messages API with tool use has been available since March 2024. The API design was deliberately simple:
- You define tools as JSON schemas
- You send messages to the model
- The model either responds with text (
stop_reason: "end_turn") or with tool calls (stop_reason: "tool_use") - You execute the tool calls and append the results
- You send the updated messages back to the model
- Repeat
This simplicity is intentional. Claude's tool use design at Anthropic was explicitly informed by the observation that many frameworks were overcomplicating a fundamentally simple loop. The API surface is minimal: client.messages.create() with tools and messages parameters.
The Anthropic Python SDK (v0.26+) adds type safety and convenience wrappers without hiding the API structure. Every tool use interaction follows the same pattern, making it easy to build reliable raw API agents.
The Core Agentic Loop
The fundamental loop is twelve lines:
import anthropic
client = anthropic.Anthropic()
def run_agent(messages: list, tools: list, system: str, max_turns: int = 20) -> str:
for turn in range(max_turns):
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=4096,
system=system,
tools=tools,
messages=messages
)
# Model is done - return its text response
if response.stop_reason == "end_turn":
return next(b.text for b in response.content if hasattr(b, 'text'))
# Model wants to use tools
if response.stop_reason == "tool_use":
messages.append({"role": "assistant", "content": response.content})
tool_results = execute_tools(response.content)
messages.append({"role": "user", "content": tool_results})
return "Max turns reached without completing the task."
That is the complete agent loop. Everything else - retry logic, logging, cost tracking, streaming - is built around this core.
Tool Definition Patterns
Schema-First Tool Definition
# Define tools as pure JSON schemas - the source of truth
TOOLS = [
{
"name": "search_web",
"description": """Search the web for current information.
Use this for facts, news, recent events, or anything that might have changed.
Returns a list of search results with titles, URLs, and snippets.""",
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query. Be specific for better results."
},
"num_results": {
"type": "integer",
"description": "Number of results to return (1-10, default 5)",
"default": 5,
"minimum": 1,
"maximum": 10
}
},
"required": ["query"]
}
},
{
"name": "read_file",
"description": "Read the contents of a file at the given path.",
"input_schema": {
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "The file path to read"
},
"encoding": {
"type": "string",
"description": "File encoding (default: utf-8)",
"default": "utf-8"
}
},
"required": ["path"]
}
},
{
"name": "write_file",
"description": "Write content to a file. Creates the file if it does not exist.",
"input_schema": {
"type": "object",
"properties": {
"path": {"type": "string"},
"content": {"type": "string"},
"mode": {
"type": "string",
"enum": ["write", "append"],
"default": "write"
}
},
"required": ["path", "content"]
}
},
{
"name": "run_python",
"description": "Execute a Python code snippet. Returns stdout and stderr.",
"input_schema": {
"type": "object",
"properties": {
"code": {
"type": "string",
"description": "Python code to execute"
},
"timeout": {
"type": "integer",
"description": "Execution timeout in seconds (max 60)",
"default": 30
}
},
"required": ["code"]
}
}
]
Tool description quality matters enormously. The model reads the description to decide whether and how to call the tool. A good description answers:
- What does this tool do?
- When should I use it vs. other tools?
- What does it return?
- Are there any constraints?
Tool Implementation
import subprocess
import sys
import tempfile
import os
import requests
from typing import Any
def search_web(query: str, num_results: int = 5) -> str:
"""Real web search implementation (using Brave Search API)."""
# Replace with your preferred search API
headers = {"X-Subscription-Token": os.environ.get("BRAVE_API_KEY", "")}
params = {"q": query, "count": num_results, "text_decorations": False}
try:
resp = requests.get(
"https://api.search.brave.com/res/v1/web/search",
headers=headers,
params=params,
timeout=10
)
data = resp.json()
results = data.get("web", {}).get("results", [])
output = []
for r in results:
output.append(f"Title: {r.get('title', 'N/A')}")
output.append(f"URL: {r.get('url', 'N/A')}")
output.append(f"Snippet: {r.get('description', 'N/A')}")
output.append("")
return "\n".join(output) if output else "No results found."
except Exception as e:
return f"Search error: {e}"
def read_file(path: str, encoding: str = "utf-8") -> str:
"""Read file contents with error handling."""
try:
with open(path, encoding=encoding) as f:
content = f.read()
# Truncate very large files
if len(content) > 50000:
return content[:50000] + f"\n\n[File truncated at 50,000 chars. Total: {len(content)} chars]"
return content
except FileNotFoundError:
return f"Error: File not found at path '{path}'"
except PermissionError:
return f"Error: Permission denied reading '{path}'"
except UnicodeDecodeError:
return f"Error: Cannot decode file with encoding '{encoding}'"
def write_file(path: str, content: str, mode: str = "write") -> str:
"""Write to file with directory creation."""
try:
# Create parent directories if they do not exist
os.makedirs(os.path.dirname(path) or ".", exist_ok=True)
file_mode = "w" if mode == "write" else "a"
with open(path, file_mode, encoding="utf-8") as f:
f.write(content)
return f"Successfully {'wrote' if mode == 'write' else 'appended'} {len(content)} chars to '{path}'"
except PermissionError:
return f"Error: Permission denied writing to '{path}'"
except Exception as e:
return f"Error writing file: {e}"
def run_python(code: str, timeout: int = 30) -> str:
"""Execute Python code in a subprocess (basic sandbox)."""
# In production: use Docker or e2b for proper sandboxing
timeout = min(timeout, 60) # Enforce maximum timeout
with tempfile.NamedTemporaryFile(
mode='w', suffix='.py', delete=False, encoding='utf-8'
) as f:
f.write(code)
tmpfile = f.name
try:
result = subprocess.run(
[sys.executable, tmpfile],
capture_output=True,
text=True,
timeout=timeout,
# Basic safety: no stdin, limited environment
stdin=subprocess.DEVNULL,
env={
"PATH": os.environ.get("PATH", ""),
"HOME": os.environ.get("HOME", ""),
"PYTHONPATH": os.environ.get("PYTHONPATH", ""),
}
)
output = []
if result.stdout:
output.append(f"STDOUT:\n{result.stdout}")
if result.stderr:
output.append(f"STDERR:\n{result.stderr}")
output.append(f"Exit code: {result.returncode}")
return "\n".join(output) if output else "Code ran successfully with no output."
except subprocess.TimeoutExpired:
return f"Error: Code execution timed out after {timeout} seconds"
except Exception as e:
return f"Error executing code: {e}"
finally:
os.unlink(tmpfile)
# Tool dispatch map
TOOL_FUNCTIONS = {
"search_web": search_web,
"read_file": read_file,
"write_file": write_file,
"run_python": run_python,
}
Tool Execution with Error Handling
def execute_tools(content_blocks: list) -> list[dict]:
"""Execute all tool calls in a response and return results."""
results = []
for block in content_blocks:
if block.type != "tool_use":
continue
tool_fn = TOOL_FUNCTIONS.get(block.name)
if not tool_fn:
# Unknown tool - return error, do not crash
result = f"Error: Unknown tool '{block.name}'. Available: {list(TOOL_FUNCTIONS.keys())}"
else:
try:
result = str(tool_fn(**block.input))
except TypeError as e:
# Wrong arguments - tell the model what went wrong
result = f"Error: Invalid arguments for '{block.name}': {e}"
except Exception as e:
# Any other error - log it, return error to model
import logging
logging.error(f"Tool '{block.name}' failed: {e}", exc_info=True)
result = f"Error in '{block.name}': {type(e).__name__}: {e}"
results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result
})
return results
The Complete Agent: 200 Lines
Here is a production-quality agent implementation with all the essential patterns:
import anthropic
import logging
import time
import json
from typing import Callable, Optional
from dataclasses import dataclass, field
logger = logging.getLogger(__name__)
@dataclass
class AgentConfig:
"""Configuration for the raw API agent."""
model: str = "claude-opus-4-6"
max_tokens: int = 4096
max_turns: int = 25
temperature: float = 0.0
system_prompt: str = "You are a helpful AI assistant."
@dataclass
class AgentRun:
"""Records the complete history of an agent run."""
task: str
messages: list = field(default_factory=list)
tool_calls: list = field(default_factory=list)
total_input_tokens: int = 0
total_output_tokens: int = 0
turns: int = 0
success: bool = False
error: Optional[str] = None
elapsed_seconds: float = 0.0
@property
def estimated_cost_usd(self) -> float:
"""Rough cost estimate at Claude Opus pricing."""
# Update these for current pricing
return (self.total_input_tokens / 1_000_000 * 15 +
self.total_output_tokens / 1_000_000 * 75)
class RawAPIAgent:
"""
A production-ready agent built entirely on the Anthropic API.
No framework dependencies. Full control over every behavior.
Features:
- Configurable agentic loop with max_turns safeguard
- Pluggable tool system with error isolation
- Complete run logging for observability
- Cost tracking per run
- Retry logic for transient API errors
- Streaming support
"""
def __init__(self, config: AgentConfig = None):
self.client = anthropic.Anthropic()
self.config = config or AgentConfig()
self.tools: list[dict] = []
self.tool_fns: dict[str, Callable] = {}
def register_tool(
self,
name: str,
fn: Callable,
description: str,
input_schema: dict
) -> None:
"""Register a Python function as an available tool."""
self.tools.append({
"name": name,
"description": description,
"input_schema": input_schema
})
self.tool_fns[name] = fn
def run(self, task: str) -> AgentRun:
"""Run the agent on a task. Returns a complete run record."""
run = AgentRun(task=task)
run.messages = [{"role": "user", "content": task}]
start_time = time.time()
try:
self._execute_loop(run)
run.success = True
except Exception as e:
run.error = str(e)
logger.error(f"Agent run failed: {e}", exc_info=True)
finally:
run.elapsed_seconds = time.time() - start_time
self._log_run(run)
return run
def _execute_loop(self, run: AgentRun) -> None:
"""The core agentic loop."""
for turn in range(self.config.max_turns):
run.turns = turn + 1
# Call the model with retry
response = self._call_model_with_retry(run.messages)
# Track tokens
run.total_input_tokens += response.usage.input_tokens
run.total_output_tokens += response.usage.output_tokens
# Log this turn
logger.info({
"event": "agent_turn",
"turn": turn + 1,
"stop_reason": response.stop_reason,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens
})
# Model finished - extract and return the text
if response.stop_reason == "end_turn":
final_text = self._extract_text(response)
run.messages.append({
"role": "assistant",
"content": response.content
})
# Store final answer in run record
run.final_answer = final_text
return
# Model wants to use tools
if response.stop_reason == "tool_use":
run.messages.append({
"role": "assistant",
"content": response.content
})
# Execute all requested tool calls
tool_results = self._execute_tools(response.content, run)
run.messages.append({
"role": "user",
"content": tool_results
})
continue
# Unexpected stop reason
logger.warning(f"Unexpected stop_reason: {response.stop_reason}")
break
run.error = f"Reached max_turns ({self.config.max_turns}) without completing"
def _call_model_with_retry(
self,
messages: list,
max_retries: int = 3
):
"""Call the model with exponential backoff retry for transient errors."""
for attempt in range(max_retries + 1):
try:
return self.client.messages.create(
model=self.config.model,
max_tokens=self.config.max_tokens,
system=self.config.system_prompt,
tools=self.tools if self.tools else anthropic.NOT_GIVEN,
messages=messages
)
except anthropic.RateLimitError as e:
if attempt == max_retries:
raise
wait = 2 ** attempt * 5 # 5s, 10s, 20s
logger.warning(f"Rate limit hit, waiting {wait}s (attempt {attempt + 1})")
time.sleep(wait)
except anthropic.APIConnectionError as e:
if attempt == max_retries:
raise
wait = 2 ** attempt
logger.warning(f"Connection error, retrying in {wait}s")
time.sleep(wait)
except anthropic.BadRequestError as e:
# Context too long, permissions error, etc. - do not retry
raise
def _execute_tools(self, content_blocks, run: AgentRun) -> list[dict]:
"""Execute tool calls and return formatted results."""
results = []
for block in content_blocks:
if block.type != "tool_use":
continue
call_record = {
"tool": block.name,
"inputs": block.input,
"tool_use_id": block.id
}
tool_fn = self.tool_fns.get(block.name)
if not tool_fn:
result_content = (
f"Error: Tool '{block.name}' is not registered. "
f"Available tools: {list(self.tool_fns.keys())}"
)
else:
try:
call_start = time.time()
result_content = str(tool_fn(**block.input))
call_record["elapsed_ms"] = round((time.time() - call_start) * 1000)
call_record["success"] = True
except TypeError as e:
result_content = f"Error: Wrong arguments for '{block.name}': {e}"
call_record["success"] = False
call_record["error"] = str(e)
except Exception as e:
result_content = f"Error in '{block.name}': {type(e).__name__}: {e}"
call_record["success"] = False
call_record["error"] = str(e)
logger.error(f"Tool {block.name} failed", exc_info=True)
call_record["result_length"] = len(result_content)
run.tool_calls.append(call_record)
logger.info({
"event": "tool_call",
"tool": block.name,
"inputs": block.input,
"result_preview": result_content[:200],
"elapsed_ms": call_record.get("elapsed_ms")
})
results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result_content
})
return results
def _extract_text(self, response) -> str:
"""Extract text from a response with multiple content blocks."""
texts = [block.text for block in response.content if hasattr(block, 'text')]
return "\n".join(texts)
def _log_run(self, run: AgentRun) -> None:
"""Log a complete run summary."""
logger.info({
"event": "agent_run_complete",
"task": run.task[:100],
"success": run.success,
"turns": run.turns,
"tool_calls": len(run.tool_calls),
"total_tokens": run.total_input_tokens + run.total_output_tokens,
"estimated_cost_usd": f"${run.estimated_cost_usd:.4f}",
"elapsed_seconds": round(run.elapsed_seconds, 2),
"error": run.error
})
# ─── Streaming variant ─────────────────────────────────────────────────────────
def run_agent_streaming(
task: str,
tools: list[dict],
tool_fns: dict[str, Callable],
system: str,
on_text: Callable[[str], None] = print,
max_turns: int = 20
) -> str:
"""
Run the agent with streaming output for long-running tasks.
Args:
task: The task to complete
tools: Tool schema list
tool_fns: Tool implementation functions
system: System prompt
on_text: Callback called with each text chunk as it streams
max_turns: Maximum agentic loop iterations
Returns:
The final complete response text
"""
client = anthropic.Anthropic()
messages = [{"role": "user", "content": task}]
final_text = ""
for turn in range(max_turns):
full_response_content = []
current_tool_calls = {}
stop_reason = None
with client.messages.stream(
model="claude-opus-4-6",
max_tokens=4096,
system=system,
tools=tools,
messages=messages
) as stream:
for event in stream:
event_type = type(event).__name__
if event_type == "RawContentBlockDeltaEvent":
if hasattr(event.delta, 'text'):
on_text(event.delta.text)
final_text += event.delta.text
elif hasattr(event.delta, 'partial_json'):
# Accumulate tool call JSON
idx = event.index
if idx not in current_tool_calls:
current_tool_calls[idx] = ""
current_tool_calls[idx] += event.delta.partial_json
elif event_type == "RawContentBlockStartEvent":
if hasattr(event.content_block, 'type'):
if event.content_block.type == "tool_use":
# Record tool call metadata
current_tool_calls[event.index] = {
"_name": event.content_block.name,
"_id": event.content_block.id,
"_json": ""
}
# Get the final response object
response = stream.get_final_message()
stop_reason = response.stop_reason
full_response_content = response.content
if stop_reason == "end_turn":
break
if stop_reason == "tool_use":
# Execute tools and continue
messages.append({"role": "assistant", "content": full_response_content})
tool_results = []
for block in full_response_content:
if block.type == "tool_use":
fn = tool_fns.get(block.name)
if fn:
try:
result = str(fn(**block.input))
except Exception as e:
result = f"Tool error: {e}"
else:
result = f"Unknown tool: {block.name}"
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result
})
on_text(f"\n[Tool: {block.name}] → {result[:100]}...\n")
messages.append({"role": "user", "content": tool_results})
return final_text
# ─── Context Management ────────────────────────────────────────────────────────
def truncate_messages_to_fit(
messages: list[dict],
max_tokens: int = 180_000,
keep_first: int = 1,
keep_last: int = 8
) -> list[dict]:
"""
Truncate message history to fit within context limits.
Preserves the first few messages (original task) and most recent messages
(current context), summarizing the middle section.
Args:
messages: Current message history
max_tokens: Maximum total token budget for context
keep_first: Number of messages to always keep from the start
keep_last: Number of messages to always keep from the end
Returns:
Truncated message list
"""
if len(messages) <= keep_first + keep_last:
return messages
# Always keep first `keep_first` and last `keep_last` messages
head = messages[:keep_first]
tail = messages[-keep_last:]
middle = messages[keep_first:-keep_last]
if not middle:
return head + tail
# Summarize the middle section
middle_content = []
for msg in middle:
role = msg["role"]
content = msg["content"]
if isinstance(content, str):
middle_content.append(f"[{role}]: {content[:500]}")
elif isinstance(content, list):
for block in content:
if isinstance(block, dict) and block.get("type") == "text":
middle_content.append(f"[{role}]: {block['text'][:500]}")
summary_placeholder = {
"role": "user",
"content": (
f"[CONTEXT SUMMARY: {len(middle)} messages were truncated. "
f"Summary of truncated content:\n" +
"\n".join(middle_content[:10])
)
}
return head + [summary_placeholder] + tail
Production Usage
if __name__ == "__main__":
import logging
logging.basicConfig(level=logging.INFO, format="%(message)s")
# Build the agent
config = AgentConfig(
model="claude-opus-4-6",
max_tokens=4096,
max_turns=20,
system_prompt="""You are a research and analysis assistant.
You can search the web, read files, write files, and run Python code.
When given a research task:
1. Plan your approach
2. Use tools to gather information
3. Use code to analyze or verify data when appropriate
4. Write your findings to a file for the user
Always verify claims from multiple sources."""
)
agent = RawAPIAgent(config)
# Register tools
agent.register_tool(
"search_web", search_web,
"Search the web for current information",
{"type": "object", "properties": {"query": {"type": "string"}}, "required": ["query"]}
)
agent.register_tool(
"read_file", read_file,
"Read a file's contents",
{"type": "object", "properties": {"path": {"type": "string"}}, "required": ["path"]}
)
agent.register_tool(
"write_file", write_file,
"Write content to a file",
{
"type": "object",
"properties": {
"path": {"type": "string"},
"content": {"type": "string"}
},
"required": ["path", "content"]
}
)
agent.register_tool(
"run_python", run_python,
"Execute Python code",
{"type": "object", "properties": {"code": {"type": "string"}}, "required": ["code"]}
)
# Run
run = agent.run(
"Research the current state of open-source LLM fine-tuning in 2025. "
"Find the top 5 approaches, their trade-offs, and write a summary to ./research_output.md"
)
print(f"\n{'='*60}")
print(f"Run complete: {'SUCCESS' if run.success else 'FAILED'}")
print(f"Turns: {run.turns} | Tool calls: {len(run.tool_calls)}")
print(f"Tokens: {run.total_input_tokens + run.total_output_tokens:,}")
print(f"Estimated cost: ${run.estimated_cost_usd:.4f}")
print(f"Time: {run.elapsed_seconds:.1f}s")
if run.error:
print(f"Error: {run.error}")
When to Graduate from Raw API
The raw API approach has limits. Recognize these signals as graduation triggers:
Signal 1: You are implementing resumption logic. If you are writing code to checkpoint agent state to a database so you can resume after crashes, you are re-implementing LangGraph's checkpointing. Migrate to LangGraph.
Signal 2: Your routing logic is deeply nested. If your run() function has more than three levels of if-statements making routing decisions, the logic has outgrown the linear loop. Migrate to LangGraph's conditional edges.
Signal 3: You have multiple agents coordinating. If you have written a raw API orchestrator that calls multiple sub-agents and aggregates their results, you have reimplemented part of CrewAI or LangGraph. Choose the framework that matches your coordination model.
Signal 4: You need human-in-the-loop with persistence. If you are building a system that pauses for human approval and resumes hours later, LangGraph's interrupt() and checkpointing are significantly more robust than what you will build from scratch.
The graduation path is clear: start raw, build until the specific pain appears, then adopt the framework that solves that specific pain.
:::danger Never Use eval() for Tool Execution
A common shortcut when building raw API agents is using eval() or exec() to dynamically execute tool function calls: eval(f"{block.name}(**block.input)"). This is a critical security vulnerability. A malicious input that influences the tool name or arguments can execute arbitrary Python code in your environment.
Always use an explicit dispatch dictionary: TOOL_FUNCTIONS[block.name](**block.input). This limits what can be called to functions you have explicitly registered, and raises KeyError for unrecognized tool names rather than executing arbitrary code.
:::
:::warning Context Window Management Is Not Optional
Without explicit context management, a long-running agent will eventually hit the model's context limit and raise a BadRequestError. This happens in production at unpredictable times - the time it takes depends on how verbose the tool outputs are.
Implement context management from the beginning: track token usage per turn, trigger truncation when approaching 80% of the context limit, and preserve the initial task and recent context in the truncated version. The truncate_messages_to_fit function above is a starting point. Tune keep_first and keep_last for your specific agent's needs.
:::
Interview Questions and Answers
Q1: Why would you build an agent with raw Anthropic API calls instead of using LangChain or LangGraph?
Three reasons, in decreasing order of importance: debuggability, control, and performance.
Debuggability: when a raw API agent fails, the error traces directly to your code. Every behavior in the agent is code you wrote and understand. When a LangChain agent fails, the trace often runs through multiple layers of framework code before reaching application code, requiring framework knowledge to interpret.
Control: the raw API lets you implement exactly the retry policy, logging format, cost tracking, and error handling you need. Frameworks have opinions about each of these that you override or work around.
Performance: removing abstraction layers reduces latency, which matters for agents called at high volume. The raw API also has no dependency on the framework's update cycle - you are not blocked by framework bugs or version conflicts.
The raw API is not always the right choice - it lacks built-in checkpointing, human-in-the-loop, and stateful orchestration. The decision should be based on what your specific agent needs.
Q2: What does stop_reason: "tool_use" mean in the Anthropic API, and how do you handle it correctly?
When Claude decides to call a tool, it returns a response with stop_reason: "tool_use". The response content contains one or more tool_use blocks, each with a name, id, and input dictionary. The response may also contain text blocks before the tool use blocks - these represent Claude's reasoning before calling the tool.
Correct handling: append the full assistant response (including both text and tool_use blocks) to the message history as an assistant turn. Execute each tool call. Build a user turn containing tool_result blocks - one per tool call - with the tool_use_id matching the tool_use block id. Append this user turn. Call the model again with the updated message history.
The critical detail: tool results must be in the same user turn and must reference the tool_use_id from the assistant turn. If you send a separate message per tool result, or if the tool_use_id does not match, the API will return an error.
Q3: How do you handle context window limits in a raw API agent?
Two strategies: prevention and recovery.
Prevention: track token usage per turn using response.usage.input_tokens and response.usage.output_tokens. When the cumulative input tokens approach 80% of the model's context limit (200K for Claude), trigger truncation before the next API call.
Recovery: implement a truncate_messages_to_fit() function that preserves the original task (first messages), the most recent context (last N messages), and summarizes the middle section. The summarization can be done with another LLM call or with simple text extraction.
For agents with very long tool outputs, also truncate individual tool results before appending them. A web scrape that returns 50,000 characters is rarely fully useful - truncate to the first 5,000 and note the truncation.
Q4: How do you implement retry logic for the Anthropic API in a production agent?
Handle three distinct error types differently: rate limit errors, transient connection errors, and permanent errors.
Rate limits (anthropic.RateLimitError): retry with exponential backoff. Wait 5 seconds, then 10, then 20. After three retries, raise. In production, pre-emptively check your rate limit headroom before long agent runs.
Transient connection errors (anthropic.APIConnectionError, anthropic.APIStatusError with 5xx status): retry with exponential backoff, shorter delays (1s, 2s, 4s). These indicate temporary infrastructure issues.
Permanent errors (anthropic.BadRequestError, anthropic.AuthenticationError): do not retry. These indicate problems with your request (context too long, invalid API key, malformed tool schema) that will not resolve with retrying.
Implement this in a _call_model_with_retry() function that wraps the client.messages.create() call with the appropriate exception handling. Never apply retry logic at the agentic loop level (which would restart the entire task) - only at the individual API call level.
Q5: What is the difference between appending the full response to the message history vs. extracting and appending just the text content?
You must append the full response content (including tool_use blocks), not just the text. The Anthropic API validates message structure: a tool_use block in an assistant turn must be followed by a corresponding tool_result block in the next user turn, with matching tool_use_id.
If you only append the text content and discard the tool_use blocks, then append tool_results referencing those discarded tool_use_ids, the API will raise a BadRequestError about missing tool references.
The correct pattern is always: messages.append({"role": "assistant", "content": response.content}) - where response.content is the full list of content blocks, not just the text. Then messages.append({"role": "user", "content": tool_results}) where each tool_result references the tool_use_id from the corresponding block in the previous assistant turn.
This is the most common mistake made by engineers building their first raw API agent. The fix is simple once you understand the validation requirement.
