Master the Observe-Think-Act loop that drives every AI agent - from the detailed mechanics of each phase to error handling, backtracking, and token management.

How does observe think act work in practice?

The Agent Loop: Observe, Think, Act covers agent loop, observe think act, OTA loop from first principles with code examples. Free lesson at https://engineersofai.com/docs/agentic-ai/agentic-foundations/agent-loop-observe-think-act

What is the difference between agent loop and OTA loop?

See the full breakdown at https://engineersofai.com/docs/agentic-ai/agentic-foundations/agent-loop-observe-think-act

The Agent Loop: Observe, Think, Act

A Flight Booking Agent Hits the Real World

The task seems simple: "Book me the cheapest flight to Tokyo next month." The agent starts confidently. Tool call 1: search for flights from the user's home city to Tokyo in the target date range. Tool call 2: filter results by price. Tool call 3: check the user's calendar for conflicts.

Then things get complicated. The cheapest option requires a 19-hour layover. Tool call 4: check if there are reasonable alternatives within 20% of the lowest price. Tool call 5: search again with a filter for maximum connection time. The agent finds a better option - but the booking page requires a CAPTCHA. Tool call 6: try a different booking API. The alternative API has a different date format and the query fails. Tool call 7: correct the date format and retry. This time it works, but the price has changed since the initial search.

The agent pauses. The price changed. Was this an error? Is this the same flight? It compares the booking confirmation details to the search result. Same flight number, different price - the fare class changed. Tool call 8: check if the original fare class is still available. Tool call 9: if not, check adjacent dates for the original fare. The original fare reappears two days later at the same price. It overlaps with the calendar - one meeting that could be moved. Tool call 10: check the meeting details. It is a one-on-one the user controls. Tool call 11: propose the rescheduling (but not actually reschedule - that requires confirmation). Tool call 12: present the booking with the updated date and the pending calendar change to the user.

23 tool calls total. 4 backtracking steps. 1 successful booking option. One human decision needed before completion.

That is the Observe-Think-Act loop in the real world - not the clean diagram on the slide, but the messy, adaptive, error-recovering reality of running an agent against an actual environment.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Agent Loop: Observe-Think-Act demo on the EngineersOfAI Playground - no code required. :::

Why This Loop Is the Right Abstraction

Before the OTA loop became the standard model for agents, there were two competing paradigms for building autonomous AI systems.

The first was plan-then-execute: generate a complete plan upfront, then execute it step by step. This works well in perfectly predictable environments (chess, for instance). It fails in the real world because the real world never matches the plan exactly. The flight booking agent above would have failed at step 3 when the CAPTCHA appeared - the plan had no provision for that.

The second was pure reaction: respond to each observation independently, with no planning. This is how simple rule-based agents work. It fails when tasks require maintaining state and intent across multiple steps - the agent would forget it was trying to find the cheapest flight and might book the first available option.

The OTA loop combines the strengths of both: it plans at each step (think), acts based on that plan (act), and then updates its understanding based on what actually happened (observe). The loop is both reactive (it responds to real observations) and deliberative (it reasons about each observation before acting).

This is why the OTA loop appears - in some form - in every serious agent architecture. LangChain's AgentExecutor is an OTA loop. LangGraph is an OTA loop with explicit state management. ReAct is an OTA loop with explicit reasoning traces. AutoGPT is an OTA loop with a task queue. The loop is the fundamental unit of agentic computation.

The Three Phases in Depth

Phase 1: Observe

Observation is richer than it sounds. The agent is not just reading a single tool result - it is building a composite picture of the current state of the world from multiple sources.

Tool outputs: The most direct form of observation. When the agent calls search_flights(...), the returned JSON of flight options is an observation. When it calls run_python(code), the stdout and stderr are observations. These are the most reliable form of input - they represent ground truth from the environment.

Context history: Everything in the conversation history is an observation. This includes the original task, all prior tool calls and their results, and any prior reasoning the agent generated. This is the agent's episodic memory - its record of what it has tried and what happened.

Memory retrieval: For agents with long-term memory, the observe phase includes querying a vector database for relevant past experiences or facts. "Have I seen this error before? What did I do?" This is how agents can learn across sessions.

Environment state: Sometimes the observation is not a direct tool result but an inference about what changed. "The price changed between my first query and the booking attempt" is an observation derived from comparing two tool outputs.

The critical skill in observation is interpretation. Raw tool output is often ambiguous, incomplete, or misleading. The agent must parse it correctly. A JSON error response from an API might mean the request was malformed, the credentials expired, the rate limit was hit, or the service is down - each requiring a different response.

Phase 2: Think

The think phase is where the LLM does its work. This is the most computationally expensive phase (an LLM API call) and the most important one.

Parsing and understanding: The LLM first makes sense of the observations. "The tool returned a list of 15 flights. The cheapest costs $847 with a 19-hour layover. The second cheapest is$ 923 with a 2-hour layover in Seoul."

Planning: Based on the understanding, the LLM decides what to do next. "The user asked for cheapest. $847 is cheapest. But a 19-hour layover is probably unacceptable. I should check if the user specified any constraints on connection time. If not, I should present options and let them decide."

Reasoning chains: Modern LLMs produce better decisions when they reason step-by-step before committing to an action. This explicit reasoning (chain-of-thought) is the difference between a model that randomly selects a tool and one that systematically works toward the goal.

Self-assessment: In the think phase, the LLM also evaluates whether the task is complete. "Do I have enough information to answer the user's question? Have I achieved the stated goal? Is there anything I should verify before concluding?"

Phase 3: Act

Action is how the agent changes the world. The LLM outputs structured tool calls (JSON-formatted requests), and the scaffolding executes them.

Read actions: Calls that gather information without side effects. Reading files, querying databases, searching the web. These are safe to retry and should generally be preferred when information is uncertain.

Write actions: Calls that change state. Writing files, updating databases, sending API requests that have side effects. These should be treated with caution - many writes are irreversible.

Compute actions: Running code, performing calculations, generating artifacts. The output becomes the next observation.

State updates: Updating the agent's own memory or scratchpad. These are internal actions that prepare the agent for future steps without affecting the external environment.

The Trajectory

A trajectory is the full sequence of (observation, thought, action) triples across an agent's run. It is the agent's "audit log" - a complete record of what happened, in order.

Step 1:
  Observation: User asked "book cheapest flight to Tokyo next month"
  Thought: I need to find flights. Let me search for available options.
  Action: search_flights(origin="SFO", dest="NRT", dates="2025-04-01:2025-04-30")

Step 2:
  Observation: 15 flights found. Cheapest: $847, 19hr layover. Second: $923, 2hr layover.
  Thought: Cheapest has long layover. User may not want that. Check for constraints.
  Action: check_user_preferences(category="travel")

Step 3:
  Observation: User preferences: max_connection_time=4hr, preferred_airline=null
  Thought: User wants max 4hr connection. $847 flight (19hr layover) is out. Best option: $923.
  Action: get_flight_details(flight_id="NH007")
...

The trajectory is important because:

It enables debugging: when something goes wrong, you can trace exactly what happened
It enables evaluation: you can assess whether the agent made good decisions at each step
It enables learning: trajectories from successful runs can train better agents
It is the agent's working memory: the LLM sees the full trajectory at each step

The Termination Problem

When does the agent stop? This is harder than it sounds.

Completion detection: The agent must recognize when the task is done. For well-defined tasks ("what is the capital of France?"), this is obvious. For open-ended tasks ("make my code better"), it is not. The LLM must decide when "good enough" is reached, and LLMs are imperfect at this - they can both terminate too early (before the task is actually done) and terminate too late (continuing to make changes after the task is complete).

Max iterations: The safety net. If the agent has not completed the task after N iterations, stop. N depends on the task: simple lookups need 3-5 iterations, complex coding tasks might need 30-50. Setting N too low causes premature termination; too high wastes time and money on clearly-failed runs.

Timeout: Wall-clock time limits. Some tasks must complete within a certain time regardless of iteration count. A real-time customer service agent cannot spend 10 minutes on a simple question.

Error threshold: If the agent encounters too many consecutive errors, stop and escalate to a human. Continuing in a persistent error state usually makes things worse.

Human interrupt: For high-stakes actions (sending emails, making purchases, deleting data), pause and require human confirmation before proceeding. This is the "human in the loop" pattern for sensitive operations.

Error Handling in the Loop

Error handling is where naive agent implementations fall apart. Here are the error types and how to handle them.

Transient errors: Tool calls that fail due to network issues, rate limits, or temporary unavailability. Strategy: retry with exponential backoff.

Malformed input errors: The agent sent a tool call with invalid parameters. Strategy: the tool returns a descriptive error message, the LLM reads it and corrects the parameters.

Semantic errors: The tool executed successfully but returned unexpected results (empty list, wrong format, out-of-range values). Strategy: the LLM must interpret the result and decide whether to retry with different parameters, try a different approach, or conclude that the task cannot be completed.

Cascading errors: One failed step invalidates subsequent steps. Strategy: the agent must detect when a critical step failed and backtrack to a clean state.

Irreversible errors: The agent performed a write action that cannot be undone (deleted a file, sent an email). Strategy: prevention - require confirmation before irreversible actions, maintain backups.

Token Management Across the Loop

Every iteration of the OTA loop adds more content to the conversation history. Each tool result, each reasoning chain, each observation - they all accumulate. This creates a fundamental tension: longer histories give the agent more context and continuity, but they eventually exceed the model's context window.

The growing context problem:

Iteration 1: 500 tokens
Iteration 5: 3,000 tokens (each iteration adds ~500-600 tokens)
Iteration 20: 12,000 tokens
Iteration 50: 30,000 tokens
Iteration 100: 60,000 tokens (approaching the limit for some models)

Strategies for managing context:

Summarization: Periodically compress old observations and actions into a summary. "Steps 1-10: I searched for flights, checked user preferences, and found 3 viable options. The best option is $923 on ANA departing April 15." The summary replaces the detailed history.
Selective retention: Keep only the most recent N turns in full detail. Older turns are either discarded or summarized.
Scratchpad pattern: Instead of relying on full conversation history, maintain a structured scratchpad that the agent updates at each step. The scratchpad contains only the current state, not the full history.
Tool output truncation: Tool outputs can be very large (reading a 10,000-line file, fetching an entire web page). Truncate or summarize large outputs before adding them to the context.

Synchronous vs Asynchronous Loops

The basic OTA loop is synchronous: each step must complete before the next begins. This is simple but slow - if a tool call takes 5 seconds, the entire loop is blocked for 5 seconds.

Asynchronous loops allow multiple tool calls to execute in parallel when they are independent of each other. This is significantly faster for tasks that naturally decompose into parallel subtasks.

The Anthropic API supports parallel tool calls natively - the model can return multiple tool_use blocks in a single response, which you execute concurrently and return together.

Complete Working Implementation

Here is a production-quality OTA loop with proper error handling, context management, backtracking support, and async execution.

"""
Production-quality Observe-Think-Act loop implementation.
Features:
- Async tool execution for parallel tool calls
- Exponential backoff retry for transient errors
- Context window management (summarization when nearing limit)
- Backtracking detection and handling
- Structured trajectory logging

Install: pip install anthropic
"""
import asyncio
import anthropic
import json
import time
import subprocess
import os
from typing import Any
from dataclasses import dataclass, field


# ── Data types ────────────────────────────────────────────────────────────────

@dataclass
class TrajectoryStep:
    """A single step in the agent's trajectory."""
    iteration: int
    observations: list[str]
    thought: str
    actions: list[dict]
    action_results: list[str]
    timestamp: float = field(default_factory=time.time)


@dataclass
class AgentConfig:
    """Configuration for the agent loop."""
    model: str = "claude-opus-4-6"
    max_tokens: int = 4096
    max_iterations: int = 30
    max_retries: int = 3
    retry_base_delay: float = 1.0
    context_window_limit: int = 150_000  # tokens; summarize when approaching this
    verbose: bool = True


# ── Tool definitions ──────────────────────────────────────────────────────────

TOOLS = [
    {
        "name": "read_file",
        "description": "Read the full contents of a file. Returns the text content. Use for source code, configs, data files.",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {"type": "string", "description": "File path to read."}
            },
            "required": ["path"]
        }
    },
    {
        "name": "write_file",
        "description": "Write content to a file. Creates the file if it does not exist. Overwrites existing content.",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {"type": "string", "description": "File path to write."},
                "content": {"type": "string", "description": "Content to write to the file."}
            },
            "required": ["path", "content"]
        }
    },
    {
        "name": "run_python",
        "description": "Execute Python code and return stdout + stderr. Use for computation, testing, data processing.",
        "input_schema": {
            "type": "object",
            "properties": {
                "code": {"type": "string", "description": "Python code to execute."}
            },
            "required": ["code"]
        }
    },
    {
        "name": "list_directory",
        "description": "List files and subdirectories at a path. Use to understand project structure.",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {"type": "string", "description": "Directory path to list.", "default": "."}
            },
            "required": []
        }
    },
    {
        "name": "search_in_file",
        "description": "Search for a pattern in a file and return matching lines with line numbers. Use before reading entire large files.",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {"type": "string", "description": "File path to search."},
                "pattern": {"type": "string", "description": "Text pattern to search for (case-sensitive)."}
            },
            "required": ["path", "pattern"]
        }
    }
]


# ── Tool execution ────────────────────────────────────────────────────────────

async def execute_tool_async(tool_name: str, tool_input: dict[str, Any]) -> str:
    """Execute a tool asynchronously. CPU-bound ops run in thread pool."""
    loop = asyncio.get_event_loop()

    # Wrap synchronous tool execution in thread pool for true async
    return await loop.run_in_executor(None, execute_tool_sync, tool_name, tool_input)


def execute_tool_sync(tool_name: str, tool_input: dict[str, Any]) -> str:
    """Synchronous tool execution with comprehensive error handling."""

    if tool_name == "read_file":
        path = tool_input.get("path", "")
        try:
            # Truncate large files to avoid context window explosion
            with open(path, "r", encoding="utf-8") as f:
                content = f.read()
            if len(content) > 50_000:
                # Truncate and note the truncation
                content = content[:50_000]
                return f"[File truncated to 50,000 chars]\n{content}"
            return f"Contents of {path}:\n\n{content}"
        except FileNotFoundError:
            return f"Error: File '{path}' not found. Check the path and try again."
        except UnicodeDecodeError:
            return f"Error: '{path}' is a binary file and cannot be read as text."
        except Exception as e:
            return f"Error reading '{path}': {type(e).__name__}: {e}"

    elif tool_name == "write_file":
        path = tool_input.get("path", "")
        content = tool_input.get("content", "")
        try:
            # Create parent directories if needed
            os.makedirs(os.path.dirname(path) if os.path.dirname(path) else ".", exist_ok=True)
            with open(path, "w", encoding="utf-8") as f:
                f.write(content)
            return f"Successfully wrote {len(content)} characters to {path}."
        except PermissionError:
            return f"Error: Permission denied writing to '{path}'."
        except Exception as e:
            return f"Error writing to '{path}': {type(e).__name__}: {e}"

    elif tool_name == "run_python":
        code = tool_input.get("code", "")
        try:
            result = subprocess.run(
                ["python3", "-c", code],
                capture_output=True,
                text=True,
                timeout=15
            )
            parts = []
            if result.stdout:
                parts.append(f"stdout:\n{result.stdout}")
            if result.stderr:
                parts.append(f"stderr:\n{result.stderr}")
            if result.returncode != 0:
                parts.append(f"Exit code: {result.returncode} (non-zero = error)")
            return "\n".join(parts) if parts else "(no output produced)"
        except subprocess.TimeoutExpired:
            return "Error: Code timed out after 15 seconds. Break the code into smaller pieces."
        except Exception as e:
            return f"Error executing code: {type(e).__name__}: {e}"

    elif tool_name == "list_directory":
        path = tool_input.get("path", ".")
        try:
            entries = sorted(os.listdir(path))
            annotated = []
            for entry in entries:
                full_path = os.path.join(path, entry)
                entry_type = "/" if os.path.isdir(full_path) else ""
                annotated.append(f"  {entry}{entry_type}")
            return f"{path}/\n" + "\n".join(annotated)
        except FileNotFoundError:
            return f"Error: Directory '{path}' not found."
        except Exception as e:
            return f"Error listing '{path}': {type(e).__name__}: {e}"

    elif tool_name == "search_in_file":
        path = tool_input.get("path", "")
        pattern = tool_input.get("pattern", "")
        try:
            matches = []
            with open(path, "r", encoding="utf-8") as f:
                for i, line in enumerate(f, 1):
                    if pattern in line:
                        matches.append(f"  Line {i}: {line.rstrip()}")
            if not matches:
                return f"Pattern '{pattern}' not found in {path}."
            return f"Found {len(matches)} matches for '{pattern}' in {path}:\n" + "\n".join(matches)
        except FileNotFoundError:
            return f"Error: File '{path}' not found."
        except Exception as e:
            return f"Error searching '{path}': {type(e).__name__}: {e}"

    else:
        available = [t["name"] for t in TOOLS]
        return f"Error: Unknown tool '{tool_name}'. Available tools: {available}"


# ── Context window management ─────────────────────────────────────────────────

def estimate_tokens(messages: list[dict]) -> int:
    """Rough token estimate: 1 token ~ 4 chars."""
    total_chars = sum(
        len(str(m.get("content", "")))
        for m in messages
    )
    return total_chars // 4


def should_summarize(messages: list[dict], limit: int) -> bool:
    """Check if we are approaching the context window limit."""
    return estimate_tokens(messages) > limit * 0.7


def summarize_early_messages(
    client: anthropic.Anthropic,
    messages: list[dict],
    keep_recent: int = 6
) -> list[dict]:
    """
    Summarize early messages to reduce context window usage.
    Keeps the most recent N messages in full detail.
    Compresses everything before that into a summary.
    """
    if len(messages) <= keep_recent + 1:  # +1 for initial user message
        return messages

    # Messages to summarize: everything except the first (initial task)
    # and the most recent `keep_recent` messages
    first_message = messages[0]
    messages_to_summarize = messages[1:-keep_recent]
    recent_messages = messages[-keep_recent:]

    if not messages_to_summarize:
        return messages

    # Ask Claude to summarize the history
    summary_prompt = (
        "Summarize the following agent trajectory steps concisely. "
        "Include: what was tried, what was observed, and what the current state is. "
        "Be specific about file names, values, and outcomes. "
        "Format as a numbered list of completed steps.\n\n"
        f"History to summarize:\n{json.dumps(messages_to_summarize, indent=2)}"
    )

    summary_response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": summary_prompt}]
    )

    summary_text = summary_response.content[0].text

    # Reconstruct messages: initial task + summary + recent
    summary_message = {
        "role": "user",
        "content": f"[Previous steps summary]:\n{summary_text}"
    }
    summary_ack = {
        "role": "assistant",
        "content": "Understood. I have the context of previous steps. Continuing from where we left off."
    }

    return [first_message, summary_message, summary_ack] + recent_messages


# ── The production OTA loop ───────────────────────────────────────────────────

async def run_agent_loop(
    task: str,
    config: AgentConfig | None = None,
    system_prompt: str | None = None
) -> tuple[str, list[TrajectoryStep]]:
    """
    Run the Observe-Think-Act loop with production-quality features.

    Returns:
        Tuple of (final_result, trajectory) where trajectory is the full
        list of steps for debugging and evaluation.
    """
    if config is None:
        config = AgentConfig()

    client = anthropic.Anthropic()

    if system_prompt is None:
        system_prompt = """You are an autonomous AI agent. You have tools to explore \
your environment and complete tasks.

Approach every task systematically:
1. OBSERVE: Read tool results carefully. Understand what you have.
2. THINK: Reason about what the observation means for your goal.
3. ACT: Choose the most useful next action. Prefer read operations before write operations.
4. VERIFY: After important actions, verify the result was correct.
5. COMPLETE: When the task is done, say so clearly and summarize what you did.

Error handling:
- If a tool returns an error, read the error message carefully and correct your approach.
- If you have tried the same approach 3 times without success, try something fundamentally different.
- If a task cannot be completed, explain clearly why and what you attempted."""

    messages = [{"role": "user", "content": task}]
    trajectory: list[TrajectoryStep] = []

    if config.verbose:
        print(f"\n{'='*70}")
        print(f"Agent starting | Task: {task[:100]}...")
        print(f"{'='*70}\n")

    for iteration in range(config.max_iterations):
        if config.verbose:
            token_est = estimate_tokens(messages)
            print(f"[Iteration {iteration + 1}/{config.max_iterations}] "
                  f"~{token_est:,} tokens in context")

        # Context window management: summarize if approaching limit
        if should_summarize(messages, config.context_window_limit):
            if config.verbose:
                print("  [Context management] Summarizing early history...")
            messages = summarize_early_messages(client, messages)

        # ── THINK: Call the LLM ───────────────────────────────────────────────
        retry_count = 0
        response = None

        while retry_count < config.max_retries:
            try:
                response = client.messages.create(
                    model=config.model,
                    max_tokens=config.max_tokens,
                    system=system_prompt,
                    tools=TOOLS,
                    messages=messages
                )
                break
            except anthropic.RateLimitError:
                retry_count += 1
                delay = config.retry_base_delay * (2 ** retry_count)
                if config.verbose:
                    print(f"  Rate limited. Retrying in {delay:.1f}s...")
                await asyncio.sleep(delay)
            except anthropic.APIError as e:
                retry_count += 1
                if retry_count >= config.max_retries:
                    return f"Fatal API error after {config.max_retries} retries: {e}", trajectory
                delay = config.retry_base_delay * (2 ** retry_count)
                await asyncio.sleep(delay)

        if response is None:
            return "Failed to get response from API.", trajectory

        if config.verbose:
            print(f"  Stop reason: {response.stop_reason} | "
                  f"Output tokens: {response.usage.output_tokens}")

        # Add assistant response to message history
        messages.append({"role": "assistant", "content": response.content})

        # Extract the thought (text content from the response)
        thought = next(
            (block.text for block in response.content if hasattr(block, "text")),
            ""
        )

        # ── TERMINATION: Check if done ────────────────────────────────────────
        if response.stop_reason == "end_turn":
            step = TrajectoryStep(
                iteration=iteration,
                observations=[],
                thought=thought,
                actions=[],
                action_results=[]
            )
            trajectory.append(step)

            if config.verbose:
                print(f"\n{'='*70}")
                print(f"Agent completed after {iteration + 1} iterations.")
                print(f"{'='*70}")

            return thought, trajectory

        # ── ACT: Execute tool calls in parallel ───────────────────────────────
        if response.stop_reason == "tool_use":
            tool_calls = [b for b in response.content if b.type == "tool_use"]
            actions = [{"name": b.name, "input": b.input} for b in tool_calls]

            if config.verbose:
                for tc in tool_calls:
                    args_str = json.dumps(tc.input)[:60]
                    print(f"  Tool call: {tc.name}({args_str})")

            # Execute all tool calls concurrently (async)
            tasks = [
                execute_tool_async(tc.name, tc.input)
                for tc in tool_calls
            ]
            results = await asyncio.gather(*tasks)

            if config.verbose:
                for tc, result in zip(tool_calls, results):
                    print(f"  Result [{tc.name}]: {result[:80]}...")

            # Build tool result messages
            tool_result_content = [
                {
                    "type": "tool_result",
                    "tool_use_id": tc.id,
                    "content": result
                }
                for tc, result in zip(tool_calls, results)
            ]

            # Add results to message history (the OBSERVE phase for next iteration)
            messages.append({"role": "user", "content": tool_result_content})

            # Record in trajectory
            step = TrajectoryStep(
                iteration=iteration,
                observations=[r for r in results],
                thought=thought,
                actions=actions,
                action_results=list(results)
            )
            trajectory.append(step)

    # Exceeded max iterations
    return (
        f"Agent stopped after {config.max_iterations} iterations. "
        "Task may require manual completion or more iterations.",
        trajectory
    )


# ── Entry point ───────────────────────────────────────────────────────────────

async def main():
    config = AgentConfig(
        model="claude-opus-4-6",
        max_iterations=15,
        verbose=True
    )

    result, trajectory = await run_agent_loop(
        task=(
            "Write a Python function that implements binary search. "
            "Save it to binary_search.py, then write a test that verifies "
            "it finds elements correctly and returns -1 for missing elements. "
            "Run the test and confirm all tests pass."
        ),
        config=config
    )

    print(f"\n{'─'*70}")
    print("RESULT:")
    print(result)
    print(f"\n{'─'*70}")
    print(f"Trajectory: {len(trajectory)} steps")
    for step in trajectory:
        print(f"  Step {step.iteration + 1}: {len(step.actions)} action(s)")


if __name__ == "__main__":
    asyncio.run(main())

Backtracking

Backtracking is the ability to recognize that a chosen path is not working and try something different. It is one of the key properties that separates capable agents from simple pipelines.

Backtracking can happen explicitly (the LLM reasons "I tried approach A three times and it failed, let me try approach B") or implicitly (the LLM simply does not continue down the failing path in the next iteration).

Signs that an agent needs backtracking:

The same error appears three times in a row
Tool call results are consistently empty or null
The LLM is generating the same tool calls repeatedly
The agent's context shows no progress toward the goal

Production Engineering Notes

:::tip Parallel tool execution is a significant performance win When the LLM returns multiple tool_use blocks in a single response, execute them concurrently. For I/O-bound tools (API calls, file reads), this can reduce latency by 3-5x on steps with multiple tool calls. The Anthropic API supports this natively. :::

:::warning Tool result size management A single tool result can easily be 50,000+ tokens if it returns a large file or API response. Truncate large results before adding them to the message history. Always give the model a summary or first N lines of very large outputs, and offer a way to request more if needed. :::

:::danger Never rely solely on the LLM for termination detection The LLM may fail to recognize task completion (continuing unnecessarily) or may prematurely declare completion (stopping too early). Always pair LLM-based completion detection with explicit max_iterations and timeout guards. For tasks with clear success criteria (tests pass, file exists, API returns 200), verify programmatically rather than trusting the LLM's self-assessment. :::

Interview Questions

Q: Explain the Observe-Think-Act loop in depth. What happens in each phase?

Observe: the agent collects all relevant information about the current state of the world. This includes tool outputs from the previous iteration, the conversation history, any memory retrieved from long-term storage, and inferences drawn from comparing current state to expected state. The quality of observation determines everything that follows - garbage in, garbage out. Think: the LLM processes the observations and reasons about what to do next. It parses the observations (what do these tool results actually mean?), plans the next action (which tool should I call, with what parameters?), reasons step-by-step about implications, and self-assesses whether the task is complete. Act: the agent executes the planned action by calling one or more tools. The tool calls change the external environment or gather more information. These results become the next iteration's observations.

Q: How do you handle the context window growing over many iterations?

Three main strategies. Summarization: periodically compress old iterations into a compact summary that preserves the key facts and current state without full detail. This typically happens when the context reaches 60-70% of the model's limit. Selective retention: keep only the N most recent turns in full detail, discarding or summarizing older ones. This works well when recent context is more relevant than old context. Structured scratchpad: instead of relying entirely on conversation history, maintain a structured scratchpad (a dictionary or document) that the agent updates at each step. The scratchpad contains only the current state, not the history. This is the most token-efficient approach but requires the agent to be disciplined about scratchpad maintenance.

Q: What is a trajectory and why does it matter for production agents?

A trajectory is the full ordered sequence of (observation, thought, action) triples across an agent's run. It is the complete audit log of everything the agent perceived, reasoned about, and did. In production, trajectories matter for three reasons: debugging (when an agent makes a bad decision, the trajectory shows exactly what it observed and how it reasoned, making root cause analysis tractable), evaluation (you can assess agent quality by examining whether decisions at each step were sensible given the observations), and learning (successful trajectories can be used as training data to improve future agents, and failed trajectories reveal where agents consistently make mistakes).

Q: How do you implement backtracking in an agent loop?

Backtracking is mostly implicit - you rely on the LLM to recognize that a path is failing and choose a different approach in the next Think phase. To make this reliable: write tool error messages that are clear about what went wrong and hint at how to correct the approach. Include the last N tool calls and their results in the context so the LLM can see the pattern of failures. In your system prompt, explicitly instruct the agent to "if you encounter the same error three times, stop and try a fundamentally different approach." For explicit backtracking, you can track consecutive failures in the scaffolding and inject a message like "You have tried approach X three times without success. Please try a different strategy."

Q: What is the difference between synchronous and asynchronous OTA loops, and when should you use each?

A synchronous loop executes one tool call at a time, waiting for each to complete before proceeding. Simple to implement and debug. Appropriate when tool calls are interdependent (the result of call A is input to call B) or when tools have side effects that must be sequenced. An asynchronous loop executes multiple tool calls concurrently when they are independent. The total time is max(individual times) rather than sum. Use this when the LLM returns multiple independent tool calls in a single response - executing them in parallel can reduce latency by 3-5x. The Anthropic API often returns multiple tool_use blocks when the agent's plan involves gathering several pieces of information simultaneously. Implement with asyncio.gather for Python async, or Promise.all for Node.js.

A Flight Booking Agent Hits the Real World​

Why This Loop Is the Right Abstraction​

The Three Phases in Depth​

Phase 1: Observe​

Phase 2: Think​

Phase 3: Act​

The Trajectory​

The Termination Problem​

Error Handling in the Loop​

Token Management Across the Loop​

Synchronous vs Asynchronous Loops​

Complete Working Implementation​

Backtracking​

Production Engineering Notes​

Interview Questions​