What is when to use ai agents?

A decision framework for when autonomous agents are appropriate vs. when simpler approaches are better - covering cost of agency, task classification, anti-patterns, and ROI analysis.

How does agent decision framework work in practice?

When to Use Agents covers when to use ai agents, agent decision framework, agent vs workflow from first principles with code examples. Free lesson at https://engineersofai.com/docs/agentic-ai/agentic-foundations/when-to-use-agents

What is the difference between when to use ai agents and agent vs workflow?

See the full breakdown at https://engineersofai.com/docs/agentic-ai/agentic-foundations/when-to-use-agents

When to Use Agents

The ticket came in on a Tuesday afternoon. A senior ML engineer at a mid-size fintech had spent six weeks building an agent to automate their monthly compliance report generation. The system would read from the data warehouse, pull regulatory filings, cross-reference transaction logs against customer records, and produce a formatted PDF. The project consumed two engineers, a production Kubernetes deployment, and a five-figure compute budget over those six weeks. When it went live, the team celebrated. The compliance officer opened the first report, glanced at the numbers, and flagged it immediately: the total transaction volume was off by an order of magnitude.

The agent had called the wrong SQL query at step 3. It had joined on account ID instead of customer ID. The resulting dataset looked plausible - same column names, same row counts in the right ballpark - and the agent had no mechanism to detect that its premise was wrong. It continued for 12 more tool calls, each one building on the flawed foundation, and produced a beautifully formatted document containing completely incorrect numbers with confident decimal precision.

The compliance team caught it during manual review - the same manual review the agent was supposed to eliminate. The engineers rolled back to the Python script that had taken 45 minutes to write three years earlier and had been running without incident every month since. The script was deterministic. It pulled exactly the columns you told it to pull, joined exactly the tables you specified, and formatted exactly the output you defined. It never reasoned its way into a wrong answer because it did not reason at all.

This story repeats constantly across organizations that have caught agent fever. The technology is genuinely powerful. The failure mode is not technical - it is architectural. Engineers reach for agents when they should reach for a script. They build orchestration layers where a workflow would suffice. They introduce autonomy into situations that require determinism. And they spend months debugging systems whose complexity they cannot fully observe, whose errors do not announce themselves, and whose failures compound silently across long chains of tool calls.

The question "should I build an agent?" is one of the most consequential questions in applied AI engineering. It does not have a simple yes/no answer. It requires a structured framework, an honest assessment of the task characteristics, and a clear-eyed view of what agents are actually good at versus where they introduce more problems than they solve. This lesson gives you that framework. By the end you will be able to classify any automation task quickly, identify the right architectural tier, spot agent anti-patterns before they waste months of engineering time, and make a defensible business case for or against autonomous agents in a production environment.

Why This Exists

Before agents existed as a recognized architectural pattern, software engineers faced a binary choice: write deterministic code, or accept that some tasks could not be automated at all. The code either worked exactly as specified or it did not. This was genuinely limiting - vast categories of real-world tasks involve ambiguity, variation, and judgment that resists reduction to explicit rules.

The first wave of solutions was rule-based automation: decision trees, business process management systems, robotic process automation (RPA). Tools like UiPath, Automation Anywhere, and Blue Prism reached billions of dollars in market cap automating tasks by recording and replaying human actions. These worked well for highly stable, highly repetitive processes - data entry, form submission, report extraction from fixed UI layouts. They failed catastrophically at variation. When the UI changed, when data arrived in an unexpected format, when an exception occurred that the rule-writer had not anticipated, the automation broke and required human intervention.

Large language models changed the fundamental execution model. Instead of following explicit rules, you provide a goal and the model figures out the steps. Instead of breaking on unexpected input, it interprets and adapts. This was transformative - but it introduced a new failure mode: agents are probabilistic, not deterministic. They can reason their way to wrong conclusions. They can take actions based on misinterpretations. And crucially, their errors do not announce themselves.

The framework in this lesson exists because the engineering community needed a way to distinguish tasks where that probabilistic reasoning capability is a feature from tasks where it is a liability. The majority of production tasks fall into the latter category. Getting this distinction right is the difference between shipping something genuinely valuable and burning six weeks on a compliance report generator that produces beautiful, precisely formatted nonsense.

:::tip 🎮 Interactive Playground Visualize this concept: Try the Agent Planning demo on the EngineersOfAI Playground - no code required. :::

Historical Context

The idea of autonomous software agents predates large language models by decades. MIT's Project Oxygen (2000) envisioned pervasive computing with intelligent assistants. DARPA's PAL program (2003–2008) produced CALO, the ancestor of Siri, attempting to build personal assistants that could reason over tasks and calendars. Stanford's CALO team later founded the startup acquired by Apple in 2010.

The academic foundations come from intelligent agent research: Russell and Norvig's Artificial Intelligence: A Modern Approach (1995, 1st edition) defined agents as systems that perceive their environment and take actions to achieve goals. BDI (Belief-Desire-Intention) agents were a formal model from Bratman, Israel, and Pollack (1988). These systems were largely brittle because they relied on hand-crafted knowledge bases that could not adapt to real-world variation.

The modern LLM agent era began with the ReAct paper (Yao et al., 2022), which demonstrated that interleaving reasoning traces with action calls in language models dramatically improved task completion on benchmarks. Then came AutoGPT in March 2023 - an open-source project that went viral on GitHub by wrapping GPT-4 in an autonomous loop with filesystem, web browsing, and code execution capabilities. AutoGPT demonstrated both the excitement and the dysfunction of unconstrained autonomy: it could browse the web and write code, but it would frequently spiral into loops, contradict itself, and consume large amounts of compute without producing results.

The engineering community's response to AutoGPT's limitations was to discipline agents rather than abandon them. Harrison Chase's LangChain (October 2022) built structured tool-use frameworks. Andrew Ng's essay series "Agentic AI" (2024) identified patterns that reliably work in production. Anthropic's research on tool use and the Haiku/Sonnet/Opus model family brought reliability to agent behavior. The field learned that agents needed constraints - bounded action spaces, human checkpoints, explicit success criteria - to be production-worthy.

The question "when to use agents" became pressing around 2023–2024 as teams started incurring real costs on deployments that turned out to be over-engineered solutions to simple problems.

The Three Questions to Ask Before Building

Before any architectural discussion, before any code is written, three questions determine whether an agent is the right approach. Answer them honestly. If you find yourself rationalizing answers to get to the conclusion you already want, that is a sign you are building an agent for the wrong reasons.

Question 1: Can you write a deterministic program to solve this?

If the answer is yes - write the program. Agents are not substitutes for code. They are tools for handling problems that genuinely resist reduction to explicit rules.

A deterministic program is always preferable when:

Input formats are known and stable
The transformation from input to output can be specified in advance
The set of possible states is enumerable
Correctness can be verified programmatically

If you can write a Python function, a SQL query, or a shell script that solves the problem reliably, do that. It will be faster, cheaper, more reliable, and dramatically easier to debug than any agent you build.

Question 2: Is there meaningful variation that a workflow cannot handle?

Agents earn their complexity cost when tasks involve variation that a deterministic program cannot handle without becoming an enormous, unmaintainable decision tree. "Meaningful variation" means that different inputs genuinely require different reasoning paths - not just different parameter values plugged into the same pipeline.

Extracting data from a fixed PDF template: deterministic. Extracting data from PDFs submitted by 200 different vendors, each with their own layout and conventions: agent territory.

Answering a question using a lookup table: deterministic. Answering questions from customers who each describe their problem in different words, from different technical backgrounds, expecting different levels of detail: agent territory.

The crucial distinction: if the variation is in the data but not in the structure of the task, a workflow handles it. If the variation is in which steps to take and in what order, an agent is warranted.

Question 3: Can you afford to be wrong?

Agents make mistakes. This is not a flaw to be engineered away - it is a fundamental consequence of probabilistic reasoning. The question is not whether your agent will be wrong (it will) but whether your system can tolerate that.

High-stakes irreversible operations - financial transactions, production database writes, sending emails to customers, submitting legal filings - require either near-perfect reliability or a human-in-the-loop checkpoint before execution. If your agent operates in a domain where a single mistake has significant consequences, you need either increased oversight or a more deterministic architecture.

:::danger The Autonomy Trap The most common mistake in agent design is giving an agent more autonomy than the task actually requires. Every degree of autonomy you add increases the surface area for unexpected behavior. Start with the minimum autonomy needed and add more only when you have evidence that it is required - not because it seems more impressive. :::

The Architectural Tiers

Think of automation approaches as a ladder with four rungs. Each rung is appropriate for a specific class of problems. Climbing too high for a simple problem is over-engineering. Not climbing high enough for a complex problem is under-building. The goal is to find the lowest rung that adequately solves the problem.

Tier 1: Direct Code

Use deterministic code when the task is fully specifiable. This is almost always the right starting point. Do not reach past this tier until you have confirmed through analysis - not intuition - that code alone is insufficient.

Signals that Tier 1 is sufficient:

Fixed input schemas that you control or can normalize
Enumerable output states
Can write a comprehensive test suite
No natural language understanding required

Examples: ETL pipelines, report generation from structured data, rule-based routing, API integrations with known schemas, scheduled jobs, data validation.

Tier 2: Single LLM Call

A single call to a language model - no tool use, no loops, no memory - is the right solution for tasks requiring language understanding or generation but not multiple steps. This tier is dramatically underused. Engineers who have learned to build agents sometimes skip it entirely in favor of full ReAct loops for tasks a single well-crafted prompt handles perfectly.

Signals that Tier 2 is sufficient:

Task is a transformation: text in, structured or text out
No external data lookup needed at inference time
No action with side effects required
Output quality can be validated by the caller

Examples: Classifying support tickets, extracting structured data from free text, generating descriptions from attributes, summarizing documents, sentiment analysis.

import anthropic
import json

client = anthropic.Anthropic()

def classify_support_ticket(ticket_text: str) -> dict:
    """
    Tier 2: single LLM call - no agent needed.
    Classification is a pure transformation task with no external dependencies.
    A ReAct loop for this would be textbook over-engineering.
    """
    response = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=256,
        messages=[{
            "role": "user",
            "content": f"""Classify this support ticket.

Categories: billing, technical, account, feature_request, other
Priority: critical, high, medium, low

Respond with JSON only:
{{"category": "...", "priority": "...", "confidence": 0.0-1.0, "summary": "one sentence"}}

Ticket: {ticket_text}"""
        }]
    )
    return json.loads(response.content[0].text)

# This is NOT a use case for an agent.
# It is a transformation. One call suffices.
result = classify_support_ticket(
    "I was charged twice for my subscription this month and need an immediate refund."
)
print(result)
# {"category": "billing", "priority": "high", "confidence": 0.97,
#  "summary": "Duplicate charge requiring refund"}

Tier 3: Workflow

When a task requires multiple LLM calls whose sequence is known in advance, use a workflow - a directed graph of prompts where each step's output feeds the next. This is not an agent. The control flow is determined by the programmer, not the model.

Workflows provide predictability that agents cannot: you know exactly which steps will execute, in what order, with what inputs. They are debuggable because you can inspect each step's output independently. They are reliable because the model is not making architectural decisions - only executing within its assigned step.

Signals that Tier 3 is sufficient:

You can draw the complete flowchart before writing any code
Every path through the flowchart terminates in a bounded number of steps
The model's role is execution within each step, not navigation between steps

import anthropic

client = anthropic.Anthropic()

def content_pipeline(raw_input: str, target_audience: str) -> dict:
    """
    Tier 3 workflow: 4-step pipeline with known structure.
    The programmer owns the control flow. The model executes within each step.
    """

    # Step 1: Extract key points
    extraction = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": f"Extract the 5 most important points from this text as a JSON array of strings:\n\n{raw_input}"
        }]
    )
    key_points = extraction.content[0].text

    # Step 2: Tailor for audience
    tailored = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"Rewrite these points for a {target_audience} audience. Keep the facts but adjust vocabulary and context:\n\n{key_points}"
        }]
    )
    audience_version = tailored.content[0].text

    # Step 3: Generate headline and summary
    summary = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=256,
        messages=[{
            "role": "user",
            "content": f"Write a compelling headline and 2-sentence summary for this content:\n\n{audience_version}"
        }]
    )

    # Step 4: Quality check
    quality = client.messages.create(
        model="claude-opus-4-6",
        max_tokens=128,
        messages=[{
            "role": "user",
            "content": f"Rate this content 1-10 for clarity and appropriateness for {target_audience}. Return JSON: {{\"score\": N, \"issues\": []}}\n\n{audience_version}"
        }]
    )

    return {
        "key_points": key_points,
        "content": audience_version,
        "summary": summary.content[0].text,
        "quality": quality.content[0].text,
    }

# Four LLM calls, fixed sequence, deterministic control flow.
# This is a workflow. Predictable, debuggable, reliable.

Tier 4: Autonomous Agent

A true autonomous agent is appropriate when the control flow itself cannot be predetermined - when the number of steps, the choice of tools, and navigation between states all depend on what the agent discovers at runtime. This is a genuinely narrow set of use cases. It requires confirming that Tiers 1–3 are insufficient, not just less elegant.

Signals that Tier 4 is genuinely required:

Tool selection depends on intermediate results, not just input characteristics
Step count varies substantially across inputs (2 steps for simple cases, 20 for complex)
The agent must handle unexpected situations by adapting strategy mid-execution
The task involves open-ended exploration where the direction is unclear at the start

The Cost of Agency

Every degree of autonomy comes with real, quantifiable costs. Understanding these concretely is essential for making sound architectural decisions. Never evaluate agent adoption without working through these numbers.

Latency Cost

Agents take longer than direct API calls. Each step in a ReAct loop requires an LLM inference (1–15 seconds depending on model and output length) plus tool execution time. A task requiring 5 reasoning steps might take 30–90 seconds. Sub-second response time requirements categorically rule out multi-step agents.

Direct API call:             ~500ms – 3s
Workflow (3 steps):          ~3s – 15s
Simple agent (5 steps):      ~15s – 60s
Complex agent (20+ steps):   ~2min – 10min+

Error Propagation Cost

In deterministic code, an error at step N throws an exception that immediately surfaces the problem. In an agent, an error at step N produces plausible-looking but incorrect output that becomes the input for step N+1. The agent has no mechanism to detect that its reasoning was subtly wrong - it optimizes for internal consistency, not external correctness.

This is the most dangerous characteristic of autonomous agents. The longer the chain, the further any error propagates before manifesting as an observable failure. The compliance report from the opening scenario is the canonical example: step 3 was wrong, but step 14 delivered the report.

Compound Reliability

If each step in an agent succeeds with probability $p$ , and the agent has $n$ steps, overall success is:

$P(\text{success}) = p^n$

Steps	99% per step	95% per step	90% per step
2	98.0%	90.3%	81.0%
5	95.1%	77.4%	59.0%
10	90.4%	59.9%	34.9%
20	81.8%	35.8%	12.2%

An agent with 10 steps where each step has 95% reliability produces correct output only 60% of the time. In production, a 40% failure rate is catastrophic. You cannot engineer your way out of this with "better" steps - to achieve 90% overall success over 10 steps, each step must succeed at 99%, an unrealistic bar for most open-ended tasks.

Compute Cost

Agents consume significantly more tokens than direct API calls. A 20-step agent loop with tool use might consume 50,000–200,000 tokens per task execution. At production scale with thousands of daily executions, this becomes a primary budget line.

# Cost comparison: agent vs. single call at scale

SONNET_INPUT_COST_PER_1M = 3.00   # USD
SONNET_OUTPUT_COST_PER_1M = 15.00  # USD

def estimate_cost(input_tokens: int, output_tokens: int, runs_per_day: int) -> dict:
    daily_input_cost = (input_tokens * runs_per_day / 1_000_000) * SONNET_INPUT_COST_PER_1M
    daily_output_cost = (output_tokens * runs_per_day / 1_000_000) * SONNET_OUTPUT_COST_PER_1M
    daily_total = daily_input_cost + daily_output_cost
    return {
        "daily": daily_total,
        "monthly": daily_total * 30,
        "annual": daily_total * 365,
    }

# Single classification call: ~500 tokens total
single_call = estimate_cost(400, 100, runs_per_day=10_000)
print(f"Single call at 10K/day: ${single_call['monthly']:.0f}/month")
# → ~$15/month

# 20-step agent: ~100K tokens total
agent = estimate_cost(80_000, 20_000, runs_per_day=10_000)
print(f"Agent at 10K/day: ${agent['monthly']:.0f}/month")
# → ~$18,000/month

Task Classification Framework

Any proposed automation task can be classified on two dimensions: task complexity (how variable and open-ended is the reasoning path?) and stakes (how bad is a wrong answer, and is it reversible?).

Fully Automatable Without Agents

Task	Right Approach	Why Not an Agent
Sentiment classification	Single LLM call	Pure transformation
SQL report generation	Direct code	Fully specifiable
Document summarization	Single LLM call	No tool use needed
Email routing by rules	Direct code	Enumerable rules
Data format conversion	Direct code	Deterministic mapping
Translation	Single LLM call	No dynamic state
Content pipeline (outline → draft → review)	Workflow	Fixed step sequence

Human-in-the-Loop Required

Task	Approach	Checkpoint Location
Contract review with tracked changes	Workflow + review	Before sending modified contract
Refund processing above threshold	Workflow + approval	Before issuing refund
Automated trading signals	Workflow + human	Before order execution
Production database migration	Script + human	Before schema change applies
Mass email campaign	Workflow + preview	Before send

Genuinely Agentic

Task	Why Agents Are Correct
Automated software debugging (run, read error, fix, repeat)	Unknown iteration count, dynamic tool selection
Open-ended research with citation following	Unknown depth, adaptive search strategy
Complex customer troubleshooting with branching	Decision tree too large to enumerate
Code generation in large unfamiliar codebases	Must read, understand, write, test, adapt
Security vulnerability investigation	Unknown attack surface, adaptive exploration

The Agent Readiness Checklist

Before building an agent, work through this checklist. Every "no" in the first three sections is a blocker that must be addressed before continuing.

Task Qualification

The task requires dynamic decision-making that cannot be encoded as a fixed workflow
I have explicitly confirmed that Tiers 1–3 are insufficient - not just less elegant
The task has clear, evaluable success criteria defined before building starts
The task scope is bounded - there is an unambiguous definition of "done"
The expected tool-use patterns are mapped and implementable

Failure Mode Analysis

I understand consequences of a wrong action at each step in the agent loop
I have identified which actions are reversible and which are not
Every irreversible action has a human-in-the-loop checkpoint before execution
I have a graceful fallback for agent failure (not silent failure - surfaced failure)
I can detect loop conditions (same tool, same input, called twice)

Observability

Every tool call will be logged: name, inputs, outputs, timestamps, tokens
Full agent trace can be reconstructed for any given run ID
Alerting exists for error rate spikes and anomalous token consumption
Intermediate state can be inspected at any point during execution

Economic Justification

Per-task token cost estimated and within budget at expected daily volume
Value generated per task demonstrably exceeds compute + engineering cost
Cost control mechanisms exist: max iterations, token budget, wall-clock timeout

import anthropic
import json
import time

client = anthropic.Anthropic()

class BoundedAgentRunner:
    """
    Enforces the readiness checklist at runtime.
    Prevents the most common production agent failure modes.
    """

    def __init__(
        self,
        max_iterations: int = 15,
        max_tokens: int = 100_000,
        timeout_seconds: int = 300,
        irreversible_tools: list[str] | None = None,
        require_approval_for: list[str] | None = None,
    ):
        self.max_iterations = max_iterations
        self.max_tokens = max_tokens
        self.timeout_seconds = timeout_seconds
        self.irreversible_tools = irreversible_tools or []
        self.require_approval_for = require_approval_for or []
        self._reset()

    def _reset(self):
        self.iterations = 0
        self.tokens_used = 0
        self.start_time = time.time()
        self.call_log: list[dict] = []
        self._last_call: tuple[str, str] | None = None

    def before_tool_call(self, tool_name: str, tool_input: dict) -> None:
        """Validate all constraints before allowing a tool call to execute."""

        # Hard iteration limit
        if self.iterations >= self.max_iterations:
            raise RuntimeError(
                f"Agent stopped: exceeded {self.max_iterations} iterations. "
                "Possible loop or unexpectedly complex task. "
                "Partial trace available in call_log."
            )

        # Token budget
        if self.tokens_used >= self.max_tokens:
            raise RuntimeError(
                f"Agent stopped: exceeded token budget of {self.max_tokens:,}. "
                f"Used {self.tokens_used:,} tokens so far."
            )

        # Wall-clock timeout
        elapsed = time.time() - self.start_time
        if elapsed > self.timeout_seconds:
            raise TimeoutError(
                f"Agent stopped: exceeded {self.timeout_seconds}s timeout. "
                f"Completed {self.iterations} iterations."
            )

        # Loop detection: same tool, same input twice in a row
        input_sig = json.dumps(tool_input, sort_keys=True)
        current_call = (tool_name, input_sig)
        if current_call == self._last_call:
            raise RuntimeError(
                f"Agent stopped: loop detected. Tool '{tool_name}' called "
                "twice in a row with identical inputs."
            )
        self._last_call = current_call

        # Human approval gate for consequential tools
        if tool_name in self.require_approval_for:
            self._request_approval(tool_name, tool_input)

    def after_tool_call(self, tool_name: str, tool_input: dict,
                        result: str, tokens_used: int) -> None:
        """Log completed tool call for audit trail."""
        self.call_log.append({
            "iteration": self.iterations,
            "timestamp": time.time() - self.start_time,
            "tool": tool_name,
            "input_preview": str(tool_input)[:300],
            "result_preview": result[:300],
            "tokens": tokens_used,
        })
        self.tokens_used += tokens_used
        self.iterations += 1

    def _request_approval(self, tool_name: str, tool_input: dict) -> None:
        """In production: integrate with Slack/PagerDuty/approval UI."""
        print(f"\n[APPROVAL REQUIRED] Tool: {tool_name}")
        print(f"Input: {json.dumps(tool_input, indent=2)}")
        answer = input("Approve? (yes/no): ").strip().lower()
        if answer != "yes":
            raise PermissionError(f"Human denied execution of '{tool_name}'.")

    def audit_report(self) -> dict:
        return {
            "iterations": self.iterations,
            "tokens_used": self.tokens_used,
            "elapsed_seconds": round(time.time() - self.start_time, 2),
            "budget_pct": f"{self.tokens_used / self.max_tokens:.1%}",
            "tool_calls": self.call_log,
        }

ROI Analysis

Building and maintaining an agent is a capital investment. Calculate the ROI before committing.

The Four-Component Formula

Monthly Value  = (Human minutes saved/task × $/hour × tasks/month / 60)
               + (Quality improvement value)
               - (Mistake rate × Mistake cost × tasks/month)

Monthly Cost   = (Token cost/task × tasks/month)
               + (Maintenance hours/month × engineering $/hour)
               + (Build cost amortized over lifetime months)

ROI            = (Monthly Value - Monthly Cost) / Monthly Cost
Payback months = Build cost / Monthly net benefit

def calculate_agent_roi(
    human_minutes_per_task: float,
    human_hourly_rate: float,
    tasks_per_month: int,
    agent_token_cost_per_task: float,
    build_hours: float,
    engineering_hourly_rate: float,
    maintenance_hours_per_month: float,
    agent_mistake_rate: float,
    human_mistake_rate: float,
    mistake_cost: float,
    lifetime_months: int = 24,
) -> None:
    # Monthly human cost (what you replace)
    human_cost_per_task = (human_minutes_per_task / 60) * human_hourly_rate
    monthly_human_cost = human_cost_per_task * tasks_per_month
    human_monthly_errors = human_mistake_rate * mistake_cost * tasks_per_month

    # Monthly agent cost (what you pay)
    monthly_token_cost = agent_token_cost_per_task * tasks_per_month
    monthly_maintenance = maintenance_hours_per_month * engineering_hourly_rate
    amortized_build = (build_hours * engineering_hourly_rate) / lifetime_months
    agent_monthly_errors = agent_mistake_rate * mistake_cost * tasks_per_month

    monthly_value = monthly_human_cost + human_monthly_errors - agent_monthly_errors
    monthly_cost = monthly_token_cost + monthly_maintenance + amortized_build
    monthly_net = monthly_value - monthly_cost

    build_total = build_hours * engineering_hourly_rate
    payback = build_total / monthly_net if monthly_net > 0 else float("inf")

    print(f"Monthly value generated:  ${monthly_value:,.0f}")
    print(f"Monthly agent cost:       ${monthly_cost:,.0f}")
    print(f"Monthly net benefit:      ${monthly_net:,.0f}")
    print(f"Payback period:           {payback:.1f} months")
    print(f"Recommended:              {'YES' if monthly_net > 0 and payback < 12 else 'NO'}")

# Example: compliance report agent (from opening scenario)
print("=== Compliance Report Agent ===")
calculate_agent_roi(
    human_minutes_per_task=120,       # 2 hours per report manually
    human_hourly_rate=100,            # $100/hr analyst
    tasks_per_month=20,               # 20 reports/month
    agent_token_cost_per_task=0.60,   # ~120K tokens at $0.005/1K
    build_hours=320,                  # 8 weeks of two engineers
    engineering_hourly_rate=150,
    maintenance_hours_per_month=10,
    agent_mistake_rate=0.30,          # 30% silent error rate (observed in scenario)
    human_mistake_rate=0.02,          # 2% human error rate
    mistake_cost=2_000,               # $2K cost to detect and fix a wrong report
)
# With 30% mistake rate and $2K per mistake: agent generates NET LOSS
# This ROI analysis would have saved 8 weeks of engineering

Agent Anti-Patterns

These are the failure patterns that recur across organizations. Knowing them prevents months of wasted engineering.

Anti-Pattern 1: The Hammer Problem

After learning to build agents, engineers see every problem as an agent problem. Before building any agent, ask explicitly: "Would a junior engineer with a Python script solve this faster and more reliably?" If the answer might be yes, start with the script.

:::warning Signs You Are Over-Engineering

You can describe every step the agent will take before it runs
The task has no branching logic that depends on intermediate results
The output format is fully specified and fixed
The task needs to complete in under 5 seconds
The test cases all have exactly the same structure :::

Anti-Pattern 2: Autonomy Without Bounds

Every production agent must have hard limits enforced by the runner, not by the model. The model cannot reliably self-terminate. A loop with no bounds will run until you run out of money or hit a provider rate limit.

# WRONG: Infinite loop with model self-termination only
def run_agent(task):
    messages = [{"role": "user", "content": task}]
    while True:                          # This will run indefinitely on stuck agents
        response = call_llm(messages)
        if response.stop_reason == "end_turn":
            return response.content
        messages.append(...)             # Accumulates context without bound

# RIGHT: Hard limits on every dimension
def run_agent_safe(task, max_steps=15, budget_tokens=100_000, timeout=300):
    runner = BoundedAgentRunner(
        max_iterations=max_steps,
        max_tokens=budget_tokens,
        timeout_seconds=timeout,
    )
    messages = [{"role": "user", "content": task}]
    while True:
        response = call_llm(messages)
        if response.stop_reason == "end_turn":
            return response.content, runner.audit_report()
        tool_name = response.tool_call.name
        tool_input = response.tool_call.input
        runner.before_tool_call(tool_name, tool_input)
        result = execute_tool(tool_name, tool_input)
        runner.after_tool_call(tool_name, tool_input, result, response.usage.total_tokens)
        messages.append(...)

Anti-Pattern 3: Trusting Agent Output Without Validation

Agent outputs are proposals, not facts. Before acting on agent output - especially for consequential downstream actions - validate against ground truth, human review, or automated checks. The fintech scenario: the agent produced a beautifully formatted report. The output looked authoritative. Validation would have caught the error before it reached compliance.

:::danger Never Trust Unvalidated Agent Output on High-Stakes Actions The model is optimizing for internally consistent reasoning, not for ground-truth correctness. A confident, well-formatted output is not evidence of correctness. Always validate outputs that feed into consequential downstream actions. :::

Anti-Pattern 4: Maximum Privilege by Default

The principle of minimal privilege applies to agents as strongly as to human users and service accounts. If an agent reads from a database, give it read-only credentials. If it writes to one directory, scope its filesystem access to that directory. Never give an agent production write access "for convenience."

Anti-Pattern 5: Building for Demo Value

Agents are impressive in demos. They appear to "think." They produce streamed reasoning. They call tools. It is tempting to build them because they look powerful. Build agents when they solve a real problem better than alternatives - not because they are exciting to demonstrate. The demo success rate and the production success rate are routinely separated by 30+ percentage points.

Anti-Pattern 6: No Observability

An agent that runs without logging every tool call is an undebuggable black box in production. When an agent produces wrong output, the only way to understand why is to reconstruct the exact sequence of observations, reasoning steps, and tool calls. Without that trace, you are guessing.

Production Engineering Notes

The Three Mandatory Production Requirements

1. Structured logging of every tool call. Every production agent must log: task ID, user ID, timestamp, every LLM call (prompt hash, token count, latency), every tool call (name, inputs, outputs, success/failure, latency), final output, total tokens, total elapsed time.

2. Hard limits enforced by the runner. Maximum iterations, maximum token budget, maximum wall-clock time - all three, enforced independently, enforced by code not by model instruction.

3. Graceful degradation. When an agent fails, it must surface a meaningful error to the caller (not a 500), log the full trace, and optionally fall back to a simpler approach or human routing. Silent failure is never acceptable.

Testing Agents Before Production

import pytest
from unittest.mock import patch, MagicMock

def test_agent_stops_at_max_iterations():
    """Agent runner must enforce iteration limit regardless of model behavior."""
    runner = BoundedAgentRunner(max_iterations=3)

    # Simulate model that never says it is done
    for _ in range(3):
        runner.before_tool_call("search", {"query": "test"})
        runner.after_tool_call("search", {"query": "test"}, "results", 100)

    with pytest.raises(RuntimeError, match="exceeded 3 iterations"):
        runner.before_tool_call("search", {"query": "test"})

def test_agent_detects_loop():
    """Runner must detect when agent calls same tool with same args twice."""
    runner = BoundedAgentRunner(max_iterations=10)
    runner.before_tool_call("read_file", {"path": "/data.csv"})
    runner.after_tool_call("read_file", {"path": "/data.csv"}, "content", 50)

    with pytest.raises(RuntimeError, match="loop detected"):
        runner.before_tool_call("read_file", {"path": "/data.csv"})

def test_agent_validates_output_schema():
    """Agent output must conform to expected schema before downstream use."""
    # Use a small fast model for testing; assert structure not exact content
    result = run_agent_safe("Extract name and email from: 'John Smith, [email protected]'")
    output = result[0]  # content
    # Validate structure: must be parseable, must have required fields
    import json
    parsed = json.loads(output)
    assert "name" in parsed
    assert "email" in parsed
    assert "@" in parsed["email"]

Monitoring in Production

Track these metrics for every deployed agent:

Success rate (per-task type, rolling 7-day)
P50/P95 latency and trend
Token cost per task and trend
Error rate by failure mode (timeout, loop, tool error, validation failure)
Human override rate (for human-in-loop tasks - high rate means agent is wrong often)

Alert on: success rate drops below threshold, cost per task increases by more than 20%, any new error mode appearing.

Interview Q&A

Q1: How do you decide whether to build an agent or use a simpler approach?

I apply three questions in order. First: can a deterministic script solve this? If yes, write the script - it will be more reliable, cheaper, and easier to maintain than any agent. Second: if language understanding is needed, is the task a single transformation or multi-step? A single well-crafted prompt handles the vast majority of NLP classification, extraction, and generation tasks without any agent machinery. Third: if multi-step reasoning is needed, are the steps known in advance? If yes, a workflow - a fixed sequence of LLM calls - handles it better than an agent. Agents are only warranted when the control flow itself must be dynamic, determined at runtime based on what the agent discovers. That is a narrow class of tasks. The default should always be: use the simplest approach that solves the problem reliably.

Q2: What is compound reliability and why does it matter for agents?

Every step in an agent's trajectory has an independent success probability. These probabilities multiply. If each step succeeds 95% of the time and there are 10 steps, overall success is $0.95^{10} = 0.60$ - a 40% failure rate. This is a mathematical property of sequential processes, not a fixable engineering problem. The implication is that agents are only viable in production when: the task has a short trajectory (few steps), per-step reliability is very high (99%+), the failure rate is acceptable to the business, or there is strong human-in-the-loop oversight to catch failures before they cause harm.

Q3: Describe three situations where you would NOT use an agent.

(1) Generating a scheduled daily report from structured database data - the steps are fixed, the data sources are known, the format is specified. A Python cron job with one LLM call for narrative formatting is the right answer: deterministic, reliable, cheap. (2) Answering customer FAQ questions - embed the FAQ, run semantic search, pass top results to one LLM call for a synthesized answer. This is a RAG pipeline, not an agent. (3) Translating documents from one language to another - always the same transformation, no tools needed, one LLM call per document. Treating these as agent tasks is textbook over-engineering.

Q4: How do you design an agent for a high-stakes domain like finance?

I would not deploy a fully autonomous agent for high-stakes financial operations. Instead I would use a hybrid: an agent to do the reasoning and preparation work (research, calculation, synthesis), followed by a mandatory human-in-the-loop checkpoint before any consequential action (trade execution, funds transfer, regulatory filing). The agent proposes; the human approves. This captures the reasoning benefit of agents while maintaining the reliability guarantee that the domain requires. Additionally: every tool call is logged, every output is validated against business rules before being presented to the human, and the agent's action set is scoped to read-only operations (it can read data and prepare analysis but cannot directly write to production systems without explicit approval).

Q5: A PM asks you to add an AI agent to your company's sales dashboard to "make it smarter." How do you respond?

I ask what specific problem the agent would solve. If the answer is "it could answer questions about the data," I would build a text-to-SQL pipeline with a single LLM call - not an agent. If the answer is "it could proactively find insights," I would build a scheduled workflow that generates a weekly insights report using a fixed sequence of analysis steps. If the answer is "it could investigate sales anomalies by pulling data from multiple sources and figuring out root cause," that might justify an agent - because the investigation path genuinely depends on what is discovered. The word "smarter" is not a task requirement. Identify the concrete user need first, then choose the simplest architecture that meets it.

Q6: What production monitoring would you set up for a deployed agent?

Minimum three layers. First, structural monitoring: every tool call logged with name, inputs, outputs, token count, and latency. Full trace reconstructable from run ID. Second, business metrics: success rate (does the agent produce a correct, complete output?) tracked per task type on a rolling 7-day basis. Cost per task tracked and alerted on significant increases. P50/P95 latency tracked. Third, safety monitoring: error rate by failure mode (timeout, loop, validation failure, tool error) with PagerDuty alerts for any new failure mode appearing. For human-in-loop agents, track override rate - if humans are overriding more than 20% of agent decisions, the agent's judgment is miscalibrated and needs retraining or constraint adjustment.

Why This Exists​

Historical Context​

The Three Questions to Ask Before Building​

Question 1: Can you write a deterministic program to solve this?​

Question 2: Is there meaningful variation that a workflow cannot handle?​

Question 3: Can you afford to be wrong?​

The Architectural Tiers​

Tier 1: Direct Code​

Tier 2: Single LLM Call​

Tier 3: Workflow​

Tier 4: Autonomous Agent​

The Cost of Agency​

Latency Cost​

Error Propagation Cost​

Compound Reliability​

Compute Cost​

Task Classification Framework​

Fully Automatable Without Agents​

Human-in-the-Loop Required​

Genuinely Agentic​

The Agent Readiness Checklist​

Task Qualification​

Failure Mode Analysis​

Observability​

Economic Justification​

ROI Analysis​

The Four-Component Formula​

Agent Anti-Patterns​

Anti-Pattern 1: The Hammer Problem​

Anti-Pattern 2: Autonomy Without Bounds​

Anti-Pattern 3: Trusting Agent Output Without Validation​

Anti-Pattern 4: Maximum Privilege by Default​

Anti-Pattern 5: Building for Demo Value​

Anti-Pattern 6: No Observability​

Production Engineering Notes​

The Three Mandatory Production Requirements​

Testing Agents Before Production​

Monitoring in Production​

Interview Q&A​