Skip to main content

Agentic Design Patterns

It is March 2024. Anthropic's applied research team has been watching thousands of production agent deployments for months. They have seen what works and what fails. They have seen engineers invent the same solutions independently across dozens of companies. They have seen identical architectures emerge from completely different problem domains. The patterns are not random - they are the natural shapes that capable agentic systems converge to.

The team writes a technical post cataloguing 5 core patterns. Not theory. Not aspirational architectures. Patterns extracted from real production systems, distilled into reusable shapes with clear applicability criteria.

This lesson is a deep dive into those 5 patterns. By the end you will not only understand each pattern conceptually - you will have working Python implementations of all of them, understand when to reach for each one, and know how to compose them for more complex problems.


Why Patterns Exist

Before the patterns, there was chaos. Every team building agents was solving the same problems from scratch: "how do I handle tasks too long for one prompt?" "how do I deal with different input types?" "how do I make the model verify its own output?"

These are not unique problems. They have good solutions. The patterns are those solutions, formalized.

The deeper insight is this: agents fail in predictable ways. Unconstrained agents hallucinate, loop, and drift. The patterns are structural constraints that channel the model's capability toward reliable outcomes. Each pattern is essentially an answer to one failure mode.

Failure ModePattern That Addresses It
Single prompt too complexPrompt Chaining
Input too diverse for one handlerRouting
Sequential execution too slowParallelization
Task requires specializationOrchestrator-Subagents
Output quality too inconsistentEvaluator-Optimizer

:::tip 🎮 Interactive Playground Visualize this concept: Try the Agentic Design Patterns demo on the EngineersOfAI Playground - no code required. :::

Pattern 1: Prompt Chaining

The Idea

The simplest pattern. Decompose a complex task into a sequence of simpler subtasks. Each LLM call handles one step and passes its output to the next. The chain is linear - the output of step N is the input of step N+1.

The key insight: a chain of focused prompts reliably outperforms one sprawling mega-prompt. Why? Because each step can be specialized, and specialization reduces the chance of the model getting confused about what it's supposed to do.

When to Use Prompt Chaining

Use when:

  • The task has clear, sequential steps where each step builds on the previous
  • Each step is independently verifiable (you can check if the intermediate output is good)
  • The full task would overwhelm a single prompt's context or attention span
  • You want to gate progress - stop the chain if an intermediate step fails

Do NOT use when the steps are parallel (use Parallelization instead) or when the "steps" are really just one complex task (use a single well-crafted prompt).

Validation Gates

The most important feature of prompt chaining: gates. A gate is a simple check between steps that validates the output before passing it forward. Gates catch errors early when they are cheap to fix, rather than late when the entire chain has run on bad data.

Gates can be:

  • Regex/structural: "does the output contain the required JSON structure?"
  • LLM-based: "ask another LLM call: is this output complete and correct?"
  • Deterministic: "is the extracted number in the valid range?"

Python Implementation: Document Analysis Chain

import anthropic
import json
import re
from dataclasses import dataclass
from typing import Optional

client = anthropic.Anthropic()

@dataclass
class ChainStep:
name: str
prompt_template: str
validator: Optional[callable] = None

def llm_call(prompt: str, system: str = "") -> str:
"""Single LLM call, returns text response."""
messages = [{"role": "user", "content": prompt}]
kwargs = {"model": "claude-opus-4-5", "max_tokens": 2048, "messages": messages}
if system:
kwargs["system"] = system
response = client.messages.create(**kwargs)
return response.content[0].text

def validate_json(output: str) -> tuple[bool, str]:
"""Gate: check if output is valid JSON with required keys."""
try:
data = json.loads(output)
required_keys = ["entities", "dates", "key_claims"]
missing = [k for k in required_keys if k not in data]
if missing:
return False, f"Missing required keys: {missing}"
return True, ""
except json.JSONDecodeError as e:
return False, f"Invalid JSON: {e}"

def validate_summary(output: str) -> tuple[bool, str]:
"""Gate: check summary meets minimum quality bar."""
if len(output.strip()) < 100:
return False, "Summary too short (under 100 chars)"
if len(output.strip()) > 2000:
return False, "Summary too long (over 2000 chars)"
return True, ""

def prompt_chain(document: str, max_retries: int = 2) -> dict:
"""
3-step document analysis chain with validation gates.

Step 1: Extract structured facts (entities, dates, claims)
Gate: validate JSON structure
Step 2: Summarize the extracted facts
Gate: validate summary length
Step 3: Draft an executive report
"""

# ── Step 1: Extract Key Facts ──────────────────────────────────────────────
print("Step 1: Extracting facts...")
extract_prompt = f"""Analyze this document and extract key information.

Document:
{document}

Return a JSON object with exactly these keys:
- "entities": list of strings (people, organizations, places mentioned)
- "dates": list of strings (dates and time references)
- "key_claims": list of strings (main claims or assertions in the document)

Return ONLY the JSON object, no other text."""

for attempt in range(max_retries + 1):
extracted_text = llm_call(extract_prompt)
# Strip markdown code fences if present
extracted_text = re.sub(r'^```json\n?', '', extracted_text.strip())
extracted_text = re.sub(r'\n?```$', '', extracted_text.strip())

valid, error = validate_json(extracted_text)
if valid:
break
if attempt == max_retries:
raise ValueError(f"Step 1 failed validation after {max_retries} retries: {error}")
print(f" Gate failed (attempt {attempt + 1}): {error}. Retrying...")

extracted = json.loads(extracted_text)
print(f" Extracted {len(extracted['entities'])} entities, "
f"{len(extracted['dates'])} dates, "
f"{len(extracted['key_claims'])} claims")

# ── Step 2: Summarize Facts ────────────────────────────────────────────────
print("Step 2: Summarizing...")
summarize_prompt = f"""Given these extracted facts from a document, write a clear summary.

Entities mentioned: {', '.join(extracted['entities'])}
Dates referenced: {', '.join(extracted['dates'])}
Key claims:
{chr(10).join(f'- {c}' for c in extracted['key_claims'])}

Write a 2-3 paragraph summary (150-400 words) that synthesizes these facts into a coherent narrative.
Focus on what happened, who was involved, and why it matters."""

for attempt in range(max_retries + 1):
summary = llm_call(summarize_prompt)
valid, error = validate_summary(summary)
if valid:
break
if attempt == max_retries:
raise ValueError(f"Step 2 failed validation: {error}")
print(f" Gate failed: {error}. Retrying...")

print(f" Summary: {len(summary.split())} words")

# ── Step 3: Draft Executive Report ────────────────────────────────────────
print("Step 3: Drafting report...")
report_prompt = f"""You are a senior analyst. Draft a professional executive report.

SUMMARY:
{summary}

RAW FACTS:
- Entities: {', '.join(extracted['entities'])}
- Key dates: {', '.join(extracted['dates'])}

Format the report with:
1. Executive Summary (2 sentences)
2. Key Findings (bullet points)
3. Timeline (if dates present)
4. Recommendations (2-3 action items)

Use professional language appropriate for C-suite executives."""

report = llm_call(
report_prompt,
system="You are a senior strategy consultant producing executive briefings."
)
print(" Report drafted.")

return {
"extracted_facts": extracted,
"summary": summary,
"executive_report": report
}


# Example usage
if __name__ == "__main__":
sample_doc = """
OpenAI announced a major partnership with Microsoft on January 23, 2023,
valued at approximately $10 billion. CEO Sam Altman stated the investment
would fund compute for training frontier models. Microsoft's Satya Nadella
confirmed integration into Azure cloud services and Office 365 products.
The deal follows an earlier $1 billion investment in 2019. Industry analysts
at Goldman Sachs projected the AI services market to reach $150 billion by 2025.
Competitors including Google DeepMind and Anthropic are watching closely.
"""

result = prompt_chain(sample_doc)
print("\n" + "="*60)
print("EXECUTIVE REPORT:")
print("="*60)
print(result["executive_report"])

Key Design Decisions

Why gates between every step? Because errors compound. If step 1 produces malformed output and step 2 doesn't catch it, step 3 gets garbage. A gate after step 1 costs one extra check but prevents wasting three more LLM calls on bad data.

Why allow retries? LLMs are non-deterministic. A prompt that fails once often succeeds on a second attempt. Set max_retries=2 - enough to handle transient failures without infinite loops.

Why extract to JSON first? Structured intermediate representations are easier to validate, debug, and pass to subsequent steps. Plain text chains are harder to gate.


Pattern 2: Routing

The Idea

An LLM classifier reads the input and routes it to the appropriate handler. Different input types get different prompts, different tools, or different specialized agents.

When to Use Routing

Use when:

  • Inputs are genuinely diverse (billing questions need different handling than technical bugs)
  • Each input type benefits from a specialized system prompt, tool set, or model
  • A single generic prompt handles all types poorly
  • You want to send different input categories to different downstream systems

Do NOT use when all inputs need the same handling (routing overhead is not worth it) or when the "categories" are really just one flexible task.

Python Implementation: Customer Support Router

import anthropic
import json
from typing import Literal
from dataclasses import dataclass

client = anthropic.Anthropic()

# ── Route Definitions ──────────────────────────────────────────────────────────
RouteType = Literal["billing", "technical", "complaint", "general"]

@dataclass
class Route:
name: RouteType
description: str
system_prompt: str
tools: list

ROUTES: dict[str, Route] = {
"billing": Route(
name="billing",
description="Payment issues, invoices, refunds, subscription changes",
system_prompt="""You are a billing specialist. You handle payment questions with
precision and empathy. You have access to account data and can process refunds
up to $500 without escalation. Always confirm account details before taking action.""",
tools=["lookup_account", "process_refund", "update_subscription"]
),
"technical": Route(
name="technical",
description="Bugs, errors, integration issues, API questions",
system_prompt="""You are a senior technical support engineer. Diagnose problems
systematically. Ask for error messages, reproduction steps, and environment details.
Provide working code examples when relevant. Escalate only if you cannot reproduce.""",
tools=["search_docs", "run_diagnostic", "create_ticket"]
),
"complaint": Route(
name="complaint",
description="Unhappy customers, service failures, escalations, urgent issues",
system_prompt="""You are a senior customer success manager handling escalations.
Your first priority is acknowledgment and de-escalation. Do not be defensive.
Offer concrete remediation. You have authority to issue credits up to $1000.""",
tools=["lookup_account", "issue_credit", "schedule_callback", "escalate"]
),
"general": Route(
name="general",
description="Product questions, feature requests, how-to, general info",
system_prompt="""You are a knowledgeable product specialist. Answer questions
accurately and helpfully. If you don't know something, say so and offer to
find out. Upsell naturally when genuinely relevant.""",
tools=["search_docs", "search_faq"]
)
}

def classify_input(user_input: str, conversation_history: list[str] = None) -> RouteType:
"""
Use an LLM to classify the input into one of the defined routes.
Returns the route name as a string.
"""
route_descriptions = "\n".join(
f'- "{name}": {route.description}'
for name, route in ROUTES.items()
)

history_context = ""
if conversation_history:
history_context = f"\nConversation so far:\n{chr(10).join(conversation_history[-3:])}\n"

classify_prompt = f"""Classify this customer support message into exactly one category.

Categories:
{route_descriptions}

{history_context}
Customer message: "{user_input}"

Respond with ONLY the category name (one of: billing, technical, complaint, general).
No explanation, no punctuation, just the category name."""

response = client.messages.create(
model="claude-haiku-4-5", # Use fast/cheap model for classification
max_tokens=10,
messages=[{"role": "user", "content": classify_prompt}]
)

route_name = response.content[0].text.strip().lower()

# Validate - fall back to "general" if classification is unexpected
if route_name not in ROUTES:
print(f" Warning: unexpected route '{route_name}', defaulting to 'general'")
route_name = "general"

return route_name

def handle_with_route(route_name: RouteType, user_input: str) -> str:
"""
Route input to specialized handler and generate response.
In production this would also set up the right tools -
here we simulate with the system prompt.
"""
route = ROUTES[route_name]

response = client.messages.create(
model="claude-opus-4-5", # Use capable model for actual handling
max_tokens=1024,
system=route.system_prompt,
messages=[{"role": "user", "content": user_input}]
)

return response.content[0].text

def support_router(user_input: str, history: list[str] = None) -> dict:
"""
Full routing pipeline:
1. Classify input → route
2. Route to appropriate handler
3. Return response + metadata
"""
print(f"Input: {user_input[:60]}...")

# Step 1: Classify
route_name = classify_input(user_input, history)
route = ROUTES[route_name]
print(f" Routed to: {route_name}")
print(f" Tools available: {route.tools}")

# Step 2: Handle
response = handle_with_route(route_name, user_input)

return {
"route": route_name,
"tools_available": route.tools,
"response": response
}


# Example usage
if __name__ == "__main__":
test_inputs = [
"I was charged twice for my subscription last month and need a refund",
"Getting a 401 Unauthorized error when calling your API with my token",
"This is absolutely unacceptable. My service has been down for 3 hours!",
"What's the difference between the Pro and Enterprise plans?",
]

for input_text in test_inputs:
result = support_router(input_text)
print(f"\n--- Route: {result['route'].upper()} ---")
print(result['response'][:200] + "...")
print()

Router Quality Matters

The router is the most critical piece. A bad router sends billing questions to the technical handler - every downstream response is wrong before it starts. Invest in your classification prompt:

  1. Use few-shot examples for ambiguous categories
  2. Use a fast/cheap model (Claude Haiku) for classification - save the powerful model for handling
  3. Log all routing decisions - your classification accuracy is a key metric
  4. Build a confusion matrix over time to identify which categories are being misrouted

Pattern 3: Parallelization

The Idea

When a task can be broken into independent subtasks, run them simultaneously. Two subtypes:

Sectioning: divide one large task across multiple parallel agents (e.g., analyze chapter 1, chapter 2, and chapter 3 simultaneously, then merge).

Voting: run the same task N times with different agents and take the majority vote (or best-of-N). Improves reliability for high-stakes decisions.

When to Use Parallelization

Use sectioning when: large input that needs to be processed uniformly, independent chunks, time is a constraint.

Use voting when: output reliability is critical, the task is judgment-based (code review, classification), you can afford N× the token cost for N× the reliability.

Python Implementation: Parallel Research with asyncio

import anthropic
import asyncio
from dataclasses import dataclass
from typing import Any
from collections import Counter

# Note: anthropic Python SDK is sync; use asyncio.to_thread to parallelize
sync_client = anthropic.Anthropic()

def sync_llm_call(prompt: str, system: str = "") -> str:
"""Synchronous LLM call - will be run in thread pool for parallelism."""
kwargs = {
"model": "claude-opus-4-5",
"max_tokens": 1500,
"messages": [{"role": "user", "content": prompt}]
}
if system:
kwargs["system"] = system
response = sync_client.messages.create(**kwargs)
return response.content[0].text

async def async_llm_call(prompt: str, system: str = "") -> str:
"""Async wrapper - runs sync call in thread pool to not block event loop."""
return await asyncio.to_thread(sync_llm_call, prompt, system)

# ── Sectioning: Parallel Document Analysis ────────────────────────────────────

async def analyze_section(section_num: int, content: str, focus: str) -> dict:
"""Analyze one section of a document."""
prompt = f"""Analyze Section {section_num} of a research report.

Section content:
{content}

Focus on: {focus}

Return:
1. Key findings (3-5 bullet points)
2. Notable data points or statistics
3. Section's main argument in one sentence"""

result = await async_llm_call(prompt)
return {"section": section_num, "analysis": result}

async def parallel_document_analysis(document: str) -> str:
"""
Split document into sections, analyze in parallel, then synthesize.
"""
# Split document into roughly equal chunks
sentences = document.split('. ')
chunk_size = max(1, len(sentences) // 3)
chunks = [
'. '.join(sentences[i:i+chunk_size]) + '.'
for i in range(0, len(sentences), chunk_size)
][:3] # Limit to 3 chunks for this example

focus_areas = [
"key arguments and evidence",
"methodology and approach",
"conclusions and implications"
]

print(f"Analyzing {len(chunks)} sections in parallel...")

# Launch all section analyses simultaneously
tasks = [
analyze_section(i+1, chunk, focus)
for i, (chunk, focus) in enumerate(zip(chunks, focus_areas))
]
section_results = await asyncio.gather(*tasks)

print("All sections analyzed. Synthesizing...")

# Synthesize results
sections_text = "\n\n".join(
f"Section {r['section']}:\n{r['analysis']}"
for r in section_results
)

synthesis_prompt = f"""You have received parallel analyses of three sections of a research document.
Synthesize these into a coherent executive summary.

{sections_text}

Write a 3-paragraph synthesis that:
1. Captures the document's main argument
2. Highlights the most important findings
3. Notes any tensions or contradictions between sections"""

synthesis = await async_llm_call(synthesis_prompt)
return synthesis

# ── Voting: Reliable Classification ───────────────────────────────────────────

async def single_vote(content: str, categories: list[str], voter_id: int) -> str:
"""One voter's classification of the content."""
cats_formatted = ", ".join(f'"{c}"' for c in categories)

prompt = f"""Classify this content into exactly one category.

Content: "{content}"

Categories: {cats_formatted}

Think carefully and respond with ONLY the category name, nothing else."""

# Add slight variation in system prompt to reduce correlated errors
system_prompts = [
"You are a precise classification system. Be conservative.",
"You are an expert categorizer. Focus on the primary intent.",
"You classify content accurately. Consider the main topic.",
]

result = await async_llm_call(prompt, system_prompts[voter_id % len(system_prompts)])
return result.strip().lower()

async def voting_classify(content: str, categories: list[str], n_voters: int = 3) -> dict:
"""
Run N voters in parallel, take majority vote.
Returns classification + confidence (fraction of voters agreeing).
"""
print(f"Running {n_voters} voters in parallel...")

tasks = [single_vote(content, categories, i) for i in range(n_voters)]
votes = await asyncio.gather(*tasks)

print(f"Votes: {votes}")

# Find most common vote
vote_counts = Counter(votes)
winner, count = vote_counts.most_common(1)[0]
confidence = count / n_voters

# Validate winner is a known category (handle hallucinations)
cats_lower = [c.lower() for c in categories]
if winner not in cats_lower:
# Fall back to second-most-common valid vote
for candidate, _ in vote_counts.most_common():
if candidate in cats_lower:
winner = candidate
confidence = vote_counts[winner] / n_voters
break
else:
winner = categories[0].lower()
confidence = 0.0

return {
"classification": winner,
"confidence": confidence,
"all_votes": list(votes),
"vote_distribution": dict(vote_counts)
}


async def main():
# Test sectioning
long_document = """
The emergence of large language models has fundamentally altered the AI landscape.
Models like GPT-4 and Claude demonstrate reasoning capabilities once thought impossible.
These systems emerged from scaling transformer architectures on vast text corpora.
The training approach leverages gradient descent on next-token prediction objectives.
Researchers at Stanford found that capability jumps occur non-linearly with scale.
The methodology involves pre-training on diverse internet text, then fine-tuning.
Reinforcement learning from human feedback (RLHF) aligns model outputs with preferences.
Constitutional AI from Anthropic offers an alternative alignment approach using AI feedback.
Implications for software engineering are profound - code generation, review, and debugging.
The most reliable production systems combine deterministic pipelines with LLM calls judiciously.
"""

print("=== SECTIONING DEMO ===")
synthesis = await parallel_document_analysis(long_document)
print("\nSynthesis:")
print(synthesis[:400] + "...")

# Test voting
print("\n=== VOTING DEMO ===")
test_text = "My credit card was charged but I never received the product"
categories = ["billing", "shipping", "fraud", "general_inquiry"]

result = await voting_classify(test_text, categories, n_voters=5)
print(f"\nClassification: {result['classification']}")
print(f"Confidence: {result['confidence']:.0%}")
print(f"Vote distribution: {result['vote_distribution']}")

if __name__ == "__main__":
asyncio.run(main())

Pattern 4: Orchestrator-Subagents

The Idea

One agent (the orchestrator) plans the task and delegates subtasks to specialized subagents. The orchestrator never does the work itself - it reasons about what work needs to be done, which agent should do it, and how to synthesize the results.

This mirrors how a senior engineer works: they design the architecture, assign implementation tasks to team members, review the results, and integrate everything.

When to Use Orchestrator-Subagents

Use when:

  • The task requires distinct types of expertise (research + writing + fact-checking)
  • You want specialization without a monolithic agent trying to do everything
  • The workflow has a planning phase that is genuinely separate from execution
  • You want to swap out individual subagents without redesigning the whole system

Do NOT use for simple linear tasks (use prompt chaining). The orchestrator-subagent pattern adds coordination overhead - it only pays off when specialization genuinely improves output quality.

Python Implementation: Research Report System

import anthropic
import json
from dataclasses import dataclass
from typing import Optional

client = anthropic.Anthropic()

def llm_call(prompt: str, system: str = "", model: str = "claude-opus-4-5") -> str:
kwargs = {
"model": model,
"max_tokens": 2000,
"messages": [{"role": "user", "content": prompt}]
}
if system:
kwargs["system"] = system
return client.messages.create(**kwargs).content[0].text

# ── Specialized Subagents ──────────────────────────────────────────────────────

def researcher_agent(topic: str, focus_areas: list[str]) -> dict:
"""Finds and structures information on a topic."""
prompt = f"""Research the following topic comprehensively.

Topic: {topic}
Focus on these specific areas: {', '.join(focus_areas)}

Provide:
1. Key facts and statistics (with approximate dates if known)
2. Major developments or milestones
3. Current state of the field
4. Controversies or open questions
5. Key players or organizations involved

Be specific and factual. Note any areas where information may be uncertain."""

result = llm_call(
prompt,
system="You are a research analyst. You provide accurate, well-structured information. "
"You clearly distinguish between established facts and estimates."
)
return {"agent": "researcher", "output": result, "focus": focus_areas}

def writer_agent(topic: str, research: str, target_audience: str, format_spec: str) -> str:
"""Drafts content based on research."""
prompt = f"""Write a {format_spec} about "{topic}" for {target_audience}.

Use this research as your source material:
{research}

Requirements:
- Engaging opening that hooks the reader
- Clear structure with headers
- Concrete examples and data points
- Accessible language appropriate for {target_audience}
- Strong conclusion with key takeaways

Do not add facts not present in the research material."""

return llm_call(
prompt,
system="You are a technical writer who makes complex topics accessible without "
"sacrificing accuracy. You write with clarity and authority."
)

def critic_agent(content: str, criteria: list[str]) -> dict:
"""Reviews content against specific quality criteria."""
criteria_text = "\n".join(f"- {c}" for c in criteria)

prompt = f"""Critically review this content against these criteria:
{criteria_text}

Content to review:
{content}

For each criterion, provide:
1. Pass/Fail
2. Specific issues found (if any)
3. Concrete suggestions for improvement

Finally, give an overall score 1-10 and a brief summary of the most important changes needed."""

result = llm_call(
prompt,
system="You are a senior editor and fact-checker. You are rigorous and specific. "
"You point to exact passages that need improvement, not vague suggestions."
)
return {"agent": "critic", "review": result}

# ── Orchestrator ───────────────────────────────────────────────────────────────

def orchestrator(goal: str) -> dict:
"""
Plans and coordinates research, writing, and review subagents.
"""
print(f"Orchestrator: Planning task - {goal}")

# Phase 1: Orchestrator plans the approach
plan_prompt = f"""You are coordinating a team to accomplish this goal:
{goal}

Create a specific execution plan with:
1. The main topic to research (one clear sentence)
2. Three specific focus areas for the researcher (each one clear phrase)
3. The target audience for the written output
4. The format specification (e.g., "1500-word blog post", "executive briefing")
5. Three quality criteria to review the output against

Respond in JSON format with keys: topic, focus_areas, audience, format, criteria"""

plan_response = llm_call(plan_prompt)

# Parse plan (handle markdown code fences)
import re
plan_text = re.sub(r'^```json\n?', '', plan_response.strip())
plan_text = re.sub(r'\n?```$', '', plan_text.strip())

try:
plan = json.loads(plan_text)
except json.JSONDecodeError:
# Fallback plan if JSON parsing fails
plan = {
"topic": goal,
"focus_areas": ["background and history", "current state", "future implications"],
"audience": "technical professionals",
"format": "1000-word technical overview",
"criteria": ["accuracy", "clarity", "practical value"]
}

print(f" Plan: research '{plan['topic']}', write for '{plan['audience']}'")

# Phase 2: Delegate to researcher
print("Orchestrator: Delegating to Researcher...")
research_result = researcher_agent(plan["topic"], plan["focus_areas"])
print(f" Research complete ({len(research_result['output'].split())} words)")

# Phase 3: Delegate to writer
print("Orchestrator: Delegating to Writer...")
draft = writer_agent(
plan["topic"],
research_result["output"],
plan["audience"],
plan["format"]
)
print(f" Draft complete ({len(draft.split())} words)")

# Phase 4: Delegate to critic
print("Orchestrator: Delegating to Critic...")
review = critic_agent(draft, plan.get("criteria", ["accuracy", "clarity"]))
print(" Review complete")

# Phase 5: Orchestrator decides whether to revise
revision_prompt = f"""You coordinated creating this content. The critic has reviewed it.

Critic's Review:
{review['review']}

Should the writer revise based on this feedback?
If YES: provide the writer with specific revision instructions (2-3 concrete changes).
If NO: explain why the current draft is acceptable.

Respond with "REVISE: [instructions]" or "ACCEPT: [reason]"."""

decision = llm_call(revision_prompt)
print(f" Orchestrator decision: {decision[:80]}...")

if decision.startswith("REVISE"):
print("Orchestrator: Requesting revision...")
revision_instructions = decision.replace("REVISE:", "").strip()

revise_prompt = f"""Revise this draft based on these specific instructions:
{revision_instructions}

Original draft:
{draft}

Maintain the overall structure but address the specific feedback."""

final_output = llm_call(
revise_prompt,
system="You are a technical writer. Implement the revision instructions precisely."
)
print(f" Revision complete ({len(final_output.split())} words)")
else:
final_output = draft
print(" Draft accepted as-is")

return {
"goal": goal,
"plan": plan,
"research": research_result["output"],
"draft": draft,
"review": review["review"],
"revision_decision": decision,
"final_output": final_output
}


if __name__ == "__main__":
result = orchestrator(
"Explain how transformer attention mechanisms work and why they matter for AI engineers"
)
print("\n" + "="*70)
print("FINAL OUTPUT:")
print("="*70)
print(result["final_output"][:800] + "...")

Pattern 5: Evaluator-Optimizer

The Idea

A generator LLM produces an output. An evaluator LLM scores it against specific criteria. If the score is below a threshold, the generator tries again, using the evaluator's feedback as guidance. The loop continues until the output meets the quality bar or max iterations is reached.

This is essentially automated self-editing: the model critiques its own output (via a separate evaluator call to avoid confirmation bias) and rewrites based on the critique.

When to Use Evaluator-Optimizer

Use when:

  • Output quality is the primary constraint (not speed or cost)
  • You have well-defined quality criteria that a model can evaluate
  • First-attempt outputs are often good but not great
  • The task benefits from iterative refinement (writing, code, design)

Do NOT use when latency is critical (each iteration adds round-trip time), when quality criteria are too subjective, or when max iterations would exceed budget.

Python Implementation: Self-Refining Code Generator

import anthropic
import json
import re
import subprocess
import sys
from dataclasses import dataclass

client = anthropic.Anthropic()

@dataclass
class EvaluationResult:
score: float # 0.0 to 1.0
passed: bool
issues: list[str]
suggestions: list[str]
raw_feedback: str

def generate_code(task: str, previous_attempt: str = "", feedback: str = "") -> str:
"""Generator: produce Python code for a given task."""

context = ""
if previous_attempt and feedback:
context = f"""
Previous attempt (needs improvement):
```python
{previous_attempt}

Evaluator feedback: {feedback}

Address ALL the feedback points in your new version. """

prompt = f"""Write production-quality Python code for this task:

{task}

{context}

Requirements:

  • Add docstrings to all functions and classes
  • Include type hints
  • Add input validation
  • Include at least 3 usage examples as doctest or in a main() function
  • Handle edge cases and errors gracefully
  • Code must be self-contained and runnable

Return ONLY the Python code, no explanations."""

response = client.messages.create(
model="claude-opus-4-5",
max_tokens=2000,
system="You are an expert Python engineer. You write clean, production-ready code "
"with proper error handling and documentation.",
messages=[{"role": "user", "content": prompt}]
)

code = response.content[0].text
# Strip markdown code fences
code = re.sub(r'^```python\n?', '', code.strip())
code = re.sub(r'\n?```$', '', code.strip())
return code

def evaluate_code(code: str, task: str) -> EvaluationResult: """ Evaluator: assess code quality against specific criteria. Returns structured evaluation with score and feedback. """

eval_prompt = f"""You are a strict code reviewer. Evaluate this Python code against specific criteria.

Task description: {task}

Code to evaluate:

{code}

Score each criterion 0-10:

  1. Correctness: Does it correctly solve the task?
  2. Documentation: Are docstrings complete and clear?
  3. Type hints: Are all functions/methods properly typed?
  4. Error handling: Are errors and edge cases handled?
  5. Code quality: Is it clean, readable, Pythonic?
  6. Examples: Are there clear usage examples?

Return a JSON object: {{ "scores": {{ "correctness": <0-10>, "documentation": <0-10>, "type_hints": <0-10>, "error_handling": <0-10>, "code_quality": <0-10>, "examples": <0-10> }}, "issues": ["issue 1", "issue 2", ...], "suggestions": ["specific fix 1", "specific fix 2", ...], "overall_verdict": "pass" or "fail" }}"""

response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1000,
system="You are a rigorous code reviewer. Be specific about issues. "
"Do not pass code that has missing docstrings, no type hints, or no examples.",
messages=[{"role": "user", "content": eval_prompt}]
)

raw_feedback = response.content[0].text

# Parse JSON evaluation
json_text = re.sub(r'^```json\n?', '', raw_feedback.strip())
json_text = re.sub(r'\n?```$', '', json_text.strip())

try:
eval_data = json.loads(json_text)
scores = eval_data.get("scores", {})
avg_score = sum(scores.values()) / len(scores) if scores else 0
normalized_score = avg_score / 10.0

return EvaluationResult(
score=normalized_score,
passed=eval_data.get("overall_verdict") == "pass" and normalized_score >= 0.75,
issues=eval_data.get("issues", []),
suggestions=eval_data.get("suggestions", []),
raw_feedback=raw_feedback
)
except (json.JSONDecodeError, KeyError):
# Fallback if JSON parsing fails
return EvaluationResult(
score=0.5,
passed=False,
issues=["Could not parse evaluation"],
suggestions=["Retry generation"],
raw_feedback=raw_feedback
)

def evaluator_optimizer( task: str, threshold: float = 0.8, max_iterations: int = 3 ) -> dict: """ Evaluator-optimizer loop: 1. Generate code 2. Evaluate against quality criteria 3. If below threshold, feed back and regenerate 4. Return best result """ best_code = "" best_score = 0.0 history = []

current_code = ""
current_feedback = ""

for iteration in range(1, max_iterations + 1):
print(f"\nIteration {iteration}/{max_iterations}")

# Generate
print(" Generating code...")
current_code = generate_code(task, current_code, current_feedback)
print(f" Generated {len(current_code.splitlines())} lines")

# Evaluate
print(" Evaluating...")
eval_result = evaluate_code(current_code, task)
print(f" Score: {eval_result.score:.2f} | Passed: {eval_result.passed}")

if eval_result.issues:
print(f" Issues: {eval_result.issues[:2]}")

# Track best
if eval_result.score > best_score:
best_score = eval_result.score
best_code = current_code

history.append({
"iteration": iteration,
"score": eval_result.score,
"passed": eval_result.passed,
"issues": eval_result.issues
})

# Check if we've met the threshold
if eval_result.passed and eval_result.score >= threshold:
print(f" Threshold met! Stopping at iteration {iteration}")
break

# Prepare feedback for next iteration
current_feedback = "\n".join([
"ISSUES TO FIX:",
*[f"- {issue}" for issue in eval_result.issues],
"\nSPECIFIC SUGGESTIONS:",
*[f"- {suggestion}" for suggestion in eval_result.suggestions]
])

return {
"task": task,
"final_code": best_code,
"final_score": best_score,
"iterations": len(history),
"history": history,
"converged": history[-1]["passed"] if history else False
}

if name == "main": task = """ Implement a thread-safe LRU (Least Recently Used) cache in Python. The cache should: - Support get(key) and put(key, value) operations in O(1) time - Have configurable max size - Automatically evict least recently used items when full - Be thread-safe for concurrent access - Support optional TTL (time-to-live) per item """

result = evaluator_optimizer(task, threshold=0.8, max_iterations=3)

print("\n" + "="*60)
print(f"FINAL SCORE: {result['final_score']:.2f}")
print(f"CONVERGED: {result['converged']}")
print(f"ITERATIONS: {result['iterations']}")
print("\nFINAL CODE:")
print("="*60)
print(result["final_code"][:500] + "...")

---

## Composing Patterns

The real power comes from combining patterns. Two common compositions:

**Chaining + Parallelization**: use parallel analysis within one step of a chain. The chain provides the overall structure; parallelization makes one step faster.

**Orchestrator + Evaluator-Optimizer**: the orchestrator delegates writing to a subagent that uses evaluator-optimizer internally. The orchestrator handles planning; the subagent handles quality.

```mermaid
flowchart TD
G["Goal"]:::blue --> O
O["Orchestrator"]:::purple
O --> R["Researcher<br/>(Parallelization internally)"]:::green
O --> W["Writer<br/>(Evaluator-Optimizer internally)"]:::teal
R --> O
W --> O
O --> OUT["Final<br/>Output"]:::blue

classDef blue fill:#dbeafe,color:#1e293b,stroke:#2563eb
classDef green fill:#dcfce7,color:#14532d,stroke:#16a34a
classDef purple fill:#ede9fe,color:#4c1d95,stroke:#7c3aed
classDef teal fill:#ccfbf1,color:#134e4a,stroke:#14b8a6

Anti-Patterns

Anti-Pattern 1: Unnecessary Agent-to-Agent Calls

Do not create agent layers that just pass through - adding orchestrators that add no reasoning is pure overhead. Every agent-to-agent call adds latency and cost.

Anti-Pattern 2: Premature Orchestration

Do not use orchestrator-subagents for tasks that a single prompt handles fine. "Can I write a prompt that does this directly?" If yes, do that first.

Anti-Pattern 3: State Explosion

In orchestrator-subagent systems, state accumulates: research results, drafts, reviews, revisions. If you are not careful, your context window fills with state that the model cannot effectively use. Design state carefully - pass the minimum context each subagent needs.

Anti-Pattern 4: No Convergence Criterion

The evaluator-optimizer loop must have a convergence criterion (threshold met or max iterations). Without it, the loop can run indefinitely on tasks where the evaluator always finds something to improve.


Pattern Selection Guide

Task CharacteristicBest Pattern
Clear sequential stepsPrompt Chaining
Diverse input typesRouting
Independent parallel chunksParallelization (Sectioning)
High-stakes, needs confidenceParallelization (Voting)
Requires specializationOrchestrator-Subagents
Quality is the primary goalEvaluator-Optimizer
Complex multi-phase workCompose patterns

Production Notes

  • Cost tracking: each pattern has different cost profiles. Voting multiplies token usage by N. Evaluator-optimizer can multiply by 3× or more. Track costs per pattern in production.
  • Timeouts: every subagent call and every chain step needs a timeout. Use asyncio.wait_for() or threading.Timer. Never let a stuck agent block indefinitely.
  • Observability: log the routing decision, the iteration count, each subagent's output length, and the final score at minimum. You cannot debug what you cannot observe.
  • Idempotency: design each step to be idempotent - safe to retry. If step 2 fails and you retry, it should not corrupt state.

:::warning Max Iterations Are Not Optional Every loop-based pattern (evaluator-optimizer, any retry logic) MUST have a hard maximum iteration count. LLMs can enter states where the evaluator always finds something wrong and the generator never satisfies it. Without a ceiling, you get infinite loops and infinite bills. :::

:::danger Do Not Trust Agent Outputs Without Validation Subagent outputs should be validated before being passed to the next stage. Use JSON schemas, range checks, length limits. A malformed output from a subagent poisons every downstream step in the chain. :::


Interview Q&A

Q: What are the 5 core agentic design patterns from Anthropic's research?

A: Prompt chaining (sequential decomposition with validation gates), routing (LLM classifier directs input to specialized handlers), parallelization (simultaneous subtasks - sectioning divides work, voting runs N agents for reliability), orchestrator-subagents (one planner + specialized executors), and evaluator-optimizer (generator-evaluator loop until quality threshold). Each addresses a different failure mode of single-prompt approaches.

Q: When would you use voting (parallelization) versus a single agent call?

A: Voting makes sense when (1) the decision is high-stakes and you cannot afford errors, (2) the task is inherently judgment-based where different "perspectives" add value, and (3) cost is not the primary constraint. The math: if each voter has 90% accuracy, 3 voters with majority vote get roughly 97% accuracy (assuming independent errors). Use voting for classification, content moderation, code review - not for factual queries where all voters will make the same error if the knowledge is absent.

Q: How do you prevent evaluator-optimizer from running forever?

A: Three mechanisms: (1) hard max_iterations ceiling (always), (2) score threshold - stop when quality is "good enough" rather than perfect, (3) convergence detection - if the score does not improve across N consecutive iterations, stop. In production, also add a wall-clock timeout. The failure mode is an evaluator that always finds something to improve, leading to infinite refinement. Real systems need to define "done" precisely.

Q: What is the key difference between prompt chaining and orchestrator-subagents?

A: Prompt chaining is linear and predetermined - you know the steps before running the chain, and each step always runs in sequence. Orchestrator-subagents is dynamic - the orchestrator decides at runtime which agents to invoke, in what order, and how many times. Prompt chaining is more predictable and cheaper; orchestrator-subagents is more flexible and can adapt to task complexity. For tasks with a known, fixed workflow, use chaining. For tasks where the plan itself is part of the problem, use orchestrator.

Q: How do you handle errors in a multi-step prompt chain?

A: At minimum: (1) validate each step's output with a gate before proceeding, (2) retry failed steps with max_retries=2 before propagating errors, (3) log exactly which step failed and the raw output for debugging, (4) have an explicit error state - do not silently pass bad intermediate output to the next step. In production: consider checkpointing (save state after each successful step) so you can resume a failed chain without rerunning completed steps. This is especially important for long chains with expensive API calls.

Q: How would you choose between routing and orchestrator-subagents?

A: Routing is for input classification - the agent's role is to identify what kind of problem this is and hand off to the right specialist. The router does not plan or synthesize; it classifies and delegates permanently. Orchestrator-subagents is for complex task execution - the orchestrator plans the whole task, delegates pieces, collects results, and synthesizes a final output. If your "agent" is only ever classifying one message and handing it off, that is routing. If it is coordinating multiple agents across multiple rounds to produce a unified output, that is an orchestrator.

© 2026 EngineersOfAI. All rights reserved.