Why Multi-Agent Systems?
Reading time: ~35 minutes | Relevance: High for senior AI engineers, system architects | Target roles: AI Engineer, ML Engineer, Research Engineer
The Scenario
You're building an AI system to bootstrap a complete SaaS product from a brief description. The system needs to:
- Analyze requirements and define the architecture
- Write frontend code (React, TypeScript)
- Write backend code (FastAPI, Python)
- Write database migrations
- Write tests (unit, integration, e2e)
- Set up infrastructure (Docker, CI/CD)
- Write documentation
- Review all of the above for correctness, security, and quality
Could a single agent do this? Technically, yes - if you gave it unlimited context and time. In practice, no. The context window fills with the frontend code before it starts on the backend. The agent loses coherence across 50,000 tokens of mixed concerns. It can't review its own work reliably.
This is exactly where multi-agent systems shine: tasks that exceed one agent's capacity, benefit from parallel execution, or require independent verification.
:::tip 🎮 Interactive Playground Visualize this concept: Try the Multi-Agent Systems demo on the EngineersOfAI Playground - no code required. :::
Why This Exists
Multi-agent systems didn't emerge from theory - they emerged from pain.
In 2023, teams building GPT-4-powered workflows kept hitting the same wall: their single-agent systems worked on toy problems and broke on real problems. The context filled up. The agent confused early decisions with late ones. There was no way to verify outputs without human review of every step.
The solution that emerged was obvious in retrospect: multiple agents, each with a focused job, coordinating through structured communication.
The academic tradition goes back decades - multi-agent systems in classical AI (MAS) studied cooperation, negotiation, and coordination among autonomous software agents. What's new is that LLMs made these agents capable of open-ended reasoning, tool use, and natural language coordination - transforming MAS from a research curiosity into a practical engineering discipline.
The Fundamental Case for Multi-Agent
Three independent forces make multi-agent systems valuable. You need at least one of them to justify the coordination cost.
Force 1: Parallelization
Some tasks are inherently sequential. Others are embarrassingly parallel.
If you need to research 20 topics for a report, a single agent takes 20 sequential research calls. Twenty agents take 1 parallel call - 20x faster.
Sequential: [research_1] → [research_2] → ... → [research_20] → [write_report]
Parallel: [r1][r2][r3]...[r20] simultaneously → [write_report]
Parallelism doesn't just save time. For some tasks, it changes what's possible. A news summarization system that needs to process 500 articles in under 60 seconds cannot run sequentially. Parallelism is the only option.
Force 2: Specialization
A generalist agent answering "write me a sonnet about recursion" performs fine. A specialist poetry agent with a system prompt tuned for poetry, awareness of meter and rhyme, and examples of great sonnets performs better.
Specialization works because:
- Focused prompts are more effective than omnibus prompts - "You are a senior security engineer reviewing code for SQL injection vulnerabilities" is better than "You are a helpful assistant who sometimes looks at security"
- Context can be domain-specific - a legal analysis agent can have its context window filled with case law, not generic knowledge
- Output format can be constrained - a JSON-generating agent trained on structured output is more reliable than a general agent asked to produce JSON
The tradeoff is coordination overhead. Adding a specialized agent means adding a communication channel, a failure mode, and cognitive complexity for the engineer maintaining the system.
Force 3: Verification
This is the most underappreciated force. An agent is bad at catching its own mistakes. Another agent catching those mistakes is significantly better.
The reason is mechanical: when an LLM generates output, it has already "committed" to a direction. Subsequent tokens follow the direction established by earlier tokens. Self-correction requires fighting that momentum.
A fresh agent evaluating the same output has no prior commitment. It sees the output cold. This is why the critic/reviewer pattern - one agent writes, another reviews - reliably improves quality on tasks where quality matters.
Three Scenarios Where Multi-Agent Wins
Scenario 1: Task Too Large for One Context Window
The problem: Writing a complete technical specification, auditing a large codebase, or processing a 300-page document exceeds what fits in one context.
The multi-agent solution: Split the task into chunks. Each agent handles one chunk. An aggregator synthesizes the results.
Document (300 pages)
↓ split
[Chunk Agent 1: pages 1-100]
[Chunk Agent 2: pages 101-200] ← parallel execution
[Chunk Agent 3: pages 201-300]
↓ merge
[Synthesis Agent: integrate findings]
↓
Final Report
This isn't just about token count. It's about coherence. An agent analyzing a 10-page section will notice more detail than one skimming 300 pages at once.
Scenario 2: Independent Parallel Work
The problem: A multi-step research report where each section is independent.
The multi-agent solution: Assign each section to a specialized agent. Run them simultaneously.
# Instead of this (sequential):
section_1 = agent.write_section("Introduction")
section_2 = agent.write_section("Background")
section_3 = agent.write_section("Analysis")
# Do this (parallel):
sections = await asyncio.gather(
agent_intro.write(),
agent_background.write(),
agent_analysis.write()
)
The key constraint: the sections must be truly independent. If section 3 depends on conclusions from section 2, you can't parallelize without first getting section 2's conclusions.
Scenario 3: Mutual Verification
The problem: Code, analysis, or writing that needs to be correct, not just plausible.
The multi-agent solution: Producer + critic + synthesizer.
[Producer Agent] → draft_output
↓
[Critic Agent] → critique (errors, gaps, improvements)
↓
[Producer Agent] → revised_output (with critique context)
↓
[Judge Agent] → final_decision (accept / request_another_round)
This is the multi-agent pattern with the best empirical support. Studies on LLM debate and critique consistently show that adding a separate critic improves output quality on tasks where quality is measurable - code correctness, factual accuracy, logical coherence.
The Coordination Overhead Cost
Multi-agent is not free. Every agent you add introduces:
Latency overhead: Each agent-to-agent message is an LLM call. If a task requires 5 agent handoffs, you've added 5x the latency of a single pass.
Token cost overhead: Every agent needs context. The orchestrator's context includes task descriptions, subagent outputs, and coordination state. Token costs multiply with complexity.
Failure surface area: A system with 5 agents has 5 points of failure. If any agent fails, the orchestrator must handle it.
Debugging complexity: When a single-agent system produces bad output, you have one LLM call to inspect. When a 5-agent system produces bad output, you have to trace through 5 calls, each with its own context, to find where the error originated.
The honest question: Would this task be better served by a single, well-prompted agent with enough tokens?
For many tasks, the answer is yes. Multi-agent adds value when:
- The task genuinely exceeds one context window
- Independent subtasks make parallelism safe
- Verification quality matters enough to justify the cost
- Specialization provides measurable improvement
Types of Multi-Agent Architectures
Pipeline Architecture
Agents run in sequence. Agent A's output is Agent B's input. Agent B's output is Agent C's input.
When to use: When each step transforms the output of the previous step. Writing pipelines (brainstorm → draft → edit → polish), data processing pipelines (extract → transform → validate → load).
Limitations: No parallelism. An error in step 2 corrupts steps 3, 4, 5. Latency is additive.
Hierarchical Architecture
One orchestrator manages multiple subagents. The orchestrator decides which agent to call and when. Subagents execute tasks and return results.
When to use: Most real-world multi-agent systems. The orchestrator has the "brain" (planning, reasoning, coordination). Subagents have the "skills" (execution, specialization).
Limitations: Single point of failure at the orchestrator. Orchestrator context grows as it accumulates subagent results.
Peer-to-Peer Architecture
Agents communicate directly with each other without a central orchestrator. Any agent can call any other agent.
When to use: Simulation, debate, collaborative writing where agents need to respond to each other.
Limitations: Hard to reason about. Emergent behavior is unpredictable. Deadlocks are possible (A waits for B, B waits for A).
Blackboard Architecture
All agents read and write to a shared state (the "blackboard"). Agents are triggered by state changes.
When to use: When multiple agents need to contribute incrementally to a shared artifact. Example: multiple research agents adding findings to a shared document.
Limitations: Requires careful state management. Write conflicts need resolution. Less common in LLM systems.
Specialization vs Generalization
A specialized agent has a system prompt that tightly defines its role, capabilities, and output format. A generalist agent handles anything.
Specialist: "You are a senior security engineer. Your only job is to review
Python code for SQL injection, command injection, and path traversal
vulnerabilities. Output a JSON list of findings."
Generalist: "You are a helpful AI assistant."
When does specialization win?
- High-stakes outputs: Security reviews, legal analysis, medical summaries
- Consistent format requirements: The specialist always outputs JSON, always in the same schema
- Domain depth: The specialist's entire context window is filled with domain-specific knowledge, not generic helpfulness
- High-volume production: Specialized agents can be fine-tuned on domain data
When does generalization win?
- Exploratory tasks: You don't know what you need yet
- Low task volume: Not worth the engineering cost of specialization
- Cross-domain synthesis: The task requires connecting multiple domains (a specialist can't do this)
Current Limitations
Multi-agent systems are powerful but immature. Honest limitations as of 2025:
Error propagation: A mistake in step 2 of a 5-step pipeline corrupts steps 3-5. Error correction is hard. Most frameworks don't have robust recovery.
Coordination complexity: Debugging a 5-agent system is significantly harder than debugging a single agent. You need full tracing.
Emergent failures: In conversational systems (like AutoGen), agents can get into loops, escalate disagreements, or lose the original goal. These failures are hard to predict.
Cost unpredictability: Agentic loops where agents call each other can spiral into many more LLM calls than expected. Production systems need hard token/cost limits.
Consistency: Different agents may make different assumptions about the task, leading to inconsistent outputs that need reconciliation.
Full Python Code: Simple Orchestrator + Subagents Pipeline
"""
simple_multi_agent.py - A 3-subagent pipeline with an orchestrator.
Orchestrator decomposes a research task into:
1. Research agent: gathers facts
2. Analysis agent: identifies insights
3. Writing agent: produces final report
Uses Anthropic SDK directly - no framework required.
"""
import asyncio
import json
from dataclasses import dataclass, field
from typing import Optional
import anthropic
client = anthropic.Anthropic()
MODEL = "claude-opus-4-5"
@dataclass
class AgentResult:
agent_name: str
task: str
output: str
success: bool
error: Optional[str] = None
@dataclass
class OrchestratorState:
original_task: str
subtasks: list[dict] = field(default_factory=list)
results: list[AgentResult] = field(default_factory=list)
final_output: Optional[str] = None
# ─── Subagent implementations ────────────────────────────────────────────────
def research_agent(topic: str) -> AgentResult:
"""Gathers factual information about a topic."""
response = client.messages.create(
model=MODEL,
max_tokens=1024,
system=(
"You are a research agent. Your job is to gather accurate, "
"specific facts about a topic. Focus on statistics, dates, "
"key figures, and verifiable claims. Output structured findings."
),
messages=[{
"role": "user",
"content": f"Research this topic thoroughly: {topic}"
}]
)
return AgentResult(
agent_name="ResearchAgent",
task=f"Research: {topic}",
output=response.content[0].text,
success=True
)
def analysis_agent(research_output: str, focus: str) -> AgentResult:
"""Identifies patterns, insights, and implications from research."""
response = client.messages.create(
model=MODEL,
max_tokens=1024,
system=(
"You are an analysis agent. You receive research findings and "
"identify key insights, patterns, cause-effect relationships, "
"and implications. Be specific. Avoid restating facts - identify "
"what they mean."
),
messages=[{
"role": "user",
"content": (
f"Analyze these research findings with focus on: {focus}\n\n"
f"Research:\n{research_output}"
)
}]
)
return AgentResult(
agent_name="AnalysisAgent",
task=f"Analyze: {focus}",
output=response.content[0].text,
success=True
)
def writing_agent(research: str, analysis: str, format_spec: str) -> AgentResult:
"""Produces a well-written final document."""
response = client.messages.create(
model=MODEL,
max_tokens=2048,
system=(
"You are a writing agent. You receive research and analysis, "
"then produce a polished, well-structured final document. "
"Focus on clarity, flow, and actionable takeaways."
),
messages=[{
"role": "user",
"content": (
f"Write a {format_spec} using the following inputs:\n\n"
f"RESEARCH:\n{research}\n\n"
f"ANALYSIS:\n{analysis}"
)
}]
)
return AgentResult(
agent_name="WritingAgent",
task=f"Write: {format_spec}",
output=response.content[0].text,
success=True
)
# ─── Orchestrator ─────────────────────────────────────────────────────────────
class Orchestrator:
"""
Decomposes tasks and coordinates subagents.
Maintains state and handles failures.
"""
def __init__(self, task: str):
self.state = OrchestratorState(original_task=task)
def decompose_task(self) -> list[dict]:
"""Ask the LLM to decompose the task into subtasks for subagents."""
response = client.messages.create(
model=MODEL,
max_tokens=512,
system=(
"You are an orchestrator. Decompose a complex task into "
"exactly 3 subtasks for: (1) a research agent, "
"(2) an analysis agent, (3) a writing agent. "
"Output ONLY valid JSON: "
'{"research_topic": "...", "analysis_focus": "...", "output_format": "..."}'
),
messages=[{
"role": "user",
"content": f"Decompose this task: {self.state.original_task}"
}]
)
raw = response.content[0].text
# Extract JSON if wrapped in markdown
if "```json" in raw:
raw = raw.split("```json")[1].split("```")[0].strip()
elif "```" in raw:
raw = raw.split("```")[1].split("```")[0].strip()
subtasks = json.loads(raw)
self.state.subtasks = [subtasks]
return subtasks
def run(self) -> str:
"""Execute the full pipeline."""
print(f"\n[Orchestrator] Task: {self.state.original_task}")
print("[Orchestrator] Decomposing task...")
subtasks = self.decompose_task()
print(f"[Orchestrator] Subtasks: {json.dumps(subtasks, indent=2)}")
# Step 1: Research
print("\n[Orchestrator] → Dispatching to ResearchAgent...")
research_result = research_agent(subtasks["research_topic"])
self.state.results.append(research_result)
print(f"[ResearchAgent] ✓ Complete ({len(research_result.output)} chars)")
# Step 2: Analysis (depends on research)
print("[Orchestrator] → Dispatching to AnalysisAgent...")
analysis_result = analysis_agent(
research_output=research_result.output,
focus=subtasks["analysis_focus"]
)
self.state.results.append(analysis_result)
print(f"[AnalysisAgent] ✓ Complete ({len(analysis_result.output)} chars)")
# Step 3: Writing (depends on research + analysis)
print("[Orchestrator] → Dispatching to WritingAgent...")
writing_result = writing_agent(
research=research_result.output,
analysis=analysis_result.output,
format_spec=subtasks["output_format"]
)
self.state.results.append(writing_result)
print(f"[WritingAgent] ✓ Complete ({len(writing_result.output)} chars)")
# Aggregate results
self.state.final_output = writing_result.output
print("\n[Orchestrator] Pipeline complete.")
return self.state.final_output
def get_trace(self) -> dict:
"""Return full execution trace for debugging."""
return {
"task": self.state.original_task,
"subtasks": self.state.subtasks,
"agent_results": [
{
"agent": r.agent_name,
"task": r.task,
"output_length": len(r.output),
"success": r.success
}
for r in self.state.results
],
"final_output_length": len(self.state.final_output or "")
}
# ─── Usage ────────────────────────────────────────────────────────────────────
def main():
task = (
"Create a comprehensive briefing on the current state of "
"AI agents in 2025: capabilities, limitations, and near-term trajectory."
)
orchestrator = Orchestrator(task)
final_output = orchestrator.run()
print("\n" + "="*60)
print("FINAL OUTPUT:")
print("="*60)
print(final_output)
print("\n" + "="*60)
print("EXECUTION TRACE:")
print("="*60)
print(json.dumps(orchestrator.get_trace(), indent=2))
if __name__ == "__main__":
main()
Architecture Taxonomy
Production Notes
Start simple: Before reaching for multi-agent, ask whether a single agent with a longer context window and better prompt solves the problem. Multi-agent adds real complexity.
Hard limits are mandatory: Always set maximum token budgets, maximum agent call counts, and maximum pipeline depth. Without limits, runaway loops are possible.
Trace everything: Every agent call should be logged with its input, output, latency, and token count. You cannot debug multi-agent systems without traces.
Test failure modes: What happens when the research agent returns garbage? What happens when the writing agent times out? Test every failure path explicitly.
Cost estimation: Estimate cost per run before deploying. Multi-agent systems can be 5-10x more expensive than single-agent systems on the same task.
:::warning Coordination Overhead is Real Multi-agent systems frequently perform worse than single-agent systems on tasks that don't genuinely require them. The added complexity - more LLM calls, more failure points, more latency - is not free. Always benchmark multi-agent against a well-optimized single-agent baseline before committing to the architecture. :::
:::danger Runaway Agent Loops Multi-agent systems can get into infinite loops where agents keep calling each other, generating exponentially increasing token costs. Always implement hard limits: max_agent_calls, max_tokens_per_pipeline, circuit breakers on error rates. A runaway 5-agent loop can generate thousands of dollars in API costs in minutes. :::
Interview Q&A
Q: When should you use a multi-agent system vs a single agent?
A: Multi-agent is justified when the task exceeds one context window (requiring handoffs between agents), when subtasks are independent enough to run in parallel (providing meaningful speed gains), or when independent verification materially improves output quality. For most tasks, a well-prompted single agent with enough context is simpler and cheaper. Always benchmark.
Q: What are the main failure modes in multi-agent systems?
A: Error propagation (a mistake in step 2 corrupts steps 3-5), coordination loops (agents calling each other indefinitely), context loss (agents losing track of the original goal across many handoffs), and emergent inconsistency (different agents making different assumptions). Production systems need hard limits, full tracing, and explicit error handling at every agent boundary.
Q: What's the difference between pipeline, hierarchical, and peer-to-peer multi-agent architectures?
A: Pipeline is sequential - Agent A's output is Agent B's input, serially. Hierarchical has one orchestrator coordinating multiple subagents in parallel or sequentially as needed. Peer-to-peer has agents communicating directly without a coordinator, enabling more emergent behavior but making the system harder to reason about and debug.
Q: How do you handle partial failures in a multi-agent pipeline?
A: Depends on the failure type. Transient failures (rate limits, timeouts) warrant retry with exponential backoff. Quality failures (agent produced bad output) warrant retry with a different prompt or model. Fatal failures (agent can't complete the task) warrant fallback to a simpler approach or human escalation. The orchestrator should track which steps succeeded so retries don't re-run completed work.
Q: What is the specialization vs generalization tradeoff in multi-agent design?
A: Specialized agents (tightly scoped system prompts, domain-specific context) perform better within their domain but can't handle tasks outside it. Generalist agents handle variety but perform worse at depth. Multi-agent systems often use specialized agents for execution (researcher, critic, writer) and a generalist orchestrator for coordination - because coordination requires flexibility, while execution benefits from specialization.
Q: How does multi-agent affect token costs?
A: Significantly. Each agent needs context - which may include the original task, prior agent outputs, and its own system prompt. A 5-agent pipeline where each agent sees prior outputs can multiply total token count by 3-5x vs a single agent. Plus, orchestrators themselves make LLM calls. Production multi-agent systems need rigorous cost measurement per pipeline run and circuit breakers when costs exceed budget.
