Agent Architectures
Agentic AI is the fastest-growing area in production LLM systems. Interviewers want to know that you understand not just the patterns (ReAct, plan-and-execute) but the engineering realities - how to make agents reliable, debuggable, and safe. This page covers the full spectrum from single-agent tool use to multi-agent orchestration.
Why Interviewers Care
"Everyone can build a demo agent. I want to see whether you understand why agents fail in production, how to evaluate them, and how to architect systems that degrade gracefully. The difference between a demo and a product is reliability - and that is what I am hiring for."
1. Foundations: What Makes an Agent
An agent is an LLM-powered system that can:
- Observe its environment (read inputs, tool outputs, memory).
- Reason about what to do next.
- Act on the environment (call tools, write outputs).
- Loop until the task is complete or a stop condition is met.
Agent vs. Chain vs. RAG
| Feature | Chain | RAG | Agent |
|---|---|---|---|
| Fixed steps | Yes | Yes | No (dynamic) |
| Tool use | No | Retriever only | Multiple tools |
| Loops | No | No | Yes |
| Planning | No | No | Yes |
| Failure modes | Predictable | Predictable | Complex |
| Latency | Low | Medium | High (variable) |
"An agent is an LLM system with a dynamic control flow - it decides at each step whether to call a tool, reason further, or return a final answer. Unlike chains (fixed sequence) or RAG (retrieve-then-generate), agents can loop, backtrack, and adapt their strategy. The core loop is observe-reason-act-repeat."
2. ReAct: Reasoning + Acting
The ReAct Pattern
ReAct (Yao et al., 2022) interleaves reasoning traces with actions in a single prompt:
Thought: I need to find the current stock price of NVIDIA.
Action: search("NVIDIA stock price today")
Observation: NVIDIA (NVDA) is trading at $892.43 as of March 6, 2026.
Thought: Now I need to calculate the market cap. I know they have about 24.5B shares outstanding.
Action: calculator("892.43 * 24500000000")
Observation: 21,864,535,000,000
Thought: The market cap is approximately $21.86 trillion. I have enough information to answer.
Final Answer: NVIDIA's current market cap is approximately $21.86 trillion.
Why ReAct Works
- Reasoning traces help the LLM decompose complex problems step by step.
- Actions ground the reasoning in real data, reducing hallucination.
- Observations provide feedback that guides subsequent reasoning.
ReAct vs. Chain-of-Thought vs. Act-Only
| Approach | Reasoning | Acting | Accuracy | Interpretability |
|---|---|---|---|---|
| Act-only | No explicit | Yes | Low | Low |
| CoT-only | Yes | No | Medium (hallucinates facts) | High |
| ReAct | Yes | Yes | High | High |
Candidates often conflate ReAct with "just prompting the LLM to use tools." ReAct's key insight is that explicit reasoning traces before each action dramatically improve tool selection accuracy and multi-step planning. Without the "Thought" step, agents make more errors in tool selection and argument construction.
3. Function Calling and Tool Use
Function Calling Mechanism
Modern LLMs support native function calling where the model outputs a structured tool invocation rather than free-text:
Tool Definition Best Practices
{
"name": "search_database",
"description": "Search the product database. Use this when the user asks about product availability, pricing, or specifications. Do NOT use for general knowledge questions.",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Natural language search query"
},
"category": {
"type": "string",
"enum": ["electronics", "clothing", "food", "all"],
"description": "Product category to filter by. Use 'all' if unsure."
},
"max_results": {
"type": "integer",
"description": "Maximum results to return (1-20)",
"default": 5
}
},
"required": ["query"]
}
}
Key principles:
- Descriptive names:
search_databasenottool_1. - When-to-use guidance: Tell the model when to use and when NOT to use the tool.
- Parameter descriptions: Each parameter should explain what it does and valid values.
- Sensible defaults: Reduce the decisions the model needs to make.
- Constrained types: Use enums, ranges, and required fields to prevent invalid calls.
Parallel Tool Calling
Many APIs support calling multiple tools simultaneously:
User: Compare the weather in Tokyo and London.
Assistant: [
{"tool": "get_weather", "args": {"city": "Tokyo"}},
{"tool": "get_weather", "args": {"city": "London"}}
]
This reduces latency by executing independent tool calls concurrently.
"Function calling lets the LLM output structured tool invocations instead of free-text. The key to reliable function calling is well-designed tool definitions \text{---} clear names, descriptions that specify when to use the tool, constrained parameter types, and sensible defaults. Parallel tool calling further reduces latency by executing independent calls concurrently."
4. Model Context Protocol (MCP)
What Is MCP
MCP (Anthropic, 2024) is an open standard for connecting LLMs to external tools and data sources. It defines a client-server protocol where:
- MCP Servers expose tools, resources, and prompts via a standardized interface.
- MCP Clients (LLM applications) discover and invoke these capabilities.
- Transport: JSON-RPC over stdio or HTTP+SSE.
MCP Capabilities
| Capability | Description | Example |
|---|---|---|
| Tools | Functions the LLM can call | query_database, send_email |
| Resources | Data the LLM can read | File contents, database schemas |
| Prompts | Pre-built prompt templates | "Summarize this codebase" |
| Sampling | Server-initiated LLM calls | Server asks LLM to analyze a result |
Why MCP Matters
- Standardization: One protocol to connect any LLM to any tool.
- Composability: Mix and match servers (database + file system + API).
- Security: Servers control access; clients request capabilities.
- Ecosystem: Growing library of pre-built MCP servers.
Anthropic expects deep MCP knowledge \text{---} it is their protocol. OpenAI may ask you to compare MCP with their plugin/function calling approach. Startups care about practical integration \text{---} how to build and deploy MCP servers. Google may ask about comparison with their Genkit tooling.
5. Multi-Agent Systems
Why Multiple Agents
Single agents struggle with:
- Complex tasks requiring diverse expertise (research + coding + review).
- Long-running tasks where context windows overflow.
- Reliability \text{---} a single point of failure.
Multi-agent systems decompose work across specialized agents that communicate and coordinate.
Architectures
| Architecture | When to Use | Strengths | Weaknesses |
|---|---|---|---|
| Hierarchical | Complex multi-domain tasks | Clear accountability, scalable | Single point of failure at manager |
| Peer-to-peer | Debate, brainstorming | Diverse perspectives | Coordination overhead, circular discussions |
| Pipeline | Sequential refinement | Predictable flow | No parallelism, long latency |
| Dynamic | Unpredictable tasks | Flexible | Hard to debug |
Framework Comparison
| Framework | Architecture | Key Feature | Best For |
|---|---|---|---|
| AutoGen | Flexible (conversation-based) | Agent-to-agent chat | Research, prototyping |
| CrewAI | Role-based hierarchical | Role + goal + backstory per agent | Business workflows |
| LangGraph | Graph-based state machine | Explicit state and transitions | Production systems |
| OpenAI Swarm | Handoff-based | Lightweight agent transfers | Customer service routing |
| Anthropic MCP | Server-based tool providers | Standardized protocol | Tool integration |
LangGraph Deep Dive
LangGraph models agents as state machines with explicit nodes and edges:
from langgraph.graph import StateGraph, END
# Define state
class AgentState(TypedDict):
messages: list
plan: str
results: list
# Define nodes (each is a function)
def planner(state):
# LLM call to create a plan
...
def executor(state):
# Execute the plan step by step
...
def reviewer(state):
# Check quality, decide if done
...
# Build graph
graph = StateGraph(AgentState)
graph.add_node("plan", planner)
graph.add_node("execute", executor)
graph.add_node("review", reviewer)
graph.add_edge("plan", "execute")
graph.add_edge("execute", "review")
graph.add_conditional_edges(
"review",
should_continue, # function that returns "plan" or END
{"plan": "plan", "end": END}
)
Why LangGraph for production:
- Persistence: State can be saved and resumed (human-in-the-loop).
- Streaming: Token-level streaming from any node.
- Debugging: Full state history and visualization.
- Checkpointing: Roll back to any previous state.
"For production multi-agent systems, I favor LangGraph because it models agents as explicit state machines with typed state, conditional edges, and built-in persistence. AutoGen is great for research prototyping with its conversation-based approach. CrewAI excels at business workflows with its role-based abstraction. The key architectural decision is whether you need a hierarchical controller, peer-to-peer communication, or a sequential pipeline."
6. Planning Strategies
Task Decomposition
Break complex tasks into manageable sub-tasks:
Plan-and-Execute
Separate the planning step from execution:
- Planner: Generate a full plan before taking any action.
- Executor: Execute each step, feeding results back.
- Replanner: After each step, optionally revise the remaining plan based on new information.
Advantages over pure ReAct:
- More coherent multi-step strategies.
- The plan provides a progress tracker.
- Easier to add human approval at the plan stage.
Disadvantages:
- Initial plan may be wrong \text{---} requires replanning.
- Slower to start (must plan before acting).
- Planning LLM call may be expensive.
Tree-of-Thought (ToT)
Explore multiple reasoning paths simultaneously:
Key components:
- Thought generator: Propose multiple candidate next steps.
- Evaluator: Score each candidate (LLM-as-judge or heuristic).
- Search strategy: BFS (explore breadth) or DFS (explore depth).
- Pruning: Discard low-scoring branches early.
When to use ToT:
- Problems with multiple valid approaches (math, coding, planning).
- When the cost of exploring is less than the cost of backtracking.
- NOT for simple, well-defined tasks (overkill).
Reflection and Self-Critique
After each action or plan step, the agent reflects on its work:
Action: Wrote function to parse CSV file.
Reflection: The function doesn't handle quoted commas or
encoding issues. It also lacks error handling for missing
files. I should revise it before moving on.
Revised Action: Rewrote function with proper CSV parsing,
UTF-8 handling, and try/except blocks.
Reflexion (Shinn et al., 2023) formalizes this as a loop:
- Act on the task.
- Evaluate the result (test, LLM judge, or environment feedback).
- If failed, generate a verbal reflection on what went wrong.
- Store the reflection in memory and retry with that context.
7. Memory Systems
Memory Types
| Memory Type | Storage | Retrieval | Persistence | Use Case |
|---|---|---|---|---|
| Short-term | Context window | Implicit (in prompt) | Session only | Current conversation |
| Long-term (semantic) | Vector DB | Similarity search | Persistent | Knowledge base, past conversations |
| Long-term (structured) | Graph DB / SQL | Query | Persistent | User profiles, relationships |
| Episodic | Vector DB + metadata | Similarity + filter | Persistent | Learning from past tasks |
Short-Term Memory Management
The context window is finite. Strategies for managing it:
- Sliding window: Keep only the last messages. Simple but loses early context.
- Summarization: Periodically summarize older messages into a compact summary.
- Token counting: Track token usage and compress when approaching the limit.
- Importance scoring: Keep high-importance messages (tool results, key decisions) and drop filler.
Long-Term Memory with Vector Stores
# Store a memory
memory_text = "User prefers Python over JavaScript for backend tasks."
embedding = embed(memory_text)
vector_store.upsert(id="mem_001", vector=embedding,
metadata={"type": "preference", "user": "alice"})
# Retrieve relevant memories
query = "What language should I use for the API?"
query_embedding = embed(query)
results = vector_store.search(query_embedding, top_k=5,
filter={"user": "alice"})
Episodic Memory
Store traces of past agent executions to learn from experience:
{
"task": "Deploy ML model to production",
"outcome": "success",
"steps_taken": 12,
"key_decisions": [
"Used Docker instead of bare metal - faster iteration",
"Added health check endpoint - caught OOM early"
],
"mistakes": [
"Initially forgot to set resource limits - caused OOM"
],
"lessons": [
"Always set CPU/memory limits in container configs"
]
}
When facing a similar task, the agent retrieves relevant episodes and incorporates lessons learned into its planning.
Candidates often describe memory as "just a vector store." Interviewers want to hear about the write side (what to store, when to update, how to handle contradictions) as much as the read side (retrieval). Memory management - deciding what to remember and what to forget - is the hard engineering problem.
8. Agent Evaluation and Debugging
Why Agent Evaluation Is Hard
- Non-deterministic: Same input can produce different action sequences.
- Multi-step: Errors compound over multiple steps.
- Tool-dependent: Tool failures are not the agent's fault but affect outcomes.
- Subjective: "Good" agent behavior is often domain-specific.
Evaluation Dimensions
| Dimension | Metric | How to Measure |
|---|---|---|
| Task completion | Success rate | Automated tests, human evaluation |
| Efficiency | Steps taken, tokens used, latency | Instrumentation |
| Tool accuracy | Correct tool selection, valid arguments | Log analysis |
| Reasoning quality | Coherence of thought chains | LLM-as-judge |
| Safety | No harmful actions, respects permissions | Red teaming |
| Cost | Total API cost per task | Token tracking |
Evaluation Frameworks
Trajectory evaluation (not just outcome):
- Did the agent take reasonable steps?
- Did it recover from errors?
- Did it avoid unnecessary tool calls?
- Did it use the right tools for the right reasons?
Debugging Techniques
- Trace logging: Log every LLM call, tool call, and state transition.
- State snapshots: Save the full state at each step for replay.
- Step-through execution: Pause after each step for human inspection.
- Counterfactual analysis: "What if the agent had chosen a different tool at step 3?"
- Failure categorization: Classify failures as planning errors, tool errors, or reasoning errors.
Common Agent Failure Modes
| Failure | Symptom | Fix |
|---|---|---|
| Infinite loop | Agent repeats the same action | Max iteration limit, loop detection |
| Wrong tool selection | Agent uses search when it should use calculator | Better tool descriptions, few-shot examples |
| Argument hallucination | Agent makes up tool arguments | Constrained schemas, validation |
| Context overflow | Agent loses track of earlier steps | Summarization, scratchpad |
| Premature termination | Agent stops before completing the task | Completion criteria in prompt |
| Over-planning | Agent plans extensively but never acts | Planning budget, force action after N thoughts |
"I care more about how you debug agents than how you build them. Walk me through how you would diagnose why an agent that works 90% of the time fails on the other 10%. Show me your systematic approach to failure analysis."
9. Safety: Sandboxing and Permission Systems
Why Agent Safety Matters
Agents can:
- Execute arbitrary code.
- Access databases and file systems.
- Call external APIs with side effects.
- Spend money (API calls, purchases).
- Leak sensitive information.
Defense in Depth
Permission Systems
Principle of least privilege: Agents should only have access to the tools and data they need for the current task.
| Permission Level | Description | Example |
|---|---|---|
| Read-only | Can read data, no side effects | Search, lookup |
| Write with approval | Can propose writes, human approves | File edits, emails |
| Write with limits | Can write within constraints | Under 100 lines, approved directories |
| Full access | Unrestricted (dangerous) | Admin tasks, emergency response |
Sandboxing Strategies
- Container isolation: Run code in Docker containers with resource limits.
- VM isolation: Separate virtual machines for untrusted code (E2B, Modal).
- API allowlisting: Only permit calls to approved API endpoints.
- File system restrictions: Chroot or namespace isolation.
- Network restrictions: Block outbound network access by default.
- Time and resource limits: Kill processes that exceed CPU, memory, or time budgets.
Human-in-the-Loop Patterns
| Pattern | When to Use | Trade-off |
|---|---|---|
| Approve every action | High-stakes, early deployment | Safe but slow |
| Approve high-risk actions only | Medium-stakes, established tools | Balanced |
| Notify and proceed | Low-stakes, trusted agent | Fast but less control |
| Full autonomy with audit | Well-tested, low-risk tasks | Fastest but requires monitoring |
"We can just let the agent run freely and review the outputs later." This shows a fundamental misunderstanding of agent safety. Agents with write access can cause irreversible damage - deleting data, sending emails, or leaking secrets. Defense in depth is not optional.
Practice Problems
Problem 1: Design an Agent Architecture
Design an agent that can help a data analyst write SQL queries, execute them against a database, visualize results, and iterate based on feedback. What tools, memory, and safety mechanisms would you include?
Hint 1 - Direction
Think about the tools needed (SQL executor, chart generator), memory (conversation + past queries), and safety (read-only DB access, query validation).
Hint 2 - Insight
Consider a plan-and-execute architecture with a reviewer step. The SQL executor should be sandboxed (read-only, query timeout, result size limits). Memory should include the schema and past successful queries.
Full Solution + Rubric
Architecture: Plan-and-Execute with LangGraph
Tools:
get_schema(table_name)- Returns table schema (columns, types, relationships).execute_sql(query)- Runs SQL query against read-only replica.create_chart(data, chart_type, title)- Generates visualization.explain_query(query)- Returns query execution plan.
Memory:
- Short-term: Conversation history + current query context.
- Long-term: Database schema (cached), past successful queries (vector store for semantic search).
- Working memory: Current data results, intermediate analysis.
Safety:
- Read-only database replica (no INSERT/UPDATE/DELETE).
- Query timeout (30 seconds max).
- Result size limit (10K rows max).
- SQL injection prevention (parameterized queries).
- PII detection on output before showing to user.
Flow:
- User describes what they want to analyze.
- Planner creates a multi-step SQL analysis plan.
- For each step: generate SQL, explain the plan, execute, validate results.
- Reviewer checks if results answer the question.
- If not, replan with the new information.
- Generate visualization and narrative summary.
Scoring:
- Strong Hire: Complete architecture with tools, memory, safety, and error handling. Discusses read-only replicas, query optimization, and how to handle ambiguous user requests.
- Lean Hire: Reasonable architecture but missing safety mechanisms or memory strategy.
- No Hire: Just describes "an LLM that writes SQL" without agent architecture.
Problem 2: Multi-Agent Coordination
You need to build a system that reviews pull requests: checks code quality, runs tests, verifies documentation, and provides a summary. Design the multi-agent architecture.
Hint 1 - Direction
Consider which tasks can run in parallel (code review, test running, doc checking) and which are sequential (summary must come last).
Hint 2 - Insight
A manager agent with three specialist agents (code reviewer, test runner, doc checker) followed by a synthesizer. Use LangGraph with a fan-out/fan-in pattern.
Full Solution + Rubric
Architecture: Hierarchical with fan-out/fan-in
Manager Agent
|
+-- Code Review Agent (parallel)
| - Analyzes diff for bugs, style, security
| - Tools: get_diff, search_codebase, lint
|
+-- Test Agent (parallel)
| - Runs test suite, analyzes failures
| - Tools: run_tests, get_coverage, analyze_failure
|
+-- Doc Agent (parallel)
| - Checks if docs match code changes
| - Tools: get_docs, compare_api_changes
|
+-- Synthesizer Agent (after all complete)
- Combines all reviews into a coherent summary
- Assigns overall risk level
- Provides approval recommendation
State Machine (LangGraph):
- Parse PR (get diff, files changed, PR description).
- Fan-out: Send to Code, Test, and Doc agents in parallel.
- Fan-in: Collect all three reviews.
- Synthesize: Combine into final review with risk assessment.
- Human approval gate: Present summary for human sign-off before posting.
Error handling:
- If a specialist agent fails, the synthesizer notes it as "not reviewed."
- Timeout per agent (5 minutes max).
- Retry once on failure, then skip with warning.
Scoring:
- Strong Hire: Parallel execution design, error handling, human-in-the-loop, specific tools per agent, and state machine definition.
- Lean Hire: Sequential design or missing error handling but reasonable agent decomposition.
- No Hire: Single agent that tries to do everything, or no concrete architecture.
Problem 3: Memory System Design
Your customer support agent handles 10K conversations per day. Users often return with follow-up questions days later. Design the memory system.
Hint 1 - Direction
You need both per-conversation memory and cross-conversation user memory. Think about what to store, how to retrieve, and how to handle stale information.
Hint 2 - Insight
Layer the memory: conversation buffer (short-term), user profile (structured long-term), conversation summaries (semantic long-term), and a resolution knowledge base (episodic).
Full Solution + Rubric
Memory Architecture:
-
Conversation buffer (Redis, TTL 24h):
- Full message history for active conversations.
- Token count tracking for context window management.
-
User profile (PostgreSQL):
- Structured data: name, account type, subscription, past issues.
- Updated after each conversation.
- Retrieved at conversation start.
-
Conversation summaries (Vector DB):
- After each conversation ends, generate a summary.
- Store with user ID, timestamp, topic tags, resolution status.
- Retrieved when user returns: "Looks like we last spoke about X on date Y."
-
Resolution knowledge base (Vector DB):
- Successful resolutions stored as templates.
- When a new issue matches a past resolution, suggest it.
- Updated by support team (human-in-the-loop).
-
Entity memory (Knowledge graph):
- Track relationships: user owns product X, product X has known issue Y.
- Update when new information is learned.
Retrieval strategy at conversation start:
- Fetch user profile (SQL lookup).
- Fetch last 5 conversation summaries (vector search + user filter).
- Inject into system prompt: "Returning user. Previous interactions: ..."
Memory hygiene:
- Conversation summaries older than 1 year are archived.
- PII is encrypted at rest.
- Users can request memory deletion (GDPR).
- Contradictions are resolved by recency (newer info wins).
Scoring:
- Strong Hire: Layered architecture with specific storage choices, retrieval strategy, PII handling, and staleness management.
- Lean Hire: Vector store + user profile but missing conversation summaries or staleness handling.
- No Hire: "Just use a vector store for everything."
Interview Cheat Sheet
| Topic | Key Fact | Typical Question |
|---|---|---|
| ReAct | Interleave reasoning traces with actions | "What is ReAct and why does it work?" |
| Function Calling | Structured tool invocations; well-defined schemas | "How do you design reliable tool definitions?" |
| MCP | Anthropic's open standard; JSON-RPC; tools + resources + prompts | "What is MCP and why does it matter?" |
| Multi-Agent | Hierarchical, peer, pipeline architectures | "When would you use multiple agents?" |
| LangGraph | Graph-based state machine; persistence; checkpointing | "How do you build a production agent?" |
| Planning | Plan-and-execute separates planning from execution | "Plan-and-execute vs. ReAct?" |
| Tree-of-Thought | Explore multiple reasoning paths; evaluate and prune | "When is ToT better than CoT?" |
| Memory (short) | Context window; sliding window or summarization | "How do you manage conversation memory?" |
| Memory (long) | Vector store + structured store; write side matters | "How do you build long-term agent memory?" |
| Episodic Memory | Past task traces with lessons learned | "How can agents learn from experience?" |
| Evaluation | Trajectory + outcome; LLM-as-judge + automated | "How do you evaluate agent quality?" |
| Safety | Least privilege; sandboxing; human-in-the-loop | "How do you make agents safe?" |
Spaced Repetition Checkpoints
Day 0 (Today)
- Explain the ReAct pattern with an example
- List 5 principles for designing good tool definitions
- Describe the three types of agent memory
- Name 4 common agent failure modes and their fixes
Day 3
- Compare AutoGen, CrewAI, and LangGraph architectures
- Explain plan-and-execute vs. ReAct trade-offs
- Describe MCP and its four capability types
- Design a permission system for an agent with database access
Day 7
- Design a multi-agent code review system
- Explain tree-of-thought and when to use it
- Describe episodic memory and how it enables agent learning
- List evaluation dimensions for agent quality
Day 14
- Whiteboard a complete agent architecture for a customer support system
- Explain how LangGraph handles persistence and checkpointing
- Design a memory system for an agent handling 10K daily conversations
- Discuss agent sandboxing strategies with trade-offs
Day 21
- Present a 30-minute deep dive on agent reliability engineering
- Compare MCP with alternative tool integration approaches
- Design an agent evaluation pipeline with trajectory scoring
- Critique a given multi-agent architecture and propose improvements
Cross-References
- RAG Systems - Retrieval as a tool in agent systems
- Prompt Engineering - Prompt design for agent system prompts and tool descriptions
- LLM Evaluation - Evaluation methods applicable to agent trajectories
- Inference Optimization - Optimizing multi-step agent latency
- Safety and Guardrails - Deep dive on agent safety mechanisms
- LLM Interview Questions Bank - Additional agent architecture questions
