Skip to main content

Agent Architectures

Agentic AI is the fastest-growing area in production LLM systems. Interviewers want to know that you understand not just the patterns (ReAct, plan-and-execute) but the engineering realities - how to make agents reliable, debuggable, and safe. This page covers the full spectrum from single-agent tool use to multi-agent orchestration.

Why Interviewers Care

Interviewer's Perspective

"Everyone can build a demo agent. I want to see whether you understand why agents fail in production, how to evaluate them, and how to architect systems that degrade gracefully. The difference between a demo and a product is reliability - and that is what I am hiring for."

1. Foundations: What Makes an Agent

An agent is an LLM-powered system that can:

  1. Observe its environment (read inputs, tool outputs, memory).
  2. Reason about what to do next.
  3. Act on the environment (call tools, write outputs).
  4. Loop until the task is complete or a stop condition is met.

Agent Loop

Agent vs. Chain vs. RAG

FeatureChainRAGAgent
Fixed stepsYesYesNo (dynamic)
Tool useNoRetriever onlyMultiple tools
LoopsNoNoYes
PlanningNoNoYes
Failure modesPredictablePredictableComplex
LatencyLowMediumHigh (variable)
60-Second Answer

"An agent is an LLM system with a dynamic control flow - it decides at each step whether to call a tool, reason further, or return a final answer. Unlike chains (fixed sequence) or RAG (retrieve-then-generate), agents can loop, backtrack, and adapt their strategy. The core loop is observe-reason-act-repeat."

2. ReAct: Reasoning + Acting

The ReAct Pattern

ReAct (Yao et al., 2022) interleaves reasoning traces with actions in a single prompt:

Thought: I need to find the current stock price of NVIDIA.
Action: search("NVIDIA stock price today")
Observation: NVIDIA (NVDA) is trading at $892.43 as of March 6, 2026.
Thought: Now I need to calculate the market cap. I know they have about 24.5B shares outstanding.
Action: calculator("892.43 * 24500000000")
Observation: 21,864,535,000,000
Thought: The market cap is approximately $21.86 trillion. I have enough information to answer.
Final Answer: NVIDIA's current market cap is approximately $21.86 trillion.

Why ReAct Works

  1. Reasoning traces help the LLM decompose complex problems step by step.
  2. Actions ground the reasoning in real data, reducing hallucination.
  3. Observations provide feedback that guides subsequent reasoning.

ReAct vs. Chain-of-Thought vs. Act-Only

ApproachReasoningActingAccuracyInterpretability
Act-onlyNo explicitYesLowLow
CoT-onlyYesNoMedium (hallucinates facts)High
ReActYesYesHighHigh
Common Trap

Candidates often conflate ReAct with "just prompting the LLM to use tools." ReAct's key insight is that explicit reasoning traces before each action dramatically improve tool selection accuracy and multi-step planning. Without the "Thought" step, agents make more errors in tool selection and argument construction.

3. Function Calling and Tool Use

Function Calling Mechanism

Modern LLMs support native function calling where the model outputs a structured tool invocation rather than free-text:

Function Calling

Tool Definition Best Practices

{
"name": "search_database",
"description": "Search the product database. Use this when the user asks about product availability, pricing, or specifications. Do NOT use for general knowledge questions.",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Natural language search query"
},
"category": {
"type": "string",
"enum": ["electronics", "clothing", "food", "all"],
"description": "Product category to filter by. Use 'all' if unsure."
},
"max_results": {
"type": "integer",
"description": "Maximum results to return (1-20)",
"default": 5
}
},
"required": ["query"]
}
}

Key principles:

  1. Descriptive names: search_database not tool_1.
  2. When-to-use guidance: Tell the model when to use and when NOT to use the tool.
  3. Parameter descriptions: Each parameter should explain what it does and valid values.
  4. Sensible defaults: Reduce the decisions the model needs to make.
  5. Constrained types: Use enums, ranges, and required fields to prevent invalid calls.

Parallel Tool Calling

Many APIs support calling multiple tools simultaneously:

User: Compare the weather in Tokyo and London.
Assistant: [
{"tool": "get_weather", "args": {"city": "Tokyo"}},
{"tool": "get_weather", "args": {"city": "London"}}
]

This reduces latency by executing independent tool calls concurrently.

60-Second Answer

"Function calling lets the LLM output structured tool invocations instead of free-text. The key to reliable function calling is well-designed tool definitions \text{---} clear names, descriptions that specify when to use the tool, constrained parameter types, and sensible defaults. Parallel tool calling further reduces latency by executing independent calls concurrently."

4. Model Context Protocol (MCP)

What Is MCP

MCP (Anthropic, 2024) is an open standard for connecting LLMs to external tools and data sources. It defines a client-server protocol where:

  • MCP Servers expose tools, resources, and prompts via a standardized interface.
  • MCP Clients (LLM applications) discover and invoke these capabilities.
  • Transport: JSON-RPC over stdio or HTTP+SSE.

MCP Architecture

MCP Capabilities

CapabilityDescriptionExample
ToolsFunctions the LLM can callquery_database, send_email
ResourcesData the LLM can readFile contents, database schemas
PromptsPre-built prompt templates"Summarize this codebase"
SamplingServer-initiated LLM callsServer asks LLM to analyze a result

Why MCP Matters

  1. Standardization: One protocol to connect any LLM to any tool.
  2. Composability: Mix and match servers (database + file system + API).
  3. Security: Servers control access; clients request capabilities.
  4. Ecosystem: Growing library of pre-built MCP servers.
Company Variation

Anthropic expects deep MCP knowledge \text{---} it is their protocol. OpenAI may ask you to compare MCP with their plugin/function calling approach. Startups care about practical integration \text{---} how to build and deploy MCP servers. Google may ask about comparison with their Genkit tooling.

5. Multi-Agent Systems

Why Multiple Agents

Single agents struggle with:

  • Complex tasks requiring diverse expertise (research + coding + review).
  • Long-running tasks where context windows overflow.
  • Reliability \text{---} a single point of failure.

Multi-agent systems decompose work across specialized agents that communicate and coordinate.

Architectures

Multi-Agent Architectures

ArchitectureWhen to UseStrengthsWeaknesses
HierarchicalComplex multi-domain tasksClear accountability, scalableSingle point of failure at manager
Peer-to-peerDebate, brainstormingDiverse perspectivesCoordination overhead, circular discussions
PipelineSequential refinementPredictable flowNo parallelism, long latency
DynamicUnpredictable tasksFlexibleHard to debug

Framework Comparison

FrameworkArchitectureKey FeatureBest For
AutoGenFlexible (conversation-based)Agent-to-agent chatResearch, prototyping
CrewAIRole-based hierarchicalRole + goal + backstory per agentBusiness workflows
LangGraphGraph-based state machineExplicit state and transitionsProduction systems
OpenAI SwarmHandoff-basedLightweight agent transfersCustomer service routing
Anthropic MCPServer-based tool providersStandardized protocolTool integration

LangGraph Deep Dive

LangGraph models agents as state machines with explicit nodes and edges:

from langgraph.graph import StateGraph, END

# Define state
class AgentState(TypedDict):
messages: list
plan: str
results: list

# Define nodes (each is a function)
def planner(state):
# LLM call to create a plan
...

def executor(state):
# Execute the plan step by step
...

def reviewer(state):
# Check quality, decide if done
...

# Build graph
graph = StateGraph(AgentState)
graph.add_node("plan", planner)
graph.add_node("execute", executor)
graph.add_node("review", reviewer)

graph.add_edge("plan", "execute")
graph.add_edge("execute", "review")
graph.add_conditional_edges(
"review",
should_continue, # function that returns "plan" or END
{"plan": "plan", "end": END}
)

Why LangGraph for production:

  • Persistence: State can be saved and resumed (human-in-the-loop).
  • Streaming: Token-level streaming from any node.
  • Debugging: Full state history and visualization.
  • Checkpointing: Roll back to any previous state.
60-Second Answer

"For production multi-agent systems, I favor LangGraph because it models agents as explicit state machines with typed state, conditional edges, and built-in persistence. AutoGen is great for research prototyping with its conversation-based approach. CrewAI excels at business workflows with its role-based abstraction. The key architectural decision is whether you need a hierarchical controller, peer-to-peer communication, or a sequential pipeline."

6. Planning Strategies

Task Decomposition

Break complex tasks into manageable sub-tasks:

Task Decomposition

Plan-and-Execute

Separate the planning step from execution:

  1. Planner: Generate a full plan before taking any action.
  2. Executor: Execute each step, feeding results back.
  3. Replanner: After each step, optionally revise the remaining plan based on new information.

Advantages over pure ReAct:

  • More coherent multi-step strategies.
  • The plan provides a progress tracker.
  • Easier to add human approval at the plan stage.

Disadvantages:

  • Initial plan may be wrong \text{---} requires replanning.
  • Slower to start (must plan before acting).
  • Planning LLM call may be expensive.

Tree-of-Thought (ToT)

Explore multiple reasoning paths simultaneously:

Tree-of-Thought for Agents

Key components:

  1. Thought generator: Propose multiple candidate next steps.
  2. Evaluator: Score each candidate (LLM-as-judge or heuristic).
  3. Search strategy: BFS (explore breadth) or DFS (explore depth).
  4. Pruning: Discard low-scoring branches early.

When to use ToT:

  • Problems with multiple valid approaches (math, coding, planning).
  • When the cost of exploring is less than the cost of backtracking.
  • NOT for simple, well-defined tasks (overkill).

Reflection and Self-Critique

After each action or plan step, the agent reflects on its work:

Action: Wrote function to parse CSV file.
Reflection: The function doesn't handle quoted commas or
encoding issues. It also lacks error handling for missing
files. I should revise it before moving on.
Revised Action: Rewrote function with proper CSV parsing,
UTF-8 handling, and try/except blocks.

Reflexion (Shinn et al., 2023) formalizes this as a loop:

  1. Act on the task.
  2. Evaluate the result (test, LLM judge, or environment feedback).
  3. If failed, generate a verbal reflection on what went wrong.
  4. Store the reflection in memory and retry with that context.

7. Memory Systems

Memory Types

Agent Memory Types

Memory TypeStorageRetrievalPersistenceUse Case
Short-termContext windowImplicit (in prompt)Session onlyCurrent conversation
Long-term (semantic)Vector DBSimilarity searchPersistentKnowledge base, past conversations
Long-term (structured)Graph DB / SQLQueryPersistentUser profiles, relationships
EpisodicVector DB + metadataSimilarity + filterPersistentLearning from past tasks

Short-Term Memory Management

The context window is finite. Strategies for managing it:

  1. Sliding window: Keep only the last NN messages. Simple but loses early context.
  2. Summarization: Periodically summarize older messages into a compact summary.
  3. Token counting: Track token usage and compress when approaching the limit.
  4. Importance scoring: Keep high-importance messages (tool results, key decisions) and drop filler.

Long-Term Memory with Vector Stores

# Store a memory
memory_text = "User prefers Python over JavaScript for backend tasks."
embedding = embed(memory_text)
vector_store.upsert(id="mem_001", vector=embedding,
metadata={"type": "preference", "user": "alice"})

# Retrieve relevant memories
query = "What language should I use for the API?"
query_embedding = embed(query)
results = vector_store.search(query_embedding, top_k=5,
filter={"user": "alice"})

Episodic Memory

Store traces of past agent executions to learn from experience:

{
"task": "Deploy ML model to production",
"outcome": "success",
"steps_taken": 12,
"key_decisions": [
"Used Docker instead of bare metal - faster iteration",
"Added health check endpoint - caught OOM early"
],
"mistakes": [
"Initially forgot to set resource limits - caused OOM"
],
"lessons": [
"Always set CPU/memory limits in container configs"
]
}

When facing a similar task, the agent retrieves relevant episodes and incorporates lessons learned into its planning.

Common Trap

Candidates often describe memory as "just a vector store." Interviewers want to hear about the write side (what to store, when to update, how to handle contradictions) as much as the read side (retrieval). Memory management - deciding what to remember and what to forget - is the hard engineering problem.

8. Agent Evaluation and Debugging

Why Agent Evaluation Is Hard

  1. Non-deterministic: Same input can produce different action sequences.
  2. Multi-step: Errors compound over multiple steps.
  3. Tool-dependent: Tool failures are not the agent's fault but affect outcomes.
  4. Subjective: "Good" agent behavior is often domain-specific.

Evaluation Dimensions

DimensionMetricHow to Measure
Task completionSuccess rateAutomated tests, human evaluation
EfficiencySteps taken, tokens used, latencyInstrumentation
Tool accuracyCorrect tool selection, valid argumentsLog analysis
Reasoning qualityCoherence of thought chainsLLM-as-judge
SafetyNo harmful actions, respects permissionsRed teaming
CostTotal API cost per taskToken tracking

Evaluation Frameworks

Agent Evaluation Frameworks

Trajectory evaluation (not just outcome):

  • Did the agent take reasonable steps?
  • Did it recover from errors?
  • Did it avoid unnecessary tool calls?
  • Did it use the right tools for the right reasons?

Debugging Techniques

  1. Trace logging: Log every LLM call, tool call, and state transition.
  2. State snapshots: Save the full state at each step for replay.
  3. Step-through execution: Pause after each step for human inspection.
  4. Counterfactual analysis: "What if the agent had chosen a different tool at step 3?"
  5. Failure categorization: Classify failures as planning errors, tool errors, or reasoning errors.

Common Agent Failure Modes

FailureSymptomFix
Infinite loopAgent repeats the same actionMax iteration limit, loop detection
Wrong tool selectionAgent uses search when it should use calculatorBetter tool descriptions, few-shot examples
Argument hallucinationAgent makes up tool argumentsConstrained schemas, validation
Context overflowAgent loses track of earlier stepsSummarization, scratchpad
Premature terminationAgent stops before completing the taskCompletion criteria in prompt
Over-planningAgent plans extensively but never actsPlanning budget, force action after N thoughts
Interviewer's Perspective

"I care more about how you debug agents than how you build them. Walk me through how you would diagnose why an agent that works 90% of the time fails on the other 10%. Show me your systematic approach to failure analysis."

9. Safety: Sandboxing and Permission Systems

Why Agent Safety Matters

Agents can:

  • Execute arbitrary code.
  • Access databases and file systems.
  • Call external APIs with side effects.
  • Spend money (API calls, purchases).
  • Leak sensitive information.

Defense in Depth

Defense in Depth for Agents

Permission Systems

Principle of least privilege: Agents should only have access to the tools and data they need for the current task.

Permission LevelDescriptionExample
Read-onlyCan read data, no side effectsSearch, lookup
Write with approvalCan propose writes, human approvesFile edits, emails
Write with limitsCan write within constraintsUnder 100 lines, approved directories
Full accessUnrestricted (dangerous)Admin tasks, emergency response

Sandboxing Strategies

  1. Container isolation: Run code in Docker containers with resource limits.
  2. VM isolation: Separate virtual machines for untrusted code (E2B, Modal).
  3. API allowlisting: Only permit calls to approved API endpoints.
  4. File system restrictions: Chroot or namespace isolation.
  5. Network restrictions: Block outbound network access by default.
  6. Time and resource limits: Kill processes that exceed CPU, memory, or time budgets.

Human-in-the-Loop Patterns

PatternWhen to UseTrade-off
Approve every actionHigh-stakes, early deploymentSafe but slow
Approve high-risk actions onlyMedium-stakes, established toolsBalanced
Notify and proceedLow-stakes, trusted agentFast but less control
Full autonomy with auditWell-tested, low-risk tasksFastest but requires monitoring
Instant Rejection

"We can just let the agent run freely and review the outputs later." This shows a fundamental misunderstanding of agent safety. Agents with write access can cause irreversible damage - deleting data, sending emails, or leaking secrets. Defense in depth is not optional.

Practice Problems

Problem 1: Design an Agent Architecture

Design an agent that can help a data analyst write SQL queries, execute them against a database, visualize results, and iterate based on feedback. What tools, memory, and safety mechanisms would you include?

Hint 1 - Direction

Think about the tools needed (SQL executor, chart generator), memory (conversation + past queries), and safety (read-only DB access, query validation).

Hint 2 - Insight

Consider a plan-and-execute architecture with a reviewer step. The SQL executor should be sandboxed (read-only, query timeout, result size limits). Memory should include the schema and past successful queries.

Full Solution + Rubric

Architecture: Plan-and-Execute with LangGraph

Tools:

  1. get_schema(table_name) - Returns table schema (columns, types, relationships).
  2. execute_sql(query) - Runs SQL query against read-only replica.
  3. create_chart(data, chart_type, title) - Generates visualization.
  4. explain_query(query) - Returns query execution plan.

Memory:

  • Short-term: Conversation history + current query context.
  • Long-term: Database schema (cached), past successful queries (vector store for semantic search).
  • Working memory: Current data results, intermediate analysis.

Safety:

  • Read-only database replica (no INSERT/UPDATE/DELETE).
  • Query timeout (30 seconds max).
  • Result size limit (10K rows max).
  • SQL injection prevention (parameterized queries).
  • PII detection on output before showing to user.

Flow:

  1. User describes what they want to analyze.
  2. Planner creates a multi-step SQL analysis plan.
  3. For each step: generate SQL, explain the plan, execute, validate results.
  4. Reviewer checks if results answer the question.
  5. If not, replan with the new information.
  6. Generate visualization and narrative summary.

Scoring:

  • Strong Hire: Complete architecture with tools, memory, safety, and error handling. Discusses read-only replicas, query optimization, and how to handle ambiguous user requests.
  • Lean Hire: Reasonable architecture but missing safety mechanisms or memory strategy.
  • No Hire: Just describes "an LLM that writes SQL" without agent architecture.

Problem 2: Multi-Agent Coordination

You need to build a system that reviews pull requests: checks code quality, runs tests, verifies documentation, and provides a summary. Design the multi-agent architecture.

Hint 1 - Direction

Consider which tasks can run in parallel (code review, test running, doc checking) and which are sequential (summary must come last).

Hint 2 - Insight

A manager agent with three specialist agents (code reviewer, test runner, doc checker) followed by a synthesizer. Use LangGraph with a fan-out/fan-in pattern.

Full Solution + Rubric

Architecture: Hierarchical with fan-out/fan-in

Manager Agent
|
+-- Code Review Agent (parallel)
| - Analyzes diff for bugs, style, security
| - Tools: get_diff, search_codebase, lint
|
+-- Test Agent (parallel)
| - Runs test suite, analyzes failures
| - Tools: run_tests, get_coverage, analyze_failure
|
+-- Doc Agent (parallel)
| - Checks if docs match code changes
| - Tools: get_docs, compare_api_changes
|
+-- Synthesizer Agent (after all complete)
- Combines all reviews into a coherent summary
- Assigns overall risk level
- Provides approval recommendation

State Machine (LangGraph):

  1. Parse PR (get diff, files changed, PR description).
  2. Fan-out: Send to Code, Test, and Doc agents in parallel.
  3. Fan-in: Collect all three reviews.
  4. Synthesize: Combine into final review with risk assessment.
  5. Human approval gate: Present summary for human sign-off before posting.

Error handling:

  • If a specialist agent fails, the synthesizer notes it as "not reviewed."
  • Timeout per agent (5 minutes max).
  • Retry once on failure, then skip with warning.

Scoring:

  • Strong Hire: Parallel execution design, error handling, human-in-the-loop, specific tools per agent, and state machine definition.
  • Lean Hire: Sequential design or missing error handling but reasonable agent decomposition.
  • No Hire: Single agent that tries to do everything, or no concrete architecture.

Problem 3: Memory System Design

Your customer support agent handles 10K conversations per day. Users often return with follow-up questions days later. Design the memory system.

Hint 1 - Direction

You need both per-conversation memory and cross-conversation user memory. Think about what to store, how to retrieve, and how to handle stale information.

Hint 2 - Insight

Layer the memory: conversation buffer (short-term), user profile (structured long-term), conversation summaries (semantic long-term), and a resolution knowledge base (episodic).

Full Solution + Rubric

Memory Architecture:

  1. Conversation buffer (Redis, TTL 24h):

    • Full message history for active conversations.
    • Token count tracking for context window management.
  2. User profile (PostgreSQL):

    • Structured data: name, account type, subscription, past issues.
    • Updated after each conversation.
    • Retrieved at conversation start.
  3. Conversation summaries (Vector DB):

    • After each conversation ends, generate a summary.
    • Store with user ID, timestamp, topic tags, resolution status.
    • Retrieved when user returns: "Looks like we last spoke about X on date Y."
  4. Resolution knowledge base (Vector DB):

    • Successful resolutions stored as templates.
    • When a new issue matches a past resolution, suggest it.
    • Updated by support team (human-in-the-loop).
  5. Entity memory (Knowledge graph):

    • Track relationships: user owns product X, product X has known issue Y.
    • Update when new information is learned.

Retrieval strategy at conversation start:

  1. Fetch user profile (SQL lookup).
  2. Fetch last 5 conversation summaries (vector search + user filter).
  3. Inject into system prompt: "Returning user. Previous interactions: ..."

Memory hygiene:

  • Conversation summaries older than 1 year are archived.
  • PII is encrypted at rest.
  • Users can request memory deletion (GDPR).
  • Contradictions are resolved by recency (newer info wins).

Scoring:

  • Strong Hire: Layered architecture with specific storage choices, retrieval strategy, PII handling, and staleness management.
  • Lean Hire: Vector store + user profile but missing conversation summaries or staleness handling.
  • No Hire: "Just use a vector store for everything."

Interview Cheat Sheet

TopicKey FactTypical Question
ReActInterleave reasoning traces with actions"What is ReAct and why does it work?"
Function CallingStructured tool invocations; well-defined schemas"How do you design reliable tool definitions?"
MCPAnthropic's open standard; JSON-RPC; tools + resources + prompts"What is MCP and why does it matter?"
Multi-AgentHierarchical, peer, pipeline architectures"When would you use multiple agents?"
LangGraphGraph-based state machine; persistence; checkpointing"How do you build a production agent?"
PlanningPlan-and-execute separates planning from execution"Plan-and-execute vs. ReAct?"
Tree-of-ThoughtExplore multiple reasoning paths; evaluate and prune"When is ToT better than CoT?"
Memory (short)Context window; sliding window or summarization"How do you manage conversation memory?"
Memory (long)Vector store + structured store; write side matters"How do you build long-term agent memory?"
Episodic MemoryPast task traces with lessons learned"How can agents learn from experience?"
EvaluationTrajectory + outcome; LLM-as-judge + automated"How do you evaluate agent quality?"
SafetyLeast privilege; sandboxing; human-in-the-loop"How do you make agents safe?"

Spaced Repetition Checkpoints

Day 0 (Today)

  • Explain the ReAct pattern with an example
  • List 5 principles for designing good tool definitions
  • Describe the three types of agent memory
  • Name 4 common agent failure modes and their fixes

Day 3

  • Compare AutoGen, CrewAI, and LangGraph architectures
  • Explain plan-and-execute vs. ReAct trade-offs
  • Describe MCP and its four capability types
  • Design a permission system for an agent with database access

Day 7

  • Design a multi-agent code review system
  • Explain tree-of-thought and when to use it
  • Describe episodic memory and how it enables agent learning
  • List evaluation dimensions for agent quality

Day 14

  • Whiteboard a complete agent architecture for a customer support system
  • Explain how LangGraph handles persistence and checkpointing
  • Design a memory system for an agent handling 10K daily conversations
  • Discuss agent sandboxing strategies with trade-offs

Day 21

  • Present a 30-minute deep dive on agent reliability engineering
  • Compare MCP with alternative tool integration approaches
  • Design an agent evaluation pipeline with trajectory scoring
  • Critique a given multi-agent architecture and propose improvements

Cross-References

© 2026 EngineersOfAI. All rights reserved.