Agent Framework History Timeline

From the ReAct paper to production-grade agent toolkits — click any event to expand

Foundations
Chain Era
Agent Era
Production
2022 Q3 ReAct: Synergizing Reasoning and Acting in Language Models Foundations
Yao et al. (Google Brain) publish the ReAct paper. The key insight: interleave reasoning traces ("Thought: I need to find...") with actions ("Action: Search[query]") and observations ("Observation: Result..."). This is the loop that every agent framework implements today. The paper shows the pattern outperforms both pure reasoning (chain-of-thought) and pure action (act-only) on knowledge-intensive tasks.
Foundation: The Thought/Action/Observation loop used by all three frameworks traces back to this paper.
2022 Q4 LangChain launches — chains, tools, and early agents Foundations
Harrison Chase releases LangChain on GitHub. The initial focus is chains — composable sequences of LLM calls. Within weeks, the community adds Tool support and the first agent implementations. LangChain's growth is unlike anything the ML OSS ecosystem had seen: 50k GitHub stars in six months. The velocity attracts contributors faster than the core team can review PRs, which sets up the API instability and dependency weight problems that appear in later benchmarks.
Created the vocabulary: chains, agents, tools, memory. Every framework since has used LangChain as the reference point.
2023 Q1 LlamaIndex (formerly GPT Index) adds agent capabilities Chain Era
Jerry Liu's GPT Index project — focused on indexing documents for LLM retrieval — rebrands as LlamaIndex and adds agent functionality. The framework's original strength is its indexing abstractions: VectorStoreIndex, SummaryIndex, KnowledgeGraphIndex. Agents are added to keep pace with LangChain and meet user demand. The design shows it: agent APIs feel like additions to a retrieval framework rather than first-class primitives. This architectural decision explains LlamaIndex's weak week-3 agent scores three years later.
Set LlamaIndex's identity: the best retrieval framework, not the best agent framework. That identity holds through 2026.
2023 Q2 OpenAI releases function calling — changes how agents work Chain Era
GPT-3.5-turbo and GPT-4 gain native function calling support. Instead of parsing tool calls from free-text LLM output, agents can now receive structured JSON tool invocations directly. This eliminates an entire class of parsing failures — the "malformed output" problem that LangChain later addresses with handle_parsing_errors=True. Frameworks scramble to adopt the new API. LangChain's StructuredTool and SynapseKit's schema-first design both trace back to adapting for this interface.
Reduced agent reliability on function-calling models by ~40% (fewer malformed outputs). Created the multi-format schema problem: OpenAI format vs Anthropic format.
2023 Q4 LangGraph released — graph-based agent orchestration Agent Era
LangChain releases LangGraph: a framework for building stateful, multi-actor applications with LLMs as a directed graph. Nodes are processing steps; edges are transitions. Cycles are allowed (unlike DAGs), which enables the agent loop natively. LangGraph becomes the most powerful tool for complex conditional agent workflows — but at the cost of setup verbosity. The "supervisor + worker" pattern from benchmark #18 is where LangGraph's graph flexibility pays off most clearly.
LangGraph wins the multi-agent flexibility benchmark in notebook #18. It's the reason LangChain still leads on complex orchestration patterns despite losing on line count.
2024 Q1 SynapseKit 1.0 — Crew/Task API and 30 built-in tools Agent Era
SynapseKit launches with a different design philosophy: agent ergonomics first, batteries included. The Crew + Task(context_from=[...]) pattern makes inter-agent dependencies explicit. 30 built-in tools (12 zero-config) means you rarely pip install a calculator. The schema() + anthropic_schema() approach exports from one definition to both OpenAI and Anthropic formats. Built by engineers frustrated with LangChain's verbosity and LlamaIndex's agent limitations. The LLM Showdown series puts the claims to a reproducible test.
Wins 4 of 6 Week-3 agent benchmarks. The batteries-included design shows up most clearly in notebook #17 (30 vs 17 vs 3 built-in tools).
2024 Q3 Anthropic releases Claude's tool use — Anthropic schema format Agent Era
Anthropic launches official tool use support for Claude. The schema format differs from OpenAI's — different field names, different structure for required fields and descriptions. This creates the multi-format problem: teams maintaining one tool definition need to keep two schema formats synchronized. SynapseKit's single .anthropic_schema() call solves this at the source. LangChain requires separate conversion utilities. The schema portability benchmark in notebook #16 measures exactly this problem.
Created the multi-provider schema portability problem. The framework that solves it best wins benchmark #16.
2025 Q1 LangSmith GA — observability goes platform, not local Production
LangSmith reaches general availability. LangChain's bet is clear: local verbose mode for development, LangSmith for production observability. Step latency, cost tracking per run, trace replay, regression testing — all live in the platform. This is an intentional architectural split. It explains why set_verbose(True) doesn't expose timing data locally (benchmark #19 — LangChain loses on local feature depth) and why SynapseKit and LlamaIndex both invest in local Tracer/callback APIs.
Explains the observability benchmark result: LangChain wins on LoC (1 line), loses on local depth. The missing features are in LangSmith.
2025 Q4 MCP (Model Context Protocol) — new agent tool standard Production
Anthropic releases the Model Context Protocol — a standardized interface for connecting LLMs to external tools and data sources. MCP support becomes a differentiator: frameworks that implement the client/server protocol can connect to any MCP-compatible tool without custom integration code. This is tested in notebook #27 (Week 4). The winner in that benchmark will likely set the production standard for how agent tool ecosystems evolve through 2026 and beyond.
Upcoming Week 4 benchmark. Expected to be a key differentiator for production agent deployment at scale.
2026 Q1 LLM Showdown Week 3 Scorecard — SynapseKit 45, LC 31, LI 26 Production
After 21 benchmarks across 3 weeks, the cumulative standings: SynapseKit 45, LangChain 31, LlamaIndex 26. SynapseKit dominates agent ergonomics (wins 4 of 6 Week-3 benchmarks). LangChain wins the most production-critical single benchmark: per-tool error handling. LlamaIndex's agent story is exposed as incomplete — strong in retrieval week, weak in agent week. Week 4 (production: async, graph workflows, evaluation, cost tracking, guardrails, MCP) begins next.
The ergonomics leader is clear. Whether it holds when Week 4 tests production concerns is the open question.
www.engineersofai.com · AI Letters #29 · LLM Showdown #21