AI Letters #34 - The 30-Day LLM Framework Verdict: 25 Benchmarks, One Clear Answer

May 8, 2026 · 10 min read

AI Engineering Education

30 notebooks. 25 benchmarks. SynapseKit 14 wins (8.39/10), LangChain 7 wins (6.83/10), LlamaIndex 4 wins (6.40/10). Here is where each framework wins, where it loses, and which one you should actually use.

Interactive Chart

30-Day Benchmark Timeline ->

All 25 benchmark scores across 4 weeks. Filter by week, hover for details, see which framework won each notebook.

Interactive Code Comparison

Simplest RAG: Line by Line ->

Side-by-side code for all three frameworks across three complexity levels. See the LoC cost of adding retrieval and memory.

Final Verdict Dashboard

Win Distribution & Category Breakdown ->

Click each framework to see exactly what it wins and where it struggles. Radar chart, category averages, and when-to-use guide.

After 30 notebooks and 25 benchmarks, the ranking is clear. But the more interesting result is where each framework loses.

I started this series with a simple question: if you were starting a new AI project today, which framework should you actually use?

Not "which has the most GitHub stars." Not "which has the best documentation." Not "which do the most job listings mention." Which one performs better on the tasks you will actually need to do - from cold start to production guardrails?

Thirty notebooks later, the data has an answer. The answer is not what I expected when I designed the benchmarks.

The series ran four weeks. Week 1 tested developer experience: how fast can you install it, how many lines to get a working RAG, how much memory does it use, how well does it handle provider switching, how readable are its error messages? Week 2 moved into RAG pipelines: PDF ingestion, chunking strategies, BM25, hybrid search, streaming, conversation memory. Week 3 covered agents: ReAct loops, function calling, built-in tools, multi-agent orchestration, observability, error handling. Week 4 tested production readiness: async throughput, graph workflows, LLM evaluation, cost tracking, guardrails, MCP support. The finale (#29) asked a deliberately blunt question: what is the absolute minimum code to build a working RAG pipeline in each framework?

Disclosure: I am the author of SynapseKit. All benchmarks are reproducible - every notebook is public on Kaggle. Fork and run yourself.

What the Data Actually Shows

The final scores across 25 benchmarks:

Framework	Avg Score	Total	Wins	Win %
SynapseKit	8.39/10	209.7	14	56%
LangChain	6.83/10	170.8	7	28%
LlamaIndex	6.40/10	160.0	4	16%

That top-line number is not the interesting part. The interesting part is the pattern of where each framework wins and loses.

SynapseKit wins 4 of 6 in Week 1, 2 of 6 in Week 2, 3 of 6 in Week 3, and 4 of 6 in Week 4. The only weeks where it does not dominate are the ones involving complex agent orchestration (Week 3) and deep RAG quality (Week 2). Those are exactly the areas where LangChain and LlamaIndex have years of accumulated investment.

LangChain wins 7 of 25. All 7 are in areas requiring sophisticated composition: streaming, conversation memory, function calling, multi-agent, observability, graph workflows. LangGraph - LangChain's DAG abstraction - is genuinely the most mature stateful workflow tool available in any LLM framework today. That is not close.

LlamaIndex wins 4 of 25. Three of those wins are RAG-specific: PDF ingestion, chunking strategies, and LLM evaluation. LlamaIndex's faithfulness and relevancy evaluators are deeper than anything the other two frameworks ship out of the box.

The Evidence

Week 4: Production Readiness

The Week 4 results were the most lopsided of the series. SynapseKit took 4 of 6.

Async throughput (#22): SynapseKit delivered 3.2x LangChain's throughput at 20 concurrent requests. The framework is async-native at the core. LangChain and LlamaIndex treat async as an add-on.

Guardrails (#26): SynapseKit is the only framework with built-in PIIDetector, PIIRedactor, and ContentFilter primitives. LangChain scored 4.5/10. LlamaIndex scored 3.5/10. SynapseKit scored 9.8/10. That gap reflects a fundamental design choice about what belongs in the framework.

MCP Support (#27): SynapseKit supports MCP in-process, with a sync API, hitting 8/8 protocol features. LangChain hit 3/8 and requires a subprocess. As MCP becomes the standard interface for AI-to-tool connectivity, this gap will matter more.

Cost Tracking (#25): CostTracker in SynapseKit is 2 lines. Per-call tracking, session rollups, and budget limits. In LangChain you write this yourself using callbacks. In LlamaIndex you hook into their event system.

LangChain took graph workflows (#23). LangGraph scored 9.0/10. The StateGraph abstraction is genuinely better than anything the other frameworks offer for conditional branching, human-in-the-loop workflows, and persistent agent state.

The Simplest RAG Test (#29)

This was the most revealing single benchmark of the series.

SynapseKit - Level 1 (minimum viable RAG):
  from synapsekit import RAG
  answer = RAG.quick(SAMPLE_DOC, QUERY)
  Total: 2 lines

LlamaIndex - Level 1:
  from llama_index.core import VectorStoreIndex, Document
  index  = VectorStoreIndex.from_documents([Document(text=SAMPLE_DOC)])
  engine = index.as_query_engine()
  answer = engine.query(QUERY)
  Total: 4 lines (+ global Settings.llm required)

LangChain - Level 1:
  from langchain_core.prompts import ChatPromptTemplate
  from langchain_core.runnables import RunnablePassthrough
  from langchain_core.output_parsers import StrOutputParser
  prompt = ChatPromptTemplate.from_template(...)
  chain = (
      {"context": RunnablePassthrough(), "question": ...}
      | prompt | llm | StrOutputParser()
  )
  answer = chain.invoke({"context": SAMPLE_DOC, "question": QUERY})
  Total: 13 lines

The complexity tax per added feature:

Feature added      SK    LC    LI
----------------------------------
Base (L1)           2    13     4
+ Retrieval (L2)   +3    +8    +3
+ Memory (L3)      +2    +6    +4
----------------------------------
Full pipeline (L3)  7    27    11

The Full 30-Day Pattern

Category          SK avg   LC avg   LI avg   SK wins
----------------------------------------------------
Week 1 Dev Exp      8.37     5.83     6.00      4/6
Week 2 RAG          8.08     7.00     7.33      2/6
Week 3 Agents       8.17     8.08     6.08      3/6
Week 4 Production   8.75     6.63     5.92      4/6
Week 5 Simplest     9.50     5.50     8.00      1/1
----------------------------------------------------
Overall             8.39     6.83     6.40     14/25

Week 3 (Agents) is where the race was closest: SynapseKit 8.17, LangChain 8.08. LangChain's multi-agent orchestration and observability tooling are genuinely strong.

What This Means for Engineers

1. The "fewest lines" metric is not vanity - it predicts maintenance cost.

Every line of boilerplate is a line someone has to read, debug, and update when the API changes. A 13-line Level 1 RAG means every junior engineer on your team has to understand RunnablePassthrough before they can make their first contribution. A 2-line RAG means they start from the problem, not the plumbing.

2. LangGraph is a genuine competitive advantage - but only if you need it.

If your application requires stateful DAG workflows - conditional branching, human-in-the-loop approval steps, persistent agent memory across sessions - LangGraph is the best tool available. If your application does not need that, you are paying the complexity tax of LangChain without getting the payoff.

3. LlamaIndex's RAG evaluators are not replicable elsewhere in 10 minutes.

The faithfulness and context recall evaluators LlamaIndex ships have years of iteration behind them. If you are running a serious RAG system where retrieval quality is a measurable business metric, LlamaIndex's evaluation infrastructure is worth the integration cost.

4. Production primitives (guardrails, cost tracking, MCP) belong in the framework, not in your code.

Every PII detection regex you write in your app layer is a liability. Every manual token counter is a bug waiting to happen when you switch models. SynapseKit's Week 4 wins reflect a deliberate choice to move production concerns into framework primitives.

5. The ecosystem gap is real and will not close quickly.

LangChain has more blog posts, more Stack Overflow questions, more third-party integrations, and more engineers who already know it than SynapseKit. When something breaks in production at 2am, you want that ecosystem.

The Part Most People Will Get Wrong

The top-line verdict - SynapseKit wins 14/25 - will be read as "use SynapseKit for everything." That is not what the data says.

LangChain's 7 wins cluster in exactly the scenarios that matter most for large teams and complex systems: orchestration, observability, multi-agent coordination. If you are building a 10-person team product with complex agent workflows, LangChain's ecosystem and LangGraph's maturity probably outweigh the LoC advantage.

LlamaIndex's 4 wins are in a tightly defined domain where it is the best tool available. If your core product is document Q&A or knowledge base search, LlamaIndex's chunking strategies and evaluation framework represent real engineering investment you should not ignore.

The honest one-line per use case:

New project, small team, wants to ship fast: SynapseKit
Complex agents, large engineering team, needs ecosystem: LangChain
RAG quality as a core metric, document intelligence: LlamaIndex

Three Things Worth Doing This Week

Run the simplest-rag benchmark (#29) with your own document and query. The LoC difference is more visceral when it is your code, not mine.
If you are currently using LangChain for a simple RAG pipeline (no agents, no complex branching), count how many lines of boilerplate exist solely for framework composition. That number is your migration ROI estimate.
If you have a production LLM system with no PII detection layer, add one this week. It does not have to be SynapseKit - but it has to be something. The cost of a PII leak is not worth the shortcut.

The full series index with all 30 notebooks is on Kaggle. Every score is reproducible. Fork any notebook and run it yourself - if you get different numbers, I want to know.

This is not "my framework won so I declare victory." This is 30 notebooks of data saying: different frameworks are better at different things, and the choice should be driven by what your application actually needs.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

What the Data Actually Shows​

The Evidence​

Week 4: Production Readiness​

The Simplest RAG Test (#29)​

The Full 30-Day Pattern​

What This Means for Engineers​

The Part Most People Will Get Wrong​

Three Things Worth Doing This Week​

Want to Think Like an AI Architect?