Skip to main content

9 posts tagged with "RAG"

Retrieval-Augmented Generation — chunking, embedding, retrieval, and production RAG systems.

View All Tags

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.

I Built a Lightweight LLM Framework Because LangChain Frustrated Me - Here's What I Learned

· 15 min read
EngineersOfAI
AI Engineering Education

There's a moment every LLM developer knows. You've got a working prototype. It's elegant, fast, and does exactly what you need. Then you try to deploy it. And suddenly you're debugging a chain inside a runnable inside a callback inside an abstraction that didn't exist six months ago.

That moment happened one too many times. So something else got built.

This is the story of SynapseKit - why it exists, what it does differently, and what 18 (and counting) objective benchmarks against LangChain and LlamaIndex actually revealed.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.

AI Letters #23 - The RAG Scorecard: Six Benchmarks, Three Frameworks, One Clear Pattern

· 8 min read
EngineersOfAI
AI Engineering Education

"Batteries-included beats fully-composable on conciseness every time. Fully-composable beats batteries-included on control every time. You just have to know which problem you're solving."

Six notebooks. Six benchmarks. Three frameworks measured on the same RAG workloads, back to back, reproducible on Kaggle.

Week 1 of the LLM Showdown covered setup overhead: environment spin-up, indexing speed, basic retrieval, reranking, evaluation harnesses, and the Week 1 scorecard. SynapseKit won that one 15–7–8 (SK–LC–LI).

Week 2 went deeper into the RAG stack: PDF ingestion, chunking strategies, BM25 availability, hybrid search RRF, streaming time-to-first-token, and conversation memory. Same methodology. 3-2-1 points for rank 1-2-3 across each benchmark, ties split.

The results are not a surprise if you've been paying attention. But the magnitude of the gap on some dimensions is.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.

AI Letters #22 - Conversation Memory in RAG: One Param vs Forty Lines of Boilerplate

· 11 min read
EngineersOfAI
AI Engineering Education

RAG gives the model context from documents. Memory gives it context from the conversation. Without both, your chatbot doesn't know what it just said.

Every RAG system eventually faces the same question: what happens on the second turn? The user asks a follow-up. "What did you mean by that?" "Can you give me an example?" "How does that compare to what you said earlier?" Without memory, the model treats each question as the first. Context from the previous turn is gone. The answer it gives to the follow-up is either wrong, generic, or disconnected from what came before.

Conversation memory is the fix. A buffer of past exchanges gets prepended to the retrieved context and injected into the prompt. The model now has the document context and the conversation context. It can use both. The question is how much it costs to add this to your pipeline - and what happens when the conversation gets long enough that you have to start dropping old messages.

We wired identical multi-turn memory into RAG pipelines across SynapseKit 1.4, LangChain 1.2, and LlamaIndex Core 0.14. Same conversation, same task, same question: how many lines does it take to add memory, and what happens at the edge cases? The LoC gap is the widest of any benchmark in this series. The persistence and window-strategy differences are what will matter in your production system.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.

AI Letters #21 - Streaming RAG: Time to First Token Across Three Frameworks

· 10 min read
EngineersOfAI
AI Engineering Education

When users wait for an LLM, the number that matters is time-to-first-token, not total time. 200ms to first token feels instant. 2 seconds to first token feels broken - even if the full answer arrives faster.

Every LLM UI eventually learns the same lesson. Users don't measure latency the way your dashboard does. They don't care about tokens-per-second, p99 tail latency, or median completion time. They care about one thing: how long until something appears on screen. That number is TTFT - time to first token - and it dominates perceived performance more than any other metric in LLM serving.

The catch is that when you're building a streaming RAG pipeline, the framework itself sits between your .stream() call and the first token your user sees. Every async for, every LCEL graph traversal, every callback dispatch adds latency before a single character leaves the server. In production that overhead is invisible because network latency to OpenAI or Anthropic is 100–1000x larger. But strip out the network with a mock LLM and you can finally see what the framework itself costs you.

We built identical streaming RAG pipelines across SynapseKit 1.4, LangChain 1.2, and LlamaIndex Core 0.14. Same documents, same query, same mock LLM that yields the exact same token list with zero network latency. The result: all three clear the sub-millisecond bar comfortably. Nobody loses on the number. The interesting split is elsewhere - in the shape of the streaming API itself.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.

AI Letters #20 - Hybrid Search: RRF Fusion Across Three Frameworks

· 8 min read
EngineersOfAI
AI Engineering Education

Pure vector search misses exact matches. Pure BM25 misses semantics. Hybrid search almost always wins - the question is how much control you get over the fusion.

Every production RAG system eventually hits the same wall. Vector search retrieves semantically similar documents, but it fails on exact-match queries: model names, version numbers, function names, error codes. The query "GPT-4o" and the document "GPT-4o" don't reliably produce close vectors. BM25 doesn't have this problem. It matches terms, weighs them by rarity, and returns the right document.

Reciprocal Rank Fusion - RRF - is the standard way to combine both. It takes two ranked lists, assigns each document a score of 1 / (k + rank), sums the scores, and re-ranks. The parameter k controls how much the top ranks dominate. It requires no score normalisation, works across retrieval algorithms with incompatible score scales, and runs in microseconds.

We built identical hybrid pipelines across SynapseKit 1.4, LangChain 1.2, and LlamaIndex Core 0.14. Same corpus, same query, same task: BM25 + vector, top-3 via RRF. The LoC gap is smaller than the BM25-only benchmark. The configurability gap is not.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.

AI Letters #19 - The BM25 Test: One Framework Silently Fails

· 9 min read
EngineersOfAI
AI Engineering Education

BM25 ships in one framework, requires a hidden install in another, and silently fails at runtime in the third. Same class name, three very different experiences.

Pure vector search has a blind spot. Exact-match queries - model names, function names, version numbers, proper nouns - embed poorly. The query "GPT-4o" and the document "GPT-4o" don't always produce similar vectors. BM25 does not have this problem. It matches terms, weighs them by rarity, and returns the right document.

Production RAG systems almost always use hybrid search: BM25 for precision on exact matches, vector search for semantic recall, reciprocal rank fusion to merge them. The question of whether BM25 ships out of the box is not academic. It determines whether your pipeline works on day one or fails at 2am in a customer demo.

We tested all three frameworks on an identical task: index five documents, run a BM25 query, get top-3 results. One framework's BM25Retriever class is in its package but silently throws a ModuleNotFoundError at runtime unless you've separately installed a library it doesn't list as a dependency.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.

AI Letters #18 - The Chunking Test: Two Frameworks Are Identical, One Is Not

· 9 min read
EngineersOfAI
AI Engineering Education

How you split documents determines what your retriever finds. Most tutorials spend two lines on this. They shouldn't.

Every RAG tutorial reaches the chunking step and sprints past it. "Split into chunks of 500 characters with 50 overlap - done." The code runs. The demo works. The demo is not production.

The split you choose affects embedding quality, retrieval precision, and whether your LLM gets enough context to say something useful. Chunking is not configuration. It's architecture.

We ran all three frameworks against the same document with identical parameters. The line counts came out nearly equal. The chunk outputs did not. One framework's default splitter interprets chunk_size=300 as tokens, not characters - producing 2 chunks averaging 986 characters each instead of 12 chunks averaging 163 characters. Same parameter name, different semantics.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.

AI Letters #17 - The PDF Test: Which Framework Makes You Learn a New API?

· 7 min read
EngineersOfAI
AI Engineering Education

The question isn't how many lines it takes to load a PDF. It's how many new concepts you need to know.

Three frameworks. Same task: take a PDF off disk, build a queryable index, answer a question. We measured lines of code and wall-clock indexing time.

The line counts are almost beside the point. What matters is this: two of the three frameworks require you to learn a loader API that didn't exist in the string version. One doesn't.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.