Skip to main content

13 posts tagged with "LLM Engineering"

Building production systems with large language models — RAG, agents, evaluation, and cost optimization.

View All Tags

AI Letters #22 - Conversation Memory in RAG: One Param vs Forty Lines of Boilerplate

· 11 min read
EngineersOfAI
AI Engineering Education

RAG gives the model context from documents. Memory gives it context from the conversation. Without both, your chatbot doesn't know what it just said.

Every RAG system eventually faces the same question: what happens on the second turn? The user asks a follow-up. "What did you mean by that?" "Can you give me an example?" "How does that compare to what you said earlier?" Without memory, the model treats each question as the first. Context from the previous turn is gone. The answer it gives to the follow-up is either wrong, generic, or disconnected from what came before.

Conversation memory is the fix. A buffer of past exchanges gets prepended to the retrieved context and injected into the prompt. The model now has the document context and the conversation context. It can use both. The question is how much it costs to add this to your pipeline - and what happens when the conversation gets long enough that you have to start dropping old messages.

We wired identical multi-turn memory into RAG pipelines across SynapseKit 1.4, LangChain 1.2, and LlamaIndex Core 0.14. Same conversation, same task, same question: how many lines does it take to add memory, and what happens at the edge cases? The LoC gap is the widest of any benchmark in this series. The persistence and window-strategy differences are what will matter in your production system.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.

AI Letters #21 - Streaming RAG: Time to First Token Across Three Frameworks

· 10 min read
EngineersOfAI
AI Engineering Education

When users wait for an LLM, the number that matters is time-to-first-token, not total time. 200ms to first token feels instant. 2 seconds to first token feels broken - even if the full answer arrives faster.

Every LLM UI eventually learns the same lesson. Users don't measure latency the way your dashboard does. They don't care about tokens-per-second, p99 tail latency, or median completion time. They care about one thing: how long until something appears on screen. That number is TTFT - time to first token - and it dominates perceived performance more than any other metric in LLM serving.

The catch is that when you're building a streaming RAG pipeline, the framework itself sits between your .stream() call and the first token your user sees. Every async for, every LCEL graph traversal, every callback dispatch adds latency before a single character leaves the server. In production that overhead is invisible because network latency to OpenAI or Anthropic is 100–1000x larger. But strip out the network with a mock LLM and you can finally see what the framework itself costs you.

We built identical streaming RAG pipelines across SynapseKit 1.4, LangChain 1.2, and LlamaIndex Core 0.14. Same documents, same query, same mock LLM that yields the exact same token list with zero network latency. The result: all three clear the sub-millisecond bar comfortably. Nobody loses on the number. The interesting split is elsewhere - in the shape of the streaming API itself.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.

AI Letters #20 - Hybrid Search: RRF Fusion Across Three Frameworks

· 8 min read
EngineersOfAI
AI Engineering Education

Pure vector search misses exact matches. Pure BM25 misses semantics. Hybrid search almost always wins - the question is how much control you get over the fusion.

Every production RAG system eventually hits the same wall. Vector search retrieves semantically similar documents, but it fails on exact-match queries: model names, version numbers, function names, error codes. The query "GPT-4o" and the document "GPT-4o" don't reliably produce close vectors. BM25 doesn't have this problem. It matches terms, weighs them by rarity, and returns the right document.

Reciprocal Rank Fusion - RRF - is the standard way to combine both. It takes two ranked lists, assigns each document a score of 1 / (k + rank), sums the scores, and re-ranks. The parameter k controls how much the top ranks dominate. It requires no score normalisation, works across retrieval algorithms with incompatible score scales, and runs in microseconds.

We built identical hybrid pipelines across SynapseKit 1.4, LangChain 1.2, and LlamaIndex Core 0.14. Same corpus, same query, same task: BM25 + vector, top-3 via RRF. The LoC gap is smaller than the BM25-only benchmark. The configurability gap is not.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.

AI Letters #19 - The BM25 Test: One Framework Silently Fails

· 9 min read
EngineersOfAI
AI Engineering Education

BM25 ships in one framework, requires a hidden install in another, and silently fails at runtime in the third. Same class name, three very different experiences.

Pure vector search has a blind spot. Exact-match queries - model names, function names, version numbers, proper nouns - embed poorly. The query "GPT-4o" and the document "GPT-4o" don't always produce similar vectors. BM25 does not have this problem. It matches terms, weighs them by rarity, and returns the right document.

Production RAG systems almost always use hybrid search: BM25 for precision on exact matches, vector search for semantic recall, reciprocal rank fusion to merge them. The question of whether BM25 ships out of the box is not academic. It determines whether your pipeline works on day one or fails at 2am in a customer demo.

We tested all three frameworks on an identical task: index five documents, run a BM25 query, get top-3 results. One framework's BM25Retriever class is in its package but silently throws a ModuleNotFoundError at runtime unless you've separately installed a library it doesn't list as a dependency.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.

AI Letters #18 - The Chunking Test: Two Frameworks Are Identical, One Is Not

· 9 min read
EngineersOfAI
AI Engineering Education

How you split documents determines what your retriever finds. Most tutorials spend two lines on this. They shouldn't.

Every RAG tutorial reaches the chunking step and sprints past it. "Split into chunks of 500 characters with 50 overlap - done." The code runs. The demo works. The demo is not production.

The split you choose affects embedding quality, retrieval precision, and whether your LLM gets enough context to say something useful. Chunking is not configuration. It's architecture.

We ran all three frameworks against the same document with identical parameters. The line counts came out nearly equal. The chunk outputs did not. One framework's default splitter interprets chunk_size=300 as tokens, not characters - producing 2 chunks averaging 986 characters each instead of 12 chunks averaging 163 characters. Same parameter name, different semantics.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.

AI Letters #17 - The PDF Test: Which Framework Makes You Learn a New API?

· 7 min read
EngineersOfAI
AI Engineering Education

The question isn't how many lines it takes to load a PDF. It's how many new concepts you need to know.

Three frameworks. Same task: take a PDF off disk, build a queryable index, answer a question. We measured lines of code and wall-clock indexing time.

The line counts are almost beside the point. What matters is this: two of the three frameworks require you to learn a loader API that didn't exist in the string version. One doesn't.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.

AI Letters #16 - Week 1 Scorecard: Which LLM Framework Wins on Developer Experience?

· 6 min read
EngineersOfAI
AI Engineering Education

The simplest framework won on developer experience. That's the expected result. The interesting part is which benchmark it lost - and why.

Six weeks of benchmarks. Six dimensions of the developer experience you have on day one of a new project. One aggregate score.

SynapseKit wins 5 of 6. The one it loses - raw import speed - is a deliberate architectural choice by LangChain, not a gap in capability. Understanding why reveals more about framework design than any of the wins.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.

AI Letters #15 - The Framework That Tells You What Went Wrong

· 9 min read
EngineersOfAI
AI Engineering Education

A good error message is a senior engineer on call. A bad one is a stack trace pointing at framework internals you've never seen.

You paste the wrong provider name. Or forget the API key. Or pass a string where a list was expected. These are not edge cases - they're the first five minutes of every new project.

What separates frameworks at this moment isn't features. It's whether the error message tells you what to do next or forces you to read source code.

We triggered five common mistakes across SynapseKit, LangChain, and LlamaIndex. We scored each error on clarity, actionability, and - most importantly - whether it fails at configuration time or only when you run your first query.

One framework lets you build a complete index before it tells you the API key was wrong.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.

AI Letters #14 - Switching LLM Providers Shouldn't Cost You 3 Files

· 7 min read
EngineersOfAI
AI Engineering Education

Switching providers shouldn't require switching imports. But for most frameworks, it does - every time.

You're three months into production. The OpenAI bill lands and it's 40% higher than projected. Someone mentions Groq is 10x faster and a fraction of the cost for your workload. You decide to try it. Then you find out what "switching providers" actually means for your framework.

For LangChain: uninstall langchain-openai, install langchain-groq, update the import line, update the constructor class name. Then do the same for every downstream test file that instantiated the LLM. For LlamaIndex: similar story - llama-index-llms-openai out, llama-index-llms-groq in, import changed, Settings.llm assignment changed. For SynapseKit: change one string.

That gap - 1 line vs 3 changes - holds for every provider switch I tested. Five providers. Three frameworks. Same pattern every time.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.

AI Letters #13 - The RAM Cost of LLM Frameworks. Measured.

· 7 min read
EngineersOfAI
AI Engineering Education

The framework doesn't own your memory budget. The embedding model does. Pick your framework on other criteria.

Your RAG app is using 380 MB. First instinct: blame the framework. Switch to something leaner. Maybe that'll bring it down to 200. It won't - the 380 MB isn't the framework. It's the embedding model sitting in RAM. The framework overhead, across the full lifecycle from import to active use, accounts for 20–30 MB of difference between your best and worst options. You were optimizing the wrong variable.

The number that actually matters is the import floor - what you pay before any LLM work starts, before you index a single document, before the embedding model loads. I measured RSS at three stages across SynapseKit, LangChain, and LlamaIndex to find it.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.

AI Letters #12 - How Much Code Does RAG Actually Take? We Measured.

· 9 min read
EngineersOfAI
AI Engineering Education

Lines of code is a proxy for cognitive load - how much API surface you need to hold in your head before you can get something working.


You've read the benchmarks. You've seen the "getting started" examples. Every framework looks reasonable in a five-line README snippet. You don't find out what you actually signed up for until you're building something real - wiring together five objects to do one thing, reading three docs pages to understand why the chain API works the way it does, debugging a provider mismatch at 11pm the night before a demo.

The real question when evaluating an LLM framework isn't capability. They all support RAG, tool calling, and retrieval. It's: how much of this framework do I need to understand before my first pipeline runs?

I built the same RAG pipeline in SynapseKit, LangChain, and LlamaIndex. Same task, same document, same query. Then counted the lines.

Disclosure: I'm the author of SynapseKit. All benchmarks are reproducible - the Kaggle notebook is public: LLM Showdown #3 - Hello RAG: LoC.


Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.