Skip to main content

AI Letters #22 - Conversation Memory in RAG: One Param vs Forty Lines of Boilerplate

· 11 min read
EngineersOfAI
AI Engineering Education

RAG gives the model context from documents. Memory gives it context from the conversation. Without both, your chatbot doesn't know what it just said.

Every RAG system eventually faces the same question: what happens on the second turn? The user asks a follow-up. "What did you mean by that?" "Can you give me an example?" "How does that compare to what you said earlier?" Without memory, the model treats each question as the first. Context from the previous turn is gone. The answer it gives to the follow-up is either wrong, generic, or disconnected from what came before.

Conversation memory is the fix. A buffer of past exchanges gets prepended to the retrieved context and injected into the prompt. The model now has the document context and the conversation context. It can use both. The question is how much it costs to add this to your pipeline - and what happens when the conversation gets long enough that you have to start dropping old messages.

We wired identical multi-turn memory into RAG pipelines across SynapseKit 1.4, LangChain 1.2, and LlamaIndex Core 0.14. Same conversation, same task, same question: how many lines does it take to add memory, and what happens at the edge cases? The LoC gap is the widest of any benchmark in this series. The persistence and window-strategy differences are what will matter in your production system.

What We Measured

Task: Build a multi-turn RAG pipeline with conversation memory. Run 5 turns of questions. Measure lines of code, window strategy, and message retention at different window sizes.

MetricWhat it captures
Lines of codeCode to add multi-turn memory to an existing RAG pipeline
Window strategyHow old messages get dropped - turn count vs token limit
Message retentionMessages kept after 5 turns at window sizes 1, 2, 3, 5

Frameworks: SynapseKit 1.4, LangChain 1.2, LlamaIndex Core 0.14. Kaggle CPU. Disclosure: I'm the author of SynapseKit. All code is on Kaggle - fork and run yourself.

The Code

SynapseKit - 1 constructor argument:

from synapsekit import RAG

rag = RAG(model="gpt-4o-mini", api_key=KEY, memory_window=5)
await rag.add_documents(DOCS)
r1 = await rag.ask("What is RAG?")
r2 = await rag.ask("How does it improve accuracy?")
r3 = await rag.ask("Which retrieval method is fastest?")

Memory is a single parameter on the RAG constructor. memory_window=5 keeps the last 5 turns. Every subsequent .ask() call automatically prepends the conversation history to the retrieved context. Zero additional setup. The tradeoff: in-memory only, no persistence across sessions.

LangChain - session store + getter + LCEL wiring:

from langchain_community.retrievers import BM25Retriever
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_core.chat_history import InMemoryChatMessageHistory

store = {}
def get_session_history(session_id: str) -> InMemoryChatMessageHistory:
if session_id not in store:
store[session_id] = InMemoryChatMessageHistory()
return store[session_id]

retriever = BM25Retriever.from_texts(DOCS, k=3)
prompt = ChatPromptTemplate.from_messages([
("system", "Context: {ctx}"),
MessagesPlaceholder("history"),
("human", "{question}"),
])
chain = ({"ctx": retriever, "question": RunnablePassthrough()} | prompt | ChatOpenAI())
chain_with_history = RunnableWithMessageHistory(
chain, get_session_history,
input_messages_key="question", history_messages_key="history"
)
r1 = chain_with_history.invoke({"question": "What is RAG?"}, config={"configurable": {"session_id": "s1"}})

RunnableWithMessageHistory is the canonical LangChain pattern. You define a session store (here in-memory, but can be Redis/DynamoDB/Postgres), a getter function, and wire it around your chain. Twelve lines before a single question is asked. The payoff: swap InMemoryChatMessageHistory for RedisChatMessageHistory and you have persistent multi-user memory with no other changes.

LlamaIndex - token-budget buffer on the chat engine:

from llama_index.core import Document, VectorStoreIndex, Settings
from llama_index.core.memory import ChatMemoryBuffer
from llama_index.llms.openai import OpenAI

Settings.llm = OpenAI(model="gpt-4o-mini")
index = VectorStoreIndex.from_documents([Document(text=d) for d in DOCS])
memory = ChatMemoryBuffer.from_defaults(token_limit=1500)
engine = index.as_chat_engine(memory=memory, chat_mode="context")
r1 = engine.chat("What is RAG?")
r2 = engine.chat("How does it improve accuracy?")
r3 = engine.chat("Which retrieval method is fastest?")

ChatMemoryBuffer takes a token_limit instead of a turn count. The engine drops old messages when the buffer exceeds the limit. Clean API - comparable conciseness to SynapseKit at the chat engine level. Can serialize to SimpleChatStore for lightweight persistence.

The Numbers

Framework Imports Functional Total
──────────────────────────────────────────
SynapseKit 1 5 6
LlamaIndex 3 6 9
LangChain 5 12 17

This is the widest LoC gap in the series. LangChain's session store pattern adds 5 lines of boilerplate before the chain is even built - the getter function, the store dict, and the RunnableWithMessageHistory wrapper. That boilerplate is the price of flexibility. You get pluggable backends. SynapseKit gives you the same result in one argument, but you're locked to in-memory.

Window Strategy: The Detail That Matters

Both frameworks drop old messages when the window fills up. The question is what "window" means:

Framework Strategy Reasoning unit Control
──────────────────────────────────────────────────────────────
SynapseKit Sliding window Turns memory_window=N
LangChain Store all; trim Turns (manual) slice last N*2
LlamaIndex Token budget Tokens token_limit=N

Turn-count windows (SynapseKit, LangChain) are easy to reason about: "keep the last 3 exchanges." The problem is that turns vary wildly in length. A 3-turn window might be 200 tokens or 2,000 tokens depending on the conversation. At scale, that variance creates unpredictable prompt sizes.

Token-limit windows (LlamaIndex) are harder to reason about - "keep 1,500 tokens of history" doesn't tell you how many turns that is. But they're more predictable in terms of prompt size, which is what actually matters for LLM API cost and latency. You know exactly how much context you're sending.

Message retention after 5 turns at different window sizes:

Window SynapseKit LangChain LlamaIndex
─────────────────────────────────────────────
w=1 2 msg 2 msg ~2 msg
w=2 4 msg 4 msg ~4 msg
w=3 6 msg 6 msg ~6 msg
w=5 10 msg 10 msg 10 msg

At equivalent settings, all three retain the same number of messages. The difference surfaces when conversations are long and token-dense - LlamaIndex starts dropping earlier than a turn-count window of the same number.

Persistence: Where They Truly Split

FeatureSynapseKitLangChainLlamaIndex
In-memoryYesYesYes
RedisNoYesNo
DynamoDBNoYesNo
PostgresNoYesNo
JSON fileNoYesYes (SimpleChatStore)
Custom backendNoYesPartial
clear()YesYesYes
Format to stringYesYesYes

LangChain's persistence ecosystem is the clear winner. Swap one import and your session store moves from in-memory to Redis. This is the critical path for any multi-user production app - users expect their conversation to persist across sessions, across devices, across server restarts.

SynapseKit's in-memory limitation is the one place where its simplicity becomes a real constraint. For a single-user, single-session chatbot, it's fine. For a production app with multiple users, you'll either fork the memory implementation or migrate to LangChain for this layer.

What This Means for Engineers

  1. Don't build your own memory layer. All three frameworks provide one. Rolling your own conversation buffer means reinventing trimming logic, format conversion, and history injection - work that's already done for you.

  2. Choose turn-count windows for simple apps, token-budget windows for production. Turn count is easy to explain to stakeholders. Token budget is what keeps your API costs predictable at scale. If you're serving real users, measure the token distribution of your turns before deciding.

  3. LangChain's RunnableWithMessageHistory is boilerplate, but it's good boilerplate. The session getter pattern decouples your chain from the storage backend. When you move to Redis in production, you change one line. That's worth 7 extra lines at setup time.

  4. LlamaIndex's chat_engine is the fastest path to a working multi-turn RAG demo. Two lines - memory and engine. If you're building a prototype or an internal tool where persistence doesn't matter, this is the fastest start.

  5. Memory and RAG interact in ways that will surprise you. When the retrieved context changes and the memory context contradicts it, the model has to reconcile them. This creates subtle failures - confident-sounding answers that combine stale memory context with fresh document context incorrectly. Test multi-turn RAG with contradictory document updates before shipping.

The Corollary Most People Miss

The memory problem compounds. A single-turn RAG pipeline has one context window to manage: the retrieved documents. A multi-turn RAG pipeline has two: the documents and the conversation history. They compete for the same token budget.

Most teams add memory and don't adjust their retrieval budget. The result: the total context grows until it hits the model's context limit and something gets truncated - usually silently. The retrieved documents get cut first because they're appended after the history. The model starts answering from memory rather than documents. Retrieval quality degrades. Nobody notices because the answers still sound coherent.

The fix is explicit: set max_tokens_for_context = total_budget - memory_tokens - system_prompt_tokens and cap your retriever's top_k accordingly. None of the three frameworks do this automatically.

Context budget allocation (simplified):
────────────────────────────────────────────────
Total context window 128,000 tokens
System prompt ~500 tokens
Conversation memory ~2,000 tokens (10 turns × ~200 tokens/turn)
Retrieved documents ~4,000 tokens (top-5 chunks × ~800 tokens)
LLM response budget ~2,000 tokens
────────────────────────────────────────────────
Remaining buffer 119,500 tokens

Do the maths before you hit the limit, not after.

Three Things Worth Doing This Week

  1. Add memory_window or token_limit to your RAG pipeline today. If you're building a chat interface on top of RAG and not passing history into the prompt, every follow-up question is being answered in isolation. That's a worse user experience than a basic chatbot.

  2. Measure your average conversation length in tokens. Pull a sample of real conversations, tokenize them, and see what percentile hits 1,500 tokens. That's your token_limit starting point. A turn-count window of 5 in a technical conversation can hit 3,000 tokens easily.

  3. Read the Kaggle notebook. Full code, retention tables at different window sizes, and the live demo: LLM Showdown #13 - Conversation Memory in RAG


Memory is the difference between a search engine with an LLM frontend and an actual conversational AI. The frameworks all provide it. The split is in how they drop old messages and whether they persist across sessions. One approach gives you a single argument and no persistence. One gives you a token budget and lightweight JSON persistence. One gives you full production backends at the cost of boilerplate. Pick the one that matches where your app needs to be in six months, not where it is today.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

Want to Think Like an AI Architect?

Join engineers receiving weekly breakdowns of AI systems, production failures, and architectural decisions.