AI Letters #22 - Conversation Memory in RAG: One Param vs Forty Lines of Boilerplate

April 9, 2026 · 11 min read

AI Engineering Education

RAG gives the model context from documents. Memory gives it context from the conversation. Without both, your chatbot doesn't know what it just said.

Every RAG system eventually faces the same question: what happens on the second turn? The user asks a follow-up. "What did you mean by that?" "Can you give me an example?" "How does that compare to what you said earlier?" Without memory, the model treats each question as the first. Context from the previous turn is gone. The answer it gives to the follow-up is either wrong, generic, or disconnected from what came before.

Conversation memory is the fix. A buffer of past exchanges gets prepended to the retrieved context and injected into the prompt. The model now has the document context and the conversation context. It can use both. The question is how much it costs to add this to your pipeline - and what happens when the conversation gets long enough that you have to start dropping old messages.

We wired identical multi-turn memory into RAG pipelines across SynapseKit 1.4, LangChain 1.2, and LlamaIndex Core 0.14. Same conversation, same task, same question: how many lines does it take to add memory, and what happens at the edge cases? The LoC gap is the widest of any benchmark in this series. The persistence and window-strategy differences are what will matter in your production system.

Interactive Chart

LoC Across All 12 Benchmarks →

Cumulative lines of code per framework from hello world to conversation memory. See where each framework has built its lead.

Code Explorer

Memory Pipeline Code by Framework →

Full multi-turn memory RAG code side by side - one param vs session stores vs token buffers, annotated for each framework.

Data

Message Retention vs Window Size + Feature Matrix →

How many messages each framework retains at different window sizes, plus the full memory API feature comparison across all three.

What We Measured

Task: Build a multi-turn RAG pipeline with conversation memory. Run 5 turns of questions. Measure lines of code, window strategy, and message retention at different window sizes.

Metric	What it captures
Lines of code	Code to add multi-turn memory to an existing RAG pipeline
Window strategy	How old messages get dropped - turn count vs token limit
Message retention	Messages kept after 5 turns at window sizes 1, 2, 3, 5

Frameworks: SynapseKit 1.4, LangChain 1.2, LlamaIndex Core 0.14. Kaggle CPU. Disclosure: I'm the author of SynapseKit. All code is on Kaggle - fork and run yourself.

The Code

SynapseKit - 1 constructor argument:

from synapsekit import RAG

rag = RAG(model="gpt-4o-mini", api_key=KEY, memory_window=5)
await rag.add_documents(DOCS)
r1 = await rag.ask("What is RAG?")
r2 = await rag.ask("How does it improve accuracy?")
r3 = await rag.ask("Which retrieval method is fastest?")

Memory is a single parameter on the RAG constructor. memory_window=5 keeps the last 5 turns. Every subsequent .ask() call automatically prepends the conversation history to the retrieved context. Zero additional setup. The tradeoff: in-memory only, no persistence across sessions.

LangChain - session store + getter + LCEL wiring:

from langchain_community.retrievers import BM25Retriever
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_core.chat_history import InMemoryChatMessageHistory

store = {}
def get_session_history(session_id: str) -> InMemoryChatMessageHistory:
    if session_id not in store:
        store[session_id] = InMemoryChatMessageHistory()
    return store[session_id]

retriever = BM25Retriever.from_texts(DOCS, k=3)
prompt = ChatPromptTemplate.from_messages([
    ("system", "Context: {ctx}"),
    MessagesPlaceholder("history"),
    ("human", "{question}"),
])
chain = ({"ctx": retriever, "question": RunnablePassthrough()} | prompt | ChatOpenAI())
chain_with_history = RunnableWithMessageHistory(
    chain, get_session_history,
    input_messages_key="question", history_messages_key="history"
)
r1 = chain_with_history.invoke({"question": "What is RAG?"}, config={"configurable": {"session_id": "s1"}})

RunnableWithMessageHistory is the canonical LangChain pattern. You define a session store (here in-memory, but can be Redis/DynamoDB/Postgres), a getter function, and wire it around your chain. Twelve lines before a single question is asked. The payoff: swap InMemoryChatMessageHistory for RedisChatMessageHistory and you have persistent multi-user memory with no other changes.

LlamaIndex - token-budget buffer on the chat engine:

from llama_index.core import Document, VectorStoreIndex, Settings
from llama_index.core.memory import ChatMemoryBuffer
from llama_index.llms.openai import OpenAI

Settings.llm = OpenAI(model="gpt-4o-mini")
index  = VectorStoreIndex.from_documents([Document(text=d) for d in DOCS])
memory = ChatMemoryBuffer.from_defaults(token_limit=1500)
engine = index.as_chat_engine(memory=memory, chat_mode="context")
r1 = engine.chat("What is RAG?")
r2 = engine.chat("How does it improve accuracy?")
r3 = engine.chat("Which retrieval method is fastest?")

ChatMemoryBuffer takes a token_limit instead of a turn count. The engine drops old messages when the buffer exceeds the limit. Clean API - comparable conciseness to SynapseKit at the chat engine level. Can serialize to SimpleChatStore for lightweight persistence.

The Numbers

Framework    Imports   Functional   Total
──────────────────────────────────────────
SynapseKit       1           5         6
LlamaIndex       3           6         9
LangChain        5          12        17

This is the widest LoC gap in the series. LangChain's session store pattern adds 5 lines of boilerplate before the chain is even built - the getter function, the store dict, and the RunnableWithMessageHistory wrapper. That boilerplate is the price of flexibility. You get pluggable backends. SynapseKit gives you the same result in one argument, but you're locked to in-memory.

Window Strategy: The Detail That Matters

Both frameworks drop old messages when the window fills up. The question is what "window" means:

Framework     Strategy              Reasoning unit   Control
──────────────────────────────────────────────────────────────
SynapseKit    Sliding window        Turns            memory_window=N
LangChain     Store all; trim       Turns (manual)   slice last N*2
LlamaIndex    Token budget          Tokens           token_limit=N

Turn-count windows (SynapseKit, LangChain) are easy to reason about: "keep the last 3 exchanges." The problem is that turns vary wildly in length. A 3-turn window might be 200 tokens or 2,000 tokens depending on the conversation. At scale, that variance creates unpredictable prompt sizes.

Token-limit windows (LlamaIndex) are harder to reason about - "keep 1,500 tokens of history" doesn't tell you how many turns that is. But they're more predictable in terms of prompt size, which is what actually matters for LLM API cost and latency. You know exactly how much context you're sending.

Message retention after 5 turns at different window sizes:

Window    SynapseKit   LangChain   LlamaIndex
─────────────────────────────────────────────
w=1          2 msg       2 msg      ~2 msg
w=2          4 msg       4 msg      ~4 msg
w=3          6 msg       6 msg      ~6 msg
w=5         10 msg      10 msg     10 msg

At equivalent settings, all three retain the same number of messages. The difference surfaces when conversations are long and token-dense - LlamaIndex starts dropping earlier than a turn-count window of the same number.

Persistence: Where They Truly Split

Feature	SynapseKit	LangChain	LlamaIndex
In-memory	Yes	Yes	Yes
Redis	No	Yes	No
DynamoDB	No	Yes	No
Postgres	No	Yes	No
JSON file	No	Yes	Yes (SimpleChatStore)
Custom backend	No	Yes	Partial
`clear()`	Yes	Yes	Yes
Format to string	Yes	Yes	Yes

LangChain's persistence ecosystem is the clear winner. Swap one import and your session store moves from in-memory to Redis. This is the critical path for any multi-user production app - users expect their conversation to persist across sessions, across devices, across server restarts.

SynapseKit's in-memory limitation is the one place where its simplicity becomes a real constraint. For a single-user, single-session chatbot, it's fine. For a production app with multiple users, you'll either fork the memory implementation or migrate to LangChain for this layer.

What This Means for Engineers

Don't build your own memory layer. All three frameworks provide one. Rolling your own conversation buffer means reinventing trimming logic, format conversion, and history injection - work that's already done for you.
Choose turn-count windows for simple apps, token-budget windows for production. Turn count is easy to explain to stakeholders. Token budget is what keeps your API costs predictable at scale. If you're serving real users, measure the token distribution of your turns before deciding.
LangChain's RunnableWithMessageHistory is boilerplate, but it's good boilerplate. The session getter pattern decouples your chain from the storage backend. When you move to Redis in production, you change one line. That's worth 7 extra lines at setup time.
LlamaIndex's chat_engine is the fastest path to a working multi-turn RAG demo. Two lines - memory and engine. If you're building a prototype or an internal tool where persistence doesn't matter, this is the fastest start.
Memory and RAG interact in ways that will surprise you. When the retrieved context changes and the memory context contradicts it, the model has to reconcile them. This creates subtle failures - confident-sounding answers that combine stale memory context with fresh document context incorrectly. Test multi-turn RAG with contradictory document updates before shipping.

The Corollary Most People Miss

The memory problem compounds. A single-turn RAG pipeline has one context window to manage: the retrieved documents. A multi-turn RAG pipeline has two: the documents and the conversation history. They compete for the same token budget.

Most teams add memory and don't adjust their retrieval budget. The result: the total context grows until it hits the model's context limit and something gets truncated - usually silently. The retrieved documents get cut first because they're appended after the history. The model starts answering from memory rather than documents. Retrieval quality degrades. Nobody notices because the answers still sound coherent.

The fix is explicit: set max_tokens_for_context = total_budget - memory_tokens - system_prompt_tokens and cap your retriever's top_k accordingly. None of the three frameworks do this automatically.

Context budget allocation (simplified):
────────────────────────────────────────────────
Total context window         128,000 tokens
System prompt                ~500 tokens
Conversation memory          ~2,000 tokens  (10 turns × ~200 tokens/turn)
Retrieved documents          ~4,000 tokens  (top-5 chunks × ~800 tokens)
LLM response budget          ~2,000 tokens
────────────────────────────────────────────────
Remaining buffer             119,500 tokens

Do the maths before you hit the limit, not after.

Three Things Worth Doing This Week

Add memory_window or token_limit to your RAG pipeline today. If you're building a chat interface on top of RAG and not passing history into the prompt, every follow-up question is being answered in isolation. That's a worse user experience than a basic chatbot.
Measure your average conversation length in tokens. Pull a sample of real conversations, tokenize them, and see what percentile hits 1,500 tokens. That's your token_limit starting point. A turn-count window of 5 in a technical conversation can hit 3,000 tokens easily.
Read the Kaggle notebook. Full code, retention tables at different window sizes, and the live demo: LLM Showdown #13 - Conversation Memory in RAG

Memory is the difference between a search engine with an LLM frontend and an actual conversational AI. The frameworks all provide it. The split is in how they drop old messages and whether they persist across sessions. One approach gives you a single argument and no persistence. One gives you a token budget and lightweight JSON persistence. One gives you full production backends at the cost of boilerplate. Pick the one that matches where your app needs to be in six months, not where it is today.

Engineers of AI

Read more: www.engineersofai.com

If this was useful, forward it to one engineer who should be reading it.

What We Measured​

The Code​

The Numbers​

Window Strategy: The Detail That Matters​

Persistence: Where They Truly Split​

What This Means for Engineers​

The Corollary Most People Miss​

Three Things Worth Doing This Week​

Want to Think Like an AI Architect?