Context Window Management
The Support Chat That Ate a Budget
A fintech startup deployed a customer support chatbot in March. By May, the finance team was asking hard questions. The LLM API bill had grown from 31,000 per month, but monthly active users had only increased 40%. Something was deeply wrong with the math.
The engineering team traced the issue. Their support conversations averaged 22 turns before resolution. At turn 1, the prompt was 800 tokens. At turn 22, the prompt was 18,400 tokens - the full conversation history plus the system prompt, sent in its entirety on every message. The 22nd message cost 23x what the first message cost. And because support conversations were getting longer (more complex issues, more AI engagement), the average conversation length was creeping up month-over-month, compounding the cost growth.
But there was a subtler problem they found when they audited response quality: the model was actually giving worse answers at turn 22 than at turn 5. The conversations were long enough to trigger the "lost in the middle" phenomenon - the model was attending well to the most recent few turns and to the system prompt, but was effectively ignoring the context from turns 5-18. All those tokens were being paid for, but the model was barely reading them.
Context management sits at the intersection of cost, latency, and quality. Unlike most engineering tradeoffs where you optimize one axis at the expense of another, bad context management makes all three axes worse simultaneously. Good context management is one of the most impactful engineering investments you can make in a production LLM application.
Why This Exists
When Transformers were introduced, the quadratic attention complexity meant context windows were small - 512 to 2,048 tokens. Applications were forced to think carefully about what information to include. As compute scaled and engineering improved (sparse attention, FlashAttention, positional encodings), context windows grew: 4K, 8K, 32K, 128K, and now 1M+ tokens in some models.
This created a dangerous illusion: "context windows are now infinite, so I can stop thinking about context management." The illusion fails for three reasons:
-
Cost scales linearly with context length. 128K tokens of context at GPT-4o rates = 32,000/day from input tokens alone.
-
Quality degrades in long contexts. The "lost in the middle" finding (Liu et al., 2023) showed that models perform significantly worse when the relevant information is in the middle of a long context versus at the beginning or end. Bigger context windows do not eliminate this problem.
-
Latency scales with context length. Prefill time is roughly linear in context length. A 128K-token context takes ~10 seconds to prefill at current hardware speeds - even before generating a single output token.
Context management is not a solved problem. It is an active engineering discipline.
The Lost-in-the-Middle Effect
Liu et al. (2023) at Stanford ran a systematic study: given a relevant document and many irrelevant ones, how does performance change as you vary the position of the relevant document?
The findings were stark:
| Position of Relevant Document | Accuracy (20 docs total) |
|---|---|
| First position (beginning) | 71.0% |
| Middle position (10 of 20) | 56.1% |
| Last position (end) | 73.2% |
A 15% accuracy drop from simply placing the relevant document in the middle of the context. This means that how you order information in your prompt matters as much as what information you include.
Practical implications:
- Place the most important information at the beginning or end of the context
- For RAG: put the most relevant chunks last (just before the user question)
- For conversation history: the most recent turns are naturally most important (they are already last)
- For few-shot examples: the most relevant/similar examples should be closest to the query
Context Budget Allocation
Think of the context window as a budget. Every token costs money and affects quality. The four slots in the budget:
A typical allocation for a 16K context window:
| Slot | Tokens | Percentage | Notes |
|---|---|---|---|
| System prompt | 800 | 5% | Instructions, persona, few-shot |
| Conversation history | 6,000 | 37.5% | Truncated with strategy |
| RAG context | 6,000 | 37.5% | Top-k retrieved chunks |
| Current turn | 1,000 | 6.25% | Current user message |
| Output buffer | 2,200 | 13.75% | Room for model response |
Key principle: Always reserve output buffer. If you fill the context window completely with input, the model has no room to generate tokens. Always reserve at least 1.5-2x your expected output length.
Conversation History Strategies
Strategy 1: Full History (Baseline)
Send all turns every time. Simple to implement. Works fine for short conversations, fails catastrophically for long ones.
def build_messages_full_history(
system_prompt: str,
history: list[dict],
current_message: str,
) -> list[dict]:
return [
{"role": "system", "content": system_prompt},
*history,
{"role": "user", "content": current_message},
]
# Works until context window is exceeded, then raises an error. No graceful degradation.
When to use: Prototypes only. Never in production for conversations that can exceed 10-15 turns.
Strategy 2: Sliding Window
Keep only the most recent N turns. Simple, predictable, and effective for most use cases.
import tiktoken
def build_messages_sliding_window(
system_prompt: str,
history: list[dict],
current_message: str,
max_history_tokens: int = 6000,
model: str = "gpt-4o",
) -> list[dict]:
enc = tiktoken.encoding_for_model(model)
def count_tokens(text: str) -> int:
return len(enc.encode(text))
# Start from most recent and work backwards
selected_turns = []
token_count = 0
for turn in reversed(history):
turn_tokens = count_tokens(turn["content"]) + 4 # 4 for message overhead
if token_count + turn_tokens > max_history_tokens:
break
selected_turns.insert(0, turn)
token_count += turn_tokens
return [
{"role": "system", "content": system_prompt},
*selected_turns,
{"role": "user", "content": current_message},
]
Tradeoff: The model loses context from earlier in the conversation. For many support/Q&A use cases, this is acceptable - users rarely need the model to remember something from 20 turns ago. For emotional support or long-running project assistance, context loss is more harmful.
Strategy 3: Rolling Summarization
Periodically compress old turns into a summary. Maintains semantic continuity while bounding token count.
async def build_messages_with_summary(
system_prompt: str,
history: list[dict],
current_message: str,
recent_turns_to_keep: int = 8,
summary_model: str = "gpt-4o-mini", # Use cheap model for summarization
) -> list[dict]:
if len(history) <= recent_turns_to_keep * 2:
# Not long enough to need summarization
return build_messages_sliding_window(system_prompt, history, current_message)
# Split: old turns (to summarize) + recent turns (to keep verbatim)
cutoff = len(history) - (recent_turns_to_keep * 2)
old_turns = history[:cutoff]
recent_turns = history[cutoff:]
# Summarize old turns using a cheap model
summary = await summarize_history(old_turns, summary_model)
summary_message = {
"role": "system",
"content": (
f"Summary of earlier conversation:\n{summary}\n\n"
"The conversation continues below:"
),
}
return [
{"role": "system", "content": system_prompt},
summary_message,
*recent_turns,
{"role": "user", "content": current_message},
]
async def summarize_history(
turns: list[dict],
model: str = "gpt-4o-mini",
) -> str:
from openai import AsyncOpenAI
client = AsyncOpenAI()
conversation_text = "\n".join(
f"{t['role'].upper()}: {t['content']}" for t in turns
)
response = await client.chat.completions.create(
model=model,
messages=[
{
"role": "system",
"content": (
"Summarize the following conversation concisely. "
"Preserve: key facts established, user goals, decisions made, "
"important context. Be brief (under 200 words)."
),
},
{"role": "user", "content": conversation_text},
],
max_tokens=250,
)
return response.choices[0].message.content
Cost consideration: Summarization costs an extra LLM call (cheap model). At scale: if you summarize every 10 turns and the summary call costs 100/day for summarization - typically worth it compared to the savings from shorter prompts.
Strategy 4: Selective Retention
Score each turn by information value and discard low-value turns. Most sophisticated but also most effective.
async def build_messages_selective_retention(
system_prompt: str,
history: list[dict],
current_message: str,
max_history_tokens: int = 6000,
selection_model: str = "gpt-4o-mini",
) -> list[dict]:
"""Keep turns that are informationally dense, discard filler."""
if not history or count_total_tokens(history) < max_history_tokens:
return [
{"role": "system", "content": system_prompt},
*history,
{"role": "user", "content": current_message},
]
# Score each turn's information value
scored_turns = await score_turn_importance(history, current_message, selection_model)
# Sort by importance, keep highest-value turns within budget
sorted_turns = sorted(scored_turns, key=lambda x: x["score"], reverse=True)
selected = []
token_budget = max_history_tokens
for item in sorted_turns:
turn_tokens = count_tokens_in_turn(item["turn"])
if token_budget - turn_tokens >= 0:
selected.append(item["turn"])
token_budget -= turn_tokens
# Re-sort selected turns back into chronological order
selected_indices = {id(t["turn"]): i for i, t in enumerate(history)}
selected.sort(key=lambda t: selected_indices.get(id(t), 0))
return [
{"role": "system", "content": system_prompt},
*selected,
{"role": "user", "content": current_message},
]
async def score_turn_importance(
history: list[dict],
current_message: str,
model: str,
) -> list[dict]:
"""Use LLM to score information density of each turn."""
from openai import AsyncOpenAI
client = AsyncOpenAI()
history_text = "\n".join(
f"[Turn {i}] {t['role'].upper()}: {t['content'][:200]}..."
for i, t in enumerate(history)
)
response = await client.chat.completions.create(
model=model,
messages=[
{
"role": "system",
"content": (
"Score each conversation turn for information value relevant "
"to answering the current question. "
"Output JSON: [{turn_index: 0, score: 0.0-1.0, reason: '...'}]"
),
},
{
"role": "user",
"content": (
f"Current question: {current_message}\n\n"
f"History:\n{history_text}"
),
},
],
response_format={"type": "json_object"},
max_tokens=500,
)
import json
scores = json.loads(response.choices[0].message.content)
return [
{"turn": history[s["turn_index"]], "score": s["score"]}
for s in scores.get("turns", [])
if s["turn_index"] < len(history)
]
RAG Context Management
When using retrieval-augmented generation, the retrieved chunks consume a significant portion of the context budget. Managing this well requires three decisions.
How Many Chunks?
More chunks = more potential relevant information, but also more tokens and more noise. The optimal number depends on the type of query:
| Query Type | Typical Optimal k | Reasoning |
|---|---|---|
| Single fact lookup | 2-3 | One correct answer, noise hurts |
| Comparative question | 4-6 | Need multiple sources |
| Synthesis/overview | 6-10 | Broad coverage needed |
| Document analysis | 1-2 large chunks | Depth over breadth |
A practical heuristic: start with k=4, measure retrieval precision on your evaluation set, increase k if recall is poor, decrease if precision is poor.
How to Order Chunks?
Given the lost-in-the-middle effect, ordering matters. Two strategies:
Strategy A: Relevance-ordered, most relevant last. The last chunk is adjacent to the user's question and receives the most attention. Place the highest-scored chunk just before the question.
Strategy B: Chronological order for temporal documents. For documents with temporal structure (meeting notes, changelogs), preserve chronological order even if it doesn't match relevance scores.
def order_chunks_for_context(
chunks: list[dict],
strategy: str = "most_relevant_last",
) -> list[dict]:
"""Order retrieved chunks to maximize model attention to relevant content."""
if strategy == "most_relevant_last":
# Sort ascending by relevance (least relevant first, most relevant last)
return sorted(chunks, key=lambda c: c["score"])
elif strategy == "sandwich":
# Most relevant first and last, least relevant in middle
sorted_chunks = sorted(chunks, key=lambda c: c["score"], reverse=True)
if len(sorted_chunks) <= 2:
return sorted_chunks
best = sorted_chunks[0]
worst_middle = sorted_chunks[1:-1]
second_best = sorted_chunks[-1]
return [best] + worst_middle + [second_best]
else:
return chunks # original order
Chunk Size vs. Number
A design choice: retrieve 5 chunks of 500 tokens each, or 10 chunks of 250 tokens each? At equal total token budget, smaller chunks give more diverse coverage but more fragmented context. Larger chunks preserve more local context but may include irrelevant material.
The optimal chunk size depends on your documents:
- Short, dense documents (FAQ, technical docs): 256-512 tokens
- Long narrative documents (reports, papers): 512-1024 tokens
- Structured data (tables, code): preserve structural units (e.g., one function per chunk)
System Prompt Optimization
The system prompt is paid on every single request. A 2,000-token system prompt at GPT-4o rates, with 100K daily requests, costs 50,000/month just for the system prompt - unless you use prompt caching.
Audit your system prompt regularly:
- Remove instructions that are no longer relevant
- Remove repetitive instructions (if you say "be concise" 5 times, say it once)
- Remove few-shot examples that aren't pulling their weight (measure their impact before removing)
Target length:
- Minimal system prompt: 100-300 tokens
- Standard chat assistant: 300-600 tokens
- Complex multi-capability assistant: 600-1,200 tokens
- Anything over 1,200 tokens should be critically reviewed
Prompt Caching Design
Both Anthropic and OpenAI support prefix caching, which stores the computed KV states for a repeated prompt prefix. This means the model doesn't need to reprocess the beginning of the prompt on every request.
The design rule: Everything that varies goes at the end; everything stable goes at the beginning.
┌─────────────────────────────────────────────┐
│ STABLE PREFIX (cached) │
│ ───────────────────────────────────────── │
│ System prompt (instructions, persona) │
│ Few-shot examples │
│ Tool definitions │
│ Company knowledge base (if static) │
│ ← 1024+ tokens here │
├─────────────────────────────────────────────┤
│ DYNAMIC SUFFIX (not cached) │
│ ───────────────────────────────────────── │
│ Retrieved RAG context (query-specific) │
│ Conversation history (session-specific) │
│ Current user message │
└─────────────────────────────────────────────┘
Implementing Anthropic prompt caching:
import anthropic
client = anthropic.Anthropic()
async def chat_with_caching(
system_prompt: str,
static_knowledge: str,
conversation_history: list[dict],
user_message: str,
) -> str:
"""Use Anthropic prompt caching for stable system content."""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1000,
system=[
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"}, # Cache this block
},
{
"type": "text",
"text": f"Knowledge base:\n{static_knowledge}",
"cache_control": {"type": "ephemeral"}, # Cache this block too
},
],
messages=[
*conversation_history, # Dynamic, not cached
{"role": "user", "content": user_message},
],
)
# Check cache performance
usage = response.usage
cache_hit = usage.cache_read_input_tokens > 0
cache_savings = usage.cache_read_input_tokens * (1 - 0.1) # 90% discount
print(f"Cache hit: {cache_hit}")
print(f"Tokens read from cache: {usage.cache_read_input_tokens}")
print(f"Tokens written to cache: {usage.cache_creation_input_tokens}")
print(f"Token savings this request: {int(cache_savings)}")
return response.content[0].text
Cache hit rate optimization: For Anthropic's ephemeral cache (5-minute TTL), you need at least one request from the same user within 5 minutes to get a cache hit. For high-traffic applications this is trivially satisfied. For low-traffic applications, consider whether caching is worth designing around.
Context Window for Agents
Agents accumulate context rapidly: tool call requests, tool results, reasoning steps, intermediate outputs. An agent that makes 10 tool calls, each with 500 tokens of results, has added 5,000 tokens of context before generating a single word of final response.
Pruning strategies for agent context:
class AgentContextManager:
"""Manages context for a running agent session."""
def __init__(
self,
max_context_tokens: int = 24000,
keep_n_recent_steps: int = 5,
):
self.max_tokens = max_context_tokens
self.keep_recent = keep_n_recent_steps
self.steps: list[dict] = []
def add_step(self, step_type: str, content: str, tokens: int):
self.steps.append({
"type": step_type, # "tool_call" | "tool_result" | "reasoning"
"content": content,
"tokens": tokens,
})
def get_pruned_context(self) -> list[dict]:
"""Return steps that fit within the token budget."""
total_tokens = sum(s["tokens"] for s in self.steps)
if total_tokens <= self.max_tokens:
return self.steps
# Always keep the last N steps
recent_steps = self.steps[-self.keep_recent:]
older_steps = self.steps[:-self.keep_recent]
# From older steps, only keep tool results (high information density)
# Discard raw reasoning steps (lower value for final answer)
filtered_older = [
s for s in older_steps
if s["type"] == "tool_result"
]
# Summarize the filtered older steps if too many tokens
remaining_budget = self.max_tokens - sum(s["tokens"] for s in recent_steps)
selected_older = []
budget = remaining_budget
for step in reversed(filtered_older):
if budget - step["tokens"] >= 0:
selected_older.insert(0, step)
budget -= step["tokens"]
return selected_older + recent_steps
@property
def step_count(self) -> int:
return len(self.steps)
@property
def total_tokens_used(self) -> int:
return sum(s["tokens"] for s in self.steps)
Production Engineering Notes
Monitor context utilization, not just token count. "Average 4,000 input tokens" tells you little. "Average context utilization 67% of 8K window" tells you whether you have headroom or are near limits.
Track cache hit rates. If you've designed for prompt caching but your cache hit rate is below 50%, something is wrong - either your prompts are changing too frequently, or your traffic patterns don't generate cache hits.
Measure quality at different context lengths. Run your evaluation set at different history lengths (2 turns, 5 turns, 10 turns, 20 turns). If quality plateaus or degrades after 5 turns, there is no quality benefit to keeping more history - just cost.
The summarization cascade problem. If you summarize every 10 turns, and then the summary becomes part of the history that gets summarized again 10 turns later, information degrades progressively. Test multi-hop summarization quality explicitly.
Common Mistakes
:::danger Mistake: Assuming more context = better answers It is tempting to add more context - more retrieved documents, more history, more examples. But the lost-in-the-middle effect means that context beyond a certain point starts hurting quality. Run experiments: does adding turns 15-20 of history actually improve answer quality compared to just keeping turns 12-20? Often the answer is no. Be skeptical of context additions and measure their impact. :::
:::danger Mistake: Not reserving an output buffer If your context budget calculation fills the entire context window with input, the model cannot generate a response and will either truncate its response or raise an error. Always reserve at least 1.5x your expected output length as a buffer. For models with a 128K context window, this is rarely an issue; for 4K or 8K window models, it is a real constraint. :::
:::warning Mistake: Always starting a new summary from scratch When you implement rolling summarization, a common error is to summarize only the current batch of old turns without including the previous summary. This causes context loss: information from turns 1-10 disappears when you summarize turns 11-20. The correct approach: include the previous summary in the input to the new summarization call, producing a cumulative summary. :::
:::warning Mistake: Placing the most important RAG chunks in the middle Given the lost-in-the-middle effect, placing your highest-scored retrieval chunks in the middle of a long context list is the worst position for model attention. Use the sandwich ordering or most-relevant-last ordering to ensure critical context gets the attention it deserves. :::
Interview Q&A
Q1: Explain the "lost in the middle" problem and how it affects system design.
Lost in the middle (Liu et al., 2023) is the empirical finding that language models pay disproportionately more attention to context at the beginning and end of a long prompt, and less attention to content in the middle. The effect is significant: in one study, placing the relevant document in the middle of a 20-document context reduced accuracy by 15 percentage points compared to placing it first or last.
The system design implication is that position is a resource. You should place critical information - the most relevant RAG chunks, key instructions, the current user question - at the end of the context. Stable but less turn-specific content (system prompt, few-shot examples) goes at the beginning. Pad the middle only with content where moderate attention loss is acceptable.
For conversation history, this is naturally handled: the most recent turns are at the end, closest to the question, and receive the most attention. But for RAG context injected before the question, explicit ordering matters.
Q2: What are the four conversation history strategies, and how do you choose between them?
The four strategies are: (1) full history, (2) sliding window, (3) rolling summarization, and (4) selective retention.
Full history works for prototypes but fails in production for long conversations due to unbounded cost and quality degradation. It is the baseline to compare against.
Sliding window is the right default for production: keep the last N turns by token budget, discard older turns. It is simple, predictable, and handles 80% of use cases well. The quality tradeoff is minimal for most query types.
Rolling summarization is better when the application requires long-term continuity - e.g., a personal assistant that needs to remember goals set earlier in the conversation. The cost is one extra LLM call per summary (using a cheap model). The quality benefit is preserving semantic context that sliding window would discard.
Selective retention is for applications where specific past turns have permanently high relevance - e.g., a coding assistant where the user defined a requirement early in the session that every subsequent message depends on. It requires an extra LLM scoring call and is the most expensive strategy, reserved for high-stakes context management.
Q3: How does prompt caching work, and how do you design prompts to maximize cache hit rate?
Prompt caching (Anthropic and OpenAI) works by storing the KV cache computed for a repeated prefix in fast memory. When the same prefix is encountered again, the model skips recomputing it, reducing both latency and cost (90% discount on Anthropic, 50% on OpenAI).
To maximize cache hit rate: put all stable content first (system instructions, persona, tool definitions, few-shot examples, static knowledge), and all dynamic content last (retrieved context, conversation history, current user message). The cache is invalidated by any change to the prefix - even adding a single token to the system prompt resets it.
For Anthropic, the minimum cacheable prefix is 1024 tokens and the TTL is 5 minutes. For a high-traffic chat application where users send messages every few minutes, this means nearly every request after the first in a session hits the cache. For a product with infrequent usage (daily planning tool), TTL-based cache misses are more common, and the ROI of careful cache design is lower.
Q4: You have a RAG application where users ask complex questions requiring synthesis across 20 documents. How do you manage context given the 8K window?
This is a reranking + chunking problem. I would use a two-stage retrieval approach:
Stage 1: Retrieve the top 20 candidate chunks using embedding similarity search (fast, cheap). Do not send all 20 to the LLM.
Stage 2: Rerank with a cross-encoder model (e.g., Cohere Rerank, or a local cross-encoder from sentence-transformers). Cross-encoders score each query-chunk pair jointly and produce much more accurate relevance scores than bi-encoder embedding similarity. Select the top 5-6 chunks from the reranker.
Context ordering: use most-relevant-last ordering for the selected chunks. Place the highest-scored chunk immediately before the user question.
If 20 documents is genuinely the requirement (synthesis task), use an iterative map-reduce approach: first pass generates a mini-summary for each document (20 parallel cheap calls), second pass synthesizes the summaries (one call, much shorter context). This is the "map-reduce" pattern for long-context synthesis.
Q5: How does context window management differ for agents compared to chat applications?
Agents have two context management challenges that chat applications do not:
First, tool result accumulation. Each tool call adds the call request plus the result to the context. A tool result from a database query might return 2,000 tokens of data. With 10 tool calls, you've added 20,000 tokens of context before generating any output. The fix: compress or summarize tool results immediately after processing them. Store the original result in external storage (database/cache) and inject only the summarized key findings into context.
Second, reasoning chain accumulation. Models like Claude with extended thinking or chain-of-thought generate internal reasoning that also consumes tokens. For long agent sessions, the reasoning chain from early steps is rarely useful by step 15. Prune reasoning chain entries older than the last 3-5 steps, keeping only tool results and key findings.
The pruning strategy: always keep the original task specification and the last N action-result pairs. Summarize the rest into a "progress summary" that captures what has been learned and what has been tried. For cost-sensitive agents, set a hard step budget and terminate with a best-effort response when the budget is exceeded.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Attention Complexity & Long Context demo on the EngineersOfAI Playground - no code required.
:::
