Lost in the Middle - How LLMs Use Long Contexts
The Bug That Wasn't a Bug
A team at a legal tech company deployed a 128K context model to process case law. The system would receive 50 relevant case documents, retrieved and concatenated, then answer a legal question. They ran hundreds of test cases and found something strange: the model's accuracy was high, dropped substantially in the middle test cases, then rose again at the end. The pattern was consistent. They checked the retrieval pipeline, the chunking strategy, the prompt format. Everything looked correct.
The issue wasn't in their code. It was in the model itself.
Nelson Liu and colleagues at Stanford published "Lost in the Middle: How Language Models Use Long Contexts" in 2023, and it described exactly this pattern. Give an LLM a question and 10 relevant documents, place the answer-containing document at different positions, and measure accuracy. The result is a U-shaped curve: accuracy is highest when the relevant document is at the very beginning or very end of the context, and drops significantly when it's in the middle.
The model wasn't broken. It was performing as designed - and as designed turned out to be insufficient for the task.
The Experimental Setup
Liu et al.'s experiment was carefully controlled:
- Task: Multi-document question answering. Given a question and k documents (exactly one contains the answer), find the answer.
- Models tested: GPT-3.5-Turbo, GPT-4, Claude-1.3, LongChat-13B, MPT-30B-Chat (various context lengths)
- Variable: Position of the answer-containing document (first, second, ..., last)
- Control: Total context length held constant; only the position of the relevant document changes
The finding: across all models tested and all context lengths, the U-shape pattern appeared. Accuracy was typically 10-20 percentage points higher when the relevant document was at position 0 (first) or position k-1 (last) compared to the middle positions.
import numpy as np
import matplotlib.pyplot as plt
def u_shape_performance(
n_documents: int,
peak_accuracy: float = 0.85,
trough_accuracy: float = 0.65,
curve_sharpness: float = 2.0,
) -> list[tuple[int, float]]:
"""
Model the U-shape performance curve from Liu et al. 2023.
Returns (document_position, accuracy) pairs.
"""
positions = list(range(n_documents))
results = []
for pos in positions:
# Normalized position: 0 at both ends, 1 at middle
normalized = abs(2 * pos / (n_documents - 1) - 1) # 1 at edges, 0 at middle
# U-shape: high at edges, low at middle
accuracy = trough_accuracy + (peak_accuracy - trough_accuracy) * (normalized ** curve_sharpness)
results.append((pos, accuracy))
return results
# Simulate the Liu et al. results for 20-document context
results = u_shape_performance(n_documents=20, peak_accuracy=0.82, trough_accuracy=0.63)
print("Position | Accuracy")
print("-" * 25)
for pos, acc in results:
bar = "█" * int(acc * 30)
print(f" {pos:2d} | {acc:.2f} {bar}")
# Actual numbers from Liu et al. for GPT-3.5-Turbo with 20 documents:
# Position 0 (first): ~83% accuracy
# Position 3: ~70%
# Position 9 (middle): ~64%
# Position 16: ~68%
# Position 19 (last): ~78%
Why This Happens - The Attention Mechanism
Recency and Primacy Effects
The U-shape pattern reflects two well-known cognitive biases - primacy (remembering things at the beginning) and recency (remembering things at the end) - but in an attention mechanism rather than human memory.
Recency effect: In autoregressive language models, each token attends to all previous tokens. When generating the answer, the model's final token representations are most strongly influenced by the tokens that appear just before them - the later documents. This is a fundamental property of how attention patterns develop during training.
Primacy effect: The system prompt and the beginning of the context are always "close" to the model's learned representations of "important context." During instruction fine-tuning, the model is trained to attend carefully to the beginning of prompts (instructions, definitions, key constraints). This attention pattern transfers to long-context inference.
Middle disadvantage: Documents in the middle of the context are far from the generation point (recency effect is weak) and don't benefit from the "pay attention to this" signal that the beginning receives. They're in the statistical dead zone.
Attention Score Patterns
You can visualize this directly by examining attention weights:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def analyze_attention_patterns(
model_name: str = "meta-llama/Meta-Llama-3-8B",
prompt: str = "Answer the question based on the context: ...",
n_documents: int = 10,
) -> dict:
"""
Analyze where attention concentrates in a long-context prompt.
Returns attention weight distributions across document positions.
"""
# Note: This requires output_attentions=True which is slow
# and memory-intensive for long contexts. Use with short examples.
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.float32, device_map="cpu"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=4096)
with torch.no_grad():
outputs = model(**inputs, output_attentions=True)
# outputs.attentions: tuple of (batch, n_heads, seq_len, seq_len) per layer
# Average over layers and heads to get overall attention distribution
all_attn = torch.stack(outputs.attentions) # (n_layers, batch, n_heads, seq_len, seq_len)
avg_attn = all_attn.mean(dim=[0, 1, 2]) # (seq_len, seq_len)
# For the last generated token, show attention over input positions
last_token_attn = avg_attn[-1] # attention from the last position
return last_token_attn.numpy()
The Serial Position Effect in Transformer Attention
Research on transformer attention patterns (Shi et al. 2023, "Large Language Models Can Be Easily Distracted by Irrelevant Context") confirmed that attention scores concentrate at:
- Delimiter tokens (special tokens marking document boundaries)
- Early context positions (system prompt, first few documents)
- Recent positions (last few documents before generation)
The middle positions receive substantially lower average attention weight, not because the model "can't reach" them, but because learned attention patterns from training data (where important context is usually near the query) create this distribution.
Quantifying the Effect
The performance gap between best-position and middle-position varies by:
Number of documents: More documents = larger gap. At 5 documents, the gap is modest (~5-8%). At 20 documents, it widens significantly (~15-20%).
Context length: Longer total context = larger gap. The middle becomes proportionally more "lost" as context grows.
Task type: Extraction tasks (find the specific passage) show the largest effect. Synthesis tasks (aggregate information across documents) show smaller but still present effects, because the model must attend to multiple positions.
Model quality: Larger, more capable models show smaller gaps. GPT-4 has a smaller U-shape than GPT-3.5-Turbo, but the gap doesn't disappear.
def measure_lost_in_middle(
model,
tokenizer,
question: str,
documents: list[str],
gold_answer: str,
) -> dict[int, bool]:
"""
Test whether the model can retrieve the answer when the gold document
is placed at each position in a list of documents.
Returns: {position: correct_answer (bool)}
"""
results = {}
n_docs = len(documents)
for gold_pos in range(n_docs):
# Construct prompt with gold document at position gold_pos
# All other positions filled with distractor documents
doc_order = (
documents[:gold_pos] + # distractors before
[documents[gold_pos]] + # gold document
documents[gold_pos+1:] # distractors after
)
# Format as multi-document prompt
context = "\n\n".join(
f"Document {i+1}:\n{doc}"
for i, doc in enumerate(doc_order)
)
prompt = (
f"Based only on the documents provided, answer this question:\n"
f"Question: {question}\n\n"
f"Documents:\n{context}\n\n"
f"Answer:"
)
# Generate answer
inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=50,
temperature=0.0,
do_sample=False,
)
answer = tokenizer.decode(
output[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True,
).strip()
# Check if answer is correct (simple substring match)
correct = gold_answer.lower() in answer.lower()
results[gold_pos] = correct
return results
Mitigation Strategies
1. Strategic Placement of Critical Information
The most direct fix: put the most important information first or last.
def reorder_documents_for_retrieval(
documents: list[dict], # [{"text": ..., "score": float}]
strategy: str = "boundary", # "boundary", "front", "back"
) -> list[dict]:
"""
Reorder retrieved documents to maximize recall of important information.
Strategies:
- "boundary": highest-scoring at beginning and end (U-shape exploit)
- "front": highest-scoring at the beginning
- "back": highest-scoring at the end
"""
sorted_docs = sorted(documents, key=lambda x: x["score"], reverse=True)
if strategy == "front":
# Best documents first
return sorted_docs
elif strategy == "back":
# Best documents last
return list(reversed(sorted_docs))
elif strategy == "boundary":
# Place top-ranked documents at boundaries (first and last)
# This exploits the U-shape: high relevance at positions with highest recall
n = len(sorted_docs)
result = [None] * n
# Alternate placing at front and back
front_idx = 0
back_idx = n - 1
for i, doc in enumerate(sorted_docs):
if i % 2 == 0:
result[front_idx] = doc
front_idx += 1
else:
result[back_idx] = doc
back_idx -= 1
return result
else:
raise ValueError(f"Unknown strategy: {strategy}")
# Example: RAG pipeline with lost-in-middle mitigation
def rag_with_position_optimization(
query: str,
retrieved_docs: list[dict], # [{"text": ..., "score": float}]
top_k: int = 10,
) -> str:
"""
RAG pipeline that places most relevant documents at boundary positions.
"""
# Take top-k retrieved documents
top_docs = sorted(retrieved_docs, key=lambda x: -x["score"])[:top_k]
# Reorder: best documents at boundaries
ordered = reorder_documents_for_retrieval(top_docs, strategy="boundary")
# Build prompt with strategic ordering
context_parts = [
f"[Document {i+1} - Relevance: {doc['score']:.2f}]\n{doc['text']}"
for i, doc in enumerate(ordered)
]
context = "\n\n".join(context_parts)
return f"Based on the following documents:\n\n{context}\n\nQuestion: {query}\n\nAnswer:"
2. Explicit Instruction to the Model
A simple but effective mitigation: explicitly tell the model where to look.
def build_long_context_prompt(
question: str,
documents: list[str],
hint_position: int | None = None,
) -> str:
"""
Build a long-context prompt with explicit guidance about document relevance.
hint_position: if known, tell the model which document is most relevant
"""
if hint_position is not None:
# Explicit position hint
instruction = (
f"The answer to the question is contained in Document {hint_position + 1}. "
f"All other documents are distractors.\n\n"
f"Question: {question}\n\n"
)
else:
# General instruction to attend to all positions
instruction = (
f"Carefully read ALL documents below, including those in the middle of the list. "
f"The relevant information may appear anywhere.\n\n"
f"Question: {question}\n\n"
)
context = "\n\n".join(
f"Document {i+1}:\n{doc}"
for i, doc in enumerate(documents)
)
return instruction + context + "\n\nAnswer:"
Experiments show that "carefully read all documents including those in the middle" increases middle-position accuracy by 5-10% - not a complete fix, but meaningful.
3. Chunked Attention / Hierarchical Processing
For very long contexts, process in chunks and hierarchically:
def hierarchical_qa(
question: str,
long_document: str,
chunk_size: int = 4096,
model=None,
tokenizer=None,
) -> str:
"""
Process long documents hierarchically to avoid lost-in-middle effects.
Step 1: Split document into chunks
Step 2: Extract relevant passages from each chunk
Step 3: Combine relevant passages and answer the question
This converts a lost-in-middle problem into multiple shorter-context problems.
"""
# Tokenize and chunk the document
tokens = tokenizer(long_document, return_tensors="pt")["input_ids"][0]
chunks = []
for start in range(0, len(tokens), chunk_size):
chunk_tokens = tokens[start:start + chunk_size]
chunk_text = tokenizer.decode(chunk_tokens, skip_special_tokens=True)
chunks.append(chunk_text)
# Step 1: Extract relevant passages from each chunk
relevant_passages = []
for i, chunk in enumerate(chunks):
extraction_prompt = (
f"From the following text, extract any passages that might help answer: "
f"'{question}'\n\n"
f"Text:\n{chunk}\n\n"
f"Relevant passages (or 'None' if not relevant):"
)
response = generate_with_model(model, tokenizer, extraction_prompt, max_tokens=300)
if "none" not in response.lower()[:20]:
relevant_passages.append(f"[From section {i+1}]\n{response.strip()}")
if not relevant_passages:
return "No relevant information found in the document."
# Step 2: Synthesize answer from extracted passages
combined_passages = "\n\n".join(relevant_passages)
synthesis_prompt = (
f"Based on the following extracted passages, answer: '{question}'\n\n"
f"{combined_passages}\n\n"
f"Answer:"
)
return generate_with_model(model, tokenizer, synthesis_prompt, max_tokens=500)
def generate_with_model(model, tokenizer, prompt, max_tokens=200):
"""Helper for generation."""
inputs = tokenizer(prompt, return_tensors="pt", truncation=True).to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=max_tokens, do_sample=False)
return tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
4. Retrieval-Augmented Generation
The most robust fix: don't put all documents in the context. Use RAG to retrieve the most relevant chunks and place them strategically (Lesson 05 covers this in detail).
Benchmarking Long-Context Models Properly
Needle in a Haystack
The Needle in a Haystack (NIAH) test is the standard sanity check for long-context models:
def create_needle_haystack_test(
needle: str = "The secret passcode is XLMQR-7734.",
haystack_text: str = None, # Long filler text
haystack_size_tokens: int = 100_000,
needle_position: float = 0.5, # 0.0 = beginning, 1.0 = end
question: str = "What is the secret passcode?",
) -> str:
"""
Create a Needle in a Haystack test prompt.
needle_position: where to insert the needle (0.0 = very beginning, 1.0 = very end)
"""
# Use default haystack if not provided
if haystack_text is None:
# In practice, use a long book, article, or filler text
haystack_text = "This is placeholder text. " * 5000
# Tokenize and compute insertion point
# Simplified: use character-level approximation
total_chars = len(haystack_text)
insert_at = int(total_chars * needle_position)
# Insert needle at position
haystack_with_needle = (
haystack_text[:insert_at] +
f"\n\n{needle}\n\n" +
haystack_text[insert_at:]
)
prompt = (
f"{haystack_with_needle}\n\n"
f"Question: {question}\n\n"
f"Answer:"
)
return prompt
def run_niah_sweep(
model,
tokenizer,
needle: str,
positions: list[float] = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
context_lengths: list[int] = [8192, 32768, 65536, 131072],
gold_answer: str = "XLMQR-7734",
) -> dict:
"""
Full NIAH sweep: test all position × context_length combinations.
Returns accuracy heatmap data.
"""
results = {}
for ctx_len in context_lengths:
results[ctx_len] = {}
for pos in positions:
prompt = create_needle_haystack_test(
needle=needle,
haystack_size_tokens=ctx_len,
needle_position=pos,
)
answer = generate_with_model(model, tokenizer, prompt, max_tokens=20)
correct = gold_answer.lower() in answer.lower()
results[ctx_len][pos] = correct
status = "✓" if correct else "✗"
print(f" ctx={ctx_len:>8,}, pos={pos:.1f}: {status} | Answer: {answer[:30]}")
return results
RULER - A More Comprehensive Benchmark
Needle in a Haystack tests only simple retrieval. RULER (Hsieh et al. 2024) includes harder tasks:
- Multi-needle retrieval: 2-3 needles scattered across the context
- Multi-hop reasoning: Answer requires combining information from multiple locations
- Aggregation: Count occurrences spread across the context
- Distracting content: Needles that look relevant but aren't
RULER shows that models claiming 128K context often have effective performance only up to much shorter lengths on harder tasks.
RULER results (approximate, models as of late 2024):
Model | 4K | 8K | 16K | 32K | 64K | 128K
─────────────────────────────────────────────────────────
GPT-4-1106 | 96 | 96 | 95 | 93 | 87 | 81
Claude-3-Sonnet | 97 | 97 | 96 | 94 | 91 | 87
Llama-3.1-70B | 95 | 95 | 94 | 91 | 88 | 79
Llama-3.1-8B | 91 | 90 | 88 | 84 | 76 | 68
Mistral-7B-v0.2 | 89 | 87 | 80 | 68 | 43 | 17
Key takeaway: all models degrade at long contexts; smaller models
degrade faster; "supports 128K" ≠ "reliable at 128K"
Production Implications
The lost-in-middle effect has direct implications for system design:
def optimize_context_for_reliability(
docs_with_scores: list[dict], # {"text": ..., "score": float, "is_critical": bool}
max_context_tokens: int = 100_000,
tokenizer = None,
) -> str:
"""
Build context optimized for LLM recall reliability.
Rules:
1. Critical/highest-scoring documents go to beginning and end
2. Fill middle with supporting documents
3. Keep total within max_context_tokens
4. Add explicit position hints for multi-document tasks
"""
# Sort by importance
critical = [d for d in docs_with_scores if d.get("is_critical", False)]
supporting = sorted(
[d for d in docs_with_scores if not d.get("is_critical", False)],
key=lambda x: -x["score"]
)
# Place critical at boundaries
n_critical = len(critical)
first_half = critical[:n_critical // 2 + 1]
second_half = critical[n_critical // 2 + 1:]
ordered_docs = first_half + supporting + second_half
# Build context with token budget
context_parts = []
total_tokens = 0
for i, doc in enumerate(ordered_docs):
part = f"[Source {i+1}]\n{doc['text']}"
tokens = len(tokenizer.encode(part)) if tokenizer else len(part) // 4
if total_tokens + tokens > max_context_tokens:
break
context_parts.append(part)
total_tokens += tokens
return "\n\n".join(context_parts)
Common Mistakes
:::danger Don't assume uniform context retrieval - test it Never deploy a long-context application without testing retrieval accuracy at different positions within the context. Run a position sweep (similar to the NIAH test) on a sample of your production queries. The performance difference between position 0 and position 50 (in a 100-document context) can exceed 20 percentage points. :::
:::warning Don't fill the context window indiscriminately More context is not always better. Adding irrelevant documents to pad out the context actually hurts performance on the relevant documents - more noise, more positions to ignore, more chances for the model to attend to the wrong thing. Retrieve and use only documents that are genuinely relevant to the query. :::
:::tip For multi-hop reasoning, consider breaking into steps Tasks that require chaining information from multiple far-apart positions are especially susceptible to lost-in-middle. If your task requires "find fact A at position 10K, find fact B at position 80K, combine them," consider breaking this into two separate queries or using a hierarchical approach where each step produces a shorter context for the next. :::
Interview Q&A
Q: Describe the lost-in-middle finding from Liu et al. 2023.
A: Liu et al. found that when models are given a question with multiple documents, the accuracy of retrieving the relevant answer follows a U-shaped curve as a function of the relevant document's position in the context. Models perform best when the relevant document is first or last, and worst when it's in the middle. This was consistent across all tested models (GPT-3.5-Turbo, GPT-4, Claude) and all context lengths. The performance gap between best position (boundary) and worst position (middle) was typically 10-20 percentage points, and it grew with more documents and longer total contexts.
Q: Why do LLMs perform better on information at the beginning and end of long contexts?
A: Two mechanisms. Primacy effect: during instruction fine-tuning, models learn to pay close attention to early context because instructions and system prompts always appear at the beginning. This attention pattern persists - early tokens receive disproportionate attention weight. Recency effect: in autoregressive generation, each position depends on all previous positions, but tokens that appear closer to the generation point have stronger influence on the final hidden states. The last few documents before the generation are most "fresh" in the model's representations. Middle positions suffer from both effects: they're not recent (recency effect is weak) and they don't carry the "this is where the important instructions are" signal.
Q: What are three practical strategies to mitigate lost-in-middle effects?
A: First, strategic placement: put the most relevant retrieved documents at the beginning and end of the context rather than in the middle. Use a "boundary" ordering that places the highest-ranked documents alternately at the front and back. Second, explicit instruction: tell the model to "carefully read all documents, including those in the middle" and that relevant information may appear anywhere. This provides a 5-10% improvement. Third, hierarchical processing: rather than giving the model all documents at once, process in chunks, extract relevant passages from each chunk, then do a final synthesis pass over just the extracted passages. This converts the long-context problem into multiple short-context problems where the U-shape effect is much smaller.
Q: How does the RULER benchmark differ from Needle in a Haystack, and why is the difference important?
A: Needle in a Haystack tests simple single-piece retrieval: find one specific piece of information at a given position. It's a necessary sanity check but insufficient as a capability benchmark. RULER includes harder variants: multi-needle retrieval (find 2-3 items scattered across the context), multi-hop reasoning (combine information from multiple positions), aggregation (count or collect entities spread throughout), and tasks with distractor needles. Models that score near-perfect on NIAH often show significant degradation on RULER's harder tasks, especially at long contexts. This is important because real applications (legal analysis, code review, research synthesis) almost always require multi-position information gathering, not single-needle retrieval.
Q: How does context length affect the severity of lost-in-middle effects?
A: Lost-in-middle effects become more severe as total context length increases, for two reasons. First, the "middle" becomes a larger fraction of the context - at 10 documents, the middle is 8 positions; at 50 documents, it's 48 positions. The model must span a wider attention gap to reach middle information. Second, longer contexts dilute attention - with more tokens to attend to, the average attention weight per token decreases, making it harder for any single middle token to receive sufficient attention. Liu et al.'s experiments showed that at 20 documents, the performance gap between boundary and middle positions was roughly twice as large as at 5 documents.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Long Context: Lost in the Middle demo on the EngineersOfAI Playground - no code required.
:::
