Context Compression Techniques
The Middle Path
You've worked through the tradeoffs: RAG is cheap but may miss relevant context; long-context inference is powerful but expensive. Is there a middle path?
Context compression says yes. The core insight: not all tokens in a long context are equally useful. A 50-page research paper doesn't need 25,000 tokens to convey its key claims to an LLM. Much of it is hedging, transitions, redundant examples, repetition. If you can identify and remove the low-value tokens before passing the text to your expensive LLM, you get most of the information at a fraction of the cost.
This isn't summarization. Summarization rewrites the content in fewer words - useful but lossy, and LLM-speed-limited. Context compression is more surgical: it removes tokens deemed redundant or low-information by a smaller, cheaper model, then passes the compressed sequence to the larger model. The goal is compression ratios of 2-6× with accuracy loss of less than 3-5%.
This lesson covers the main approaches: token-level compression (LLMLingua), soft prompt compression (AutoCompressors, GIST tokens), and selective retention (Selective Context).
Why Context Compression Is Hard
The Token Value Distribution Problem
Compressing a prompt requires deciding which tokens are "necessary" for the downstream LLM to produce a correct answer. This is inherently query-dependent: the token "January" in a document might be critical if the question is about timing, and irrelevant if the question is about methodology.
A good compressor must estimate, for each token, how much removing it would affect the final answer - without running the final LLM (which defeats the purpose). This is the core difficulty.
Approaches at a Glance
Selective Context - The Baseline Approach
Before specialized compression models, the simplest approach is selective context: score sentences or passages by information content, then keep the highest-scoring ones.
Self-Information Scoring
Each token's "information content" is measured by how surprising it is to a small language model:
High information content = the small LM was surprised = the token is non-redundant, non-predictable, potentially important.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import numpy as np
class SelectiveContextCompressor:
"""
Selective Context: compress by keeping high-self-information tokens.
Based on: Li et al. (2023), "Compressing Context to Enhance Inference
Efficiency of Large Language Models"
Uses a small LM to score each sentence's information content,
then keeps only the highest-scoring sentences.
"""
def __init__(
self,
scorer_model: str = "gpt2", # Small, fast scoring model
device: str = "cuda" if torch.cuda.is_available() else "cpu",
):
print(f"Loading scorer: {scorer_model}")
self.model = AutoModelForCausalLM.from_pretrained(scorer_model)
self.tokenizer = AutoTokenizer.from_pretrained(scorer_model)
self.model.eval().to(device)
self.device = device
def compute_self_information(self, text: str) -> list[tuple[str, float]]:
"""
Compute per-token self-information scores.
Returns list of (token_text, information_score) tuples.
"""
inputs = self.tokenizer(text, return_tensors="pt").to(self.device)
with torch.no_grad():
outputs = self.model(**inputs, labels=inputs["input_ids"])
# Per-token log probabilities
logits = outputs.logits[:, :-1, :] # predictions for all but last
targets = inputs["input_ids"][:, 1:] # actual next tokens
log_probs = torch.nn.functional.log_softmax(logits, dim=-1)
token_log_probs = log_probs.gather(
dim=-1,
index=targets.unsqueeze(-1)
).squeeze(-1)
tokens = self.tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
# Self-information = -log_prob (higher = more surprising = more informative)
self_info = -token_log_probs[0].cpu().numpy()
return list(zip(tokens[1:], self_info))
def compress(
self,
text: str,
keep_ratio: float = 0.5,
unit: str = "sentence", # "sentence" or "token"
) -> str:
"""
Compress text by keeping highest-information-content units.
keep_ratio: fraction of sentences/tokens to keep
"""
if unit == "sentence":
return self._compress_by_sentence(text, keep_ratio)
else:
return self._compress_by_token(text, keep_ratio)
def _compress_by_sentence(self, text: str, keep_ratio: float) -> str:
"""Keep highest-scoring sentences by average token self-information."""
import re
sentences = re.split(r'(?<=[.!?])\s+', text)
if len(sentences) < 2:
return text
sentence_scores = []
for sent in sentences:
if len(sent) < 10: # Skip very short fragments
sentence_scores.append(0.0)
continue
token_info = self.compute_self_information(sent)
# Score = mean self-information of all tokens
avg_info = np.mean([score for _, score in token_info]) if token_info else 0.0
sentence_scores.append(avg_info)
# Keep top keep_ratio fraction of sentences
n_keep = max(1, int(len(sentences) * keep_ratio))
keep_indices = set(np.argsort(sentence_scores)[-n_keep:])
# Reconstruct maintaining original order
kept_sentences = [sent for i, sent in enumerate(sentences) if i in keep_indices]
return " ".join(kept_sentences)
def _compress_by_token(self, text: str, keep_ratio: float) -> str:
"""Keep highest-information individual tokens."""
token_info = self.compute_self_information(text)
n_keep = max(1, int(len(token_info) * keep_ratio))
threshold = sorted([score for _, score in token_info])[-n_keep]
kept_tokens = [token for token, score in token_info if score >= threshold]
# Note: this produces fragmented text - only use as context, not as readable prose
return self.tokenizer.convert_tokens_to_string(kept_tokens)
LLMLingua - Coarse-to-Fine Token Pruning
LLMLingua (Jiang et al. 2023, "LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models") is the most widely used context compression method. It extends the self-information approach with:
- Budget allocation: assign per-segment compression budgets based on the segment's importance
- Coarse-to-fine compression: first decide which sentences to keep (coarse), then which tokens within kept sentences (fine)
- Conditional compression: condition the small LM on the query to estimate token relevance given the specific question
The Coarse-to-Fine Algorithm
class LLMLinguaCompressor:
"""
LLMLingua: Budget-aware, query-conditioned prompt compression.
Based on: Jiang et al. (2023), "LLMLingua: Compressing Prompts for
Accelerated Inference of Large Language Models"
Architecture:
- Small LM (e.g., Llama-2-7B or GPT-2-XL) for perplexity scoring
- Query-conditioned scoring: compress in context of the actual question
- Coarse-to-fine: sentence budget allocation → token pruning
For production use, install the official package:
pip install llmlingua
"""
def __init__(
self,
small_lm_name: str = "NousResearch/Llama-2-7b-hf",
device: str = "cuda",
):
self.model = AutoModelForCausalLM.from_pretrained(
small_lm_name, torch_dtype=torch.bfloat16
).to(device)
self.tokenizer = AutoTokenizer.from_pretrained(small_lm_name)
self.device = device
def compute_conditional_ppl(
self,
context_text: str,
query: str,
) -> float:
"""
Compute perplexity of context conditioned on the query.
Higher perplexity = the small LM is more surprised by this context
given the query = the context is more query-relevant.
LLMLingua's key insight: query-conditioned perplexity is a better
signal of token importance than unconditional perplexity.
"""
# Concatenate query + context for conditional scoring
combined = f"{query}\n{context_text}"
inputs = self.tokenizer(combined, return_tensors="pt").to(self.device)
query_len = len(self.tokenizer.encode(query))
with torch.no_grad():
output = self.model(**inputs, labels=inputs["input_ids"])
# We only compute perplexity over the context tokens, not the query
# (In full implementation, this requires per-token log probs)
return output.loss.item()
def coarse_compress(
self,
segments: list[str], # List of sentences or paragraphs
query: str,
budget_ratio: float = 0.5,
) -> list[tuple[str, float]]:
"""
Coarse stage: score each segment and allocate compression budget.
Returns (segment, allocated_ratio) pairs - segments with higher
perplexity get higher keep_ratio (less compressed).
"""
# Score each segment
segment_ppls = []
for seg in segments:
ppl = self.compute_conditional_ppl(seg, query)
segment_ppls.append(ppl)
# Normalize perplexity scores to get importance weights
ppls = np.array(segment_ppls)
weights = ppls / ppls.sum()
# Allocate budget proportionally to importance
# Total budget = budget_ratio * total_tokens
total_tokens = sum(len(self.tokenizer.encode(seg)) for seg in segments)
total_budget = int(total_tokens * budget_ratio)
# Allocate more budget (higher keep_ratio) to high-importance segments
allocated_ratios = weights * budget_ratio * len(segments)
allocated_ratios = np.clip(allocated_ratios, 0.1, 1.0)
return list(zip(segments, allocated_ratios))
def fine_compress(
self,
text: str,
keep_ratio: float,
query: str,
) -> str:
"""
Fine stage: prune individual tokens from a segment.
Removes tokens with lowest conditional perplexity contribution.
"""
# Get per-token perplexity contribution conditioned on query
combined = f"{query}\n{text}"
inputs = self.tokenizer(combined, return_tensors="pt").to(self.device)
query_len = len(self.tokenizer.encode(query))
with torch.no_grad():
outputs = self.model(**inputs, labels=inputs["input_ids"])
logits = outputs.logits
log_probs = torch.nn.functional.log_softmax(logits, dim=-1)
# Extract per-token log prob for the context (not query) tokens
context_token_ids = inputs["input_ids"][0, query_len:]
context_log_probs = log_probs[0, query_len-1:-1].gather(
dim=-1,
index=context_token_ids.unsqueeze(-1)
).squeeze(-1)
# Tokens with low log prob (high perplexity) are more informative
info_scores = -context_log_probs.cpu().numpy()
# Keep top keep_ratio by information score
n_keep = max(1, int(len(info_scores) * keep_ratio))
threshold = np.sort(info_scores)[-n_keep]
keep_mask = info_scores >= threshold
# Reconstruct compressed text
context_tokens = self.tokenizer.convert_ids_to_tokens(context_token_ids.tolist())
kept_tokens = [t for t, keep in zip(context_tokens, keep_mask) if keep]
return self.tokenizer.convert_tokens_to_string(kept_tokens)
def compress(
self,
context: str,
query: str,
target_ratio: float = 0.3, # Keep 30% of original tokens
) -> str:
"""
Full LLMLingua compression pipeline.
1. Split into segments
2. Coarse: allocate budget per segment
3. Fine: prune tokens within each segment
"""
import re
segments = re.split(r'(?<=[.!?])\s+', context)
# Coarse stage
coarse_results = self.coarse_compress(segments, query, target_ratio)
# Fine stage: compress each segment with its allocated ratio
compressed_parts = []
for segment, ratio in coarse_results:
if ratio < 0.2:
# Very low budget: skip this segment entirely
continue
elif ratio > 0.9:
# High budget: keep segment mostly intact
compressed_parts.append(segment)
else:
compressed = self.fine_compress(segment, ratio, query)
if compressed.strip():
compressed_parts.append(compressed)
return " ".join(compressed_parts)
Using the Official LLMLingua Package
# In production, use the official package
# pip install llmlingua
from llmlingua import PromptCompressor
# Initialize with a small LM (LLMLingua-2 uses a specialized model)
compressor = PromptCompressor(
model_name="microsoft/llmlingua-2-bert-large-multilingual-cased-meetingbank",
use_llmlingua2=True,
)
# Original long context
original_prompt = """
System: You are a helpful assistant.
Context: [... 8000 tokens of context ...]
Question: What were the main causes of the 2008 financial crisis?
"""
# Compress with LLMLingua-2 (task-agnostic, 4-6× compression)
compressed = compressor.compress_prompt(
original_prompt,
instruction="",
question="What were the main causes of the 2008 financial crisis?",
target_token=1500, # Target 1500 tokens (from ~8000)
condition_compare=True,
condition_in_question="after",
reorder_context="sort",
dynamic_context_compression_ratio=0.3,
)
print(f"Original tokens: {compressed['origin_tokens']}")
print(f"Compressed tokens: {compressed['compressed_tokens']}")
print(f"Ratio: {compressed['ratio']:.1f}x compression")
print(f"\nCompressed prompt:\n{compressed['compressed_prompt'][:500]}...")
LLMLingua-2 Improvements
LLMLingua-2 (Pan et al. 2024) improved over the original by:
- Task-agnostic compression: doesn't require a query at compression time (useful when you don't know the query in advance)
- Training-based approach: fine-tunes a small BERT-style model specifically for token importance prediction
- Higher compression ratios: 4-6× with less accuracy degradation than LLMLingua-1
- Better fluency: the compressed output is more grammatically coherent (important for LLM comprehension)
| Method | Compression Ratio | Accuracy Drop |
|---|---|---|
| No compression | 1× | 0% |
| Selective Context | 2-3× | 5-8% |
| LLMLingua | 3-5× | 4-7% |
| LLMLingua-2 | 4-6× | 3-5% |
| Naive truncation | 4-6× | 15-25% |
AutoCompressors - Soft Prompt Compression
All the techniques above produce compressed text - a shorter version of the original that's still readable by the LLM. AutoCompressors take a fundamentally different approach: compress the context into soft prompt vectors that the LLM can condition on, without the compressed representation being human-readable.
The Architecture
AutoCompressor (Chevalier et al. 2023, "Adapting Language Models to Compress Contexts") fine-tunes a language model to:
- Process a long context
- Produce a fixed set of summary embeddings (soft prompt vectors)
- Use those summary embeddings as a prefix for subsequent generation
The summary embeddings are not tokens - they're continuous vectors in the model's embedding space. They can encode far more information per "slot" than discrete tokens.
import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM
class AutoCompressor(nn.Module):
"""
AutoCompressor: compresses context into soft prompt vectors.
Fine-tunes a base LLM with:
1. Special [SUMMARY] tokens appended to each segment
2. Summary token embeddings learned to encode the segment's content
3. Summary embeddings from one segment reused as context for the next
This is a simplified conceptual implementation.
For production, use the official implementation from princeton-nlp/AutoCompressors.
"""
def __init__(
self,
base_model_name: str,
n_summary_tokens: int = 50, # Number of soft prompt vectors to produce
segment_length: int = 512, # Process context in this-sized segments
):
super().__init__()
self.model = AutoModelForCausalLM.from_pretrained(base_model_name)
self.n_summary = n_summary_tokens
self.segment_length = segment_length
# Special summary token ID (added to tokenizer during training)
# In practice, you'd add this to the tokenizer and resize embeddings
self.summary_token_id = self.model.config.vocab_size - 1
def compress_segment(
self,
segment_input_ids: torch.Tensor,
prev_summary: torch.Tensor | None = None,
) -> torch.Tensor:
"""
Process one segment and produce summary embeddings.
segment_input_ids: (batch, segment_len)
prev_summary: (batch, n_summary, hidden_size) or None
Returns summary: (batch, n_summary, hidden_size)
"""
# Append n_summary [SUMMARY] tokens to segment
summary_ids = torch.full(
(segment_input_ids.shape[0], self.n_summary),
fill_value=self.summary_token_id,
device=segment_input_ids.device,
)
combined_ids = torch.cat([segment_input_ids, summary_ids], dim=1)
# If we have previous summary, prepend it as soft prompts
# (In full implementation, this modifies the key-value cache)
outputs = self.model(combined_ids, output_hidden_states=True)
# Extract the hidden states at [SUMMARY] token positions
# These are the compressed representations of the segment
last_hidden = outputs.hidden_states[-1] # (batch, seq_len, hidden_size)
summary_vectors = last_hidden[:, -self.n_summary:, :] # (batch, n_summary, hidden_size)
return summary_vectors
def compress_long_context(
self,
input_ids: torch.Tensor,
) -> torch.Tensor:
"""
Compress an entire long context into a fixed set of soft prompt vectors.
Processes the context in segments, accumulating summaries.
"""
batch_size = input_ids.shape[0]
seq_len = input_ids.shape[1]
all_summaries = []
prev_summary = None
for start in range(0, seq_len, self.segment_length):
segment = input_ids[:, start:start + self.segment_length]
summary = self.compress_segment(segment, prev_summary)
all_summaries.append(summary)
prev_summary = summary # Pass to next segment
# Concatenate all summaries
# Shape: (batch, n_segments * n_summary, hidden_size)
return torch.cat(all_summaries, dim=1)
def generate_with_compressed_context(
self,
compressed_context: torch.Tensor,
query_input_ids: torch.Tensor,
max_new_tokens: int = 200,
) -> torch.Tensor:
"""
Generate an answer conditioned on compressed context and query.
The compressed context serves as a soft prompt prefix.
"""
# In a full implementation, the compressed context vectors are
# prepended to the key-value cache of the model, allowing the
# model to attend to them during generation without re-encoding.
#
# This is architecturally similar to prefix tuning but the prefix
# is dynamically computed from the context rather than fixed learned parameters.
pass # Full implementation requires custom attention layer modifications
AutoCompressor Tradeoffs
| Property | Text Compression (LLMLingua) | Soft Compression (AutoCompressors) |
|---|---|---|
| Compression ratio | 3-6× | 10-100× (variable) |
| Works with any LLM | Yes | No - requires AutoCompressor-specific model |
| Interpretable output | Yes (readable text) | No (embedding vectors) |
| Losslessness | Moderate | High (trained end-to-end) |
| Training required | No (zero-shot) | Yes (fine-tuning the base model) |
| Production complexity | Low-medium | High |
GIST Tokens - Generalizable and Interspersed
GIST tokens (Mu et al. 2023, "Learning to Compress Prompts with Gist Tokens") are a more parameter-efficient approach to soft compression.
The Concept
GIST training adds a small set of "gist token" embeddings to the model. During fine-tuning:
- A few GIST token positions are inserted into the prompt
- The model is trained to produce the same outputs when given [prompt + gist tokens] as when given [full prompt]
- At inference: compress the prompt to just the GIST tokens + the query
The GIST tokens learn to encode the most important information from the full prompt, compressing it to a small fixed number of learned slots.
class GISTCompressor:
"""
GIST Tokens: learned generalizable prompt compression.
Key properties:
- Fixed number of GIST tokens (e.g., 40) regardless of prompt length
- Trained end-to-end with distillation objective
- Works by distilling full-prompt output into GIST-token-only output
For production use, see the official GIST implementation.
This is a conceptual illustration of the training setup.
"""
def __init__(
self,
base_model,
tokenizer,
n_gist_tokens: int = 40,
):
self.model = base_model
self.tokenizer = tokenizer
self.n_gist = n_gist_tokens
# Initialize GIST token embeddings (learned parameters)
embed_dim = base_model.config.hidden_size
self.gist_embeddings = nn.Embedding(n_gist_tokens, embed_dim)
nn.init.normal_(self.gist_embeddings.weight, std=0.02)
def gist_loss(
self,
full_prompt_ids: torch.Tensor, # The full prompt
gist_prefix_ids: torch.Tensor, # Just the instruction/query part
target_ids: torch.Tensor, # Expected outputs
) -> torch.Tensor:
"""
GIST training objective: the model with GIST tokens should produce
the same output distribution as with the full prompt.
Loss = KL divergence between full-prompt and gist-token outputs.
"""
# Forward pass with full prompt
with torch.no_grad():
full_output = self.model(full_prompt_ids, labels=target_ids)
full_logits = full_output.logits
# Forward pass with GIST tokens (learned compression)
gist_ids = torch.arange(self.n_gist, device=full_prompt_ids.device)
gist_embeds = self.gist_embeddings(gist_ids) # (n_gist, embed_dim)
# Concatenate gist embeddings + query token embeddings
# (In full implementation, use inputs_embeds rather than input_ids)
gist_output = self.model(
input_ids=gist_prefix_ids,
# ... prepend gist_embeds to the embedding sequence
labels=target_ids,
)
# Distillation loss: match full-prompt distribution
kl_loss = nn.functional.kl_div(
gist_output.logits.log_softmax(dim=-1),
full_logits.softmax(dim=-1),
reduction="batchmean",
)
return kl_loss
When GIST Tokens Are Appropriate
GIST tokens work best when:
- You have many queries against the same prompt (the GIST compression is learned per-prompt-type)
- You can afford fine-tuning on the target task
- The prompt is static enough that a fixed compression is valid
GIST is less useful when:
- The prompt changes significantly with each query
- You need zero-shot compression without fine-tuning
- The LLM you're using cannot be modified (API-only access)
Recomp - Compression for RAG
RECOMP (Xu et al. 2023, "RECOMP: Improving Retrieval-Augmented LMs with Context Compression and Selective Augmentation") specifically targets the RAG use case: given retrieved documents, compress each one to a short summary that's still useful for the final query.
Two variants:
Extractive RECOMP: select the most relevant sentences from each retrieved document. Abstractive RECOMP: summarize each retrieved document with a small summarizer model.
def recomp_extractive(
documents: list[str],
query: str,
n_sentences_per_doc: int = 3,
scorer_model = None,
scorer_tokenizer = None,
) -> list[str]:
"""
Extractive RECOMP: select top-N sentences from each document.
For each document, score each sentence by relevance to the query
using a cross-encoder, then keep the top-N.
"""
import re
compressed_docs = []
for doc in documents:
sentences = re.split(r'(?<=[.!?])\s+', doc)
if len(sentences) <= n_sentences_per_doc:
compressed_docs.append(doc)
continue
# Score sentences by query relevance
# (In production, use a cross-encoder like ms-marco-MiniLM)
scores = []
for sent in sentences:
if scorer_model is not None:
score = score_relevance(query, sent, scorer_model, scorer_tokenizer)
else:
# Naive: use sentence length as proxy (not recommended for production)
score = len(sent)
scores.append(score)
# Keep top-N sentences in original order
top_indices = sorted(
sorted(range(len(scores)), key=lambda i: -scores[i])[:n_sentences_per_doc]
)
kept_sentences = [sentences[i] for i in top_indices]
compressed_docs.append(" ".join(kept_sentences))
return compressed_docs
def score_relevance(query: str, text: str, model, tokenizer) -> float:
"""Score text relevance to query using a cross-encoder."""
inputs = tokenizer(
query, text,
return_tensors="pt",
truncation=True,
max_length=512,
)
with torch.no_grad():
output = model(**inputs)
return output.logits[0].item()
Practical Compression Pipeline
class ContextCompressionPipeline:
"""
Production context compression pipeline.
Supports multiple compression strategies and automatically
selects based on context length and query availability.
"""
def __init__(
self,
strategy: str = "llmlingua2", # "selective", "llmlingua", "llmlingua2", "extractive"
target_compression_ratio: float = 4.0, # Compress 4×
):
self.strategy = strategy
self.target_ratio = target_compression_ratio
self._setup()
def _setup(self):
"""Initialize the appropriate compression model."""
if self.strategy == "llmlingua2":
from llmlingua import PromptCompressor
self.compressor = PromptCompressor(
model_name="microsoft/llmlingua-2-bert-large-multilingual-cased-meetingbank",
use_llmlingua2=True,
device_map="auto",
)
elif self.strategy == "selective":
self.compressor = SelectiveContextCompressor()
else:
raise ValueError(f"Unknown strategy: {self.strategy}")
def compress(
self,
context: str,
query: str | None = None,
max_output_tokens: int | None = None,
) -> dict:
"""
Compress context and return statistics.
Returns:
- compressed_text: the compressed context
- original_tokens: approximate original token count
- compressed_tokens: approximate compressed token count
- compression_ratio: actual compression achieved
"""
original_tokens = len(context) // 4 # rough approximation
if max_output_tokens is None:
max_output_tokens = int(original_tokens / self.target_ratio)
if self.strategy == "llmlingua2":
result = self.compressor.compress_prompt(
context,
instruction=query or "",
question=query or "",
target_token=max_output_tokens,
)
compressed = result["compressed_prompt"]
compressed_tokens = result["compressed_tokens"]
elif self.strategy == "selective":
keep_ratio = 1.0 / self.target_ratio
compressed = self.compressor.compress(context, keep_ratio)
compressed_tokens = len(compressed) // 4
else:
compressed = context
compressed_tokens = original_tokens
return {
"compressed_text": compressed,
"original_tokens": original_tokens,
"compressed_tokens": compressed_tokens,
"compression_ratio": original_tokens / max(1, compressed_tokens),
}
# Usage example
pipeline = ContextCompressionPipeline(strategy="llmlingua2", target_compression_ratio=4.0)
# Compress a long document before sending to expensive LLM
long_context = "... 20,000 tokens of retrieved documents ..."
query = "What are the key risk factors mentioned?"
result = pipeline.compress(long_context, query=query, max_output_tokens=5000)
print(f"Compressed {result['original_tokens']:,} → {result['compressed_tokens']:,} tokens "
f"({result['compression_ratio']:.1f}× compression)")
# Now use the compressed context with any LLM
compressed_prompt = f"{result['compressed_text']}\n\nQuestion: {query}\n\nAnswer:"
Common Mistakes
:::danger Don't use compression for short contexts Context compression adds latency (running a small model to score tokens) and introduces accuracy risk (the compression may remove relevant information). For contexts under 4K tokens, compression overhead exceeds any benefit. Reserve compression for contexts that are genuinely too long for your LLM. :::
:::warning Compression accuracy depends heavily on the compressor model quality "Context compression" is not plug-and-play. The quality of the compressor model (particularly for LLMLingua variants) matters significantly. A poor scoring model produces compressed contexts that remove important information. Always evaluate compression accuracy on a held-out set before deploying in production. :::
:::tip Combine compression with strategic placement After compressing documents, apply the boundary placement strategy from Lesson 04. Place the highest-relevance compressed chunks at the beginning and end of the compressed context, not in the order they were retrieved. Compression reduces the total token count; boundary placement reduces the lost-in-middle risk. :::
Interview Q&A
Q: What is LLMLingua and how does it differ from simple summarization?
A: LLMLingua is a token-level compression method that removes individual tokens from a long context rather than rewriting it. A small language model (e.g., LLaMA-2-7B) scores each token's "conditional perplexity" - how surprising the token is given both the preceding context and the user's query. High-perplexity (high-surprise) tokens are retained; low-perplexity (predictable, redundant) tokens are dropped. The key differences from summarization: (1) no rewriting - the remaining tokens are in the original order with original wording; (2) it's query-conditioned - tokens are scored relative to the specific question; (3) it's faster to apply than generating a summary.
Q: What is the difference between hard (token-level) and soft (embedding-level) context compression?
A: Hard compression produces a shorter sequence of actual tokens - human-readable compressed text that any LLM can use. Soft compression maps the long context to a fixed set of continuous embedding vectors that serve as a "soft prompt" prefix for the LLM. AutoCompressors and GIST tokens use soft compression. Hard compression (LLMLingua, Selective Context) works with any LLM out of the box. Soft compression achieves higher compression ratios (50-100× vs 3-6× for hard) but requires a specially fine-tuned model and isn't interpretable or transferable across models.
Q: How does query-conditioned compression (LLMLingua) improve over query-independent compression?
A: Query-independent compression (like Selective Context) scores token importance based solely on the token's self-information in the document - how surprising it is to a language model. This gives equal priority to any surprising token regardless of whether it's relevant to the query. Query-conditioned compression (LLMLingua) computes the token's perplexity given both the document and the query. A technical detail about an unrelated topic might be high-information (surprising) but not query-relevant; query-conditioned scoring can identify this and remove it. Empirically, query-conditioned compression achieves significantly better accuracy at the same compression ratio.
Q: What is RECOMP and when would you use it in a RAG pipeline?
A: RECOMP (Retrieval-Augmented Context Compression) compresses each retrieved document before adding it to the LLM context. The extractive variant selects the most query-relevant sentences from each document using a cross-encoder scorer; the abstractive variant generates a short summary of each document focused on the query. RECOMP is particularly useful in RAG when: (1) retrieved documents are long but only partially relevant to the query; (2) you're concatenating many retrieved documents and need to fit them within a limited context window; (3) you want to improve the signal-to-noise ratio by removing irrelevant parts of retrieved passages before presenting them to the expensive LLM.
Q: What compression ratio can you expect from LLMLingua-2 and at what accuracy cost?
A: LLMLingua-2 typically achieves 4-6× compression ratios in practice. Pan et al. (2024) report accuracy drops of 3-5% on question answering benchmarks (NaturalQuestions, TriviaQA, HotpotQA) at 4× compression. At 6× compression, the accuracy drop increases to 7-10%. For comparison, naive truncation (simply cutting the context to the target length) causes 15-25% accuracy drops at the same compression ratios. The training-based approach in LLMLingua-2 (using a fine-tuned BERT-style model for token importance prediction, rather than a general language model) improves both compression quality and inference speed compared to LLMLingua-1.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Long Context: Lost in the Middle demo on the EngineersOfAI Playground - no code required.
:::
