Context Compression Techniques

The Middle Path

You've worked through the tradeoffs: RAG is cheap but may miss relevant context; long-context inference is powerful but expensive. Is there a middle path?

Context compression says yes. The core insight: not all tokens in a long context are equally useful. A 50-page research paper doesn't need 25,000 tokens to convey its key claims to an LLM. Much of it is hedging, transitions, redundant examples, repetition. If you can identify and remove the low-value tokens before passing the text to your expensive LLM, you get most of the information at a fraction of the cost.

This isn't summarization. Summarization rewrites the content in fewer words - useful but lossy, and LLM-speed-limited. Context compression is more surgical: it removes tokens deemed redundant or low-information by a smaller, cheaper model, then passes the compressed sequence to the larger model. The goal is compression ratios of 2-6× with accuracy loss of less than 3-5%.

This lesson covers the main approaches: token-level compression (LLMLingua), soft prompt compression (AutoCompressors, GIST tokens), and selective retention (Selective Context).

Why Context Compression Is Hard

The Token Value Distribution Problem

Compressing a prompt requires deciding which tokens are "necessary" for the downstream LLM to produce a correct answer. This is inherently query-dependent: the token "January" in a document might be critical if the question is about timing, and irrelevant if the question is about methodology.

A good compressor must estimate, for each token, how much removing it would affect the final answer - without running the final LLM (which defeats the purpose). This is the core difficulty.

Approaches at a Glance

Selective Context - The Baseline Approach

Before specialized compression models, the simplest approach is selective context: score sentences or passages by information content, then keep the highest-scoring ones.

Self-Information Scoring

Each token's "information content" is measured by how surprising it is to a small language model:

$\text{information}(t_i) = -\log P_{\text{small LM}}(t_i | t_1, \ldots, t_{i-1})$

High information content = the small LM was surprised = the token is non-redundant, non-predictable, potentially important.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import numpy as np

class SelectiveContextCompressor:
    """
    Selective Context: compress by keeping high-self-information tokens.

    Based on: Li et al. (2023), "Compressing Context to Enhance Inference
    Efficiency of Large Language Models"

    Uses a small LM to score each sentence's information content,
    then keeps only the highest-scoring sentences.
    """

    def __init__(
        self,
        scorer_model: str = "gpt2",  # Small, fast scoring model
        device: str = "cuda" if torch.cuda.is_available() else "cpu",
    ):
        print(f"Loading scorer: {scorer_model}")
        self.model = AutoModelForCausalLM.from_pretrained(scorer_model)
        self.tokenizer = AutoTokenizer.from_pretrained(scorer_model)
        self.model.eval().to(device)
        self.device = device

    def compute_self_information(self, text: str) -> list[tuple[str, float]]:
        """
        Compute per-token self-information scores.

        Returns list of (token_text, information_score) tuples.
        """
        inputs = self.tokenizer(text, return_tensors="pt").to(self.device)

        with torch.no_grad():
            outputs = self.model(**inputs, labels=inputs["input_ids"])
            # Per-token log probabilities
            logits = outputs.logits[:, :-1, :]  # predictions for all but last
            targets = inputs["input_ids"][:, 1:]  # actual next tokens

            log_probs = torch.nn.functional.log_softmax(logits, dim=-1)
            token_log_probs = log_probs.gather(
                dim=-1,
                index=targets.unsqueeze(-1)
            ).squeeze(-1)

        tokens = self.tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
        # Self-information = -log_prob (higher = more surprising = more informative)
        self_info = -token_log_probs[0].cpu().numpy()

        return list(zip(tokens[1:], self_info))

    def compress(
        self,
        text: str,
        keep_ratio: float = 0.5,
        unit: str = "sentence",  # "sentence" or "token"
    ) -> str:
        """
        Compress text by keeping highest-information-content units.

        keep_ratio: fraction of sentences/tokens to keep
        """
        if unit == "sentence":
            return self._compress_by_sentence(text, keep_ratio)
        else:
            return self._compress_by_token(text, keep_ratio)

    def _compress_by_sentence(self, text: str, keep_ratio: float) -> str:
        """Keep highest-scoring sentences by average token self-information."""
        import re
        sentences = re.split(r'(?<=[.!?])\s+', text)
        if len(sentences) < 2:
            return text

        sentence_scores = []
        for sent in sentences:
            if len(sent) < 10:  # Skip very short fragments
                sentence_scores.append(0.0)
                continue
            token_info = self.compute_self_information(sent)
            # Score = mean self-information of all tokens
            avg_info = np.mean([score for _, score in token_info]) if token_info else 0.0
            sentence_scores.append(avg_info)

        # Keep top keep_ratio fraction of sentences
        n_keep = max(1, int(len(sentences) * keep_ratio))
        keep_indices = set(np.argsort(sentence_scores)[-n_keep:])

        # Reconstruct maintaining original order
        kept_sentences = [sent for i, sent in enumerate(sentences) if i in keep_indices]
        return " ".join(kept_sentences)

    def _compress_by_token(self, text: str, keep_ratio: float) -> str:
        """Keep highest-information individual tokens."""
        token_info = self.compute_self_information(text)
        n_keep = max(1, int(len(token_info) * keep_ratio))
        threshold = sorted([score for _, score in token_info])[-n_keep]

        kept_tokens = [token for token, score in token_info if score >= threshold]
        # Note: this produces fragmented text - only use as context, not as readable prose
        return self.tokenizer.convert_tokens_to_string(kept_tokens)

LLMLingua - Coarse-to-Fine Token Pruning

LLMLingua (Jiang et al. 2023, "LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models") is the most widely used context compression method. It extends the self-information approach with:

Budget allocation: assign per-segment compression budgets based on the segment's importance
Coarse-to-fine compression: first decide which sentences to keep (coarse), then which tokens within kept sentences (fine)
Conditional compression: condition the small LM on the query to estimate token relevance given the specific question

The Coarse-to-Fine Algorithm

class LLMLinguaCompressor:
    """
    LLMLingua: Budget-aware, query-conditioned prompt compression.

    Based on: Jiang et al. (2023), "LLMLingua: Compressing Prompts for
    Accelerated Inference of Large Language Models"

    Architecture:
    - Small LM (e.g., Llama-2-7B or GPT-2-XL) for perplexity scoring
    - Query-conditioned scoring: compress in context of the actual question
    - Coarse-to-fine: sentence budget allocation → token pruning

    For production use, install the official package:
    pip install llmlingua
    """

    def __init__(
        self,
        small_lm_name: str = "NousResearch/Llama-2-7b-hf",
        device: str = "cuda",
    ):
        self.model = AutoModelForCausalLM.from_pretrained(
            small_lm_name, torch_dtype=torch.bfloat16
        ).to(device)
        self.tokenizer = AutoTokenizer.from_pretrained(small_lm_name)
        self.device = device

    def compute_conditional_ppl(
        self,
        context_text: str,
        query: str,
    ) -> float:
        """
        Compute perplexity of context conditioned on the query.

        Higher perplexity = the small LM is more surprised by this context
        given the query = the context is more query-relevant.

        LLMLingua's key insight: query-conditioned perplexity is a better
        signal of token importance than unconditional perplexity.
        """
        # Concatenate query + context for conditional scoring
        combined = f"{query}\n{context_text}"
        inputs = self.tokenizer(combined, return_tensors="pt").to(self.device)
        query_len = len(self.tokenizer.encode(query))

        with torch.no_grad():
            output = self.model(**inputs, labels=inputs["input_ids"])

        # We only compute perplexity over the context tokens, not the query
        # (In full implementation, this requires per-token log probs)
        return output.loss.item()

    def coarse_compress(
        self,
        segments: list[str],  # List of sentences or paragraphs
        query: str,
        budget_ratio: float = 0.5,
    ) -> list[tuple[str, float]]:
        """
        Coarse stage: score each segment and allocate compression budget.

        Returns (segment, allocated_ratio) pairs - segments with higher
        perplexity get higher keep_ratio (less compressed).
        """
        # Score each segment
        segment_ppls = []
        for seg in segments:
            ppl = self.compute_conditional_ppl(seg, query)
            segment_ppls.append(ppl)

        # Normalize perplexity scores to get importance weights
        ppls = np.array(segment_ppls)
        weights = ppls / ppls.sum()

        # Allocate budget proportionally to importance
        # Total budget = budget_ratio * total_tokens
        total_tokens = sum(len(self.tokenizer.encode(seg)) for seg in segments)
        total_budget = int(total_tokens * budget_ratio)

        # Allocate more budget (higher keep_ratio) to high-importance segments
        allocated_ratios = weights * budget_ratio * len(segments)
        allocated_ratios = np.clip(allocated_ratios, 0.1, 1.0)

        return list(zip(segments, allocated_ratios))

    def fine_compress(
        self,
        text: str,
        keep_ratio: float,
        query: str,
    ) -> str:
        """
        Fine stage: prune individual tokens from a segment.

        Removes tokens with lowest conditional perplexity contribution.
        """
        # Get per-token perplexity contribution conditioned on query
        combined = f"{query}\n{text}"
        inputs = self.tokenizer(combined, return_tensors="pt").to(self.device)
        query_len = len(self.tokenizer.encode(query))

        with torch.no_grad():
            outputs = self.model(**inputs, labels=inputs["input_ids"])
            logits = outputs.logits
            log_probs = torch.nn.functional.log_softmax(logits, dim=-1)

        # Extract per-token log prob for the context (not query) tokens
        context_token_ids = inputs["input_ids"][0, query_len:]
        context_log_probs = log_probs[0, query_len-1:-1].gather(
            dim=-1,
            index=context_token_ids.unsqueeze(-1)
        ).squeeze(-1)

        # Tokens with low log prob (high perplexity) are more informative
        info_scores = -context_log_probs.cpu().numpy()

        # Keep top keep_ratio by information score
        n_keep = max(1, int(len(info_scores) * keep_ratio))
        threshold = np.sort(info_scores)[-n_keep]
        keep_mask = info_scores >= threshold

        # Reconstruct compressed text
        context_tokens = self.tokenizer.convert_ids_to_tokens(context_token_ids.tolist())
        kept_tokens = [t for t, keep in zip(context_tokens, keep_mask) if keep]
        return self.tokenizer.convert_tokens_to_string(kept_tokens)

    def compress(
        self,
        context: str,
        query: str,
        target_ratio: float = 0.3,  # Keep 30% of original tokens
    ) -> str:
        """
        Full LLMLingua compression pipeline.

        1. Split into segments
        2. Coarse: allocate budget per segment
        3. Fine: prune tokens within each segment
        """
        import re
        segments = re.split(r'(?<=[.!?])\s+', context)

        # Coarse stage
        coarse_results = self.coarse_compress(segments, query, target_ratio)

        # Fine stage: compress each segment with its allocated ratio
        compressed_parts = []
        for segment, ratio in coarse_results:
            if ratio < 0.2:
                # Very low budget: skip this segment entirely
                continue
            elif ratio > 0.9:
                # High budget: keep segment mostly intact
                compressed_parts.append(segment)
            else:
                compressed = self.fine_compress(segment, ratio, query)
                if compressed.strip():
                    compressed_parts.append(compressed)

        return " ".join(compressed_parts)

Using the Official LLMLingua Package

# In production, use the official package
# pip install llmlingua

from llmlingua import PromptCompressor

# Initialize with a small LM (LLMLingua-2 uses a specialized model)
compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-bert-large-multilingual-cased-meetingbank",
    use_llmlingua2=True,
)

# Original long context
original_prompt = """
System: You are a helpful assistant.

Context: [... 8000 tokens of context ...]

Question: What were the main causes of the 2008 financial crisis?
"""

# Compress with LLMLingua-2 (task-agnostic, 4-6× compression)
compressed = compressor.compress_prompt(
    original_prompt,
    instruction="",
    question="What were the main causes of the 2008 financial crisis?",
    target_token=1500,  # Target 1500 tokens (from ~8000)
    condition_compare=True,
    condition_in_question="after",
    reorder_context="sort",
    dynamic_context_compression_ratio=0.3,
)

print(f"Original tokens: {compressed['origin_tokens']}")
print(f"Compressed tokens: {compressed['compressed_tokens']}")
print(f"Ratio: {compressed['ratio']:.1f}x compression")
print(f"\nCompressed prompt:\n{compressed['compressed_prompt'][:500]}...")

LLMLingua-2 Improvements

LLMLingua-2 (Pan et al. 2024) improved over the original by:

Task-agnostic compression: doesn't require a query at compression time (useful when you don't know the query in advance)
Training-based approach: fine-tunes a small BERT-style model specifically for token importance prediction
Higher compression ratios: 4-6× with less accuracy degradation than LLMLingua-1
Better fluency: the compressed output is more grammatically coherent (important for LLM comprehension)

Method	Compression Ratio	Accuracy Drop
No compression	1×	0%
Selective Context	2-3×	5-8%
LLMLingua	3-5×	4-7%
LLMLingua-2	4-6×	3-5%
Naive truncation	4-6×	15-25%

AutoCompressors - Soft Prompt Compression

All the techniques above produce compressed text - a shorter version of the original that's still readable by the LLM. AutoCompressors take a fundamentally different approach: compress the context into soft prompt vectors that the LLM can condition on, without the compressed representation being human-readable.

The Architecture

AutoCompressor (Chevalier et al. 2023, "Adapting Language Models to Compress Contexts") fine-tunes a language model to:

Process a long context
Produce a fixed set of summary embeddings (soft prompt vectors)
Use those summary embeddings as a prefix for subsequent generation

The summary embeddings are not tokens - they're continuous vectors in the model's embedding space. They can encode far more information per "slot" than discrete tokens.

import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM

class AutoCompressor(nn.Module):
    """
    AutoCompressor: compresses context into soft prompt vectors.

    Fine-tunes a base LLM with:
    1. Special [SUMMARY] tokens appended to each segment
    2. Summary token embeddings learned to encode the segment's content
    3. Summary embeddings from one segment reused as context for the next

    This is a simplified conceptual implementation.
    For production, use the official implementation from princeton-nlp/AutoCompressors.
    """

    def __init__(
        self,
        base_model_name: str,
        n_summary_tokens: int = 50,  # Number of soft prompt vectors to produce
        segment_length: int = 512,   # Process context in this-sized segments
    ):
        super().__init__()
        self.model = AutoModelForCausalLM.from_pretrained(base_model_name)
        self.n_summary = n_summary_tokens
        self.segment_length = segment_length

        # Special summary token ID (added to tokenizer during training)
        # In practice, you'd add this to the tokenizer and resize embeddings
        self.summary_token_id = self.model.config.vocab_size - 1

    def compress_segment(
        self,
        segment_input_ids: torch.Tensor,
        prev_summary: torch.Tensor | None = None,
    ) -> torch.Tensor:
        """
        Process one segment and produce summary embeddings.

        segment_input_ids: (batch, segment_len)
        prev_summary: (batch, n_summary, hidden_size) or None

        Returns summary: (batch, n_summary, hidden_size)
        """
        # Append n_summary [SUMMARY] tokens to segment
        summary_ids = torch.full(
            (segment_input_ids.shape[0], self.n_summary),
            fill_value=self.summary_token_id,
            device=segment_input_ids.device,
        )
        combined_ids = torch.cat([segment_input_ids, summary_ids], dim=1)

        # If we have previous summary, prepend it as soft prompts
        # (In full implementation, this modifies the key-value cache)
        outputs = self.model(combined_ids, output_hidden_states=True)

        # Extract the hidden states at [SUMMARY] token positions
        # These are the compressed representations of the segment
        last_hidden = outputs.hidden_states[-1]  # (batch, seq_len, hidden_size)
        summary_vectors = last_hidden[:, -self.n_summary:, :]  # (batch, n_summary, hidden_size)

        return summary_vectors

    def compress_long_context(
        self,
        input_ids: torch.Tensor,
    ) -> torch.Tensor:
        """
        Compress an entire long context into a fixed set of soft prompt vectors.

        Processes the context in segments, accumulating summaries.
        """
        batch_size = input_ids.shape[0]
        seq_len = input_ids.shape[1]

        all_summaries = []
        prev_summary = None

        for start in range(0, seq_len, self.segment_length):
            segment = input_ids[:, start:start + self.segment_length]
            summary = self.compress_segment(segment, prev_summary)
            all_summaries.append(summary)
            prev_summary = summary  # Pass to next segment

        # Concatenate all summaries
        # Shape: (batch, n_segments * n_summary, hidden_size)
        return torch.cat(all_summaries, dim=1)

    def generate_with_compressed_context(
        self,
        compressed_context: torch.Tensor,
        query_input_ids: torch.Tensor,
        max_new_tokens: int = 200,
    ) -> torch.Tensor:
        """
        Generate an answer conditioned on compressed context and query.

        The compressed context serves as a soft prompt prefix.
        """
        # In a full implementation, the compressed context vectors are
        # prepended to the key-value cache of the model, allowing the
        # model to attend to them during generation without re-encoding.
        #
        # This is architecturally similar to prefix tuning but the prefix
        # is dynamically computed from the context rather than fixed learned parameters.
        pass  # Full implementation requires custom attention layer modifications

AutoCompressor Tradeoffs

Property	Text Compression (LLMLingua)	Soft Compression (AutoCompressors)
Compression ratio	3-6×	10-100× (variable)
Works with any LLM	Yes	No - requires AutoCompressor-specific model
Interpretable output	Yes (readable text)	No (embedding vectors)
Losslessness	Moderate	High (trained end-to-end)
Training required	No (zero-shot)	Yes (fine-tuning the base model)
Production complexity	Low-medium	High

GIST Tokens - Generalizable and Interspersed

GIST tokens (Mu et al. 2023, "Learning to Compress Prompts with Gist Tokens") are a more parameter-efficient approach to soft compression.

The Concept

GIST training adds a small set of "gist token" embeddings to the model. During fine-tuning:

A few GIST token positions are inserted into the prompt
The model is trained to produce the same outputs when given [prompt + gist tokens] as when given [full prompt]
At inference: compress the prompt to just the GIST tokens + the query

The GIST tokens learn to encode the most important information from the full prompt, compressing it to a small fixed number of learned slots.

class GISTCompressor:
    """
    GIST Tokens: learned generalizable prompt compression.

    Key properties:
    - Fixed number of GIST tokens (e.g., 40) regardless of prompt length
    - Trained end-to-end with distillation objective
    - Works by distilling full-prompt output into GIST-token-only output

    For production use, see the official GIST implementation.
    This is a conceptual illustration of the training setup.
    """

    def __init__(
        self,
        base_model,
        tokenizer,
        n_gist_tokens: int = 40,
    ):
        self.model = base_model
        self.tokenizer = tokenizer
        self.n_gist = n_gist_tokens

        # Initialize GIST token embeddings (learned parameters)
        embed_dim = base_model.config.hidden_size
        self.gist_embeddings = nn.Embedding(n_gist_tokens, embed_dim)
        nn.init.normal_(self.gist_embeddings.weight, std=0.02)

    def gist_loss(
        self,
        full_prompt_ids: torch.Tensor,  # The full prompt
        gist_prefix_ids: torch.Tensor,  # Just the instruction/query part
        target_ids: torch.Tensor,       # Expected outputs
    ) -> torch.Tensor:
        """
        GIST training objective: the model with GIST tokens should produce
        the same output distribution as with the full prompt.

        Loss = KL divergence between full-prompt and gist-token outputs.
        """
        # Forward pass with full prompt
        with torch.no_grad():
            full_output = self.model(full_prompt_ids, labels=target_ids)
            full_logits = full_output.logits

        # Forward pass with GIST tokens (learned compression)
        gist_ids = torch.arange(self.n_gist, device=full_prompt_ids.device)
        gist_embeds = self.gist_embeddings(gist_ids)  # (n_gist, embed_dim)

        # Concatenate gist embeddings + query token embeddings
        # (In full implementation, use inputs_embeds rather than input_ids)
        gist_output = self.model(
            input_ids=gist_prefix_ids,
            # ... prepend gist_embeds to the embedding sequence
            labels=target_ids,
        )

        # Distillation loss: match full-prompt distribution
        kl_loss = nn.functional.kl_div(
            gist_output.logits.log_softmax(dim=-1),
            full_logits.softmax(dim=-1),
            reduction="batchmean",
        )
        return kl_loss

When GIST Tokens Are Appropriate

GIST tokens work best when:

You have many queries against the same prompt (the GIST compression is learned per-prompt-type)
You can afford fine-tuning on the target task
The prompt is static enough that a fixed compression is valid

GIST is less useful when:

The prompt changes significantly with each query
You need zero-shot compression without fine-tuning
The LLM you're using cannot be modified (API-only access)

Recomp - Compression for RAG

RECOMP (Xu et al. 2023, "RECOMP: Improving Retrieval-Augmented LMs with Context Compression and Selective Augmentation") specifically targets the RAG use case: given retrieved documents, compress each one to a short summary that's still useful for the final query.

Two variants:

Extractive RECOMP: select the most relevant sentences from each retrieved document. Abstractive RECOMP: summarize each retrieved document with a small summarizer model.

def recomp_extractive(
    documents: list[str],
    query: str,
    n_sentences_per_doc: int = 3,
    scorer_model = None,
    scorer_tokenizer = None,
) -> list[str]:
    """
    Extractive RECOMP: select top-N sentences from each document.

    For each document, score each sentence by relevance to the query
    using a cross-encoder, then keep the top-N.
    """
    import re

    compressed_docs = []
    for doc in documents:
        sentences = re.split(r'(?<=[.!?])\s+', doc)
        if len(sentences) <= n_sentences_per_doc:
            compressed_docs.append(doc)
            continue

        # Score sentences by query relevance
        # (In production, use a cross-encoder like ms-marco-MiniLM)
        scores = []
        for sent in sentences:
            if scorer_model is not None:
                score = score_relevance(query, sent, scorer_model, scorer_tokenizer)
            else:
                # Naive: use sentence length as proxy (not recommended for production)
                score = len(sent)
            scores.append(score)

        # Keep top-N sentences in original order
        top_indices = sorted(
            sorted(range(len(scores)), key=lambda i: -scores[i])[:n_sentences_per_doc]
        )
        kept_sentences = [sentences[i] for i in top_indices]
        compressed_docs.append(" ".join(kept_sentences))

    return compressed_docs


def score_relevance(query: str, text: str, model, tokenizer) -> float:
    """Score text relevance to query using a cross-encoder."""
    inputs = tokenizer(
        query, text,
        return_tensors="pt",
        truncation=True,
        max_length=512,
    )
    with torch.no_grad():
        output = model(**inputs)
    return output.logits[0].item()

Practical Compression Pipeline

class ContextCompressionPipeline:
    """
    Production context compression pipeline.

    Supports multiple compression strategies and automatically
    selects based on context length and query availability.
    """

    def __init__(
        self,
        strategy: str = "llmlingua2",  # "selective", "llmlingua", "llmlingua2", "extractive"
        target_compression_ratio: float = 4.0,  # Compress 4×
    ):
        self.strategy = strategy
        self.target_ratio = target_compression_ratio
        self._setup()

    def _setup(self):
        """Initialize the appropriate compression model."""
        if self.strategy == "llmlingua2":
            from llmlingua import PromptCompressor
            self.compressor = PromptCompressor(
                model_name="microsoft/llmlingua-2-bert-large-multilingual-cased-meetingbank",
                use_llmlingua2=True,
                device_map="auto",
            )
        elif self.strategy == "selective":
            self.compressor = SelectiveContextCompressor()
        else:
            raise ValueError(f"Unknown strategy: {self.strategy}")

    def compress(
        self,
        context: str,
        query: str | None = None,
        max_output_tokens: int | None = None,
    ) -> dict:
        """
        Compress context and return statistics.

        Returns:
        - compressed_text: the compressed context
        - original_tokens: approximate original token count
        - compressed_tokens: approximate compressed token count
        - compression_ratio: actual compression achieved
        """
        original_tokens = len(context) // 4  # rough approximation

        if max_output_tokens is None:
            max_output_tokens = int(original_tokens / self.target_ratio)

        if self.strategy == "llmlingua2":
            result = self.compressor.compress_prompt(
                context,
                instruction=query or "",
                question=query or "",
                target_token=max_output_tokens,
            )
            compressed = result["compressed_prompt"]
            compressed_tokens = result["compressed_tokens"]
        elif self.strategy == "selective":
            keep_ratio = 1.0 / self.target_ratio
            compressed = self.compressor.compress(context, keep_ratio)
            compressed_tokens = len(compressed) // 4
        else:
            compressed = context
            compressed_tokens = original_tokens

        return {
            "compressed_text": compressed,
            "original_tokens": original_tokens,
            "compressed_tokens": compressed_tokens,
            "compression_ratio": original_tokens / max(1, compressed_tokens),
        }


# Usage example
pipeline = ContextCompressionPipeline(strategy="llmlingua2", target_compression_ratio=4.0)

# Compress a long document before sending to expensive LLM
long_context = "... 20,000 tokens of retrieved documents ..."
query = "What are the key risk factors mentioned?"

result = pipeline.compress(long_context, query=query, max_output_tokens=5000)
print(f"Compressed {result['original_tokens']:,} → {result['compressed_tokens']:,} tokens "
      f"({result['compression_ratio']:.1f}× compression)")

# Now use the compressed context with any LLM
compressed_prompt = f"{result['compressed_text']}\n\nQuestion: {query}\n\nAnswer:"

Common Mistakes

:::danger Don't use compression for short contexts Context compression adds latency (running a small model to score tokens) and introduces accuracy risk (the compression may remove relevant information). For contexts under 4K tokens, compression overhead exceeds any benefit. Reserve compression for contexts that are genuinely too long for your LLM. :::

:::warning Compression accuracy depends heavily on the compressor model quality "Context compression" is not plug-and-play. The quality of the compressor model (particularly for LLMLingua variants) matters significantly. A poor scoring model produces compressed contexts that remove important information. Always evaluate compression accuracy on a held-out set before deploying in production. :::

:::tip Combine compression with strategic placement After compressing documents, apply the boundary placement strategy from Lesson 04. Place the highest-relevance compressed chunks at the beginning and end of the compressed context, not in the order they were retrieved. Compression reduces the total token count; boundary placement reduces the lost-in-middle risk. :::

Interview Q&A

Q: What is LLMLingua and how does it differ from simple summarization?

A: LLMLingua is a token-level compression method that removes individual tokens from a long context rather than rewriting it. A small language model (e.g., LLaMA-2-7B) scores each token's "conditional perplexity" - how surprising the token is given both the preceding context and the user's query. High-perplexity (high-surprise) tokens are retained; low-perplexity (predictable, redundant) tokens are dropped. The key differences from summarization: (1) no rewriting - the remaining tokens are in the original order with original wording; (2) it's query-conditioned - tokens are scored relative to the specific question; (3) it's faster to apply than generating a summary.

Q: What is the difference between hard (token-level) and soft (embedding-level) context compression?

A: Hard compression produces a shorter sequence of actual tokens - human-readable compressed text that any LLM can use. Soft compression maps the long context to a fixed set of continuous embedding vectors that serve as a "soft prompt" prefix for the LLM. AutoCompressors and GIST tokens use soft compression. Hard compression (LLMLingua, Selective Context) works with any LLM out of the box. Soft compression achieves higher compression ratios (50-100× vs 3-6× for hard) but requires a specially fine-tuned model and isn't interpretable or transferable across models.

Q: How does query-conditioned compression (LLMLingua) improve over query-independent compression?

A: Query-independent compression (like Selective Context) scores token importance based solely on the token's self-information in the document - how surprising it is to a language model. This gives equal priority to any surprising token regardless of whether it's relevant to the query. Query-conditioned compression (LLMLingua) computes the token's perplexity given both the document and the query. A technical detail about an unrelated topic might be high-information (surprising) but not query-relevant; query-conditioned scoring can identify this and remove it. Empirically, query-conditioned compression achieves significantly better accuracy at the same compression ratio.

Q: What is RECOMP and when would you use it in a RAG pipeline?

A: RECOMP (Retrieval-Augmented Context Compression) compresses each retrieved document before adding it to the LLM context. The extractive variant selects the most query-relevant sentences from each document using a cross-encoder scorer; the abstractive variant generates a short summary of each document focused on the query. RECOMP is particularly useful in RAG when: (1) retrieved documents are long but only partially relevant to the query; (2) you're concatenating many retrieved documents and need to fit them within a limited context window; (3) you want to improve the signal-to-noise ratio by removing irrelevant parts of retrieved passages before presenting them to the expensive LLM.

Q: What compression ratio can you expect from LLMLingua-2 and at what accuracy cost?

A: LLMLingua-2 typically achieves 4-6× compression ratios in practice. Pan et al. (2024) report accuracy drops of 3-5% on question answering benchmarks (NaturalQuestions, TriviaQA, HotpotQA) at 4× compression. At 6× compression, the accuracy drop increases to 7-10%. For comparison, naive truncation (simply cutting the context to the target length) causes 15-25% accuracy drops at the same compression ratios. The training-based approach in LLMLingua-2 (using a fine-tuned BERT-style model for token importance prediction, rather than a general language model) improves both compression quality and inference speed compared to LLMLingua-1.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Long Context: Lost in the Middle demo on the EngineersOfAI Playground - no code required.

:::

The Middle Path​

Why Context Compression Is Hard​

The Token Value Distribution Problem​

Approaches at a Glance​

Selective Context - The Baseline Approach​

Self-Information Scoring​

LLMLingua - Coarse-to-Fine Token Pruning​

The Coarse-to-Fine Algorithm​

Using the Official LLMLingua Package​

LLMLingua-2 Improvements​

AutoCompressors - Soft Prompt Compression​

The Architecture​

AutoCompressor Tradeoffs​

GIST Tokens - Generalizable and Interspersed​

The Concept​

When GIST Tokens Are Appropriate​

Recomp - Compression for RAG​

Practical Compression Pipeline​

Common Mistakes​

Interview Q&A​