Long-Context Evaluation

A Legal Team's Near-Miss

The contract was 847 pages. Standard M&A due diligence - the kind that legal teams at major firms process dozens of times a year. The AI system your company deployed six months ago had been a productivity multiplier. Associates could ask questions about contract terms, cross-reference clauses, surface risks. The model claimed a 128k token context window. The contract fit easily.

Then the acquirer's counsel asked a pointed question: did the indemnification clause in Schedule 7 align with the limitation-of-liability carve-outs in Section 14.3? Your associate ran the query. The model said yes, they were consistent. The deal closed. Six weeks later, during integration, someone actually read both clauses side by side. They did not align. The discrepancy exposed the acquirer to an uncapped liability in a specific IP infringement scenario. The legal bill to remediate it ran into seven figures.

The model had not hallucinated the clauses. Both were real, quoted accurately. The problem was positional. Section 14.3 appeared near the beginning of the document. Schedule 7 appeared near the end - deep in the middle of the token stream, around token 61,000 of a 95,000-token input. The model retrieved the beginning and end accurately. The middle? It effectively did not exist for the model's attention mechanism at inference time.

This is not a theoretical concern. It is not an edge case. It is a reproducible, measurable failure mode that affects every major language model to varying degrees. The phenomenon has a name: "lost in the middle," described precisely by Liu et al. in their 2023 paper. And the evaluation method that surfaces it reliably - Needle in a Haystack (NIAH) - has become one of the most important tools in the applied AI engineer's evaluation toolkit.

Understanding long-context evaluation is not optional if you are deploying models on real documents. It is the difference between a system that is genuinely capable at 100k tokens and one that performs a convincing impression of capability while silently failing on anything positioned in the middle third of the input. This lesson teaches you to tell the difference.

Why This Exists - The Gap Between Claimed and Actual Context

For the first three years of the transformer era, context length was a solved problem in a simple sense: nobody pretended the limits were not real. GPT-2 had a 1,024-token window. GPT-3 had 2,048 tokens. Models were evaluated at their actual limits and nobody claimed otherwise.

Then two things happened simultaneously. The technical community figured out how to extend context windows significantly - through rotary positional embeddings (RoPE), ALiBi, sliding window attention, and eventually full attention with flash attention optimizations. And the commercial pressure to market larger context windows became intense. "128k context" sounds better than "32k context" in a product comparison table.

The problem is that extending the technical context window - the number of tokens a model can process without crashing - does not automatically extend the effective context window - the range over which a model can reliably retrieve and reason about information. These two numbers can differ by a factor of 10 or more. A model with a 128k technical context window might have an effective retrieval capability that degrades sharply beyond 16k tokens, particularly for information positioned in the middle of the input.

Before NIAH and similar targeted evaluations existed, this gap was invisible. Standard benchmarks did not test it. MMLU, HellaSwag, ARC - these are short-context tasks. Even most RAG evaluations use chunked retrieval, which sidesteps the long-context problem entirely. The failure mode was hiding in plain sight.

The "lost in the middle" paper (Liu et al., 2023) made the invisible visible. They ran a systematic study: give models a set of documents, embed the answer in documents at different positions in the input (beginning, middle, end), and measure retrieval accuracy as a function of position. The results were striking. Models performed best when the relevant document was at the very beginning or very end of the input. Performance dropped dramatically - sometimes by 30 to 40 percentage points - when the relevant document was positioned in the middle. The U-shaped performance curve became a defining diagram in the field.

What this means practically: a model evaluated only on its average performance across a long-context task will look much stronger than it actually is, because the good performance on beginning/end positions masks the failure in the middle. You need position-aware evaluation to see the real picture.

Historical Context - From Attention Limits to Needle Tests

The transformer architecture described in "Attention is All You Need" (Vaswani et al., 2017) had quadratic complexity in sequence length - processing a sequence twice as long cost four times as much compute. This made long contexts practically impossible at scale. The original BERT models used 512 tokens. The early GPT models used 1,024 to 2,048.

The first serious attempt to break the quadratic barrier came from Longformer (Beltagy et al., 2020) at the Allen Institute, which introduced a sliding window attention mechanism allowing efficient processing of documents up to 4,096 tokens. Big Bird (Zaheer et al., 2020) from Google extended this with sparse attention patterns to 4,096 tokens as well. These were academic advances - important but not yet commercially mainstream.

The real shift came in 2022-2023 with several simultaneous developments. Flash Attention (Dao et al., 2022) made exact full attention dramatically faster and more memory-efficient through IO-aware computation. RoPE (Su et al., 2022) provided a relative positional encoding that could be extended beyond training lengths through interpolation. YaRN (Peng et al., 2023) improved RoPE extension significantly. These advances made 8k, 16k, 32k context windows viable at inference time.

The marketing race followed immediately. Anthropic released Claude with a 100k context window in May 2023. GPT-4 Turbo launched with 128k context in November 2023. The open-source community followed: Llama 2 Long, Mistral with sliding window attention, Yi-34B with 200k context claims.

The Needle in a Haystack evaluation was developed by Gregory Kamradt in late 2023 as a direct response to these claims. The idea is elegant in its simplicity: take a long text corpus (typically Paul Graham essays), insert a single specific sentence - the "needle" - at a specific position in the document, ask the model to find it, and measure whether it succeeds. Vary the needle position and the document length systematically to produce a 2D heatmap. The heatmap reveals exactly where a model's attention degrades.

When Kamradt published his first NIAH results in November 2023, the AI community had its first reliable tool for cutting through context window marketing claims. The results were humbling: models that claimed 100k+ context windows showed dramatic performance degradation in the middle of long documents. The NIAH heatmap - with its characteristic drop in the center - became one of the defining images of the 2023-2024 LLM evaluation era.

Core Concepts

The Effective Context Window

The most important concept in long-context evaluation is the distinction between the technical context window and the effective context window.

The technical context window is the maximum number of tokens a model can process in a single forward pass without exceeding memory limits or causing positional encoding errors. This is what model cards report.

The effective context window is the range over which a model can reliably retrieve and reason about specific information with accuracy above some acceptable threshold. This is what actually matters for applications.

Mathematically, define the retrieval accuracy function $R(p, L)$ where $p$ is the position of the target information (as a fraction of total context, $0 \leq p \leq 1$ ) and $L$ is the total context length. For a model with claimed context length $C_{max}$ , the effective context length $C_{eff}$ is:

$C_{eff} = \max\{L : \min_{p \in [0.2, 0.8]} R(p, L) \geq \theta\}$

where $\theta$ is your accuracy threshold (typically 0.8 or 0.9) and the $[0.2, 0.8]$ range focuses on the middle of the context, which is typically where performance degrades first.

This definition says: the effective context length is the longest document at which you can reliably retrieve information even when it is positioned in the middle third of the document. By this definition, many models claiming 128k context have effective context lengths of 16k to 32k.

The Lost-in-the-Middle Phenomenon

Liu et al. (2023) ran a multi-document question answering experiment. Given a question and $k$ documents (only one containing the answer), they varied which position the relevant document occupied. They tested with $k = 10$ and $k = 20$ documents.

The key finding: when the relevant document was at position 1 (the beginning) or position $k$ (the end), accuracy was high. When it was at positions in the middle - particularly positions 3 through $k-3$ - accuracy dropped substantially. For GPT-3.5-Turbo with 20 documents, accuracy at position 1 was ~75%, dropped to ~55% at middle positions, and recovered to ~72% at the final position. The pattern held across multiple models.

Why does this happen? The prevailing explanation involves attention recency bias and primacy bias. Transformer attention mechanisms naturally attend more strongly to the beginning of the sequence (primacy - because earlier tokens have been processed through more attention layers) and to recent tokens (recency - because they appear in the local neighborhood for many attention heads). Information in the middle of a long sequence falls into a relative attention dead zone.

Flash Attention and improved positional encodings help at the architecture level, but do not fully eliminate the pattern. It persists even in models specifically trained on long-context data.

Needle in a Haystack (NIAH) Evaluation

NIAH is a position-controlled retrieval test. The mechanics:

Take a background corpus - typically a long, coherent text like Paul Graham's essays or Wikipedia articles (the "haystack")
Construct the haystack to a target length $L$ (e.g., 32k tokens)
Insert a specific "needle" sentence at position $p$ (e.g., "The secret passcode is: BLUE-FALCON-7749")
Prompt the model: "Based on the document below, what is the secret passcode? If you cannot find it, say 'Not found.'"
Record whether the model returns the correct answer
Repeat across a grid of $(p, L)$ values: positions from 0% to 100% of document depth, lengths from 1k to the maximum tested

The output is a 2D heatmap. The x-axis is document length. The y-axis is needle depth (position in the document, expressed as a percentage). Each cell is colored green (found) or red (not found), often with gradient colors for partial accuracy across multiple runs.

A perfect model produces an all-green heatmap. Real models produce heatmaps with characteristic patterns: a green "U" shape (good at beginning and end, poor in the middle), or a green border with a red interior, or good performance up to a certain length and then degradation everywhere.

Multi-Hop Reasoning at Long Context

Single-fact retrieval (NIAH) is the simplest case. Real-world tasks often require multi-hop reasoning: retrieving fact A from one part of the document, fact B from another part, and combining them to answer a question.

The RULER benchmark (Hsieh et al., 2024) extends NIAH to cover a taxonomy of long-context tasks:

Single NIAH (S-NIAH): retrieve one needle from one position
Multi-key NIAH (MK-NIAH): retrieve one needle, but there are multiple needles inserted; retrieve the specific one asked about
Multi-value NIAH (MV-NIAH): retrieve multiple values associated with a single key
Multi-query NIAH (MQ-NIAH): answer multiple questions about different needles in one response
Variable tracking (VT): track a variable through a sequence of assignments and report its final value
Common/frequent words (CWE/FWE): identify the most or least common word in a long context
Question answering (QA): multi-document QA similar to Liu et al.

RULER provides a composite score across all task types, giving a much richer picture of long-context capability than single-fact NIAH alone. Models that look strong on basic NIAH often show significant degradation on the multi-hop reasoning tasks.

Position-Relative Performance Curves

When plotting model performance vs. document length, a naive analysis averages over all positions. This hides the structure. Better analysis computes:

Beginning performance: accuracy when target is in the first 20% of context
Middle performance: accuracy when target is in positions 20%-80%
End performance: accuracy when target is in the final 20% of context
Degradation slope: how quickly middle performance falls as document length increases

For production evaluation, middle performance is the number that matters. Beginning and end performance are often misleadingly high.

The performance gap between beginning/end and middle is called the positional degradation of the model. A positional degradation below 10 percentage points is considered good. Above 30 points is a serious concern for applications that place important information in the middle of long contexts.

Code Examples

Implementing a Basic NIAH Evaluator

import json
import time
from dataclasses import dataclass, field
from typing import Optional
import numpy as np

@dataclass
class NIAHConfig:
    """Configuration for Needle in a Haystack evaluation."""
    needle: str = "The secret code phrase is: INDIGO-CRANE-4491"
    question: str = "What is the secret code phrase mentioned in the document?"
    expected_answer: str = "INDIGO-CRANE-4491"
    context_lengths: list = field(default_factory=lambda: [
        4096, 8192, 16384, 32768, 65536, 131072
    ])
    depth_percents: list = field(default_factory=lambda: [
        0.0, 10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0
    ])
    n_runs_per_cell: int = 3  # Average over multiple runs per (length, depth) cell
    haystack_text: Optional[str] = None
    haystack_path: Optional[str] = None


def load_haystack(config: NIAHConfig, tokenizer) -> str:
    """Load a long background text to use as the haystack."""
    if config.haystack_text:
        return config.haystack_text
    if config.haystack_path:
        with open(config.haystack_path, "r") as f:
            return f.read()
    # Default: use a repeated placeholder (replace with real corpus in production)
    # In practice, use Paul Graham essays or Wikipedia text
    base = (
        "The study of artificial intelligence has a long history, beginning with "
        "the formal work of Alan Turing, who proposed the Turing Test in 1950 as a "
        "measure of machine intelligence. Since then, researchers have developed "
        "increasingly capable systems, from early expert systems to modern neural "
        "networks trained on massive datasets. "
    ) * 500
    return base


def build_context(haystack: str, needle: str, depth_percent: float,
                  target_tokens: int, tokenizer) -> str:
    """
    Build a context string of approximately target_tokens tokens,
    with the needle inserted at depth_percent into the document.
    """
    # Tokenize haystack to find correct insertion point
    haystack_tokens = tokenizer.encode(haystack)
    needle_tokens = tokenizer.encode(needle)

    # How many haystack tokens we need
    n_haystack_tokens = target_tokens - len(needle_tokens) - 50  # 50 token buffer

    if n_haystack_tokens < 100:
        raise ValueError(f"Target length {target_tokens} too short for needle.")

    # Truncate haystack to needed length
    haystack_tokens = haystack_tokens[:n_haystack_tokens]

    # Find insertion position
    insert_pos = int(len(haystack_tokens) * depth_percent / 100.0)
    insert_pos = max(0, min(insert_pos, len(haystack_tokens)))

    # Assemble: beginning + needle + rest
    combined_tokens = (
        haystack_tokens[:insert_pos]
        + needle_tokens
        + haystack_tokens[insert_pos:]
    )

    return tokenizer.decode(combined_tokens)


def evaluate_niah_cell(
    model,
    tokenizer,
    config: NIAHConfig,
    context_length: int,
    depth_percent: float,
    haystack: str,
) -> dict:
    """
    Evaluate a single (context_length, depth_percent) cell.
    Returns accuracy across n_runs_per_cell independent runs.
    """
    results = []

    for run_idx in range(config.n_runs_per_cell):
        # Build the context for this cell
        context = build_context(
            haystack=haystack,
            needle=config.needle,
            depth_percent=depth_percent,
            target_tokens=context_length,
            tokenizer=tokenizer,
        )

        prompt = (
            f"Below is a long document. Read it carefully and answer the question.\n\n"
            f"Document:\n{context}\n\n"
            f"Question: {config.question}\n"
            f"Answer concisely. If not found, say 'Not found.'\nAnswer:"
        )

        # Get model response
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        with __import__("torch").no_grad():
            output = model.generate(
                **inputs,
                max_new_tokens=50,
                temperature=0.0,
                do_sample=False,
            )
        response = tokenizer.decode(
            output[0][inputs["input_ids"].shape[1]:],
            skip_special_tokens=True
        ).strip()

        # Score: did the expected answer appear in the response?
        correct = config.expected_answer.lower() in response.lower()
        results.append({
            "run": run_idx,
            "response": response,
            "correct": correct,
        })

    accuracy = sum(r["correct"] for r in results) / len(results)
    return {
        "context_length": context_length,
        "depth_percent": depth_percent,
        "accuracy": accuracy,
        "runs": results,
    }


def run_niah_evaluation(model, tokenizer, config: NIAHConfig) -> dict:
    """
    Run the full NIAH evaluation grid and return results.
    """
    haystack = load_haystack(config, tokenizer)
    results = []

    total_cells = len(config.context_lengths) * len(config.depth_percents)
    cell_idx = 0

    for ctx_len in config.context_lengths:
        for depth in config.depth_percents:
            cell_idx += 1
            print(f"[{cell_idx}/{total_cells}] ctx={ctx_len} depth={depth:.0f}%")

            try:
                cell_result = evaluate_niah_cell(
                    model=model,
                    tokenizer=tokenizer,
                    config=config,
                    context_length=ctx_len,
                    depth_percent=depth,
                    haystack=haystack,
                )
                results.append(cell_result)
            except Exception as e:
                print(f"  ERROR: {e}")
                results.append({
                    "context_length": ctx_len,
                    "depth_percent": depth,
                    "accuracy": None,
                    "error": str(e),
                })

    return {
        "config": {
            "needle": config.needle,
            "question": config.question,
            "n_runs_per_cell": config.n_runs_per_cell,
        },
        "results": results,
    }

Visualizing NIAH Results as a Heatmap

import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import numpy as np
import json
from typing import Optional


def plot_niah_heatmap(
    results: dict,
    model_name: str = "Model",
    save_path: Optional[str] = None,
    figsize: tuple = (14, 8),
) -> None:
    """
    Plot a NIAH heatmap from evaluation results.
    X-axis: context length (tokens)
    Y-axis: needle depth (% into document)
    Color: accuracy (green = high, red = low)
    """
    raw = results["results"]

    # Extract unique axes
    ctx_lengths = sorted(set(r["context_length"] for r in raw if r.get("accuracy") is not None))
    depths = sorted(set(r["depth_percent"] for r in raw if r.get("accuracy") is not None))

    # Build the accuracy matrix
    accuracy_matrix = np.full((len(depths), len(ctx_lengths)), np.nan)
    depth_idx = {d: i for i, d in enumerate(depths)}
    ctx_idx = {c: i for i, c in enumerate(ctx_lengths)}

    for r in raw:
        if r.get("accuracy") is not None:
            i = depth_idx[r["depth_percent"]]
            j = ctx_idx[r["context_length"]]
            accuracy_matrix[i, j] = r["accuracy"]

    # Create heatmap
    fig, ax = plt.subplots(figsize=figsize)
    cmap = mcolors.LinearSegmentedColormap.from_list(
        "niah", [(0.0, "#dc2626"), (0.5, "#f59e0b"), (1.0, "#16a34a")]
    )

    im = ax.imshow(
        accuracy_matrix,
        cmap=cmap,
        aspect="auto",
        vmin=0.0,
        vmax=1.0,
        origin="upper",
    )

    # Annotate cells with accuracy values
    for i in range(len(depths)):
        for j in range(len(ctx_lengths)):
            val = accuracy_matrix[i, j]
            if not np.isnan(val):
                text_color = "white" if val < 0.4 or val > 0.75 else "black"
                ax.text(
                    j, i, f"{val:.0%}",
                    ha="center", va="center",
                    fontsize=8, color=text_color, fontweight="bold"
                )

    # Axes
    ax.set_xticks(range(len(ctx_lengths)))
    ax.set_xticklabels([f"{c // 1000}k" for c in ctx_lengths], fontsize=10)
    ax.set_yticks(range(len(depths)))
    ax.set_yticklabels([f"{d:.0f}%" for d in depths], fontsize=10)

    ax.set_xlabel("Context Length (tokens)", fontsize=12)
    ax.set_ylabel("Needle Position (% into document)", fontsize=12)
    ax.set_title(f"NIAH Heatmap - {model_name}", fontsize=14, fontweight="bold")

    plt.colorbar(im, ax=ax, label="Retrieval Accuracy")
    plt.tight_layout()

    if save_path:
        plt.savefig(save_path, dpi=150, bbox_inches="tight")
        print(f"Saved heatmap to {save_path}")
    else:
        plt.show()


def compute_niah_summary_stats(results: dict) -> dict:
    """
    Compute position-stratified summary statistics from NIAH results.
    """
    raw = [r for r in results["results"] if r.get("accuracy") is not None]

    beginning = [r["accuracy"] for r in raw if r["depth_percent"] <= 20.0]
    middle = [r["accuracy"] for r in raw if 20.0 < r["depth_percent"] < 80.0]
    end = [r["accuracy"] for r in raw if r["depth_percent"] >= 80.0]
    all_vals = [r["accuracy"] for r in raw]

    def safe_mean(lst):
        return float(np.mean(lst)) if lst else None

    summary = {
        "overall_accuracy": safe_mean(all_vals),
        "beginning_accuracy": safe_mean(beginning),  # positions 0-20%
        "middle_accuracy": safe_mean(middle),          # positions 20-80%
        "end_accuracy": safe_mean(end),                # positions 80-100%
        "positional_degradation": (
            (safe_mean(beginning) + safe_mean(end)) / 2 - safe_mean(middle)
            if beginning and middle and end else None
        ),
        "n_cells_evaluated": len(raw),
    }

    # Per-context-length breakdown
    ctx_breakdown = {}
    for r in raw:
        cl = r["context_length"]
        if cl not in ctx_breakdown:
            ctx_breakdown[cl] = []
        ctx_breakdown[cl].append(r["accuracy"])
    summary["per_context_length"] = {
        k: float(np.mean(v)) for k, v in sorted(ctx_breakdown.items())
    }

    return summary

Implementing a RULER-Style Multi-Hop Evaluation

import random
import string
from typing import Callable


def generate_variable_tracking_task(
    n_vars: int = 5,
    n_assignments: int = 20,
    context_filler_tokens: int = 8000,
    tokenizer=None,
) -> dict:
    """
    RULER variable tracking task: track variable assignments through
    a long context and report final values.

    Example:
      x = 7
      y = 3
      ... (filler) ...
      x = 14
      ... (filler) ...
      y = 99
      Question: What is the final value of x?
      Answer: 14
    """
    var_names = random.sample(string.ascii_lowercase, n_vars)
    final_values = {}
    assignment_sequence = []

    # Generate random reassignments
    for _ in range(n_assignments):
        var = random.choice(var_names)
        val = random.randint(1, 999)
        assignment_sequence.append((var, val))
        final_values[var] = val  # Track the latest value

    # Choose query variable
    query_var = random.choice(var_names)
    expected_answer = str(final_values[query_var])

    # Build context: assignments interspersed with filler text
    filler_sentence = (
        "Research shows that language models exhibit complex emergent behaviors "
        "when trained at scale on diverse data sources. "
    )

    # Estimate tokens per sentence
    filler_tokens_per_sentence = 25  # Approximate
    filler_sentences_needed = context_filler_tokens // filler_tokens_per_sentence

    context_parts = []
    sentences_per_gap = filler_sentences_needed // (n_assignments + 1)

    for i, (var, val) in enumerate(assignment_sequence):
        # Add filler before each assignment
        filler = filler_sentence * sentences_per_gap
        context_parts.append(filler)
        context_parts.append(f"\n{var} = {val}\n")

    context_parts.append(filler_sentence * sentences_per_gap)
    context = "".join(context_parts)

    prompt = (
        f"{context}\n\n"
        f"Based on the assignments above, what is the final value of variable '{query_var}'?\n"
        f"Answer with just the number:"
    )

    return {
        "task_type": "variable_tracking",
        "prompt": prompt,
        "expected_answer": expected_answer,
        "query_variable": query_var,
        "all_final_values": final_values,
    }


def generate_multi_needle_task(
    n_needles: int = 5,
    haystack_tokens: int = 32768,
    haystack_text: str = "",
    tokenizer=None,
) -> dict:
    """
    RULER multi-key NIAH: multiple needles are inserted at different positions.
    The model must retrieve the one asked about.
    """
    # Generate distinct needles
    needles = {}
    for i in range(1, n_needles + 1):
        code = f"CODE-{random.randint(1000, 9999)}-{chr(64 + i)}"
        needles[f"entity_{i}"] = code

    # Choose which needle to ask about
    query_entity = random.choice(list(needles.keys()))
    expected_answer = needles[query_entity]

    # Distribute needles evenly through the context
    positions = np.linspace(5, 95, n_needles).tolist()

    needle_sentences = {
        entity: f"The special identifier for {entity} is: {code}"
        for entity, code in needles.items()
    }

    prompt = (
        f"[CONTEXT - contains {n_needles} identifiers]\n"
        f"... (long context with needles at various positions) ...\n\n"
        f"What is the special identifier for {query_entity}?\n"
        f"Answer:"
    )

    return {
        "task_type": "multi_needle",
        "expected_answer": expected_answer,
        "query_entity": query_entity,
        "all_needles": needles,
        "needle_positions": positions,
        "prompt": prompt,
    }


def run_ruler_evaluation(
    model,
    tokenizer,
    context_lengths: list,
    n_samples_per_cell: int = 10,
) -> dict:
    """
    Run a simplified RULER evaluation across task types and context lengths.
    """
    task_generators = {
        "variable_tracking": generate_variable_tracking_task,
        "multi_needle": generate_multi_needle_task,
    }

    results = {}

    for ctx_len in context_lengths:
        results[ctx_len] = {}
        for task_name, generator in task_generators.items():
            task_results = []
            for _ in range(n_samples_per_cell):
                sample = generator(context_filler_tokens=ctx_len)
                # Run model inference here
                # response = run_inference(model, tokenizer, sample["prompt"])
                # correct = sample["expected_answer"] in response
                # task_results.append(correct)
                pass  # Placeholder - integrate with your model inference
            results[ctx_len][task_name] = {
                "accuracy": float(np.mean(task_results)) if task_results else None,
                "n_samples": n_samples_per_cell,
            }

    return results

Building a Long-Context Test Suite for Document Processing

from pathlib import Path
import hashlib


class LongContextTestSuite:
    """
    A practical test suite for evaluating a document processing application.
    Tests real-world long-context scenarios beyond synthetic NIAH.
    """

    def __init__(self, model_runner: Callable, tokenizer):
        self.model_runner = model_runner
        self.tokenizer = tokenizer
        self.results = []

    def test_cross_reference_retrieval(
        self,
        document: str,
        test_cases: list[dict],
    ) -> list[dict]:
        """
        Test whether the model correctly cross-references facts from
        different positions in a single document.

        Each test case: {
            "question": str,
            "answer": str,
            "fact_positions": [0.15, 0.72]  # approximate positions of relevant facts
        }
        """
        case_results = []
        for case in test_cases:
            prompt = (
                f"Document:\n{document}\n\n"
                f"Question: {case['question']}\n"
                f"Answer:"
            )
            response = self.model_runner(prompt)
            correct = case["answer"].lower() in response.lower()
            case_results.append({
                "question": case["question"],
                "expected": case["answer"],
                "response": response[:200],
                "correct": correct,
                "fact_positions": case.get("fact_positions", []),
            })
        return case_results

    def test_document_summarization_coverage(
        self,
        document: str,
        key_facts: list[str],
        summary_prompt: str = "Summarize the key points of this document in 5 bullet points:",
    ) -> dict:
        """
        Evaluate whether a summary covers key facts distributed throughout
        a long document. Measures content coverage, not just coherence.
        """
        prompt = f"Document:\n{document}\n\n{summary_prompt}"
        summary = self.model_runner(prompt)

        covered = []
        missing = []
        for fact in key_facts:
            # Simple substring match - replace with semantic similarity in production
            if any(keyword.lower() in summary.lower() for keyword in fact.split()):
                covered.append(fact)
            else:
                missing.append(fact)

        return {
            "coverage_rate": len(covered) / len(key_facts) if key_facts else 0.0,
            "covered_facts": covered,
            "missing_facts": missing,
            "summary": summary[:500],
        }

    def test_instruction_following_at_length(
        self,
        base_document: str,
        instructions: list[str],
        expected_outputs: list[str],
    ) -> list[dict]:
        """
        Test whether the model follows specific instructions embedded at
        various points in a long document alongside the main content.
        This tests whether instruction-following degrades at long context.
        """
        results = []
        for instruction, expected in zip(instructions, expected_outputs):
            prompt = (
                f"INSTRUCTION: {instruction}\n\n"
                f"Document:\n{base_document}\n\n"
                f"Follow the instruction above. Response:"
            )
            response = self.model_runner(prompt)
            results.append({
                "instruction": instruction[:100],
                "expected": expected,
                "response": response[:200],
                "followed": expected.lower() in response.lower(),
            })
        return results

    def run_full_suite(
        self,
        document: str,
        cross_ref_cases: list[dict],
        key_facts: list[str],
        instructions: list[str],
        expected_outputs: list[str],
    ) -> dict:
        """Run all test categories and aggregate results."""
        cross_ref = self.test_cross_reference_retrieval(document, cross_ref_cases)
        summarization = self.test_document_summarization_coverage(document, key_facts)
        instruction_following = self.test_instruction_following_at_length(
            document, instructions, expected_outputs
        )

        return {
            "cross_reference_accuracy": (
                sum(r["correct"] for r in cross_ref) / len(cross_ref)
                if cross_ref else None
            ),
            "summarization_coverage": summarization["coverage_rate"],
            "instruction_following_rate": (
                sum(r["followed"] for r in instruction_following) / len(instruction_following)
                if instruction_following else None
            ),
            "cross_reference_details": cross_ref,
            "summarization_details": summarization,
            "instruction_details": instruction_following,
            "document_token_length": len(self.tokenizer.encode(document)),
        }

Mermaid Diagrams

The NIAH Evaluation Pipeline

The Lost-in-the-Middle Performance Pattern

RULER Task Taxonomy

Production Engineering Notes

Choosing Context Length Test Points

Do not test at every possible context length. The relationship between accuracy and context length is monotonically non-increasing (performance stays the same or gets worse as context grows), so you can binary search for the effective context boundary:

Start with coarse grid: 4k, 8k, 16k, 32k, 64k, 128k
Find the first length where middle-position accuracy drops below threshold
Refine with a finer grid around that boundary (e.g., 20k, 24k, 28k, 32k)
Report the effective context window as the length at which middle accuracy is still acceptable

This reduces evaluation cost by 60-70% compared to a uniform fine-grained grid.

Temperature and Determinism

Always run NIAH with temperature=0 and greedy decoding. The needle retrieval task is a deterministic lookup - there is one correct answer. Temperature introduces variance that makes it harder to distinguish genuine retrieval failure from sampling noise.

If you run multiple trials per cell (recommended to detect flakiness), run them all at temperature=0. If the model is truly capable at that position/length, it should answer correctly on every trial. If you see inconsistent results at temperature=0, that itself is a signal of marginal capability at that context position.

Haystack Quality Matters

The background text (haystack) significantly affects results. Using repetitive or semantically homogeneous text makes the task easier because the needle stands out more. Real evaluation should use:

Domain-matched text: if your application processes legal contracts, use legal text as the haystack
Semantically coherent text: real documents, not repeated sentences
Text that does not contain your needle: verify the needle phrase does not appear naturally in the haystack before insertion
Multiple distinct haystacks: test on at least 3-5 different background documents to ensure results generalize

Handling Models with Sliding Window Attention

Some open-source models use sliding window attention (Mistral's architecture being the canonical example). These models have a fundamentally different failure mode: instead of gradual degradation in the middle, they exhibit sharp cutoffs. Information outside the effective window is simply inaccessible regardless of position.

For sliding window models, the NIAH heatmap will show a diagonal or step-function pattern rather than the U-shape. Context beyond the window size will fail uniformly, not preferentially in the middle. The evaluation methodology is the same; the interpretation differs.

RAG Context Window Evaluation

For RAG applications, the relevant question is not "can the model retrieve a needle from a 100k-token context" but rather "at what context size does retrieval from a multi-document input become unreliable."

A practical RAG-specific NIAH variant:

Take your actual retrieval chunks (typical sizes: 512-1024 tokens each)
Build contexts of $k$ chunks (e.g., k = 5, 10, 20, 50)
Place the relevant chunk at different positions in the $k$ -chunk set
Measure retrieval accuracy vs. number of chunks

This tells you the effective $k$ for your specific chunk size - more useful than raw token counts for RAG system design.

Cost Estimation for NIAH at Scale

Running a full NIAH grid is expensive. Estimate before running:

For a grid of 6 context lengths x 11 depth percentages x 3 runs = 198 cells.

At 128k tokens per cell, that is approximately 25 million tokens total. For an API-served model at $15 per million input tokens, that is approximately$ 375 for one full NIAH evaluation. For local inference, estimate GPU-hours based on your hardware's tokens-per-second throughput.

Optimization: run a partial grid first (3 context lengths x 5 depths x 1 run) to get a rough picture, then spend the remaining budget on the interesting cells near the degradation boundary.

Common Mistakes

:::danger Trusting the Claimed Context Window

The single most dangerous mistake in long-context evaluation is accepting the vendor's claimed context window as your system's actual capability. Model providers report the technical context window - the maximum input length without crashing. They do not typically report the effective context window at any specific accuracy threshold.

A model claiming 128k context may have an effective context window of 16k for reliable middle-position retrieval. Deploying it on 80k-token documents without verification is a system design error, not just a model limitation.

Always run NIAH or equivalent evaluation on your target context lengths before deploying. Budget this time into your deployment timeline.

:::

:::danger Evaluating Only Beginning and End Positions

Many teams test long-context capability by placing test information at the beginning of the document ("does the model see this instruction at the top?") or by appending test information at the end. This systematically overestimates model capability.

The U-shaped performance curve means beginning and end accuracy can be 15-40 percentage points higher than middle accuracy at the same document length. If your evaluation only samples beginning and end positions, you will see artificially high accuracy and conclude the model handles your document length - and then be surprised when it fails on real documents where relevant information is distributed throughout.

Always include middle positions (30%-70% depth) in your evaluation grid. They are the positions that matter most and fail first.

:::

:::warning Ignoring Multi-Hop Reasoning Degradation

Single-fact NIAH evaluates whether a model can find one piece of information. It does not evaluate whether a model can reason across two pieces of information from different positions.

Multi-hop retrieval degrades much faster than single-fact retrieval. A model that scores 90% on single NIAH at 32k tokens might score only 50% on two-hop reasoning at the same context length.

If your application requires cross-referencing facts (e.g., comparing a clause in section 3 to a definition in appendix B), run multi-hop evaluation, not just single-fact NIAH. The RULER benchmark provides ready-made multi-hop tasks.

:::

:::warning Using the Same Haystack Every Time

If you run multiple NIAH evaluations using the same background text (e.g., always the same Paul Graham essays), your results may not generalize. The model may have seen this specific text in training and have anomalously good or poor performance on it.

Rotate through at least 3-5 different haystack documents. If you are evaluating for a specific domain, use domain-appropriate text. Legal models should be evaluated with legal haystacks, not general web text.

:::

Interview Q&A

Q1: Explain the "lost in the middle" phenomenon and why it occurs in transformer models.

The lost-in-the-middle phenomenon is the empirically observed pattern where language models reliably retrieve information placed at the beginning or end of a long context but show significant accuracy degradation when the same information is placed in the middle of the context. Liu et al. (2023) documented this systematically in multi-document QA tasks: with 20 documents, accuracy at position 1 was around 75%, dropped to around 52% at the middle positions, and recovered near 70% at the last position.

The underlying cause involves two competing biases in transformer attention mechanisms. Primacy bias arises because early tokens attend to all subsequent tokens during self-attention, so they appear in the key-value cache for every subsequent attention computation. This makes them "sticky" in the model's internal representations. Recency bias arises because local attention patterns naturally weight nearby tokens more heavily, making recent tokens (near the end of the sequence) more accessible. Middle tokens benefit from neither effect, falling into an effective attention dead zone.

Fine-tuning on long-context data and architectural improvements like RoPE interpolation help reduce but do not eliminate the phenomenon. For production systems, this means assuming uniform capability across a claimed context window is incorrect - you need position-stratified evaluation to understand actual capability.

Q2: How would you design a Needle in a Haystack evaluation for a legal document processing system?

I would adapt the standard NIAH framework in three ways for legal documents.

First, use domain-appropriate haystacks. Instead of Paul Graham essays, use actual (anonymized) legal documents: contracts, briefs, regulatory filings. Legal text has a specific lexical density and structure - evaluation results on general web text often do not transfer.

Second, design needles that represent actual failure modes. For a legal system, the critical failure case is not finding a random phrase - it is correctly resolving a cross-reference: "the indemnification limitations in Section 14.3 apply to all claims described in Schedule 7." The needle should require the model to hold one piece of information in mind while finding related information at a different position.

Third, test at realistic document lengths. Real legal contracts for M&A transactions often run 200-400 pages, which is 50k-150k tokens. Test at your actual working lengths, not at convenient round numbers. If your contracts average 80k tokens, test at 80k, not at 64k or 128k.

For metrics, I would track: single-clause retrieval accuracy at different positions, cross-reference accuracy (fact A at position X with fact B at position Y), and specifically test the 30%-70% middle range where degradation is worst. I would also measure how often the model confidently returns wrong answers vs. saying "not found" - overconfident wrong answers are worse than admissions of uncertainty in legal applications.

Q3: What is the difference between the technical context window and the effective context window, and how do you measure the latter?

The technical context window is the maximum number of tokens a model can process in a single forward pass without causing memory overflow or positional encoding errors. It is determined by the model architecture and is the number vendors report in product documentation.

The effective context window is the maximum context length at which the model can reliably retrieve and reason about information regardless of where in the context that information appears. It is always less than or equal to the technical context window and is not reported by default.

To measure the effective context window: run a position-stratified retrieval evaluation (NIAH or equivalent) across a range of context lengths. For each length, compute accuracy specifically on middle-position targets (positions 20%-80% through the document). The effective context window is the longest length at which middle-position accuracy exceeds your acceptable threshold - commonly 80% or 90%.

In practice, this measurement typically reveals effective context windows 3x to 8x smaller than the claimed technical context window. A model claiming 128k technical context may have an effective context window of 16k-32k at a 0.85 accuracy threshold. This is the number that matters for production system design.

Q4: How do you adapt NIAH evaluation for a RAG (Retrieval-Augmented Generation) system?

Standard NIAH uses a single continuous document with a needle at a specific token position. RAG systems work differently: the context is a collection of discrete chunks (typically 512-1024 tokens each), retrieved and concatenated. The relevant question changes from "can the model find a fact at token position X in a 100k-token document" to "can the model use the right chunk when it is chunk K out of N retrieved chunks."

RAG-specific NIAH: take your application's actual chunk size and chunk the haystack into $N$ chunks. Insert the relevant information into a specific chunk $k$ (varying $k$ from 1 to $N$ ). Ask the model to answer using the provided chunks. Measure accuracy as a function of $k$ (the position of the relevant chunk in the retrieved set) and $N$ (the total number of retrieved chunks).

This reveals two practical parameters: the maximum $N$ (number of chunks) at which your model reliably uses all of them, and whether chunk ordering matters (does putting the relevant chunk first improve accuracy?).

For most production RAG systems, I find the answer is: chunk position matters even with small $N$ (5-10 chunks), the first and last positions outperform the middle, and the safe maximum $N$ before accuracy degrades meaningfully is 10-15 chunks for most models. This informs the top- $k$ hyperparameter in your retriever.

Q5: A model claims 128k context. Your NIAH evaluation shows 95% accuracy at 32k but only 60% accuracy in the middle of 64k contexts. How do you communicate this to your engineering team and what do you recommend?

I would communicate it as: the model has an effective context window of approximately 32k tokens for middle-position retrieval at our target accuracy threshold, not 128k.

The practical recommendation depends on the application. If the task involves documents under 32k tokens, the 128k model is fine - deploy it. If the task involves 64k-token documents, you have three options: (1) switch to a model with better long-context performance at 64k (if one exists at acceptable cost), (2) use chunked processing with a strong retrieval step to ensure relevant information appears near the beginning or end of the context the model sees, (3) extend the model's effective context window through additional fine-tuning on long-context data using YaRN or similar techniques.

For option 2, I would implement a "context packing" strategy: use the retriever to identify the most relevant chunks, then place them at the beginning of the context (primacy position) with supporting context after them. This leverages the model's strength (beginning position) while working around its weakness (middle degradation). The downside is that it requires more sophisticated retrieval and means you are not using the full 64k context effectively.

I would specifically avoid the choice of ignoring the evaluation result and deploying on 64k documents expecting 95% accuracy - that is the decision that leads to the legal liability scenario at the start of this lesson.

Q6: How does RULER differ from basic NIAH, and when should you use each?

NIAH tests a single capability: can the model retrieve one specific fact from a specific position in a long document? It answers the question "does the attention mechanism reach the needle at this position and length." It is fast to run (a few hundred cells covers most use cases), easy to interpret (the heatmap is self-explanatory), and directly reveals positional degradation.

RULER extends NIAH into a taxonomy of increasingly complex long-context tasks. Single NIAH is just the base case. The harder tasks - multi-key NIAH, variable tracking, multi-document QA - require the model to do more than retrieve: it must track, integrate, and reason across multiple pieces of information at different positions. These tasks degrade faster than single NIAH because each additional retrieval or reasoning step multiplies the probability of a miss.

Use basic NIAH when: (1) you want a quick sanity check on a model's effective context window, (2) your application is primarily retrieval-based (find this specific thing in this document), (3) you have limited evaluation budget.

Use RULER when: (1) your application requires multi-hop reasoning or cross-referencing, (2) you are comparing multiple models and want a comprehensive capability ranking, (3) you want to understand not just "does performance degrade" but "what type of task degrades first and fastest." For most production applications that use long context for complex reasoning, RULER gives a more accurate prediction of real-world failure modes.

A Legal Team's Near-Miss​

Why This Exists - The Gap Between Claimed and Actual Context​

Historical Context - From Attention Limits to Needle Tests​

Core Concepts​

The Effective Context Window​

The Lost-in-the-Middle Phenomenon​

Needle in a Haystack (NIAH) Evaluation​

Multi-Hop Reasoning at Long Context​

Position-Relative Performance Curves​

Code Examples​

Implementing a Basic NIAH Evaluator​

Visualizing NIAH Results as a Heatmap​

Implementing a RULER-Style Multi-Hop Evaluation​

Building a Long-Context Test Suite for Document Processing​

Mermaid Diagrams​

The NIAH Evaluation Pipeline​

The Lost-in-the-Middle Performance Pattern​

RULER Task Taxonomy​

Production Engineering Notes​

Choosing Context Length Test Points​

Temperature and Determinism​

Haystack Quality Matters​

Handling Models with Sliding Window Attention​

RAG Context Window Evaluation​

Cost Estimation for NIAH at Scale​

Common Mistakes​

Interview Q&A​