Skip to main content

Module 15 - Long Context Strategies

Your model supports 128K tokens. A user sends in a 400-page PDF. Your system processes it. Then you discover: the model answered a question about page 3 correctly, nailed something from page 387, and completely missed the critical information on page 201. Not because the information wasn't in the context - it was. The model just didn't attend to it.

Long context is one of the most actively evolving areas of LLM engineering. Context windows have expanded from 2K tokens (GPT-2) to 2K (early GPT-3) to 128K (GPT-4o) to 1M (Gemini 1.5 Pro) to 2M (Gemini 1.5 Ultra) in just four years. Each expansion brings new challenges: how to train models to actually use long contexts, how to make attention tractable at scale, and how to get models to reliably attend to information wherever it appears in the context.


Module Map


Lessons at a Glance

LessonTopicKey Concepts
01Attention ScalingO(n²) cost, KV cache bytes, FlashAttention
02RoPE and ALiBiRotary encodings, theta parameter, ALiBi extrapolation
03Context ExtensionPosition interpolation, NTK scaling, YaRN, LongLoRA
04Lost in the MiddleU-shape curve, multi-doc QA, mitigation strategies
05RAG vs Long ContextCost/accuracy/latency comparison, hybrid approach
06Context CompressionLLMLingua, AutoCompressors, GIST tokens
07Practical GuideGPT-4o/Claude/Gemini comparison, prompt structure, costs

Key Concepts

KV Cache - The key and value tensors from attention are cached to avoid recomputation during autoregressive generation. Size grows with context length and number of layers.

RoPE - Rotary Position Embedding. Encodes position by rotating query and key vectors. Enables relative position attention and is the foundation for most context extension techniques.

YaRN - Yet another RoPE extensioN. A frequency-aware interpolation strategy that extends context windows significantly without full retraining.

Lost in the Middle - The empirical observation (Liu et al. 2023) that LLMs recall information from the beginning and end of long contexts well, but struggle with information in the middle.

LLMLingua - A compression technique that uses a smaller LM to identify and remove redundant tokens from a long prompt before passing to a large LM. Achieves 2-6× compression.

RULER - A long-context benchmark testing retrieval, multi-hop reasoning, and aggregation at various context lengths. More challenging than Needle-in-a-Haystack.


Prerequisites

  • Transformer attention mechanism (Q, K, V, softmax)
  • Positional encoding basics (absolute, relative)
  • Basic knowledge of GPT/Llama architecture
  • Familiarity with tokenization and KV cache

Why This Module Matters

The shift from 4K to 128K context windows is not just a quantitative change - it's a qualitative change in what applications are possible. Entire codebases, full research papers, multi-hour transcripts: all can now fit in a single context. But fitting information in the context and reliably using it are different challenges. This module gives you the tools to understand both.

© 2026 EngineersOfAI. All rights reserved.