Module 15 - Long Context Strategies
Your model supports 128K tokens. A user sends in a 400-page PDF. Your system processes it. Then you discover: the model answered a question about page 3 correctly, nailed something from page 387, and completely missed the critical information on page 201. Not because the information wasn't in the context - it was. The model just didn't attend to it.
Long context is one of the most actively evolving areas of LLM engineering. Context windows have expanded from 2K tokens (GPT-2) to 2K (early GPT-3) to 128K (GPT-4o) to 1M (Gemini 1.5 Pro) to 2M (Gemini 1.5 Ultra) in just four years. Each expansion brings new challenges: how to train models to actually use long contexts, how to make attention tractable at scale, and how to get models to reliably attend to information wherever it appears in the context.
Module Map
Lessons at a Glance
| Lesson | Topic | Key Concepts |
|---|---|---|
| 01 | Attention Scaling | O(n²) cost, KV cache bytes, FlashAttention |
| 02 | RoPE and ALiBi | Rotary encodings, theta parameter, ALiBi extrapolation |
| 03 | Context Extension | Position interpolation, NTK scaling, YaRN, LongLoRA |
| 04 | Lost in the Middle | U-shape curve, multi-doc QA, mitigation strategies |
| 05 | RAG vs Long Context | Cost/accuracy/latency comparison, hybrid approach |
| 06 | Context Compression | LLMLingua, AutoCompressors, GIST tokens |
| 07 | Practical Guide | GPT-4o/Claude/Gemini comparison, prompt structure, costs |
Key Concepts
KV Cache - The key and value tensors from attention are cached to avoid recomputation during autoregressive generation. Size grows with context length and number of layers.
RoPE - Rotary Position Embedding. Encodes position by rotating query and key vectors. Enables relative position attention and is the foundation for most context extension techniques.
YaRN - Yet another RoPE extensioN. A frequency-aware interpolation strategy that extends context windows significantly without full retraining.
Lost in the Middle - The empirical observation (Liu et al. 2023) that LLMs recall information from the beginning and end of long contexts well, but struggle with information in the middle.
LLMLingua - A compression technique that uses a smaller LM to identify and remove redundant tokens from a long prompt before passing to a large LM. Achieves 2-6× compression.
RULER - A long-context benchmark testing retrieval, multi-hop reasoning, and aggregation at various context lengths. More challenging than Needle-in-a-Haystack.
Prerequisites
- Transformer attention mechanism (Q, K, V, softmax)
- Positional encoding basics (absolute, relative)
- Basic knowledge of GPT/Llama architecture
- Familiarity with tokenization and KV cache
Why This Module Matters
The shift from 4K to 128K context windows is not just a quantitative change - it's a qualitative change in what applications are possible. Entire codebases, full research papers, multi-hour transcripts: all can now fit in a single context. But fitting information in the context and reliably using it are different challenges. This module gives you the tools to understand both.
