Module 15 - Long Context Strategies

Your model supports 128K tokens. A user sends in a 400-page PDF. Your system processes it. Then you discover: the model answered a question about page 3 correctly, nailed something from page 387, and completely missed the critical information on page 201. Not because the information wasn't in the context - it was. The model just didn't attend to it.

Long context is one of the most actively evolving areas of LLM engineering. Context windows have expanded from 2K tokens (GPT-2) to 2K (early GPT-3) to 128K (GPT-4o) to 1M (Gemini 1.5 Pro) to 2M (Gemini 1.5 Ultra) in just four years. Each expansion brings new challenges: how to train models to actually use long contexts, how to make attention tractable at scale, and how to get models to reliably attend to information wherever it appears in the context.

Module Map

Lessons at a Glance

Lesson	Topic	Key Concepts
01	Attention Scaling	O(n²) cost, KV cache bytes, FlashAttention
02	RoPE and ALiBi	Rotary encodings, theta parameter, ALiBi extrapolation
03	Context Extension	Position interpolation, NTK scaling, YaRN, LongLoRA
04	Lost in the Middle	U-shape curve, multi-doc QA, mitigation strategies
05	RAG vs Long Context	Cost/accuracy/latency comparison, hybrid approach
06	Context Compression	LLMLingua, AutoCompressors, GIST tokens
07	Practical Guide	GPT-4o/Claude/Gemini comparison, prompt structure, costs

Key Concepts

KV Cache - The key and value tensors from attention are cached to avoid recomputation during autoregressive generation. Size grows with context length and number of layers.

RoPE - Rotary Position Embedding. Encodes position by rotating query and key vectors. Enables relative position attention and is the foundation for most context extension techniques.

YaRN - Yet another RoPE extensioN. A frequency-aware interpolation strategy that extends context windows significantly without full retraining.

Lost in the Middle - The empirical observation (Liu et al. 2023) that LLMs recall information from the beginning and end of long contexts well, but struggle with information in the middle.

LLMLingua - A compression technique that uses a smaller LM to identify and remove redundant tokens from a long prompt before passing to a large LM. Achieves 2-6× compression.

RULER - A long-context benchmark testing retrieval, multi-hop reasoning, and aggregation at various context lengths. More challenging than Needle-in-a-Haystack.

Prerequisites

Transformer attention mechanism (Q, K, V, softmax)
Positional encoding basics (absolute, relative)
Basic knowledge of GPT/Llama architecture
Familiarity with tokenization and KV cache

Why This Module Matters

The shift from 4K to 128K context windows is not just a quantitative change - it's a qualitative change in what applications are possible. Entire codebases, full research papers, multi-hour transcripts: all can now fit in a single context. But fitting information in the context and reliably using it are different challenges. This module gives you the tools to understand both.

Module Map​

Lessons at a Glance​

Key Concepts​

Prerequisites​

Why This Module Matters​

Module Map

Lessons at a Glance

Key Concepts

Prerequisites

Why This Module Matters