8 docs tagged with "long-context"

Context Compression Techniques

How LLMLingua, AutoCompressors, GIST tokens, and selective compression reduce long contexts to fewer tokens while preserving the information needed to answer queries.

Context Window Extension - YaRN, LongRoPE, LongLoRA

How position interpolation, NTK-aware scaling, YaRN, and LongLoRA extend pretrained models to context windows far beyond their original training length.

Lost in the Middle - How LLMs Use Long Contexts

The empirical finding that LLMs reliably recall information at the beginning and end of long contexts but miss information in the middle, and strategies to mitigate this U-shaped performance degradation.

Module 15 Overview - Long Context Strategies

How modern LLMs handle extremely long inputs - from the fundamental O(n²) attention problem to RoPE scaling, context compression, and production engineering for 128K+ context windows.

RAG vs Long Context - When to Use Each

A rigorous cost, latency, and accuracy comparison of retrieval-augmented generation versus long-context stuffing, with decision frameworks for production use cases.

RoPE and ALiBi - Positional Encoding for Long Context

How Rotary Position Embedding encodes relative positions through complex-plane rotations, why ALiBi achieves length extrapolation with linear biases, and why RoPE became the dominant approach for long-context models.

The Challenge of Attention at Long Contexts

Why attention is O(n²) in memory and compute, how the KV cache grows with context length, and how FlashAttention solves the IO bottleneck without changing the algorithm.

Working with 128K+ Context Windows in Production

A complete production engineering guide for building applications with long-context LLMs - model selection, cost management, prompt structure, multi-turn conversation, and memory-augmented systems.