RoPE and ALiBi - Positional Encoding for Long Context
The Problem With Learned Position Embeddings
The original transformer used absolute position embeddings: learned vectors added to token embeddings at each position . This worked well at the training context length but had a fundamental limitation: position 2048 has a learned embedding, but position 2049 does not. Any attempt to process sequences longer than the training maximum produces undefined behavior.
The problem was obvious enough that researchers started looking for alternatives almost immediately. The search criteria were clear: a position encoding scheme that:
- Doesn't require a fixed vocabulary of position indices
- Allows attention scores to depend on relative positions, not just absolute ones
- Extrapolates smoothly to positions beyond the training range
- Is computationally efficient (no additional parameters, no extra computation beyond what's needed)
Two approaches emerged that dominate long-context models today: ALiBi (2021) and RoPE (2021). They solve the problem differently, with different tradeoffs that explain why RoPE became the dominant approach for extending context beyond 32K tokens.
Rotary Position Embedding (RoPE)
The Core Idea
RoPE was introduced by Jianlin Su and colleagues in "RoFormer: Enhanced Transformer with Rotary Position Embedding" (2021). The key insight is elegant: encode position by rotating the query and key vectors in the complex plane.
Instead of adding a position vector to the token embedding (as absolute position embeddings do), RoPE applies a rotation matrix to the query vector at position and a rotation matrix to the key vector at position . The resulting inner product depends only on the relative position .
This is the crucial property: attention scores in RoPE automatically encode relative position, not absolute position.
The Mathematics
For a two-dimensional case (easy to visualize), the rotation at position is:
For a -dimensional query/key vector (where must be even), the vector is partitioned into pairs, and each pair is rotated by a different frequency:
Where each dimension pair has its own frequency :
This is the same frequency formula as the original Transformer's sinusoidal position encoding, but used differently - as rotation angles rather than additive embeddings.
Why It Gives Relative Position Attention
The inner product between a rotated query at position and a rotated key at position is:
The key observation: this depends only on (the relative position), not on or individually. Two tokens at positions (3, 7) and (103, 107) both have relative position 4 and receive identical attention score contributions from the positional component.
This relative-position property is what makes RoPE suitable for length generalization: the model has seen relative positions up to during training, and these relative distances remain meaningful at inference time.
import torch
import numpy as np
def precompute_rope_frequencies(
dim: int,
max_seq_len: int = 8192,
theta: float = 10000.0,
) -> tuple[torch.Tensor, torch.Tensor]:
"""
Precompute the cos and sin frequency matrices for RoPE.
Parameters
----------
dim : head dimension (must be even)
max_seq_len : maximum sequence length to precompute for
theta : base frequency (default 10000, Llama-3 uses 500000)
Returns
-------
cos, sin tensors of shape (max_seq_len, dim//2)
"""
# Frequencies for each dimension pair
# theta_i = 1 / (base^(2i/dim)) for i in 0..dim/2-1
i = torch.arange(0, dim, 2).float()
inv_freq = 1.0 / (theta ** (i / dim)) # shape: (dim//2,)
# Position indices
t = torch.arange(max_seq_len, dtype=torch.float) # shape: (max_seq_len,)
# Outer product: t[m] * inv_freq[i] = m * theta_i
freqs = torch.outer(t, inv_freq) # shape: (max_seq_len, dim//2)
# Precompute cos and sin for efficiency
cos = freqs.cos() # shape: (max_seq_len, dim//2)
sin = freqs.sin()
return cos, sin
def apply_rope(
x: torch.Tensor,
cos: torch.Tensor,
sin: torch.Tensor,
position_ids: torch.Tensor | None = None,
) -> torch.Tensor:
"""
Apply Rotary Position Embedding to query or key tensor.
Parameters
----------
x : query or key tensor, shape (batch, seq_len, n_heads, head_dim)
cos, sin : precomputed RoPE frequencies, shape (max_seq_len, head_dim//2)
position_ids: optional explicit position indices, shape (batch, seq_len)
Returns
-------
Rotated tensor of same shape as x.
"""
seq_len = x.shape[1]
if position_ids is None:
# Default: sequential positions 0, 1, 2, ..., seq_len-1
cos_seq = cos[:seq_len] # (seq_len, head_dim//2)
sin_seq = sin[:seq_len]
else:
cos_seq = cos[position_ids] # (batch, seq_len, head_dim//2)
sin_seq = sin[position_ids]
# Split into pairs for rotation
x1, x2 = x[..., ::2], x[..., 1::2] # even and odd indices
# Rotate: [x1, x2] -> [x1*cos - x2*sin, x1*sin + x2*cos]
# This is 2D rotation: [cos -sin; sin cos] @ [x1; x2]
rotated_x1 = x1 * cos_seq.unsqueeze(1) - x2 * sin_seq.unsqueeze(1)
rotated_x2 = x1 * sin_seq.unsqueeze(1) + x2 * cos_seq.unsqueeze(1)
# Interleave back
result = torch.stack([rotated_x1, rotated_x2], dim=-1)
return result.flatten(-2)
# Example: compute attention with RoPE
def rope_attention(
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
theta: float = 10000.0,
) -> torch.Tensor:
"""
Self-attention with RoPE position encoding.
query/key/value: (batch, seq_len, n_heads, head_dim)
"""
batch, seq_len, n_heads, head_dim = query.shape
scale = head_dim ** -0.5
# Precompute frequencies
cos, sin = precompute_rope_frequencies(head_dim, seq_len, theta)
# Apply RoPE to queries and keys
query = apply_rope(query, cos, sin)
key = apply_rope(key, cos, sin)
# Standard attention
# Reshape for batched matmul: (batch, n_heads, seq_len, head_dim)
q = query.transpose(1, 2)
k = key.transpose(1, 2)
v = value.transpose(1, 2)
scores = torch.matmul(q, k.transpose(-2, -1)) * scale
weights = torch.softmax(scores, dim=-1)
output = torch.matmul(weights, v)
return output.transpose(1, 2)
The Theta Hyperparameter and Context Length
The base frequency in RoPE controls the relationship between position and rotation angle. With (the original value from Su et al. and used in Llama-1/2):
- Dimension pair 0 has frequency (fast-changing, high frequency)
- Dimension pair has frequency (medium)
- Dimension pair has frequency (slow-changing, low frequency)
The low-frequency dimensions complete one full rotation cycle over positions. This means at , the slow-frequency dimensions can distinguish positions up to about 62K positions - but the model was only trained up to positions. Beyond that, the rotation angles are unseen during training.
Llama-3's : Meta increased the base frequency by 50× for Llama-3. With , the slow-frequency dimensions complete one cycle over positions - providing much longer effective position range before the model encounters out-of-distribution rotation angles. This is one reason Llama-3 extended from 4K (Llama-1) to 8K (Llama-3-base) to 128K (Llama-3.1) context relatively smoothly.
def analyze_rope_frequency_spectrum(
dim: int,
theta: float,
context_lengths: list[int] = [4096, 8192, 32768, 128000],
) -> None:
"""
Analyze which RoPE frequencies are "active" vs "out-of-distribution"
for different context lengths.
A frequency is "active" if the model has seen at least one full rotation
cycle during training.
"""
print(f"RoPE frequency analysis (dim={dim}, theta={theta:.0f})")
print("-" * 60)
i_vals = np.arange(0, dim, 2)
inv_freqs = 1.0 / (theta ** (i_vals / dim))
wavelengths = (2 * np.pi) / inv_freqs # positions per full rotation
for ctx_len in context_lengths:
n_active = np.sum(wavelengths <= ctx_len)
n_total = len(wavelengths)
pct = n_active / n_total * 100
print(f" Context {ctx_len:>7,}: {n_active}/{n_total} ({pct:.0f}%) "
f"frequency bands complete ≥1 cycle")
# theta=10000 (Llama-1/2):
analyze_rope_frequency_spectrum(dim=128, theta=10000)
# Context 4,096: 37/64 (58%) frequency bands active
# Context 8,192: 41/64 (64%) frequency bands active
# Context 32,768: 48/64 (75%) frequency bands active
# Context 128,000: 54/64 (84%) frequency bands active
print()
# theta=500000 (Llama-3):
analyze_rope_frequency_spectrum(dim=128, theta=500000)
# Context 4,096: 29/64 (45%) frequency bands active
# Context 8,192: 33/64 (52%) frequency bands active
# Context 32,768: 40/64 (62%) frequency bands active
# Context 128,000: 46/64 (72%) frequency bands active
Higher theta means fewer frequencies complete a full rotation within the training range, which paradoxically improves long-context extrapolation because the slow frequencies remain within "seen" rotation angles for longer sequences.
ALiBi - Attention with Linear Biases
The Core Idea
ALiBi (Press et al. 2021, "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation") takes a completely different approach: instead of encoding position in the Q and K vectors, add a learned bias to attention logits that penalizes attending to distant tokens.
For each head , ALiBi subtracts a constant times the distance between tokens from the attention score:
The head-specific slopes are fixed (not learned): they form a geometric sequence from to for 8 heads, and from to for 8 additional heads in 16-head models.
Why ALiBi Extrapolates
ALiBi's key claim: models trained with ALiBi on short sequences (e.g., 1024 tokens) can attend to longer sequences at inference (e.g., 2048 tokens) without any fine-tuning.
The mechanism: the linear bias term naturally penalizes very distant tokens. Even for positions beyond the training range, the linear distance penalty remains meaningful and monotonically increasing. There's no "unknown position" - just a large linear penalty. The model has learned to interpret "high linear penalty = far away" during training; this interpretation extrapolates to even larger distances.
import torch
import torch.nn.functional as F
def get_alibi_slopes(n_heads: int) -> torch.Tensor:
"""
Compute ALiBi slopes for each attention head.
Following the formula from Press et al. 2021:
Slopes form a geometric sequence in powers of 2.
"""
def get_slopes_power_of_2(n: int) -> list[float]:
start = 2 ** (-(2 ** -(np.log2(n) - 3)))
ratio = start
return [start * ratio**i for i in range(n)]
if np.log2(n_heads).is_integer():
slopes = get_slopes_power_of_2(n_heads)
else:
# Handle non-power-of-2 head counts
closest_power_of_2 = 2 ** np.floor(np.log2(n_heads))
slopes = (
get_slopes_power_of_2(int(closest_power_of_2))
+ get_slopes_power_of_2(2 * int(closest_power_of_2))[0::2][:n_heads - int(closest_power_of_2)]
)
return torch.tensor(slopes, dtype=torch.float32)
def alibi_attention(
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
) -> torch.Tensor:
"""
Multi-head attention with ALiBi positional bias.
query/key/value: (batch, n_heads, seq_len, head_dim)
"""
batch, n_heads, seq_len, head_dim = query.shape
scale = head_dim ** -0.5
# Compute attention scores (no position encoding in Q, K)
scores = torch.matmul(query, key.transpose(-2, -1)) * scale
# scores shape: (batch, n_heads, seq_len, seq_len)
# Compute ALiBi bias
slopes = get_alibi_slopes(n_heads).to(query.device)
# slopes shape: (n_heads,)
# Create relative distance matrix
positions = torch.arange(seq_len, device=query.device)
# distances[i, j] = |i - j| (causal: only care about j <= i)
distances = (positions.unsqueeze(0) - positions.unsqueeze(1)).abs()
distances = distances.unsqueeze(0).unsqueeze(0) # (1, 1, seq_len, seq_len)
# Bias = -m_h * distance
alibi_bias = -slopes.view(1, n_heads, 1, 1) * distances.float()
# alibi_bias shape: (1, n_heads, seq_len, seq_len)
# Add bias to attention scores
scores = scores + alibi_bias
# Causal masking
causal_mask = torch.triu(torch.ones(seq_len, seq_len, device=query.device), diagonal=1)
scores = scores.masked_fill(causal_mask.bool(), float('-inf'))
weights = F.softmax(scores, dim=-1)
output = torch.matmul(weights, value)
return output
ALiBi's Extrapolation Limits
ALiBi extrapolates well to 2-3× its training length. Press et al. showed that a model trained on 1024 tokens maintains competitive perplexity at 2048 tokens. But the linear decay assumption becomes problematic at very long contexts:
-
All long distances look the same: At very long distances, the linear penalty dominates all other attention score components. Tokens at position 5000 and 50000 are both "very far away" but the model must still distinguish them semantically.
-
No frequency bands for long-range patterns: RoPE's multi-frequency spectrum allows the model to learn different attention patterns at different scales (local syntax vs. global structure). ALiBi's single linear decay per head doesn't support this hierarchy.
-
Headcount constraint: The number of slope values is fixed by the number of heads. With 32 heads, there are 32 distinct distance-decay rates - a limited vocabulary for position-dependent patterns.
These limitations explain why ALiBi is found primarily in models with moderate context lengths (MPT-7B: 2K-8K, BloomZ: 2K) and RoPE dominates in models designed for very long contexts (Llama-3: 128K, Mistral: 32K, Gemini).
RoPE vs ALiBi - Direct Comparison
| Property | ALiBi | RoPE |
|---|---|---|
| Extrapolation (no fine-tune) | Good (2-3×) | Limited (1.2-1.5×) |
| Extrapolation (with fine-tune) | Limited | Excellent (10-32×+) |
| Relative position encoding | Implicit (via bias) | Explicit (via rotation) |
| Learned parameters | Slopes (fixed, not learned) | None (computed) |
| Integration with Flash Attn | Requires modification | Native support |
| Support in modern frameworks | Decreasing | Dominant |
| Multi-frequency position info | No (single linear decay) | Yes (frequency spectrum) |
Why RoPE Dominates Modern Long-Context Models
The choice of RoPE over ALiBi for modern long-context models comes down to one key property: RoPE is systematically extensible.
The failure mode of RoPE at long context (perplexity degradation at positions beyond training range) has a principled fix: modify the rotation frequencies to spread the training range over a larger position space. This can be done through:
- Simple scaling: multiply all position indices by a scaling factor
- NTK-aware scaling: scale high-frequency components differently from low-frequency ones
- YaRN: apply a ramp function that scales each frequency band based on its wavelength
- Increasing theta: pre-train with larger theta to have wider base range
ALiBi's linear decay structure doesn't have the same principled extensibility. You can fine-tune on longer sequences, but there's no analogy to RoPE's frequency interpolation that gives researchers a well-motivated scaling recipe.
Additionally, RoPE integrates naturally with Flash Attention implementations, while ALiBi's bias terms require special handling.
Practical Notes on Theta Values
Different models use different theta values for RoPE:
| Model | theta | Training Context |
|---|---|---|
| Llama-1 | 10,000 | 2,048 |
| Llama-2 | 10,000 | 4,096 |
| Code Llama | 1,000,000 | 16,384 |
| Llama-3 base | 500,000 | 8,192 |
| Llama-3.1 | 500,000 + YaRN | 128,000 |
| Mistral-7B | 10,000 | 32,768 |
| Gemma-2 | 10,000 | 8,192 |
Code Llama's extreme theta (1M) was a deliberate choice for code contexts: code files can be very long, and positions far into a file should be distinguishable. With theta=1M, the slow-frequency dimensions have a wavelength of positions - far beyond any practical code file length.
Common Mistakes
:::danger Don't assume RoPE automatically extrapolates to arbitrary lengths RoPE provides relative position encoding, but it does NOT automatically generalize to lengths much beyond the training maximum. The rotation angles for positions far outside the training range have never been seen during training. Models degrade sharply at 2× and catastrophically at 5× their training context without the scaling techniques described in Lesson 03. :::
:::warning Don't confuse theta (base frequency) with context length Increasing theta extends the wavelength of RoPE's slowest frequencies, which helps with long-context extrapolation - but it doesn't directly set the context length. A model with theta=500,000 still needs to be trained (or fine-tuned) on long contexts to actually perform well at long contexts. Theta is one component of the recipe, not the complete solution. :::
:::tip Check which RoPE variant your model uses
Models using YaRN, LongRoPE, or other RoPE modifications have specific configuration parameters that must be set correctly at inference time. When loading a model like Llama-3.1 that uses YaRN-based extension, ensure your transformers version supports the model's RoPE scaling type. Loading with an older version may silently fall back to standard RoPE, giving degraded performance at long contexts.
:::
Interview Q&A
Q: How does RoPE encode relative position information? Why is this useful for long context?
A: RoPE encodes position by applying a rotation matrix to the query and key vectors: query at position is rotated by and key at position is rotated by . The inner product depends only on the relative position , not on the absolute values of and . This is useful for long context because the model learns patterns based on relative distances - "token A is 5 positions before token B" - rather than absolute positions. As long as the relative distances at inference time are similar to those seen during training, the model can apply its learned patterns regardless of where in the sequence the tokens appear.
Q: What is the theta hyperparameter in RoPE and how does it affect long-context behavior?
A: Theta is the base frequency that determines the rotation speed of each dimension pair. With theta=10,000, dimension pair has rotation frequency . Higher theta means slower-changing frequencies: the slowest frequency has wavelength positions. Increasing theta (e.g., from 10,000 to 500,000 in Llama-3) extends the effective position range before models encounter "out-of-distribution" rotation angles. Llama-3 uses theta=500,000 to support 8K training context more cleanly, and this larger theta also helps with the subsequent extension to 128K via YaRN.
Q: How does ALiBi achieve length extrapolation and why doesn't it scale to very long contexts?
A: ALiBi adds a linear bias to attention scores, where is a head-specific slope. This naturally penalizes distant tokens. Because the penalty is purely linear in distance, even positions beyond the training range receive a meaningful (and monotonically larger) penalty - the model has learned to interpret "large penalty = far away" during training, and this interpretation extends to larger distances. However, ALiBi doesn't scale to very long contexts because: (1) the linear decay becomes so dominant at large distances that distant tokens are effectively ignored regardless of their content; (2) there's no multi-frequency structure to encode long-range vs. short-range patterns with different decay rates; (3) fine-tuning on very long contexts doesn't help as much as with RoPE because the linear structure can't be systematically modified.
Q: Why did RoPE become the dominant position encoding for long-context models over ALiBi?
A: Three main reasons. First, RoPE is systematically extensible: its frequency-based structure allows principled modifications (theta scaling, interpolation, YaRN) that extend context length without full retraining. ALiBi's linear structure doesn't have analogous extension techniques. Second, RoPE integrates naturally with FlashAttention kernels (which require the attention scores to be a function of Q and K only), while ALiBi's additive bias term requires special kernel modifications. Third, empirically RoPE with fine-tuning achieves better performance at very long contexts (128K+) than ALiBi with comparable fine-tuning, partly because the multi-frequency spectrum supports richer position-dependent attention patterns.
Q: What is the frequency spectrum of RoPE and why does it matter for long-range dependencies?
A: RoPE applies different rotation speeds to different dimension pairs. High-frequency dimensions (small wavelength) rotate quickly through their cycle - they change significantly between adjacent tokens and encode short-range positional information. Low-frequency dimensions (large wavelength) rotate slowly - they change minimally between adjacent tokens but distinguish positions separated by thousands of tokens. This multi-scale structure means the model can learn different patterns for local (syntactic) and global (semantic) dependencies simultaneously. The high-frequency bands handle "this token is the next one after" distinctions; the low-frequency bands handle "this token is in section 3 vs section 10" distinctions. ALiBi's single linear decay rate per head doesn't support this hierarchy.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Positional Encoding: Sinusoidal, RoPE & ALiBi demo on the EngineersOfAI Playground - no code required.
:::
