How Rotary Position Embedding encodes relative positions through complex-plane rotations, why ALiBi achieves length extrapolation with linear biases, and why RoPE became the dominant approach for long-context models.

How does rotary position embedding work in practice?

RoPE and ALiBi - Positional Encoding for Long Context covers RoPE, rotary position embedding, ALiBi from first principles with code examples. Free lesson at https://engineersofai.com/docs/llms/long-context-strategies/rope-and-alibi

What is the difference between RoPE and ALiBi?

See the full breakdown at https://engineersofai.com/docs/llms/long-context-strategies/rope-and-alibi

RoPE and ALiBi - Positional Encoding for Long Context

The Problem With Learned Position Embeddings

The original transformer used absolute position embeddings: learned vectors $\mathbf{p}_i$ added to token embeddings at each position $i$ . This worked well at the training context length but had a fundamental limitation: position 2048 has a learned embedding, but position 2049 does not. Any attempt to process sequences longer than the training maximum produces undefined behavior.

The problem was obvious enough that researchers started looking for alternatives almost immediately. The search criteria were clear: a position encoding scheme that:

Doesn't require a fixed vocabulary of position indices
Allows attention scores to depend on relative positions, not just absolute ones
Extrapolates smoothly to positions beyond the training range
Is computationally efficient (no additional parameters, no extra computation beyond what's needed)

Two approaches emerged that dominate long-context models today: ALiBi (2021) and RoPE (2021). They solve the problem differently, with different tradeoffs that explain why RoPE became the dominant approach for extending context beyond 32K tokens.

Rotary Position Embedding (RoPE)

The Core Idea

RoPE was introduced by Jianlin Su and colleagues in "RoFormer: Enhanced Transformer with Rotary Position Embedding" (2021). The key insight is elegant: encode position by rotating the query and key vectors in the complex plane.

Instead of adding a position vector to the token embedding (as absolute position embeddings do), RoPE applies a rotation matrix $R_m$ to the query vector at position $m$ and a rotation matrix $R_n$ to the key vector at position $n$ . The resulting inner product $\langle R_m \mathbf{q}, R_n \mathbf{k} \rangle$ depends only on the relative position $m - n$ .

This is the crucial property: attention scores in RoPE automatically encode relative position, not absolute position.

The Mathematics

For a two-dimensional case (easy to visualize), the rotation at position $m$ is:

$R_m \mathbf{x} = \begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix}$

For a $d$ -dimensional query/key vector (where $d$ must be even), the vector is partitioned into $d/2$ pairs, and each pair is rotated by a different frequency:

$R_m^{(i)} \mathbf{x}_{2i:2i+1} = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix} \begin{pmatrix} x_{2i} \\ x_{2i+1} \end{pmatrix}$

Where each dimension pair $i$ has its own frequency $\theta_i$ :

$\theta_i = \frac{1}{10000^{2i/d}}$

This is the same frequency formula as the original Transformer's sinusoidal position encoding, but used differently - as rotation angles rather than additive embeddings.

Why It Gives Relative Position Attention

The inner product between a rotated query at position $m$ and a rotated key at position $n$ is:

$\langle R_m \mathbf{q}, R_n \mathbf{k} \rangle = \text{Re}\left[\sum_j (q_{2j} + iq_{2j+1})(k_{2j} - ik_{2j+1})e^{i(m-n)\theta_j}\right]$

The key observation: this depends only on $m - n$ (the relative position), not on $m$ or $n$ individually. Two tokens at positions (3, 7) and (103, 107) both have relative position 4 and receive identical attention score contributions from the positional component.

This relative-position property is what makes RoPE suitable for length generalization: the model has seen relative positions up to $L_{train} - 1$ during training, and these relative distances remain meaningful at inference time.

import torch
import numpy as np

def precompute_rope_frequencies(
    dim: int,
    max_seq_len: int = 8192,
    theta: float = 10000.0,
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Precompute the cos and sin frequency matrices for RoPE.

    Parameters
    ----------
    dim         : head dimension (must be even)
    max_seq_len : maximum sequence length to precompute for
    theta       : base frequency (default 10000, Llama-3 uses 500000)

    Returns
    -------
    cos, sin tensors of shape (max_seq_len, dim//2)
    """
    # Frequencies for each dimension pair
    # theta_i = 1 / (base^(2i/dim)) for i in 0..dim/2-1
    i = torch.arange(0, dim, 2).float()
    inv_freq = 1.0 / (theta ** (i / dim))  # shape: (dim//2,)

    # Position indices
    t = torch.arange(max_seq_len, dtype=torch.float)  # shape: (max_seq_len,)

    # Outer product: t[m] * inv_freq[i] = m * theta_i
    freqs = torch.outer(t, inv_freq)  # shape: (max_seq_len, dim//2)

    # Precompute cos and sin for efficiency
    cos = freqs.cos()  # shape: (max_seq_len, dim//2)
    sin = freqs.sin()

    return cos, sin


def apply_rope(
    x: torch.Tensor,
    cos: torch.Tensor,
    sin: torch.Tensor,
    position_ids: torch.Tensor | None = None,
) -> torch.Tensor:
    """
    Apply Rotary Position Embedding to query or key tensor.

    Parameters
    ----------
    x           : query or key tensor, shape (batch, seq_len, n_heads, head_dim)
    cos, sin    : precomputed RoPE frequencies, shape (max_seq_len, head_dim//2)
    position_ids: optional explicit position indices, shape (batch, seq_len)

    Returns
    -------
    Rotated tensor of same shape as x.
    """
    seq_len = x.shape[1]

    if position_ids is None:
        # Default: sequential positions 0, 1, 2, ..., seq_len-1
        cos_seq = cos[:seq_len]  # (seq_len, head_dim//2)
        sin_seq = sin[:seq_len]
    else:
        cos_seq = cos[position_ids]  # (batch, seq_len, head_dim//2)
        sin_seq = sin[position_ids]

    # Split into pairs for rotation
    x1, x2 = x[..., ::2], x[..., 1::2]  # even and odd indices

    # Rotate: [x1, x2] -> [x1*cos - x2*sin, x1*sin + x2*cos]
    # This is 2D rotation: [cos -sin; sin cos] @ [x1; x2]
    rotated_x1 = x1 * cos_seq.unsqueeze(1) - x2 * sin_seq.unsqueeze(1)
    rotated_x2 = x1 * sin_seq.unsqueeze(1) + x2 * cos_seq.unsqueeze(1)

    # Interleave back
    result = torch.stack([rotated_x1, rotated_x2], dim=-1)
    return result.flatten(-2)


# Example: compute attention with RoPE
def rope_attention(
    query: torch.Tensor,
    key: torch.Tensor,
    value: torch.Tensor,
    theta: float = 10000.0,
) -> torch.Tensor:
    """
    Self-attention with RoPE position encoding.

    query/key/value: (batch, seq_len, n_heads, head_dim)
    """
    batch, seq_len, n_heads, head_dim = query.shape
    scale = head_dim ** -0.5

    # Precompute frequencies
    cos, sin = precompute_rope_frequencies(head_dim, seq_len, theta)

    # Apply RoPE to queries and keys
    query = apply_rope(query, cos, sin)
    key = apply_rope(key, cos, sin)

    # Standard attention
    # Reshape for batched matmul: (batch, n_heads, seq_len, head_dim)
    q = query.transpose(1, 2)
    k = key.transpose(1, 2)
    v = value.transpose(1, 2)

    scores = torch.matmul(q, k.transpose(-2, -1)) * scale
    weights = torch.softmax(scores, dim=-1)
    output = torch.matmul(weights, v)

    return output.transpose(1, 2)

The Theta Hyperparameter and Context Length

The base frequency $\theta$ in RoPE controls the relationship between position and rotation angle. With $\theta = 10000$ (the original value from Su et al. and used in Llama-1/2):

Dimension pair 0 has frequency $\theta_0 = 1/10000^0 = 1$ (fast-changing, high frequency)
Dimension pair $d/4$ has frequency $\theta_{d/4} = 1/10000^{1/2} = 0.01$ (medium)
Dimension pair $d/2-1$ has frequency $\theta_{d/2-1} = 1/10000^1 = 0.0001$ (slow-changing, low frequency)

The low-frequency dimensions complete one full rotation cycle over $2\pi / \theta_{d/2-1} = 2\pi \times 10000 \approx 62,832$ positions. This means at $\theta = 10000$ , the slow-frequency dimensions can distinguish positions up to about 62K positions - but the model was only trained up to $L_{train}$ positions. Beyond that, the rotation angles are unseen during training.

Llama-3's $\theta = 500,000$ : Meta increased the base frequency by 50× for Llama-3. With $\theta = 500,000$ , the slow-frequency dimensions complete one cycle over $2\pi \times 500,000 \approx 3.14M$ positions - providing much longer effective position range before the model encounters out-of-distribution rotation angles. This is one reason Llama-3 extended from 4K (Llama-1) to 8K (Llama-3-base) to 128K (Llama-3.1) context relatively smoothly.

def analyze_rope_frequency_spectrum(
    dim: int,
    theta: float,
    context_lengths: list[int] = [4096, 8192, 32768, 128000],
) -> None:
    """
    Analyze which RoPE frequencies are "active" vs "out-of-distribution"
    for different context lengths.

    A frequency is "active" if the model has seen at least one full rotation
    cycle during training.
    """
    print(f"RoPE frequency analysis (dim={dim}, theta={theta:.0f})")
    print("-" * 60)

    i_vals = np.arange(0, dim, 2)
    inv_freqs = 1.0 / (theta ** (i_vals / dim))
    wavelengths = (2 * np.pi) / inv_freqs  # positions per full rotation

    for ctx_len in context_lengths:
        n_active = np.sum(wavelengths <= ctx_len)
        n_total = len(wavelengths)
        pct = n_active / n_total * 100
        print(f"  Context {ctx_len:>7,}: {n_active}/{n_total} ({pct:.0f}%) "
              f"frequency bands complete ≥1 cycle")

# theta=10000 (Llama-1/2):
analyze_rope_frequency_spectrum(dim=128, theta=10000)
# Context   4,096:  37/64 (58%) frequency bands active
# Context   8,192:  41/64 (64%) frequency bands active
# Context  32,768:  48/64 (75%) frequency bands active
# Context 128,000:  54/64 (84%) frequency bands active

print()
# theta=500000 (Llama-3):
analyze_rope_frequency_spectrum(dim=128, theta=500000)
# Context   4,096:  29/64 (45%) frequency bands active
# Context   8,192:  33/64 (52%) frequency bands active
# Context  32,768:  40/64 (62%) frequency bands active
# Context 128,000:  46/64 (72%) frequency bands active

Higher theta means fewer frequencies complete a full rotation within the training range, which paradoxically improves long-context extrapolation because the slow frequencies remain within "seen" rotation angles for longer sequences.

ALiBi - Attention with Linear Biases

The Core Idea

ALiBi (Press et al. 2021, "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation") takes a completely different approach: instead of encoding position in the Q and K vectors, add a learned bias to attention logits that penalizes attending to distant tokens.

For each head $h$ , ALiBi subtracts a constant $m_h$ times the distance between tokens from the attention score:

$\text{attention\_score}(q_i, k_j) = \frac{q_i \cdot k_j}{\sqrt{d_k}} - m_h \cdot |i - j|$

The head-specific slopes $m_h$ are fixed (not learned): they form a geometric sequence from $\frac{1}{2^1}$ to $\frac{1}{2^8}$ for 8 heads, and from $\frac{1}{2^{\frac{1}{2}}}$ to $\frac{1}{2^{4.5}}$ for 8 additional heads in 16-head models.

Why ALiBi Extrapolates

ALiBi's key claim: models trained with ALiBi on short sequences (e.g., 1024 tokens) can attend to longer sequences at inference (e.g., 2048 tokens) without any fine-tuning.

The mechanism: the linear bias term naturally penalizes very distant tokens. Even for positions beyond the training range, the linear distance penalty remains meaningful and monotonically increasing. There's no "unknown position" - just a large linear penalty. The model has learned to interpret "high linear penalty = far away" during training; this interpretation extrapolates to even larger distances.

import torch
import torch.nn.functional as F

def get_alibi_slopes(n_heads: int) -> torch.Tensor:
    """
    Compute ALiBi slopes for each attention head.

    Following the formula from Press et al. 2021:
    Slopes form a geometric sequence in powers of 2.
    """
    def get_slopes_power_of_2(n: int) -> list[float]:
        start = 2 ** (-(2 ** -(np.log2(n) - 3)))
        ratio = start
        return [start * ratio**i for i in range(n)]

    if np.log2(n_heads).is_integer():
        slopes = get_slopes_power_of_2(n_heads)
    else:
        # Handle non-power-of-2 head counts
        closest_power_of_2 = 2 ** np.floor(np.log2(n_heads))
        slopes = (
            get_slopes_power_of_2(int(closest_power_of_2))
            + get_slopes_power_of_2(2 * int(closest_power_of_2))[0::2][:n_heads - int(closest_power_of_2)]
        )

    return torch.tensor(slopes, dtype=torch.float32)


def alibi_attention(
    query: torch.Tensor,
    key: torch.Tensor,
    value: torch.Tensor,
) -> torch.Tensor:
    """
    Multi-head attention with ALiBi positional bias.

    query/key/value: (batch, n_heads, seq_len, head_dim)
    """
    batch, n_heads, seq_len, head_dim = query.shape
    scale = head_dim ** -0.5

    # Compute attention scores (no position encoding in Q, K)
    scores = torch.matmul(query, key.transpose(-2, -1)) * scale
    # scores shape: (batch, n_heads, seq_len, seq_len)

    # Compute ALiBi bias
    slopes = get_alibi_slopes(n_heads).to(query.device)
    # slopes shape: (n_heads,)

    # Create relative distance matrix
    positions = torch.arange(seq_len, device=query.device)
    # distances[i, j] = |i - j| (causal: only care about j <= i)
    distances = (positions.unsqueeze(0) - positions.unsqueeze(1)).abs()
    distances = distances.unsqueeze(0).unsqueeze(0)  # (1, 1, seq_len, seq_len)

    # Bias = -m_h * distance
    alibi_bias = -slopes.view(1, n_heads, 1, 1) * distances.float()
    # alibi_bias shape: (1, n_heads, seq_len, seq_len)

    # Add bias to attention scores
    scores = scores + alibi_bias

    # Causal masking
    causal_mask = torch.triu(torch.ones(seq_len, seq_len, device=query.device), diagonal=1)
    scores = scores.masked_fill(causal_mask.bool(), float('-inf'))

    weights = F.softmax(scores, dim=-1)
    output = torch.matmul(weights, value)

    return output

ALiBi's Extrapolation Limits

ALiBi extrapolates well to 2-3× its training length. Press et al. showed that a model trained on 1024 tokens maintains competitive perplexity at 2048 tokens. But the linear decay assumption becomes problematic at very long contexts:

All long distances look the same: At very long distances, the linear penalty dominates all other attention score components. Tokens at position 5000 and 50000 are both "very far away" but the model must still distinguish them semantically.
No frequency bands for long-range patterns: RoPE's multi-frequency spectrum allows the model to learn different attention patterns at different scales (local syntax vs. global structure). ALiBi's single linear decay per head doesn't support this hierarchy.
Headcount constraint: The number of slope values is fixed by the number of heads. With 32 heads, there are 32 distinct distance-decay rates - a limited vocabulary for position-dependent patterns.

These limitations explain why ALiBi is found primarily in models with moderate context lengths (MPT-7B: 2K-8K, BloomZ: 2K) and RoPE dominates in models designed for very long contexts (Llama-3: 128K, Mistral: 32K, Gemini).

RoPE vs ALiBi - Direct Comparison

Property	ALiBi	RoPE
Extrapolation (no fine-tune)	Good (2-3×)	Limited (1.2-1.5×)
Extrapolation (with fine-tune)	Limited	Excellent (10-32×+)
Relative position encoding	Implicit (via bias)	Explicit (via rotation)
Learned parameters	Slopes (fixed, not learned)	None (computed)
Integration with Flash Attn	Requires modification	Native support
Support in modern frameworks	Decreasing	Dominant
Multi-frequency position info	No (single linear decay)	Yes (frequency spectrum)

Why RoPE Dominates Modern Long-Context Models

The choice of RoPE over ALiBi for modern long-context models comes down to one key property: RoPE is systematically extensible.

The failure mode of RoPE at long context (perplexity degradation at positions beyond training range) has a principled fix: modify the rotation frequencies to spread the training range over a larger position space. This can be done through:

Simple scaling: multiply all position indices by a scaling factor
NTK-aware scaling: scale high-frequency components differently from low-frequency ones
YaRN: apply a ramp function that scales each frequency band based on its wavelength
Increasing theta: pre-train with larger theta to have wider base range

ALiBi's linear decay structure doesn't have the same principled extensibility. You can fine-tune on longer sequences, but there's no analogy to RoPE's frequency interpolation that gives researchers a well-motivated scaling recipe.

Additionally, RoPE integrates naturally with Flash Attention implementations, while ALiBi's bias terms require special handling.

Practical Notes on Theta Values

Different models use different theta values for RoPE:

Model	theta	Training Context
Llama-1	10,000	2,048
Llama-2	10,000	4,096
Code Llama	1,000,000	16,384
Llama-3 base	500,000	8,192
Llama-3.1	500,000 + YaRN	128,000
Mistral-7B	10,000	32,768
Gemma-2	10,000	8,192

Code Llama's extreme theta (1M) was a deliberate choice for code contexts: code files can be very long, and positions far into a file should be distinguishable. With theta=1M, the slow-frequency dimensions have a wavelength of $2\pi \times 1,000,000 \approx 6.3M$ positions - far beyond any practical code file length.

Common Mistakes

:::danger Don't assume RoPE automatically extrapolates to arbitrary lengths RoPE provides relative position encoding, but it does NOT automatically generalize to lengths much beyond the training maximum. The rotation angles for positions far outside the training range have never been seen during training. Models degrade sharply at 2× and catastrophically at 5× their training context without the scaling techniques described in Lesson 03. :::

:::warning Don't confuse theta (base frequency) with context length Increasing theta extends the wavelength of RoPE's slowest frequencies, which helps with long-context extrapolation - but it doesn't directly set the context length. A model with theta=500,000 still needs to be trained (or fine-tuned) on long contexts to actually perform well at long contexts. Theta is one component of the recipe, not the complete solution. :::

:::tip Check which RoPE variant your model uses Models using YaRN, LongRoPE, or other RoPE modifications have specific configuration parameters that must be set correctly at inference time. When loading a model like Llama-3.1 that uses YaRN-based extension, ensure your transformers version supports the model's RoPE scaling type. Loading with an older version may silently fall back to standard RoPE, giving degraded performance at long contexts. :::

Interview Q&A

Q: How does RoPE encode relative position information? Why is this useful for long context?

A: RoPE encodes position by applying a rotation matrix to the query and key vectors: query at position $m$ is rotated by $R_m$ and key at position $n$ is rotated by $R_n$ . The inner product $\langle R_m q, R_n k \rangle$ depends only on the relative position $m - n$ , not on the absolute values of $m$ and $n$ . This is useful for long context because the model learns patterns based on relative distances - "token A is 5 positions before token B" - rather than absolute positions. As long as the relative distances at inference time are similar to those seen during training, the model can apply its learned patterns regardless of where in the sequence the tokens appear.

Q: What is the theta hyperparameter in RoPE and how does it affect long-context behavior?

A: Theta is the base frequency that determines the rotation speed of each dimension pair. With theta=10,000, dimension pair $i$ has rotation frequency $\theta_i = 1/10000^{2i/d}$ . Higher theta means slower-changing frequencies: the slowest frequency has wavelength $2\pi \times \text{theta}$ positions. Increasing theta (e.g., from 10,000 to 500,000 in Llama-3) extends the effective position range before models encounter "out-of-distribution" rotation angles. Llama-3 uses theta=500,000 to support 8K training context more cleanly, and this larger theta also helps with the subsequent extension to 128K via YaRN.

Q: How does ALiBi achieve length extrapolation and why doesn't it scale to very long contexts?

A: ALiBi adds a linear bias $-m_h \cdot |i-j|$ to attention scores, where $m_h$ is a head-specific slope. This naturally penalizes distant tokens. Because the penalty is purely linear in distance, even positions beyond the training range receive a meaningful (and monotonically larger) penalty - the model has learned to interpret "large penalty = far away" during training, and this interpretation extends to larger distances. However, ALiBi doesn't scale to very long contexts because: (1) the linear decay becomes so dominant at large distances that distant tokens are effectively ignored regardless of their content; (2) there's no multi-frequency structure to encode long-range vs. short-range patterns with different decay rates; (3) fine-tuning on very long contexts doesn't help as much as with RoPE because the linear structure can't be systematically modified.

Q: Why did RoPE become the dominant position encoding for long-context models over ALiBi?

A: Three main reasons. First, RoPE is systematically extensible: its frequency-based structure allows principled modifications (theta scaling, interpolation, YaRN) that extend context length without full retraining. ALiBi's linear structure doesn't have analogous extension techniques. Second, RoPE integrates naturally with FlashAttention kernels (which require the attention scores to be a function of Q and K only), while ALiBi's additive bias term requires special kernel modifications. Third, empirically RoPE with fine-tuning achieves better performance at very long contexts (128K+) than ALiBi with comparable fine-tuning, partly because the multi-frequency spectrum supports richer position-dependent attention patterns.

Q: What is the frequency spectrum of RoPE and why does it matter for long-range dependencies?

A: RoPE applies different rotation speeds to different dimension pairs. High-frequency dimensions (small wavelength) rotate quickly through their cycle - they change significantly between adjacent tokens and encode short-range positional information. Low-frequency dimensions (large wavelength) rotate slowly - they change minimally between adjacent tokens but distinguish positions separated by thousands of tokens. This multi-scale structure means the model can learn different patterns for local (syntactic) and global (semantic) dependencies simultaneously. The high-frequency bands handle "this token is the next one after" distinctions; the low-frequency bands handle "this token is in section 3 vs section 10" distinctions. ALiBi's single linear decay rate per head doesn't support this hierarchy.

:::tip 🎮 Interactive Playground

Visualize this concept: Try the Positional Encoding: Sinusoidal, RoPE & ALiBi demo on the EngineersOfAI Playground - no code required.

:::

The Problem With Learned Position Embeddings​

Rotary Position Embedding (RoPE)​

The Core Idea​

The Mathematics​

Why It Gives Relative Position Attention​

The Theta Hyperparameter and Context Length​

ALiBi - Attention with Linear Biases​

The Core Idea​

Why ALiBi Extrapolates​

ALiBi's Extrapolation Limits​

RoPE vs ALiBi - Direct Comparison​

Why RoPE Dominates Modern Long-Context Models​

Practical Notes on Theta Values​

Common Mistakes​

Interview Q&A​

The Problem With Learned Position Embeddings

Rotary Position Embedding (RoPE)

The Core Idea

The Mathematics

Why It Gives Relative Position Attention

The Theta Hyperparameter and Context Length

ALiBi - Attention with Linear Biases

The Core Idea

Why ALiBi Extrapolates

ALiBi's Extrapolation Limits

RoPE vs ALiBi - Direct Comparison

Why RoPE Dominates Modern Long-Context Models

Practical Notes on Theta Values

Common Mistakes

Interview Q&A