Positional Encoding
Reading time: ~35 min · Interview relevance: High · Target roles: ML Engineer, AI Engineer, Research Engineer
The Silent Failure
The bug report came in at 11pm. The deployed translation service was producing outputs where word order seemed scrambled in a subtle way: "The dog bit the man" was being translated identically to "The man bit the dog." The semantic content was right, but order was wrong.
The engineer on call, Priya, pulled the model config. She found the issue in 10 minutes. Three weeks earlier, a junior engineer had "cleaned up" the preprocessing pipeline. In doing so, they'd quietly removed the line that added positional encodings to the input embeddings. The model was receiving token embeddings with no position information at all.
The result was exactly what you'd expect: the model treated "dog bit man" and "man bit dog" as identical bags of tokens. The attention mechanism saw the same content - just different orderings - and produced similar outputs. The loss during evaluation was actually lower (the model was "more certain" about its answers), which is why the regression hadn't been caught in automated testing.
This incident illustrates something fundamental about transformers: unlike RNNs, they have zero built-in notion of sequence order. An RNN processes token 1, then token 2, then token 3 - position is implicit in the processing order. A transformer processes all tokens simultaneously. Without explicit positional information, the model is permutation-invariant: it cannot distinguish "A then B" from "B then A."
Positional encoding is the mechanism that fixes this. It is not a clever trick - it is a fundamental requirement.
Why Transformers Need Explicit Position Information
The attention operation:
treats the sequence as a set, not a sequence. The output for position depends only on the content of the tokens, not their positions. If you permute the input tokens, you permute the outputs by the same permutation - the model is equivariant to permutation.
For many tasks, order is everything:
- "The cat ate the fish" vs "The fish ate the cat"
- "not good" vs "good not" (word order changes negation)
- Code:
return x + yvsreturn y + x(equivalent) vsx = y + return(invalid)
To inject position information, the transformer adds a positional encoding vector to the token embedding :
This simple addition gives the model two types of information simultaneously: what the token is (from the embedding) and where it sits in the sequence (from the positional encoding).
Sinusoidal Positional Encoding (Original Transformer)
Vaswani et al. (2017) used a fixed, non-learned function based on sine and cosine waves of different frequencies:
where:
- is the position in the sequence (0, 1, 2, ...)
- is the dimension index (0, 1, ..., )
- is the embedding dimension
This creates a matrix of shape where each row is the encoding for one position.
Intuition: Binary Counting Analogy
Think of binary numbers:
position 0: 000
position 1: 001
position 2: 010
position 3: 011
position 4: 100
The least significant bit flips every 1 step, the next bit every 2 steps, the next every 4 steps. Each bit is a different "frequency" for encoding position.
Sinusoidal encoding is the continuous version: instead of bits that flip between 0 and 1, you have sine/cosine waves at different frequencies. Dimension uses frequency - lower dimensions have high frequency (oscillate often), higher dimensions have low frequency (change slowly).
For and position up to 10,000:
- Dimension 0: oscillates with period (extremely fast)
- Dimension 256: oscillates with period (extremely slow - barely moves over 10K positions)
This gives each position a unique "fingerprint" - a vector in -dimensional space that uniquely identifies it.
Why Sinusoidal? The Relative Position Property
A key mathematical property: for any fixed offset , can be expressed as a linear function of :
where .
This means the attention mechanism can learn to compute relative positions by learning a linear transformation. A head that wants to "look back 3 tokens" can represent this as a linear operation on positional encodings. This is a significant advantage over arbitrary learned embeddings.
Implementation: Sinusoidal Encoding
import numpy as np
import torch
import matplotlib.pyplot as plt
def sinusoidal_positional_encoding(
max_seq_len: int,
d_model: int,
) -> np.ndarray:
"""
Compute sinusoidal positional encodings.
Returns:
PE: shape (max_seq_len, d_model)
"""
# Position indices: (max_seq_len, 1)
position = np.arange(max_seq_len)[:, np.newaxis]
# Division terms: (d_model/2,) - one per sine/cosine pair
# 10000^(2i/d_model) for i = 0, 1, ..., d_model/2 - 1
i = np.arange(0, d_model, 2) # Even indices only
div_term = np.power(10000, i / d_model) # (d_model/2,)
# Compute PE
PE = np.zeros((max_seq_len, d_model))
PE[:, 0::2] = np.sin(position / div_term) # Even dims: sine
PE[:, 1::2] = np.cos(position / div_term) # Odd dims: cosine
return PE
# Visualize the encoding
PE = sinusoidal_positional_encoding(max_seq_len=100, d_model=64)
print(f"Positional encoding shape: {PE.shape}") # (100, 64)
print(f"Position 0, first 8 dims: {PE[0, :8].round(3)}")
print(f"Position 1, first 8 dims: {PE[1, :8].round(3)}")
print(f"Position 99, first 8 dims: {PE[99, :8].round(3)}")
# Each position has a unique signature
diffs = np.abs(PE[0] - PE[1]).sum()
print(f"L1 distance between pos 0 and pos 1: {diffs:.3f}") # Should be non-zero
# PyTorch module version
import torch
import torch.nn as nn
import math
class SinusoidalPositionalEncoding(nn.Module):
def __init__(self, d_model: int, max_len: int = 5000, dropout: float = 0.1):
super().__init__()
self.dropout = nn.Dropout(p=dropout)
# Compute once at init, register as buffer (not a parameter)
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1).float()
div_term = torch.exp(
torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0) # (1, max_len, d_model)
# register_buffer: saved with model, moved with .to(device), not a parameter
self.register_buffer('pe', pe)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""
Args:
x: (batch, seq, d_model) - token embeddings
Returns:
x + PE: (batch, seq, d_model)
"""
seq_len = x.size(1)
x = x + self.pe[:, :seq_len, :]
return self.dropout(x)
# Test
pe_module = SinusoidalPositionalEncoding(d_model=512, max_len=1024)
x = torch.randn(2, 32, 512) # (batch, seq, d_model)
out = pe_module(x)
print(f"\nAfter PE addition: {out.shape}") # (2, 32, 512)
Learned Positional Embeddings (GPT Style)
Instead of a fixed function, learned positional embeddings are trainable parameters:
class LearnedPositionalEmbedding(nn.Module):
def __init__(self, max_seq_len: int, d_model: int, dropout: float = 0.1):
super().__init__()
self.embedding = nn.Embedding(max_seq_len, d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""x: (batch, seq, d_model)"""
batch, seq_len, _ = x.shape
positions = torch.arange(seq_len, device=x.device).unsqueeze(0) # (1, seq)
pos_embeddings = self.embedding(positions) # (1, seq, d_model)
return self.dropout(x + pos_embeddings)
Differences from sinusoidal:
- Learned during training - the model can adapt position representations to the task
- Cannot generalize beyond
max_seq_len- position 513 in a model trained on 512-length sequences is undefined - GPT-2 uses learned embeddings with
max_seq_len = 1024; GPT-3 usesmax_seq_len = 2048
When to use learned vs sinusoidal:
- Sinusoidal: when you need to generalize to longer sequences than seen in training
- Learned: when training sequence length matches inference length, and you want maximum flexibility
In practice, for modern LLMs, neither approach works well for long-context generalization. This motivated newer methods.
RoPE: Rotary Position Embedding
RoPE (Su et al., 2021) is the position encoding used by LLaMA, LLaMA-2, PaLM-2, GPT-NeoX, and most modern open-source LLMs. It is fundamentally different from additive approaches.
The Core Idea
Instead of adding a position vector to the token embedding, RoPE rotates the query and key vectors by an angle that depends on their position. The rotation is applied such that the dot product (between query at position and key at position ) depends only on the relative position , not on absolute positions.
For a 2D case:
This is a rotation by angle in 2D. For -dimensional vectors, we apply this to consecutive pairs of dimensions, each with a different frequency .
The key property: when you compute , the result depends only on (relative position) and the content of and - not on absolute or .
Why RoPE Works Better for Long Contexts
-
Relative positions: The attention score inherently encodes relative position, making the model more robust to positional shift.
-
Generalization: RoPE can be extended to longer sequences using "RoPE scaling" techniques (YaRN, LongRoPE) by adjusting the frequency base. LLaMA-2 is trained on 4K context but can be extended to 32K-100K with position interpolation.
-
No wasted dimensions: Sinusoidal encoding uses the embedding dimension for both content and position. RoPE encodes position in the rotation of existing vectors, leaving the full dimension for content.
import torch
import torch.nn.functional as F
def apply_rope(
x: torch.Tensor,
position_ids: torch.Tensor,
base: float = 10000.0,
) -> torch.Tensor:
"""
Apply Rotary Position Embedding to query or key tensor.
Args:
x: (batch, heads, seq, head_dim)
position_ids: (batch, seq) - absolute positions
base: frequency base (10000 for standard RoPE)
Returns:
x_rotated: same shape as x
"""
head_dim = x.shape[-1]
device = x.device
# Compute frequencies for each pair of dimensions
# theta_i = 1 / base^(2i/d) for i = 0, ..., d/2 - 1
dim_indices = torch.arange(0, head_dim, 2, device=device).float()
theta = 1.0 / (base ** (dim_indices / head_dim)) # (head_dim/2,)
# Compute rotation angles for each position
# position_ids: (batch, seq) -> (batch, seq, 1)
# theta: (head_dim/2,)
# outer product -> (batch, seq, head_dim/2)
angles = position_ids.float().unsqueeze(-1) * theta.unsqueeze(0).unsqueeze(0)
# Expand for all heads: (batch, 1, seq, head_dim/2)
angles = angles.unsqueeze(1)
# Create cos and sin: (batch, 1, seq, head_dim/2)
cos_a = torch.cos(angles)
sin_a = torch.sin(angles)
# Split x into pairs: x_even, x_odd (batch, heads, seq, head_dim/2) each
x_even = x[..., 0::2]
x_odd = x[..., 1::2]
# Apply rotation: [x_even, x_odd] * [cos, cos] + [-x_odd, x_even] * [sin, sin]
x_rotated_even = x_even * cos_a - x_odd * sin_a
x_rotated_odd = x_odd * cos_a + x_even * sin_a
# Interleave back
x_rotated = torch.stack([x_rotated_even, x_rotated_odd], dim=-1).flatten(-2)
return x_rotated
# Quick test
torch.manual_seed(42)
batch, heads, seq, head_dim = 2, 8, 16, 64
q = torch.randn(batch, heads, seq, head_dim)
position_ids = torch.arange(seq).unsqueeze(0).expand(batch, -1)
q_rope = apply_rope(q, position_ids)
print(f"Q shape: {q.shape}") # (2, 8, 16, 64)
print(f"Q_rope shape: {q_rope.shape}") # (2, 8, 16, 64)
print(f"Norm preserved: {torch.allclose(q.norm(dim=-1), q_rope.norm(dim=-1), atol=1e-5)}") # True
ALiBi: Attention with Linear Biases
ALiBi (Press et al., 2022) takes a different approach: instead of modifying the input embeddings or Q/K vectors, it adds a static bias to the attention scores based on distance.
where is a head-specific slope (decreasing geometric sequence across heads: ).
The effect: tokens farther away get a linearly increasing penalty to their attention score. Nearby tokens are preferred; distant tokens are suppressed. The strength of this preference varies across heads (some heads attend more globally, others more locally).
Advantage for long-context generalization: ALiBi is designed to extrapolate to sequences longer than training. At test time with 4× the training length, the bias slopes simply extend linearly - the model has seen the bias shape at every distance up to its training length, and extrapolates naturally.
Disadvantage: The model is biased toward local attention. Tasks that require long-range dependencies may underperform compared to RoPE-based models.
Comparison: When to Use What
| Encoding | Parameters | Generalization | Used by |
|---|---|---|---|
| Sinusoidal (fixed) | 0 | Good (designed for it) | Original transformer, some BERT variants |
| Learned | Poor (extrapolation fails) | GPT-2, GPT-3 | |
| RoPE | 0 (computed) | Excellent with scaling | LLaMA, LLaMA-2, Mistral, GPT-NeoX |
| ALiBi | 0 (computed) | Good (linear extrapolation) | MPT, BLOOM |
| xPos | 0 (computed) | Excellent | Used in some research models |
Production Engineering Notes
RoPE Scaling for Long Context
LLaMA-2 is pretrained with 4096 max context. To extend to 32K or 100K:
Linear position interpolation (Chen et al., 2023): Scale all position indices by scale = original_max / target_max. A position of 32768 becomes position 32768 * (4096/32768) = 4096 - within the training range.
YaRN (Peng et al., 2023): More sophisticated - different frequency components of RoPE are scaled differently (low-frequency components scale more, high-frequency less). Achieves better perplexity at extended contexts than linear interpolation.
def apply_rope_scaled(x, position_ids, base=10000.0, scale=1.0):
"""Linear scaled RoPE for context extension."""
# Scale position ids to be within training range
position_ids = position_ids.float() / scale
return apply_rope(x, position_ids.long(), base)
Caching Positional Encodings
For sinusoidal encoding, compute once and cache - it's a constant tensor. Use register_buffer to move it to GPU automatically:
# This is what the original transformer does
self.register_buffer('pe', precomputed_pe_tensor)
For RoPE, the cos/sin values are typically precomputed for the maximum sequence length at model init time and indexed at runtime.
Common Mistakes
:::danger Using additive PE in a model designed for RoPE If you load a pretrained LLaMA checkpoint (which uses RoPE) but apply sinusoidal additive PE in your code, the attention patterns will be completely wrong. The model was trained with rotated Q and K vectors - queries and keys have position encoded in their orientation, not their magnitude. Always match the positional encoding scheme to the pretrained model. :::
:::danger Forgetting that learned PE cannot generalize beyond max_seq_len
GPT-2 has max_seq_len = 1024. If you feed it a 2048-token sequence, position IDs 1024-2047 are out of bounds for the embedding table. This either throws an error or wraps around depending on implementation. Always check the model's maximum sequence length before inference.
:::
:::warning Applying RoPE to the wrong vectors RoPE is applied to Q and K only, not to V. Applying it to V is a common mistake that degrades performance. The intuition: RoPE makes the attention score sensitive to relative position, but the value vectors (what information is retrieved) should not be position-dependent - they carry content, not position. :::
:::tip Testing positional encoding correctness A simple test: run your model on "A B C" and "C B A" (reversed order). If outputs are identical, your positional encoding is broken. If outputs differ (especially for position-sensitive tasks), it's working. Always add this as a unit test for any transformer implementation. :::
Interview Q&A
Q1: Why do transformers need positional encoding? What happens without it?
Answer: Transformers process all tokens simultaneously via attention. The attention operation is permutation-equivariant: if you permute the input tokens, the output tokens permute by the same permutation. There is no inherent notion of "earlier" or "later" in the sequence.
Without positional encoding:
- "The dog bit the man" and "The man bit the dog" produce identical intermediate representations (same tokens, just reordered)
- The model can only learn bag-of-words style statistics
- Word order, syntax, and grammatical structure become invisible
With positional encoding, each token's representation incorporates position information, breaking the permutation symmetry and allowing the model to learn order-dependent patterns.
Empirically: ablation studies show significant drops (5-10+ BLEU points in translation) when positional encoding is removed.
Q2: Explain the sinusoidal positional encoding formula and why each design choice was made.
Answer: ,
Why sine and cosine? Paired sin/cos at the same frequency gives a 2D rotation - mathematically, position is a linear rotation of position . This means "look back tokens" can be expressed as a linear transformation, which the model can learn.
Why ? This creates wavelengths from (fastest, dimension 0) to (slowest, highest dimension). The spread ensures no two positions have the same encoding, and different positions differ in both high-frequency and low-frequency components - making them easy to distinguish.
Why addition (not concatenation)? Addition keeps the dimension fixed. The model can learn to separate content and position in the combined vector through the learned weight matrices. The paper notes this is equivalent to learned embeddings in the limit of large data.
Why not learned? Two reasons: (1) fixed encodings can generalize to sequences longer than training, (2) the sinusoidal form has the mathematically nice relative-position property.
Q3: What is RoPE and why is it better than sinusoidal for large language models?
Answer: RoPE (Rotary Position Embedding) encodes position by rotating query and key vectors in frequency-specific 2D subspaces, rather than adding a fixed vector to the embedding.
The key advantages:
Relative position sensitivity: The inner product depends only on (relative distance) and token content. This means attention naturally uses relative positions - "two tokens apart" has the same representation regardless of where in the sequence.
No wasted dimensions: Sinusoidal PE adds position information on top of token embeddings. RoPE applies position as a rotation - the same capacity is used for position-aware content.
Long-context extrapolation: RoPE supports "scaling" techniques (linear interpolation, YaRN) that allow extending context beyond training length with fine-tuning. LLaMA-2 is trained on 4K context and deployed at 32K+ with position interpolation.
No extra parameters: Like sinusoidal, RoPE requires no learned parameters - frequencies are computed from fixed formulas.
This is why every major open-source LLM (LLaMA, Mistral, Gemma, Falcon, Qwen) uses RoPE.
Q4: How does a transformer handle a sequence that is longer than what it was trained on?
Answer: It depends on the positional encoding type:
-
Learned embeddings (GPT-2): Position IDs beyond
max_seq_lenare out of the embedding table bounds. You cannot directly extend. Requires fine-tuning with longer sequences. -
Sinusoidal (original transformer): Can technically compute encodings for any length. However, the model was not trained to use very large position values in the dot products, so long-context behavior degrades unpredictably.
-
RoPE (modern LLMs): Two techniques:
- Linear interpolation: Scale position IDs by
original_max / target_max. Position 8192 in a 4K-trained model becomes8192 * (4096/8192) = 4096. This keeps positions within the trained range but loses some precision. Works surprisingly well. - YaRN: More sophisticated frequency-specific scaling. Achieves lower perplexity than linear interpolation at 16K-100K contexts.
- Linear interpolation: Scale position IDs by
-
ALiBi: Extrapolates naturally. The linear bias extends - position 5000 gets bias
m * 5000, which the model hasn't seen but is in the same linear family.
Q5: What is the "lost in the middle" problem, and is it related to positional encoding?
Answer: The "lost in the middle" problem (Liu et al., 2023) refers to the empirical observation that LLMs perform worse at retrieving information from the middle of long contexts compared to the beginning or end.
It is partially related to positional encoding but also has other causes:
Positional contribution: With attention, nearby tokens (high relative-position bias in RoPE/ALiBi) receive higher baseline attention scores. Tokens at the very beginning also receive high attention due to the "attention sink" phenomenon. Middle-sequence tokens compete with both effects.
Training data bias: Most training documents have important information at the beginning or end (abstracts, conclusions). Models learn to attend to these positions preferentially.
Solution approaches:
- Prompt engineering: put critical information at beginning or end
- "Lost in the middle" fine-tuning on tasks where middle retrieval matters
- RAG: instead of very long contexts, retrieve only the relevant chunk
This is an active research area - it shows that positional encoding alone doesn't fully solve long-context reasoning, and training data distribution matters as much as architecture.
:::tip 🎮 Interactive Playground
Visualize this concept: Try the Positional Encoding: Sinusoidal, RoPE & ALiBi demo on the EngineersOfAI Playground - no code required.
:::
