Rotary Position Embeddings (RoPE)
Instead of adding Positional Encoding vectors to input for encoding token position, RoPE encodes position by rotating Q and K vectors. Each position $i$ has a rotation matrix $R_i$:
When computing attention between positions $i$ and $j$:
The key property: $R_i R_j^T$ depends only on the relative position $(i - j)$, not absolute positions. This comes from trig identities—terms like $\cos(i\theta)\cos(j\theta) + \sin(i\theta)\sin(j\theta)$ simplify to $\cos((i-j)\theta)$.
What's the rotation?
RoPE rotates pairs of dimensions. For a 2D case at position $i$:
For higher dimensions, this is applied to pairs of dimensions with different frequencies $\theta$, allowing the model to capture both fine-grained (high frequency) and long-range (low frequency) positional information.
RoPE applies an absolute rotation per position, but the dot product captures relative distance—unifying absolute and relative approaches without explicitly computing or storing relative biases.