Rotary Position Embeddings (RoPE)

Created December 3, 2025 · Updated December 3, 2025

Instead of adding Positional Encoding vectors to input for encoding token position, RoPE encodes position by rotating Q and K vectors. Each position $$i$$ has a rotation matrix $$R_i$$ :

$$ Q_i = (X_i W_Q) R_i $$

$$ K_j = (X_j W_K) R_j $$

When computing attention between positions

$i$

and

$j$

$$ Q_i K_j^T = (X_i W_Q) R_i R_j^T (X_j W_K)^T $$

The key property: $$R_i R_j^T$$ depends only on the relative position $$(i - j)$$ , not absolute positions. This comes from trig identities—terms like $\cos(i\theta)\cos(j\theta) + \sin(i\theta)\sin(j\theta)$ simplify to $\cos((i-j)\theta)$ .

What's the rotation?

RoPE rotates pairs of dimensions. For a 2D case at position $$i$$ :

R_i = \begin{pmatrix} \cos(i\theta) & -\sin(i\theta) \ \sin(i\theta) & \cos(i\theta) \end{pmatrix}

For higher dimensions, this is applied to pairs of dimensions with different frequencies $\theta$ , allowing the model to capture both fine-grained (high frequency) and long-range (low frequency) positional information.

RoPE applies an absolute rotation per position, but the dot product captures relative distance—unifying absolute and relative approaches without explicitly computing or storing relative biases.