High-Dimensional Dot Product Normalization
In high-dimensional spaces, dot products between random vectors naturally grow in magnitude proportionally to $\sqrt{d_k}$, where $d_k$ is the dimension. This can destabilize training in attention mechanisms and other neural networks.
Mathematical Explanation
Here's the step-by-step breakdown of the variance calculation:
The dot product: $\mathbf{q} \cdot \mathbf{k} = \sum_{i=1}^{d_k} q_i \times k_i$
Expected value:
(since $E[q_i] = E[k_i] = 0$ and components are independent)
Variance:
Since components are independent:
For independent random variables with zero mean (similar to reasoning in Weight Initialization in Deep Neural Networks > Initializing weights by preserving variance):
Since $\text{Var}(q_i) = E[q_i^2] - (E[q_i])^2 = E[q_i^2] = \sigma^2$:
Therefore:
Standard deviation:
The Solution: Scale by $\frac{1}{\sqrt{d_k}}$
Dividing dot products by $\sqrt{d_k}$ normalizes the variance:
This keeps variance constant regardless of dimension: $\text{Var} = \sigma^2$
Why This Specific Scaling?
- Too small scaling (e.g., $\frac{1}{d_k}$): Attention becomes too uniform
- Too large scaling (e.g., constant): Gradients vanish due to saturation
- Just right ($\frac{1}{\sqrt{d_k}}$): Maintains optimal softmax sensitivity
Application in Attention
This scaling ensures stable training across different attention head dimensi