High-Dimensional Dot Product Normalization

In high-dimensional spaces, dot products between random vectors naturally grow in magnitude proportionally to $\sqrt{d_k}$, where $d_k$ is the dimension. This can destabilize training in attention mechanisms and other neural networks.

Mathematical Explanation

Here's the step-by-step breakdown of the variance calculation:

The dot product: $\mathbf{q} \cdot \mathbf{k} = \sum_{i=1}^{d_k} q_i \times k_i$

Expected value:

$$ E[\mathbf{q} \cdot \mathbf{k}] = E\left[\sum_{i=1}^{d_k} q_i k_i\right] = \sum_{i=1}^{d_k} E[q_i k_i] = \sum_{i=1}^{d_k} E[q_i]E[k_i] = 0 $$

(since $E[q_i] = E[k_i] = 0$ and components are independent)

Variance:

$$ \text{Var}(\mathbf{q} \cdot \mathbf{k}) = \text{Var}\left(\sum_{i=1}^{d_k} q_i k_i\right) $$

Since components are independent:

$$ = \sum_{i=1}^{d_k} \text{Var}(q_i k_i) $$

For independent random variables with zero mean (similar to reasoning in Weight Initialization in Deep Neural Networks > Initializing weights by preserving variance):

$$ \text{Var}(q_i k_i) = E[q_i^2 k_i^2] - (E[q_i k_i])^2 = E[q_i^2]E[k_i^2] - 0 $$

Since $\text{Var}(q_i) = E[q_i^2] - (E[q_i])^2 = E[q_i^2] = \sigma^2$:

$$ \text{Var}(q_i k_i) = \sigma^2 \times \sigma^2 = \sigma^4 $$

Therefore:

$$ \text{Var}(\mathbf{q} \cdot \mathbf{k}) = \sum_{i=1}^{d_k} \sigma^4 = d_k \sigma^4 $$

Standard deviation:

$$ \text{SD}(\mathbf{q} \cdot \mathbf{k}) = \sqrt{d_k \sigma^4} = \sigma^2\sqrt{d_k} $$

The Solution: Scale by $\frac{1}{\sqrt{d_k}}$

Dividing dot products by $\sqrt{d_k}$ normalizes the variance:

$$ \text{Scaled dot product} = \frac{\mathbf{q} \cdot \mathbf{k}}{\sqrt{d_k}} $$

This keeps variance constant regardless of dimension: $\text{Var} = \sigma^2$

Why This Specific Scaling?

  • Too small scaling (e.g., $\frac{1}{d_k}$): Attention becomes too uniform
  • Too large scaling (e.g., constant): Gradients vanish due to saturation
  • Just right ($\frac{1}{\sqrt{d_k}}$): Maintains optimal softmax sensitivity

Application in Attention

$$ \text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V} $$

This scaling ensures stable training across different attention head dimensi