High-Dimensional Dot Product Normalization

Created July 20, 2025 · Updated March 18, 2026

In high-dimensional spaces, dot products between random vectors naturally grow in magnitude proportionally to $\sqrt{d_k}$ , where $$d_k$$ is the dimension. This can destabilize training in attention mechanisms and other neural networks.

Mathematical Explanation

Here's the step-by-step breakdown of the variance calculation:

The dot product: $\mathbf{q} \cdot \mathbf{k} = \sum_{i=1}^{d_k} q_i \times k_i$

Expected value:

E[\mathbf{q} \cdot \mathbf{k}] = E\left[\sum_{i=1}^{d_k} q_i k_i\right] = \sum_{i=1}^{d_k} E[q_i k_i] = \sum_{i=1}^{d_k} E[q_i]E[k_i] = 0

(since

$E[q_i] = E[k_i] = 0$

and components are independent)

Variance:

\text{Var}(\mathbf{q} \cdot \mathbf{k}) = \text{Var}\left(\sum_{i=1}^{d_k} q_i k_i\right)

Since components are independent:

= \sum_{i=1}^{d_k} \text{Var}(q_i k_i)

For independent random variables with zero mean (similar to reasoning in Weight Initialization in Deep Neural Networks > Initializing weights by preserving variance):

\text{Var}(q_i k_i) = E[q_i^2 k_i^2] - (E[q_i k_i])^2 = E[q_i^2]E[k_i^2] - 0

Since $\text{Var}(q_i) = E[q_i^2] - (E[q_i])^2 = E[q_i^2] = \sigma^2$ :

\text{Var}(q_i k_i) = \sigma^2 \times \sigma^2 = \sigma^4

Therefore:

\text{Var}(\mathbf{q} \cdot \mathbf{k}) = \sum_{i=1}^{d_k} \sigma^4 = d_k \sigma^4

Standard deviation:

\text{SD}(\mathbf{q} \cdot \mathbf{k}) = \sqrt{d_k \sigma^4} = \sigma^2\sqrt{d_k}

The Solution: Scale by $\frac{1}{\sqrt{d_k}}$

Dividing dot products by $\sqrt{d_k}$ normalizes the variance:

\text{Scaled dot product} = \frac{\mathbf{q} \cdot \mathbf{k}}{\sqrt{d_k}}

This keeps variance constant regardless of dimension: $\text{Var} = \sigma^2$

Why This Specific Scaling?

Too small scaling (e.g., $\frac{1}{d_k}$ ): Attention becomes too uniform
Too large scaling (e.g., constant): Gradients vanish due to saturation
Just right ( $\frac{1}{\sqrt{d_k}}$ ): Maintains optimal softmax sensitivity

Application in Attention

\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}

This scaling ensures stable training across different attention head dimensi