
Understanding the division by √d
We use an example with embedding dimension , sequence length , and input vectors sampled from two Gaussian distributions.- Each row ,
- Mean:
- Variance:
Applying this to the dot product
Let: Then:- Each term has mean 0 and variance 1
- The terms are i.i.d. (since and are independent)
- So: ,
Why divide by √d?
If we define the scaled score as: Then: The variance of the attention logits is constant regardless of dimension , keeping the softmax numerically stable across different embedding sizes. Without scaling, as grows the dot product variance grows linearly, causing the softmax to become extremely sharp — one large value dominates and others vanish, leading to poor gradient flow. With scaling, the dot product distribution is normalized and the softmax stays smooth and expressive.
Summary
- Without scaling, attention scores can be overly large, leading to softmax outputs that are near one-hot.
- This results in vanishing gradients and unstable training.
- Scaling by normalizes the variance of the dot product, improving gradient flow and model stability.

