Grouped Query Attention (GQA)
GQA trades off KV cache size for model quality. In Self-Attention Mechanism > Multi Head Attention (MHA) each attention head has its own K and V. In GQA, multiple query heads share the same K and V.
MHA (8 heads):
Q1→K1,V1 Q2→K2,V2 Q3→K3,V3 Q4→K4,V4 Q5→K5,V5 Q6→K6,V6 Q7→K7,V7 Q8→K8,V8
GQA (8 heads, 2 KV groups):
Q1→K1,V1 Q2→K1,V1 Q3→K1,V1 Q4→K1,V1 Q5→K2,V2 Q6→K2,V2 Q7→K2,V2 Q8→K2,V2
Why does it work? We don't know, mostly empirical reasoning, with ablations showing it works without significantly quality loss.
Common hypothesis goes like this: K and V have more redundancy across heads than Q. Different query heads can ask different questions ("is this a noun?", "is this related to the subject?"), but they're all querying the same underlying content. So you don't need as many ways to describe that content.
KV Cache Compression
| Method | KV heads | Cache per token per layer |
|---|---|---|
| MHA | $h$ | $h \times d_{head} \times 2$ |
| GQA | $g$ | $g \times d_{head} \times 2$ |
Compression ratio: $h / g$
Popular model configurations:
| Model | Q heads | KV heads | Ratio | Cache reduction |
|---|---|---|---|---|
| Llama 3 8B | 32 | 8 | 4:1 | 75% |
| Llama 3 70B | 64 | 8 | 8:1 | 87.5% |
| Llama 3 405B | 128 | 8 | 16:1 | 93.75% |
GQA provides significant cache reduction while ablation studies show performance close to full MHA, making it the standard for modern LLMs.