Grouped Query Attention (GQA)

GQA trades off KV cache size for model quality. In Self-Attention Mechanism > Multi Head Attention (MHA) each attention head has its own K and V. In GQA, multiple query heads share the same K and V.

MHA (8 heads):
Q1→K1,V1  Q2→K2,V2  Q3→K3,V3  Q4→K4,V4  Q5→K5,V5  Q6→K6,V6  Q7→K7,V7  Q8→K8,V8

GQA (8 heads, 2 KV groups):
Q1→K1,V1  Q2→K1,V1  Q3→K1,V1  Q4→K1,V1  Q5→K2,V2  Q6→K2,V2  Q7→K2,V2  Q8→K2,V2

Why does it work? We don't know, mostly empirical reasoning, with ablations showing it works without significantly quality loss.

Common hypothesis goes like this: K and V have more redundancy across heads than Q. Different query heads can ask different questions ("is this a noun?", "is this related to the subject?"), but they're all querying the same underlying content. So you don't need as many ways to describe that content.

KV Cache Compression

Method KV heads Cache per token per layer
MHA $h$ $h \times d_{head} \times 2$
GQA $g$ $g \times d_{head} \times 2$

Compression ratio: $h / g$

Popular model configurations:

Model Q heads KV heads Ratio Cache reduction
Llama 3 8B 32 8 4:1 75%
Llama 3 70B 64 8 8:1 87.5%
Llama 3 405B 128 8 16:1 93.75%

GQA provides significant cache reduction while ablation studies show performance close to full MHA, making it the standard for modern LLMs.