Grouped Query Attention (GQA)

Created December 3, 2025 · Updated December 3, 2025

GQA trades off KV cache size for model quality. In Attention > Multi Head Attention (MHA) each attention head has its own K and V. In GQA, multiple query heads share the same K and V.

MHA (8 heads):
Q1→K1,V1  Q2→K2,V2  Q3→K3,V3  Q4→K4,V4  Q5→K5,V5  Q6→K6,V6  Q7→K7,V7  Q8→K8,V8

GQA (8 heads, 2 KV groups):
Q1→K1,V1  Q2→K1,V1  Q3→K1,V1  Q4→K1,V1  Q5→K2,V2  Q6→K2,V2  Q7→K2,V2  Q8→K2,V2

Why does it work? We don't know, mostly empirical reasoning, with ablations showing it works without significantly quality loss.

Common hypothesis goes like this: K and V have more redundancy across heads than Q. Different query heads can ask different questions ("is this a noun?", "is this related to the subject?"), but they're all querying the same underlying content. So you don't need as many ways to describe that content.

KV Cache Compression

Method	KV heads	Cache per token per layer
MHA	$$h$$	$h \times d_{head} \times 2$
GQA	$$g$$	$g \times d_{head} \times 2$

Compression ratio: $$h / g$$

Popular model configurations:

Model	Q heads	KV heads	Ratio	Cache reduction
Llama 3 8B	32	8	4:1	75%
Llama 3 70B	64	8	8:1	87.5%
Llama 3 405B	128	8	16:1	93.75%

GQA provides significant cache reduction while ablation studies show performance close to full MHA, making it the standard for modern LLMs.