Architectures

Multi-Query Attention

Attention with many query heads but a single shared key and value set.

Definition

Multi-Query Attention keeps a separate projection for each query head but shares one key and one value projection across all heads. This shrinks the KV cache (the model's stored memory of the conversation so far) and speeds up generation when memory is the bottleneck, at the cost of a small drop in quality compared with full multi-head attention. Grouped-query attention is a middle-ground version that recovers much of that quality.

Multi-Query Attention

Definition

Related terms