Inference & Serving

PagedAttention

A memory trick that stores the KV cache in small reusable blocks like OS paging.

Definition

PagedAttention is the method for managing KV cache memory (the model's saved record of past words) introduced by vLLM (SOSP 2023). Instead of reserving one large unbroken chunk of GPU memory per request, it splits the cache into small fixed-size blocks (16 words by default) and uses a lookup table to find them wherever they sit in memory. This works like the way an operating system juggles memory: it nearly eliminates wasted gaps and over-reserving, letting many more requests share the GPU.

PagedAttention

Definition

Related terms