All terms
Inference & Serving
KV Cache
Stored attention keys/values that let a model avoid recomputing past tokens.
Definition
The KV cache (key-value cache) stores the intermediate values the model worked out for every word already processed in a sequence. As the model writes one word at a time, each new word must look back at all earlier words; keeping these saved values means it never has to redo that work. The cache grows with the length of the text and the number of requests handled at once, and is often the biggest user of GPU memory while the model runs — which is why PagedAttention and prefix caching matter so much.