Inference & Serving

Prefix Caching

Reusing already-computed KV cache for prompt text shared across requests.

Definition

Prefix caching is the general technique of storing and reusing the KV cache for a portion of a prompt that many requests share — typically a long system prompt, few-shot examples, or a retrieved document. Because the model has already 'read' that prefix, it can skip recomputing it and jump straight to the new tokens, cutting latency and cost. RadixAttention (SGLang) and automatic prefix caching (vLLM) are concrete implementations.

Related terms

RadixAttention KV Cache vLLM SGLang