Inference & Serving

RadixAttention

SGLang's technique for automatically reusing the KV cache of shared prompt prefixes.

Definition

RadixAttention is SGLang's method for automatically reusing the work done on repeated opening text. It files away the model's saved values (its KV cache) in a tree organized by the exact words seen so far; when a new request begins with the same text as an earlier one (a system prompt, a retrieved document, prior conversation turns), the model picks up from the shared branch instead of starting over. Workloads with lots of repeated openings see a much shorter wait for the first word of the answer.

RadixAttention

Definition

Related terms