Inference & Serving

Disaggregated Prefill

Running the prefill phase on dedicated hardware separate from token generation.

Definition

Disaggregated prefill dedicates separate hardware to prefill — the stage where the prompt is read in and the KV cache (the model's running notes on the text) is built. Once those notes are computed, they are handed off to other machines tuned for generating the output word by word. Routing the heavy work of reading long prompts to specialized machines improves throughput (total work done per second) and cost while keeping generation fast on the rest of the fleet.

Disaggregated Prefill

Definition

Related terms