Inference & Serving

Chunked Prefill

Splitting a long prompt's prefill into chunks so it doesn't block other requests.

Definition

Chunked prefill breaks prefill — the work of reading in a prompt before any answer is generated — into smaller pieces that can be interleaved with the token-by-token generation of other requests already running. Without it, one very long prompt can hog the GPU (the chip that runs the model) and make everyone else wait much longer for their first word of output. It is a key control for keeping response times low when short and long requests are mixed.

Related terms

Continuous Batching TTFT vLLM