Inference & Serving

Continuous Batching

Adding and removing requests from a GPU batch every step instead of waiting.

Definition

Continuous (also called iteration-level) batching is a scheduling strategy where the server can admit new requests and drop finished ones at every processing step, rather than running one fixed group of requests to completion. This keeps the GPU (the chip that runs the model) busy even when prompts and outputs have wildly different lengths, dramatically improving throughput (total work done per second). It is on by default in vLLM and SGLang.

Related terms

vLLM PagedAttention Chunked Prefill