Inference & Serving

Scheduler

The component that decides which requests run and in what order over time.

Definition

A scheduler is the component of a serving system that decides how work is assigned over time — which requests to admit, batch together, or defer. In language model inference it manages queues, groups requests for efficiency, and applies policies like iteration-level scheduling so new requests can join an in-flight batch. Good scheduling balances throughput against per-request latency under changing load.

Scheduler

Definition

Related terms