Inference & Serving

Throughput

The total work a serving system completes per unit time, such as tokens per second.

Definition

Throughput measures the total work a serving system completes per unit time, usually expressed as tokens or requests per second across all users. It reflects how efficiently the hardware is used and is raised through batching, high GPU utilization, and optimized memory access. Throughput and latency are often in tension: aggressive batching serves more users at once but can lengthen the wait for any single request.

Throughput

Definition

Related terms