All terms
Inference & Serving
Throughput
The total work a serving system completes per unit time, such as tokens per second.
Definition
Throughput measures the total work a serving system completes per unit time, usually expressed as tokens or requests per second across all users. It reflects how efficiently the hardware is used and is raised through batching, high GPU utilization, and optimized memory access. Throughput and latency are often in tension: aggressive batching serves more users at once but can lengthen the wait for any single request.