Skip to main content
All terms
Inference & Serving

Tokens per Second

A common speed measure for how fast a model generates or processes text.

Definition

Tokens per second is a common measure of language model speed, counting how many tokens a system generates or processes each second. Reported per request, it tracks how quickly streamed text appears to one user; reported across all requests, it tracks aggregate throughput. The figure depends on model size, hardware memory bandwidth, batch size, and serving optimizations such as continuous batching and speculative decoding.