All terms
Inference & Serving
Tokens per Second
A common speed measure for how fast a model generates or processes text.
Definition
Tokens per second is a common measure of language model speed, counting how many tokens a system generates or processes each second. Reported per request, it tracks how quickly streamed text appears to one user; reported across all requests, it tracks aggregate throughput. The figure depends on model size, hardware memory bandwidth, batch size, and serving optimizations such as continuous batching and speculative decoding.