Skip to main content
All terms
Inference & Serving

Concurrency

How many requests a serving system handles at once without performance collapsing.

Definition

Concurrency measures how many active requests a serving system can process simultaneously while keeping latency acceptable. Serving engines raise concurrency by carefully sharing the memory of the GPU (the chip that runs the model), queuing incoming requests, and using continuous batching to push many users' next words forward in a single processing pass. High concurrency is what lets public APIs and chatbots serve many users on shared hardware.