Inference & Serving

Time Per Output Token

How long a model takes to produce each token during the decode phase.

Definition

Time per output token measures how long the model takes to produce each token during the decode phase, usually in milliseconds. It sets the pace at which streamed text appears once prefill is complete, so lower values make streaming feel smoother. TPOT depends on batch size, model size, how fast the hardware can move data, and how well the model's core math runs on the chip, and together with time to first token it defines a generation's responsiveness.

Time Per Output Token

Definition

Related terms