Inference & Serving

Token Limit

The maximum number of tokens a model can take in or produce in one request.

Definition

A token limit is the maximum number of tokens — the word-pieces a model reads and writes — allowed in a single request. It covers the prompt plus the response and is set by the model's context window. Going over forces you to shorten the input or split the work into parts.

Related terms

Token Context Window Context Length LLM