All terms
Inference & Serving
Token Limit
The maximum number of tokens a model can take in or produce in one request.
Definition
A token limit is the maximum number of tokens — the word-pieces a model reads and writes — allowed in a single request. It covers the prompt plus the response and is set by the model's context window. Going over forces you to shorten the input or split the work into parts.