All terms
Foundations
Tokenization
Breaking input text into tokens, often sub-word pieces, before a model processes it.
Definition
Tokenization is the step that breaks input text into tokens, often sub-word pieces, before a model processes it. The choice of tokenizer affects how efficiently a model handles different languages and code, since some are split into many more pieces than others. It also determines how length and cost are measured, because models bill and limit context by token count.