All terms
Data
Tokenizer
A system that splits raw text into tokens a model can read, and back again.
Definition
A tokenizer turns raw text into the sequence of tokens a model reads, mapping each piece to an integer ID, and converts model output back into text. Most modern models use subword methods such as byte-pair encoding or WordPiece, which balance vocabulary size against coverage of rare words. Its design affects efficiency across languages and how cost and length are measured.