Skip to main content
All terms
Data

Tokenizer Training

Learning a token vocabulary and splitting rules from a text corpus.

Definition

Tokenizer training learns the vocabulary and merge or splitting rules used to break text into tokens, derived from a representative corpus with an algorithm such as byte-pair encoding, WordPiece, or unigram. A good tokenizer balances vocabulary size, compression rate, and coverage across languages and domains. Choices about corpus composition, vocabulary size, and whitespace handling have lasting effects on model quality and multilingual capability.