All terms
Data
Unigram Tokenizer
A tokenization method that selects subword pieces from a learned inventory.
Definition
A unigram tokenizer is a subword tokenization method that starts from a large candidate inventory of pieces and prunes it to a target vocabulary using a probabilistic model. At encoding time it chooses the segmentation of a word that is most likely under that model, rather than applying fixed merge rules. It is commonly used through SentencePiece and offers an alternative to byte-pair encoding and WordPiece.