All terms
Data
SentencePiece
A tokenizer toolkit that learns subword units directly from raw text.
Definition
SentencePiece is a tokenizer toolkit that learns subword units directly from raw text without requiring pre-tokenized words or language-specific rules. It treats text as a stream of characters, including whitespace, which makes it well suited to languages that do not separate words with spaces. It supports algorithms such as byte-pair encoding and unigram, and is widely used to build tokenizers for language models.