Skip to main content
All terms
Data

Byte-Pair Encoding

A tokenization method that builds subwords by repeatedly merging the most frequent symbol pairs.

Definition

Byte-Pair Encoding is a tokenization algorithm that starts from a character or byte vocabulary and iteratively merges the most frequent adjacent pair of symbols into a new token, repeating until it reaches a target vocabulary size. The resulting subword vocabulary handles rare and unknown words gracefully. A byte-level variant underpins the tokenizers used by GPT, LLaMA, and many other language models.