All terms
Data
Multilingual Data
Training data that spans more than one language.
Definition
Multilingual data is training data that spans more than one language, used so that a model can understand and generate text beyond English. Its balance across languages shapes how well a model serves speakers of each one, and underrepresented languages often see weaker quality. The mix also interacts with tokenizer design, since some languages need more tokens (the small chunks of text a model reads) to express the same content.