Skip to main content
All terms
Data

Multilingual Data

Training data that spans more than one language.

Definition

Multilingual data is training data that spans more than one language, used so that a model can understand and generate text beyond English. Its balance across languages shapes how well a model serves speakers of each one, and underrepresented languages often see weaker quality. The mix also interacts with tokenizer design, since some languages need more tokens (the small chunks of text a model reads) to express the same content.