All terms
Data
Data Mixing
Blending data from different sources in chosen proportions to shape a model's abilities.
Definition
Data mixing is the deliberate blending of data from different sources, such as web text, books, code, math, multilingual text, and synthetic data, in carefully chosen proportions during pretraining. The mixture shapes how balanced a model's capabilities are across domains, so adjusting proportions can strengthen or weaken specific skills. The exact recipes used by frontier labs are often treated as proprietary.