All terms
Data
Dataset
An organized collection of examples used to train, fine-tune, or evaluate models.
Definition
A dataset is an organized collection of data examples, such as text, images, audio, labels, or combinations of these, used at various stages of the machine learning lifecycle: pretraining, fine-tuning, evaluation, and alignment. Its size, quality, diversity, and licensing all directly affect a model's capabilities and safety. Large pretraining corpora sit at one extreme, while small, carefully curated instruction sets sit at the other.