Skip to main content
All terms
Data

Dataset

An organized collection of examples used to train, fine-tune, or evaluate models.

Definition

A dataset is an organized collection of data examples, such as text, images, audio, labels, or combinations of these, used at various stages of the machine learning lifecycle: pretraining, fine-tuning, evaluation, and alignment. Its size, quality, diversity, and licensing all directly affect a model's capabilities and safety. Large pretraining corpora sit at one extreme, while small, carefully curated instruction sets sit at the other.