Data

Pretraining Corpus

The massive body of text and other data used to train a model from scratch.

Definition

A pretraining corpus is the massive body of text — and sometimes code, images, or audio — used in the initial training of a model, typically trillions of tokens (the small chunks of text a model reads) drawn from web crawls, books, code repositories, and scientific papers. Its composition, quality, and cleaning have a large effect on the resulting model's abilities and biases, so building one demands heavy filtering and deduplication.

Pretraining Corpus

Definition

Related terms