All terms
Data
Common Crawl
A nonprofit that publishes a large public archive of web pages used to pretrain models.
Definition
Common Crawl is a nonprofit that maintains a freely available, petabyte-scale archive of web pages, metadata, and extracted text gathered through regular internet crawls. It is the foundational raw source behind most large-scale language model pretraining datasets, including C4 and The Pile. Practitioners apply heavy filtering, deduplication, and language identification on top of its raw dumps to produce usable corpora.