Skip to main content
All terms
Data

Common Crawl

A nonprofit that publishes a large public archive of web pages used to pretrain models.

Definition

Common Crawl is a nonprofit that maintains a freely available, petabyte-scale archive of web pages, metadata, and extracted text gathered through regular internet crawls. It is the foundational raw source behind most large-scale language model pretraining datasets, including C4 and The Pile. Practitioners apply heavy filtering, deduplication, and language identification on top of its raw dumps to produce usable corpora.