Skip to main content
All terms
Data

C4

A large cleaned English web-text dataset built from a Common Crawl snapshot by Google.

Definition

C4, short for Colossal Clean Crawled Corpus, is a large English text dataset created by applying heuristic quality filters and deduplication to a Common Crawl snapshot. It was introduced by Google alongside the T5 model and became a widely used pretraining corpus. Its straightforward filtering pipeline made it an influential reference point for later work on cleaning and curating web data.