All terms
Data
C4
A large cleaned English web-text dataset built from a Common Crawl snapshot by Google.
Definition
C4, short for Colossal Clean Crawled Corpus, is a large English text dataset created by applying heuristic quality filters and deduplication to a Common Crawl snapshot. It was introduced by Google alongside the T5 model and became a widely used pretraining corpus. Its straightforward filtering pipeline made it an influential reference point for later work on cleaning and curating web data.