Skip to main content
All terms
Data

Deduplication

Removing identical or near-identical examples from a dataset before training.

Definition

Deduplication removes identical or near-identical documents from a training corpus. Exact matches are found by giving each document a short digital fingerprint and comparing fingerprints, while near-duplicates are caught with specialized fingerprinting methods built for huge collections. Removing duplicates reduces the chance that a model memorizes repeated text verbatim, improves generalization, and makes better use of the compute budget by exposing the model to more diverse content. It is a standard step in preparing large pretraining corpora.