Skip to main content
All terms
Data

Red-List

A blocklist of terms, domains, or content used to filter training data.

Definition

A red-list, also called a blocklist, is a collection of words, phrases, domains, or content hashes flagged as harmful, toxic, or policy-violating, used to filter training data. Documents that contain red-listed items above a threshold are removed from the corpus. It is a standard step in cleaning web-crawled data and helps reduce toxic, illegal, or low-quality content in pretraining corpora.