Safety & Alignment

Safety Filter / Classifier

A lightweight model or rule set that screens inputs and outputs for unsafe content.

Definition

A safety filter is a lightweight model or set of rules deployed in production pipelines to detect and handle toxic, hateful, sexual, violent, or otherwise policy-violating content in user inputs or model outputs. It can block, rewrite, or route content for human review. Safety filters form a layer of defense alongside model-level alignment, since neither approach catches every case on its own.

Safety Filter / Classifier

Definition

Related terms