Safety & Alignment

Safety Classifier

A model trained to detect harmful or disallowed content in prompts or responses.

Definition

A safety classifier is a model trained to detect categories of harmful or disallowed content in prompts or responses, acting as a filter alongside the main model. It scores inputs and outputs so a system can block, rewrite, or escalate risky content before it reaches a user. Safety classifiers are a common component of guardrail systems and layer on top of model-level alignment.

Safety Classifier

Definition

Related terms