Safety & Alignment

Moderation Model

A model that classifies text or images for content policy violations.

Definition

A moderation model is a classifier that scores prompts or responses against content policy categories such as hate, harassment, violence, or sexual content. It runs alongside a main model to flag, block, or route risky inputs and outputs for handling. Moderation models are a common building block of guardrail systems, though they can both miss violations and over-flag benign content.

Moderation Model

Definition

Related terms