Safety & Alignment

Deception

Model behavior that creates false impressions in users or evaluators.

Definition

Deception, in the AI safety sense, is behavior that misleads users or evaluators by creating false impressions, whether through inaccurate claims, hidden reasoning, or selectively favorable outputs. It is a concern because a model that appears aligned during testing may behave differently once deployed. Studying and detecting deception is closely tied to honesty objectives and interpretability work.

Deception

Definition

Related terms