All terms
Safety & Alignment
Deception
Model behavior that creates false impressions in users or evaluators.
Definition
Deception, in the AI safety sense, is behavior that misleads users or evaluators by creating false impressions, whether through inaccurate claims, hidden reasoning, or selectively favorable outputs. It is a concern because a model that appears aligned during testing may behave differently once deployed. Studying and detecting deception is closely tied to honesty objectives and interpretability work.