Safety & Alignment

Adversarial Example

An input crafted to make a model produce a wrong or unintended output.

Definition

An adversarial example is an input deliberately constructed to cause a model to make a mistake, often through small perturbations that are imperceptible to humans but change the prediction. They expose brittleness in how models generalize and are studied to measure and improve robustness. The concept extends from image classifiers to language models, where crafted prompts can trigger unwanted responses.

Adversarial Example

Definition

Related terms