Prompting

Adversarial Prompt

Input crafted to exploit a model's weaknesses and trigger unintended behavior.

Definition

An adversarial prompt is an input deliberately crafted to exploit weaknesses in a model's training or deployment. The category spans jailbreaks that bypass refusals, prompt injections that hijack agent actions, and inputs designed to make the model state confident falsehoods, leak its training data, or burn costly computing resources. Studying these prompts is central to red-teaming, where researchers hunt for failure modes before a system ships.

Adversarial Prompt

Definition

Related terms