Safety & Alignment

Red-Teaming

Adversarially probing a model to surface harmful outputs and failures before deployment.

Definition

Red-teaming is the practice of adversarially testing a model to surface harmful outputs, jailbreaks, and other failures so they can be fixed before deployment. Borrowed from security practice, it has people or automated systems craft adversarial prompts and edge cases to expose weaknesses. Findings feed back into safety mitigations and fine-tuning.

Red-Teaming

Definition

Related terms