Skip to main content
All terms
Evaluation

Red Team Eval

An evaluation that adversarially stress-tests a model to surface harmful behavior.

Definition

A red team eval is a structured evaluation that deliberately probes a model with adversarial inputs to find where it fails or produces harmful output. Testers craft jailbreaks, misleading prompts, and edge cases that ordinary benchmarks miss, then record which attacks succeed. It is used before and after release to measure safety weaknesses and to guide guardrails and further training.