Skip to main content
All terms
Evaluation

Evaluation

The methods used to judge a model's quality, safety, and reliability.

Definition

Evaluation, often shortened to evals, is the set of methods used to judge how well a model performs on quality, safety, and reliability. It ranges from standardized benchmarks and automatic metrics to human review and task-specific tests. Well-designed evals are central to building, comparing, and trusting AI systems, since they reveal both capabilities and failure modes.