All terms
Evaluation
Evaluation
The methods used to judge a model's quality, safety, and reliability.
Definition
Evaluation, often shortened to evals, is the set of methods used to judge how well a model performs on quality, safety, and reliability. It ranges from standardized benchmarks and automatic metrics to human review and task-specific tests. Well-designed evals are central to building, comparing, and trusting AI systems, since they reveal both capabilities and failure modes.