Evaluation

Agent Benchmarks

Evaluations built to measure what AI agents can actually accomplish, not just model knowledge.

Definition

Agent benchmarks are evaluations aimed at agents rather than standalone models. Instead of testing knowledge in one shot, they measure tool use, planning, memory, multi-step execution, and how well an agent holds up in changing environments — closer to real tasks done over time.

Agent Benchmarks

Definition

Related terms