All terms
Evaluation
Long-Horizon Evaluation
Testing AI agents on long tasks that need many steps and sustained coherence.
Definition
Long-horizon evaluation measures how well an agent handles tasks that stretch across many steps — staying on goal, recovering from errors, managing memory, and making sensible decisions over dozens or hundreds of actions. It matters most for real deployments, where work rarely finishes in a single step.