Skip to main content
All terms
Evaluation

Long-Horizon Evaluation

Testing AI agents on long tasks that need many steps and sustained coherence.

Definition

Long-horizon evaluation measures how well an agent handles tasks that stretch across many steps — staying on goal, recovering from errors, managing memory, and making sensible decisions over dozens or hundreds of actions. It matters most for real deployments, where work rarely finishes in a single step.