All terms
Evaluation
OSWorld
A benchmark testing AI agents on real desktop computer tasks.
Definition
OSWorld is a benchmark that tests AI agents on real computer tasks across operating systems and applications. It checks whether a system can use a normal graphical interface — windows, menus, files — to complete open-ended desktop work, which is much harder than answering questions.