All terms
Evaluation
SWE-bench
A benchmark that tests coding agents on real GitHub issues from open repositories.
Definition
SWE-bench evaluates coding agents on real software tasks drawn from GitHub issues in open-source repositories. For each task the model is given the codebase and an issue, and must produce a patch that resolves it and passes the project's existing tests. Because it requires reading and editing across many files, it has become a standard test of agentic coding, though scores depend heavily on the surrounding scaffolding.