All terms
Evaluation
HellaSwag
A commonsense benchmark where models pick the most plausible ending to a short scenario.
Definition
HellaSwag is a commonsense reasoning benchmark in which a model reads a short situation and chooses the most plausible continuation from four options. The wrong choices are deliberately crafted to look plausible at a glance but break down on reflection, which made the task hard for earlier models. Modern language models have largely solved it, so it now serves as a baseline rather than a frontier evaluation.