Evaluation

HELM

A framework and benchmark suite for broad, transparent evaluation of language models.

Definition

HELM is a framework and benchmark suite for evaluating language models across many tasks and metrics at once. Rather than a single score, it reports a structured view that covers dimensions such as accuracy, calibration, robustness, fairness, and efficiency. The goal is broad, transparent comparison, so the same models are measured under consistent conditions across scenarios.

HELM

Definition

Related terms