Skip to main content
All terms
Evaluation

HELM

A framework and benchmark suite for broad, transparent evaluation of language models.

Definition

HELM is a framework and benchmark suite for evaluating language models across many tasks and metrics at once. Rather than a single score, it reports a structured view that covers dimensions such as accuracy, calibration, robustness, fairness, and efficiency. The goal is broad, transparent comparison, so the same models are measured under consistent conditions across scenarios.