Evaluation

LM Evaluation Harness

EleutherAI's open framework for running language models across many standard benchmarks.

Definition

The LM Evaluation Harness is an open-source framework from EleutherAI that gives a single, reproducible interface for running language models through hundreds of benchmarks, including MMLU, ARC, HellaSwag, GSM8K, and TruthfulQA. It supports models from Hugging Face, API providers, and local backends, and serves as the evaluation backbone behind the Open LLM Leaderboard. Its consistency has made it a community standard for comparing open models.

LM Evaluation Harness

Definition

Related terms