Evaluation

LLM-as-a-Judge

Using a capable model to score or compare the outputs of other models.

Definition

LLM-as-a-Judge uses a capable language model, often guided by a rubric or reference answer, to score a response or pick the better of two candidates. It scales evaluation beyond what human annotators can label and underpins preference learning, benchmark construction, and production monitoring. It can inherit biases — favoring the first or longer answer, or its own outputs — so judgments are validated against human ratings.

LLM-as-a-Judge

Definition

Related terms