All terms
Evaluation
Pairwise Comparison
Judging outputs by directly comparing two candidates and picking the better one.
Definition
Pairwise comparison evaluates outputs by showing two candidate responses to the same prompt and asking a judge — human or model — which is better. Comparing relative quality is often easier and more reliable than assigning an absolute score to each response alone. Aggregating many such comparisons yields metrics like win rate and Elo-style ratings used in preference data and leaderboards.