Correct looks better: Pairwise comparisons reveal accuracy rankings
Abstract
Pairwise comparisons by humans or judge models, combined with aggregation methods such as Elo or Bradley-Terry, have become a central part of evaluating generative models. However, there has been significant debate whether they measure what they intend to measure. Some argue, pairwise comparisons from judges may reward superficial stylistic cues or display judge biases. In a more positive turn, we show that model rankings from pairwise comparisons strongly agree with ground-truth-based accuracy rankings when such ground truth is available for comparison. To make this observation, we convert well-known benchmarks into free-form generative evaluations scored with Elo rankings from pairwise comparisons. We find that Elo rankings show Spearman correlation above 0.9 with accuracy rankings across five established benchmarks. In addition, Elo rankings have significantly more agreement with accuracy than direct evaluation when the judge is weak. Finally, we show that style and judge bias have only minor effects on model rankings. Although style and bias impede absolute measurement, our work demonstrates that model rankings from pairwise comparisons nevertheless reflect accuracy.