Robust AI Evaluation through Maximal Lotteries
Abstract
The standard way to evaluate language models on subjective tasks is through pairwise comparisons: an annotator chooses the "better" of two model responses for a given prompt. These comparisons are then aggregated into a single ranking via the Bradley–Terry (BT) framework, forcing heterogeneous preferences into a total order and violating basic social-choice desiderata. In contrast, social choice theory provides an alternative approach called maximal lotteries, which aggregates pairwise preferences without imposing any assumptions on their structure. However, we show that maximal lotteries can be highly sensitive to heterogeneity among annotators and across prompts. We introduce robust lotteries, which optimize worst-case performance under plausible shifts in the preference data. On large-scale preference datasets, robust lotteries achieve more reliable win rate guarantees across the annotator distribution and recover a stable set of top performing models.