Anytime-Valid Inference for Online Ranking of Large Language Models
Abstract
Online evaluation of large language models increasingly relies on sequentially collected pairwise preferences, enabling human-aligned assessment and continuous data collection until closely performing models can be reliably distinguished. However, adaptive sampling and continuous monitoring invalidate classical fixed-sample inference, rendering existing ranking procedures largely heuristic. We propose SERPANT (Sequential E-value Ranking and Pruning via Adaptive Null Testing), a principled framework for online LLM ranking with anytime-valid guarantees. SERPANT formulates model comparison as a collection of pairwise hypothesis tests and constructs e-processes to ensure family-wise error rate control at any monitoring time. Anytime validity provides a theoretical justification for early stopping, enabling substantial cost savings from expensive human annotation. To improve efficiency, we introduce a novel tournament-based sampling strategy that adaptively selects comparisons based on past outcomes. The proposed framework further provides anytime-valid confidence sets for top-k model identification. Theoretical and empirical results on benchmark datasets validate the efficiency and statistical guarantees.