RouterJudge: Preference-Based Evaluation of LLM Routers under Pluralistic User Preferences
Abstract
Large language model routing aims to select an appropriate model for each user query under constraints such as quality, cost, latency, and task context. Existing routing methods are commonly evaluated with static benchmarks, golden answers, or automated quality scores. Such offline evaluation assumes a relatively fixed notion of response quality, but this assumption breaks down in open-ended real-world settings, where users may disagree on what constitutes a better answer due to differences in style preference, desired level of detail, cost sensitivity, and task-specific expectations. We propose RouterJudge, an online pairwise preference evaluation framework for LLM routing systems. Inspired by the anonymous A/B comparison protocol of Chatbot Arena, RouterJudge shifts the unit of evaluation from model-level response quality to router-level decision quality. For each query, multiple routing strategies independently recommend candidate models; selected model outputs are then presented to users in a blinded pairwise comparison, and user preferences are attributed back to the routing strategies that produced the corresponding routing decisions. Each evaluation record integrates the user query, routing decisions, paired model responses, preference labels, and cost information. Based on this protocol, RouterJudge supports routing-oriented analyses including preference win rate, cost-quality Pareto frontier, task-conditioned performance, pairwise router comparison, and routing behavior diagnostics. By grounding routing evaluation in pluralistic user preferences rather than fixed golden answers, RouterJudge provides a practical foundation for studying preference-aware, cost-aware, and context-adaptive LLM routing.