Routing and Reasoned Evaluation with Large Language Models
Guiyao Tie ⋅ Tianyao Luo ⋅ Xueyang Zhou ⋅ Chaoran Hu ⋅ Yunhong He ⋅ Junran Wu ⋅ Yuanfan Yao ⋅ Pan Zhou ⋅ Lichao Sun
Abstract
Large language models (LLMs) are increasingly used to provide automated assessment signals for evaluating model-generated outputs. However, practical deployment faces three persistent challenges: heterogeneous reliability across models, substantial latency and token costs, and the absence of principled strategies for allocating evaluation resources. We introduce R$^2$Eval, a routing-aware automated assessment framework that formulates evaluation as a resource allocation and aggregation problem rather than relying on a single monolithic evaluator. R$^2$Eval combines difficulty-aware routing with reasoned evaluation signals to dynamically select evaluator models on a per-instance basis under explicit accuracy, latency, and cost constraints. Our study makes three contributions. First, we construct six difficulty-aware datasets spanning both reasoning-intensive (mathematics, logic, code) and non-reasoning (knowledge, roleplay, writing) tasks, with human-annotated reference assessments. Second, we provide a systematic empirical analysis of how reasoning traces produced by different evaluator models correlate with assessment outcomes, revealing substantial variance and systematic mismatches across difficulty regimes. Third, we develop and evaluate both offline and online routing strategies that adaptively allocate evaluation queries, achieving substantially improved accuracy–efficiency trade-offs compared to static baselines. Experiments across 19 language models demonstrate that R$^2$Eval significantly reduces evaluation cost and latency while maintaining close alignment with human assessments. These results highlight the importance of routing-aware automated assessment and establish R$^2$Eval as a scalable and reliable framework for large-scale model evaluation.
Successful Page Load