AlphaRouter: Token-level Routing Between SLM and LLM with Reinforcement Learning and Tree Search
Abstract
SLM-LLM routing accelerates generation by strategically invoking LLMs for critical tokens. However, existing methods typically train routers to mimic the LLM, capping performance at the reference trajectory's limit. In this work, we demonstrate that the SLM-LLM collaborative inference space offers a richer solution set, yielding correct answers even when the LLM fails. To exploit this, we propose AlphaRouter, a routing framework learning optimal collaborative inference paths via a search and iterative update paradigm. Formulating routing as a Markov Decision Process, we introduce Collaborative Inference Tree Search (CITS) to explore the solution space. To tackle the severe credit assignment challenge posed by sparse rewards, we propose Tree-Advantage Policy Optimization (TAPO) to optimize the routing policy. By leveraging counterfactual advantages within the tree structure, TAPO effectively attributes the final reward to specific routing decisions, stabilizing training without dense supervision. Extensive experiments show that AlphaRouter advances the Pareto frontier of accuracy-efficiency trade-offs by exploiting better inference trajectories in the collaborative space. Code is available at https://anonymous.4open.science/r/AlphaRouter.