Selective Deferred Routing: Enabling Cost-Efficient Collaboration between Local SLMs and Remote LLMs
Abstract
The rapid advancement of large language models (LLMs) has led to remarkable performance across diverse domains, making them indispensable assistants in daily life and work. Currently, LLM services are primarily accessed in two ways: (i) paid access to cloud-hosted LLMs, which are powerful but introduce nontrivial cost; and (ii) deployment of small language models (SLMs) on personal devices or small clusters, which, while less powerful, are sufficient for handling relatively simple tasks. To achieve a balanced trade-off between monetary cost and task performance, we propose Selective Deferred Routing, a paradigm that enables cost-efficient collaboration between local SLMs and remote LLMs. In this framework, a user request is first processed by the local SLM, which not only generates a preliminary response but also provides rich semantic representations of the request. A lightweight decision module then leverages this information to either adopt the initial response or route the request to the most suitable remote LLM for a higher-quality response. Extensive experiments across diverse model architectures and families, including both SLMs and LLMs, as well as datasets spanning multiple task scenarios, demonstrate that our approach consistently outperforms existing multi-LLM collaboration methods under a wide range of cost–performance trade-offs.