AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving
Ying Wang ⋅ Zhen Jin ⋅ Zhenqian Chen ⋅ Jiexiong Xu ⋅ Wenhai Lin ⋅ Yiquan Chen ⋅ Wenzhi CHEN
Abstract
Augmented large language models (LLMs) that invoke external calls are increasingly prevalent in inference serving. However, such augmentations pose significant challenges to inference efficiency under strict Service-Level Objectives (SLOs). Existing inference systems are agnostic to the dynamic execution behaviors induced by external calls and rely on fixed batch-level token budget, which leads to severe Head-of-Line (HoL) blocking and substantially reduced effective throughput. We present AugServe, an efficient augmented LLM inference serving framework that mitigates request queuing latency and improves effective throughput under external-call-augmented workloads. AugServe integrates state-aware request scheduling with dynamic batch-level token budgets to adapt to heterogeneous requests and their dynamically changing execution states. Experimental results show that AugServe achieves 6.5$\times$ and 4.7$\times$ higher effective throughput than vLLM and INFERCEPT, respectively.
Successful Page Load