Poster Wed, Jul 8, 2026 • 10:30 AM – 12:15 PM KST Coex: HALL A

MAPS: Memory-Aware Predictive Scheduling Framework for Large Language Models Serving

Tiancheng Zhang ⋅ Yulin Chen ⋅ Yunfeng Zhao ⋅ Shaoyuan Huang ⋅ Cheng Zhang ⋅ Xiaofei Wang

Abstract

The surge of large language model (LLM) applications on personal devices imposes massive, bursty workloads on cloud serving infrastructure. While prefill-decode disaggregation improves throughput and scalability, memory-bound decode instances often suffer from persistent load imbalance, as output lengths are unknown when requests arrive at the cloud. To address this, we propose MAPS, a memory-aware predictive scheduling framework tailored for disaggregated LLM serving. MAPS performs device-assisted speculative output-length prediction overlapped with cloud-side prefilling, incurring negligible latency overhead. To handle generation uncertainty, MAPS applies uncertainty-aware calibration to derive output length upper bounds with target coverage, enabling safe scheduling decisions. Building on these bounds, MAPS employs a hierarchical global-local scheduling strategy to mitigate inter-decoder queue buildup and intra-decoder head-of-line blocking. Extensive experiments on two real-world workloads and two LLMs show that MAPS significantly outperforms three state-of-the-art systems, reducing average end-to-end latency by 42.6\% and tail latency by up to 84.8\%.