ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving
Yifan Qiao ⋅ Shan Yu ⋅ Shu Anzai ⋅ Haoran Ma ⋅ Shuo Yang ⋅ Yang Wang ⋅ Miryung Kim ⋅ Yongji Wu ⋅ Yang Zhou ⋅ Jiarong Xing ⋅ Joseph E Gonzalez ⋅ Ion Stoica ⋅ Harry Xu
Abstract
Large language model (LLM) serving demands low latency and high throughput, but high load variability leads to significant GPU utilization. In this paper, we identify a synergetic but overlooked opportunity to co-serve latency-critical online requests alongside *latency-tolerant offline* tasks, which existing systems fail to exploit due to their coarse-grained resource management and interference. We present ConServe, a co-serving system that enables fine-grained resource sharing through latency-aware token-level scheduling, sub-iteration layer-wise preemption, and incremental KV-cache management. These mechanisms allow offline execution to fill *millisecond-scale* GPU idle time while preserving strict online latency guarantees. Across real-world workloads with Llama-3.1 and Qwen-2.5 models, ConServe improves throughput by 2.2$\times$ on average and reduces online tail latency by 2.9$\times$ over state-of-the-art systems.
Successful Page Load