HexGen-3: A Fully Disaggregated LLM Serving Framework with Fine-Grained Heterogeneous Resource Autoscaling
Abstract
The operational cost of serving large language models remains prohibitively high, largely due to extreme workload heterogeneity in production traffic. We observe that combining disaggregated inference with resource autoscaling enables fine-grained resource adjustment, allowing inference phases and operations to scale independently based on their specific bottlenecks. Building on this insight, we propose HexGen-3, a cost-effective LLM serving framework that leverages a fully disaggregated inference architecture and heterogeneous resource autoscaling. HexGen-3 introduces two key components: (i) A hierarchical scheduling framework that jointly optimizes resource allocation and parallelism configuration for any given resource provisioning, and (ii) an autoscaling framework that dynamically adjusts resources and triggers deployment rescheduling in response to workload fluctuations. Experiments comparing HexGen-3 against state-of-the-art LLM serving systems demonstrate up to 60% (on average 46.5%) improvement in per-cost throughput under static resource provisioning, and up to 78.3% (on average 55.1%) improvement with autoscaling enabled under dynamic workloads.