THETA: Threshold-Based Exclusive Batching for Memory-Bandwidth-Constrained LLM Inference
Abstract
Chunked prefill has become the dominant scheduling strategy for large language model (LLM) inference, interleaving prefill and decode operations to improve GPU utilization. However, this approach does not universally outperform exclusive batching: on bandwidth-constrained GPUs, mixed batches can intensify memory-bandwidth contention and degrade decode throughput. Through controlled experiments on H200 (4.8\~TB/s) and RTX PRO 6000 (1.8\~TB/s), we find that mixed batching begins to underperform exclusive decoding at 80\% decode ratio on H200, but at merely 20\% on the bandwidth-constrained RTX PRO 6000. We present \sys, a scheduling framework for exclusive batching that derives closed-form, asymptotically optimal phase-switching thresholds under stochastic output-length models, along with memory-safe batch sizing. Experiments on real-world workloads demonstrate that \sys achieves up to 15.3\% higher throughput than chunked prefill on bandwidth-constrained hardware while maintaining competitive latency.