When RL Meets Adaptive Speculative Training: A Unified Training-Serving System
Abstract
Speculative decoding can significantly accelerate LLM serving, but its real-world benefits often erode due to training–serving mismatch and non-stationary traffic. Unlike previous systems that decouple speculator training from inference, we present a unified training–serving system, Aurora, that closes this loop by continuously learning a speculator model directly from live inference traces. Our design integrates an SGLang-based inference server with an asynchronous training server connected via efficient GPU-to-GPU RPC, enabling hot-swapped speculator updates without service interruption. Crucially, our system supports day-0 deployment: a speculator can be served immediately and quickly adapted on live traffic, improving overall system throughput. This paradigm shift enables us to frame the training–serving loop as an asynchronous reinforcement learning process and allows us to leverage rejected tokens from the speculator to improve sampling efficiency. Our experiments show that this unified system achieves a 1.33× speedup in the mixed-data scenario when starting from a scratch speculator, and a 1.48× speedup compared to a static speculator. We also find that the system adapts more effectively to distribution shifts in user traffic, delivering a 1.25× speedup over a well-trained but static speculative decoding.