AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding
Shuqing Luo ⋅ Yilin Guan ⋅ Pingzhi Li ⋅ Hanrui Wang ⋅ Tianlong Chen
Abstract
Test-time scaling (TTS) can boost LLM reasoning through long chain-of-thought (CoT), but the linear KV-cache growth amplifies the memory-bound bottleneck of LLM decoding. Query-aware sparse decoding methods can achieve state-of-the-art performance under constrained FLOP budget, but are mainly constrained by both sequential-dependent page filtering and coarse-grained token selection, hampering the serving efficiency and model performance on TTS tasks under high concurrency and long CoT scenarios, where token selection can even occupy higher runtime than the forward pipeline itself. In this paper, we first find that the query state of the current decoding token can be approximated in a unified manner from a short sliding window of recent queries, enabling training-free query-aware sparsity without sequential dependency in the decoding loop. Based on the findings, we propose $\texttt{\textbf{AsyncSpade}}$, an asynchronous framework for efficient TTS, built on two core components: $\textbf{(1) a novel light-weight temporal-regressive module}$ that predicts the next-token query state, and $\textbf{(2) an asynchronous disaggregated framework}$ that decouples the KV cache selection from the auto-regressive decoding loop, overlapping the token-level KV selection with the forward inference computation through asynchronism, thereby eliminating the sequential dependency without sacrificing model performance. We validate the effectiveness of $\texttt{AsyncSpade}$ on common LLM serving setups with an A100 node, where $\texttt{AsyncSpade}$ can fully overlap KV-cache operations with the inference pipeline within a certain workload range, $\textbf{achieving theoretical optimal time-per-output-token~(TPOT)}$. Specifically, $\texttt{AsyncSpade}$ delivers over 20% reduction on TPOT compared to SoTA baseline ($\textit{i.e.}$ Quest) and at least 50% TPOT reduction compared to full attention on Qwen3-8B and Qwen3-32B models, while matching or surpassing their accuracy on various TTS benchmarks (AIME-24/25, GPQA-Diamond, MATH-500). Our code is available through https://anonymous.4open.science/r/AsyncSpade-063C.
Successful Page Load