ARC-Decode: Accelerated Decoding with Risk-Bounded Acceptance
Ying Li ⋅ Zhaode Wang ⋅ Zhiwen Chen ⋅ chengfei lv ⋅ Huan Wang
Abstract
As larger language models deliver stronger capabilities, their autoregressive inference becomes increasingly expensive. *Speculative decoding* accelerates generation by letting a fast draft propose tokens that the target model verifies in parallel. Yet under sampling ($T>0$), observed speedups consistently lag behind those under greedy decoding, as the classical lossless verification rule tends to over-reject low-risk drafts, leading to lower acceptance rates and limited acceleration. To address this gap, we propose **ARC-Decode** (**A**cceptance with **R**isk **C**ontrol), a training-free method that augments speculative decoding without extra forward passes. ARC-Decode enables \textbf{relaxed} acceptance by identifying drafts whose acceptance preserves the output distribution of the target model, under a risk-controlled criterion based on Jensen--Shannon divergence. It combines confidence-based pre-verification filtering with a risk-bounded acceptance criterion derived from an analytic upper bound on the potential distributional deviation. Integrated into the state-of-the-art EAGLE-3 pipeline, ARC-Decode increases accept length per cycle and reduces verification compute, achieving up to **1.6**$\times$ end-to-end speedup over EAGLE-3 under sampling with negligible quality change across benchmarks.
Successful Page Load