Fast and Accurate Causal Parallel Decoding using Jacobi Forcing
Lanxiang Hu ⋅ Siqi Kou ⋅ Yichao Fu ⋅ Samyam Rajbhandari ⋅ Tajana Rosing ⋅ Yuxiong He ⋅ Zhijie Deng ⋅ Hao Zhang
Abstract
Multi-token generation has emerged as a promising paradigm for accelerating language model inference, with the diffusion Large Language Models (dLLMs) as the most notable approach recently. Popular dLLMs like SDAR and Fast-dLLM v2 are post-trained on pre-trained AR models to minimize training cost while maintaining high performance. However, there exists a fundamental pretrain-to-posttrain mismatch -- the masked data distribution and bidirectional attention in post-training deviates significantly from the real data distribution and causal attention for pretraining. As a result, the post-trained dLLMs usually suffer from limited speedup or substantially degraded performance. To address this, we introduce Jacobi Forcing to bypass the dLLM formulation, directly post-training a causal multi-token predictor from an AR LLM. In particular, we force the model to learn to leap along its own parallel token generation trajectories based on Jacobi Decoding, and introduce an elaborate progressive distillation paradigm. The trained models achieve $3.8\times$ wall-clock speedup on coding and math benchmarks with minimal loss in performance. Based on the trajectory characteristics of the model, we further introduce multi-block decoding with rejection recycling, which enables up to $4.6\times$ higher token acceptance count per iteration and $4.0\times$ wall-clock speedup, effectively trading additional compute for lower inference latency.
Successful Page Load