Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed
Yonggan Fu ⋅ Lexington Whalen ⋅ Zhifan Ye ⋅ Xin Dong ⋅ Shizhe Diao ⋅ Jingyu Liu ⋅ CHENGYUE WU ⋅ Hao Zhang ⋅ Enze Xie ⋅ Song Han ⋅ Maksim Khadkevich ⋅ Jan Kautz ⋅ Yingyan (Celine) Lin ⋅ Pavlo Molchanov
Abstract
Diffusion language models (dLMs) have emerged as a promising paradigm enabling parallel generation, but their learning efficiency lags behind that of autoregressive (AR) language models when trained from scratch. To this end, we study AR-to-dLM conversion, which transforms pretrained AR models into efficient dLMs that excel in speed while preserving AR models’ task accuracy. We achieve this by identifying limitations in the attention patterns and objectives of existing AR-to-dLM methods and then proposing methodologies and actionable insights for scalable AR-to-dLM conversion. Specifically, we first systematically compare different attention patterns and find that maintaining pretrained AR weight distributions is key to effective AR-to-dLM conversion. Accordingly, we introduce a continuous pretraining scheme with a block-wise attention pattern. We find that, in addition to block-wise attention’s known benefit of enabling KV caching, its block-wise causality better preserves pretrained AR models’ weight distributions, leading to a win–win in accuracy and efficiency. Second, to mitigate the training–test gap in mask token distributions (uniform vs. highly left-to-right), we propose a position-dependent token masking strategy that assigns higher masking probabilities to later tokens during training to better mimic test-time behavior. These studies lead to the Efficient-DLM model family, which outperforms state-of-the-art AR models and dLMs in accuracy–throughput trade-offs; for example, our Efficient-DLM 8B achieves +5.4\%/+2.7\% higher accuracy with 4.5$\times$/2.7$\times$ higher throughput compared to Dream 7B and Qwen3 4B, respectively.
Successful Page Load