Learnability-Informed Fine-Tuning of Diffusion Language Models
Abstract
We aim to improve the reasoning capabilities of diffusion language models (DLMs). While SFT performs well for autoregressive models, its use in DLMs faces challenges. Our observation and analysis reveal that vanilla SFT does not consider learnability, i.e., what and when tokens are learned. Specifically, we observe that rare tokens are difficult to learn when most of the input is masked. In contrast, it is straightforward and thus of little value to learn common tokens when most of the input is unmasked. To consider learnability, we propose LIFT, a learnability-informed fine-tuning strategy for DLMs. LIFT learns easy tokens when most of the input is noisy and hard tokens when more input is available, thereby aligning training with information available at different diffusion time steps. Our results show that LIFT outperforms existing SFT baselines across six reasoning benchmarks, achieving up to a 3x relative gain on AIME'24 and AIME'25.