InfoDLM: an Information-Adaptive Framework for Discrete Diffusion Language Model Pretraining
Abstract
Diffusion language models (DLMs) can match or surpass similarly sized autoregressive language models on language understanding and reasoning. However, their mask-and-denoise pretraining relies on heuristic random masking, which fails to target the most informative tokens. Consequently, the model spends significant computational effort on redundant or trivial tokens. To address this, we propose InfoDLM, an adaptive DLM pretraining framework that reformulates mask selection as an active, feedback-driven process. InfoDLM targets tokens that offer the highest measurable information gain during mask selection. Specifically, we: (1) introduce a Trainable Information-Gain (TIG) signal to quantify information gain of each masking configuration; (2) develop a feedback mechanism that adapts the masking policy to the model’s evolving state with a maturity indicator; and (3) jointly optimize the DLM and masking policy through an interleaved training flow with minimal computational overhead. Across reasoning-oriented benchmarks, InfoDLM achieves up to 13\% improvement in reasoning accuracy over a small variant of LLaDA under comparable pretraining budgets.