Improving Sampling for Masked Diffusion Models via Information Gain
Abstract
Masked Diffusion Models (MDMs) offer greater flexibility in decoding order than autoregressive models but require careful path planning to achieve high-quality generation. While existing samplers typically greedily select positions with the lowest uncertainty, we identify their fundamental limitations through failure case analysis, showing they overlook the impact of current actions on subsequent steps and fail to optimize cumulative uncertainty. To bridge this gap, we propose the Info-Gain Sampler, a principled decoding framework that balances immediate costs with information gain. Our method features a simple but effective objective and an efficient implementation that ensures practical overhead is minimal. Extensive evaluations across diverse architectures and tasks (reasoning, coding, creative writing, and image generation) demonstrate that Info-Gain Sampler consistently outperforms existing samplers, significantly raising the performance ceiling of MDMs. For instance, it achieves a 5.5\% improvement in average accuracy on reasoning tasks and a 63.1\% win-rate in creative writing; notably, on reasoning tasks it reduces cumulative uncertainty from 78.4 to 48.6, outperforming the best baseline by a large margin.