Edit-Based Refinement for Parallel Masked Diffusion Language Models
Abstract
Masked diffusion language models enable parallel token generation and offer improved decoding efficiency over autoregressive models. However, their performance degrades significantly when generating multiple tokens simultaneously, due to a mismatch between token-level training objectives and the need for joint sequence consistency. In this paper, we propose ME-DLM, an edit-based refinement framework that augments diffusion generation with a lightweight post-generation editing step. After producing an initial complete response, the model refines it through minimal edit operations, including replacement, deletion, and insertion, conditioned on the full sequence. Training supervision is derived from edit distance, providing a deterministic supervision signal under a fixed canonicalization scheme for learning minimal corrections. This approach encourages sequence-level consistency through minimal, globally conditioned edits while preserving the efficiency benefits of parallel diffusion decoding, and substantially improves the quality and robustness of multi-token generation. Extensive experiments demonstrate that the proposed approach substantially improves the quality and robustness of multi-token parallel generation. In particular, when built upon LLaDA, our method achieves consistent gains of 11.6% on HumanEval and 33.6% on GSM8K while using one-eighth of the total diffusion steps.