Single-Step Initialization for Exploratory Parallel Rollouts in Diffusion LLMs
Abstract
We propose training-free parallel decoding for rollout generation in diffusion large language model (dLLM) policy optimization, reducing rollout cost without auxiliary models or policy modification. We find, however, that confidence-based decoding suffers from delayed branching, and parallel decoding largely inherits this characteristic. Rollouts agree on both unmasked tokens and positions for much of generation, leading to a lack of exploration that weakens the group-relative learning signal. We address this with a minimal initialization step in which each rollout independently unmasks one uniformly random position after which the original sampler resumes unchanged. The intervention is drop-in compatible with any sampling strategies. Combined with Fast-dLLM on LLaDA-8B-Instruct, it improves rollout diversity and yields stronger downstream RL performance on GSM8K and MATH-500.