Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
Abstract
Large diffusion vision–language models (LDVLMs) have recently demonstrated competitive performance on multimodal tasks, emerging as a promising alternative to autoregressive models. They enable parallel decoding for efficient inference and leverage bidirectional attention to capture global context. Despite these advances, their behavior under long-form generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and lead to degraded visual grounding. Through analysis, we identify two underlying causes of these failures. First, repetitive generation originates from a mask token prior. Because generation tokens are initialized as mask tokens, their hidden representations progressively drift toward a shared prior direction over generation steps. Second, a fundamental misalignment exists between the positional attention bias and the iterative unmasking process. This discrepancy suppresses the model's attention toward informative visual tokens, leading to degradation in visual grounding. Based on these insights, we propose a training-free approach that mitigates both issues. Specifically, we introduce Mask Prior Suppression and Monotonic RoPE Scaling, which mitigate mask prior drift and positional attention collapse during decoding. Experiments on general multimodal benchmarks and visual grounding tasks demonstrate improvements over baseline LDVLMs, with robust gains on long-form description benchmarks. Overall, our results show that these failures can be effectively addressed with a lightweight, plug-and-play strategy that requires no additional training and generalizes across diverse LDVLM architectures.