Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies
Abstract
Vision–Language–Action (VLA) models adapt large vision–language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions autoregressively in a fixed left-to-right order or attach separate diffusion heads outside the backbone, fragmenting information pathways and hindering unified, scalable architectures. We present Discrete Diffusion VLA, a unified-transformer policy that models discretized action chunks with discrete diffusion retaining progressive refinement inside the VLM backbone. Our method achieves an adaptive decoding order that resolves high-confidence (easy) action elements before harder ones and employs secondary re-masking to revisit uncertain predictions, enabling robust error correction. This design preserves pretrained vision-language priors, supports parallel decoding, and improves the efficiency. Discrete Diffusion VLA achieves 96.5% avg.~success on LIBERO, 71.2% visual matching on SimplerEnv-Fractal, and 54.2% overall on SimplerEnv-Bridge. On out-of-distribution benchmarks, our method exhibits only 1.4% language degradation versus 8.0% for parallel decoding, and 21.0% vision degradation versus 29.0% for continuous diffusion, demonstrating well retention of pretrained vision-language capabilities. Visualization analysis confirms the learned decoding order adaptively prioritizes high-confidence tokens, validating our refinement strategy.