ViewMask-1-to-3: Multi-View Consistent Image Generation via Multimodal Diffusion Models
Abstract
Motivated by discrete diffusion's success in language-vision modeling, we explore its potential for multi-view generation, a task dominated by continuous approaches. We introduce \textbf{ViewMask-1-to-3}, formulating multi-view synthesis as a discrete sequence modeling problem where each viewpoint is represented as visual tokens from MAGVIT-v2. Through \textbf{masked token prediction}, our approach \textbf{enables progressive multi-view generation via iterative token unmasking}, unifying language and vision in a shared token space. Importantly, simple random masking combined with self-attention naturally encourages cross-view consistency without specialized architectures or 3D geometric priors. Our method outperforms the baseline on the GSO and 3D-FUTURE benchmarks, ranking first on average across standard image metrics and improving IoU by 10.6\% on 3D-FUTURE. This validates discrete diffusion as a promising candidate for multi-view generation.