Poster

Any-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

lijiang Li ⋅ zuwei long ⋅ Yunhang Shen ⋅ Heting Gao ⋅ Haoyu Cao ⋅ Xing Sun ⋅ Caifeng Shan ⋅ Ran He ⋅ Chaoyou Fu

Abstract

While recent multimodal large language models (MLLMs) have made impressive strides, they mostly employ a conventional autoregressive architecture as their backbone, leaving significant room for exploring effective and efficient alternatives in architectural design. Meanwhile, recent studies have successfully applied discrete diffusion models to natural language processing, revealing their considerable potential as a promising new approach in this domain. Drawing inspiration from these pioneering researches, we introduce Any-Diffusion, the first any-to-any multimodal language model built purely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Any-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach enables support for not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models.