MODUS: Decoder-only Any-to-Any Modeling of Diverse Modalities
Abstract
Any-to-any modeling aims to flexibly relate arbitrary modalities within a single system, a requirement that arises across multimodal learning and scientific domains such as ecology and astronomy. However, existing any-to-any approaches are typically trained from scratch using encoder–decoder or diffusion architectures, limiting empirical performance and the use of pretrained models. We investigate decoder-only any-to-any multimodal modeling, which treats all modalities symmetrically and supports arbitrary modalities as inputs and outputs without modality-specific heads, losses, or task pipelines. As a consequence of this unified design, the resulting model MODUS naturally enables chained generation through intermediate modalities, cross-modal consistency verification, and analysis of visual representations by combining semantic and reconstruction features. Across a range of benchmarks, MODUS demonstrates strong out-of-the-box performance and flexible multimodal composition within a single model.