Towards Unified Multimodal Pretraining
Abstract
Unified multimodal models aim to input and output both vision and language data within a single system. In this work, we explore the design space of Unified Multimodal Pretraining through a controlled, from-scratch study. We find that leveraging a single high-dimensional semantic encoder (e.g. SigLIP 2) achieves the best combined performance for both visual understanding and generation. Furthermore, we observe that integrating diverse visual data---including raw video and image-text pairs---has minimal impact on language capabilities, suggesting that vision and text are compatible within a single unified model. We identify positive synergy where joint pretraining enhances downstream capabilities such as Visual Question Answering (VQA) and World Modeling. Turning to architecture, we investigate Mixture-of-Experts (MoE) design choices, such as granularity and sparsity, to identify an effective training recipe. Finally, we quantify scaling dynamics via IsoFLOP analysis and uncover a scaling asymmetry: language scaling is parameter-hungry, while vision scaling is significantly more data-hungry. We demonstrate that MoE architectures help address this imbalance by decoupling total parameter capacity from active compute, enabling the high capacity required for language while also accommodating the data-intensive nature of vision.