Poster
in
Workshop: Trustworthy Multi-modal Foundation Models and AI Agents (TiFA)
Wasserstein Modality Alignment Makes Your Multimodal Transformer More Robust
Zhuo ZHI · Ziquan Liu · Qiangqiang Wu · Miguel Rodrigues
Early fusion at a one-tower model such as a multimodal transformer is an effective multimodal learning paradigm. However, in a multimodal transformer, the modality fusion is performed solely through the self-attention function, which is originally designed for unimodal token sequences. To improve the self-attention mechanism for handling multimodal input, a parametric adapter model, like the Q-former in BLIP-2, is often used to align tokens from different modalities. Unlike existing methods that use an adapter model for modality alignment, our paper proposes an implicit approach based on Wasserstein distance that aligns tokens from different modalities in a multimodal transformer without using any additional parameters. Our empirical study shows that the implicit modality alignment improves the effectiveness of the multimodal Transformer in discriminative tasks, as well as its robustness to input noise and missing modalities. We conduct experiments on four different types of downstream task datasets, including both 2-modalities and 3-modalities tasks. In standard testing, testing with modality noise, and testing with missing modalities, the averaged improvement of our method compared with the baseline over all datasets are 0.9%, 2.5%, and 2.1% respectively.