Understanding Multimodal Learning: A Loss Landscape Smoothness Perspective
Abstract
A surge of recent advancements has consistently highlighted the superiority of multimodal learning over unimodal approaches across a variety of tasks. However, the theoretical foundations elucidating this advantage remain underexplored: existing theoretical analyses are often constrained by tight assumptions, and lack empirical validation. In this paper, we link this gap by proposing a novel theoretical framework grounded in \textit{convolutional smoothing}, offering a new perspective on how multimodal learning contributes to a smoother loss landscape compared to unimodal learning. Building upon this theoretical foundation, we introduce a simple yet effective distributional training approach based on stochastic modality pairing instead of fixed pairing; thus, further promoting flatter landscape via convolutional smoothing. Our empirical results across various multimodal datasets demonstrate that multimodal models not only achieve better performance but also exhibit smoother loss landscape, which represent better robustness and generalization.