The Convergent Representation of Vision-Language Contrastive Learning: Geometry, Modality Gap and Shared Space Alignment
Abstract
Multimodal contrastive learning (MCL) aims to embed data from two modalities in a shared embedding space. However, in practice, representations of images and text occupy completely separate regions of embedding space, a phenomenon called the modality gap. Moreover, experimental findings on how the size of the modality gap affects downstream performance are inconsistent. These observations raise two key questions: (1) What causes the modality gap? (2) What affects downstream performance? To address these questions, we introduce the first theoretical framework for analyzing the convergent optimal representations (COR) of MCL when training is optimized. We prove that, when representations of image and text collapse into different subspaces, a phenomenon called \emph{dimension collapse}, the modality gap occurs. Our theorem also reveals that while the modality gap prevents representations of image and text from aligning directly, their projections onto the shared space can be aligned. And share space alignment plays dominate role in determining downstream performance. Inspired by these findings, we first propose Shared Space Alignment (SSA) to improve MCL pretraining by enhancing alignment within the shared space. Extensive experiments validate our theoretical analysis and proposed methods.