Skip to yearly menu bar Skip to main content


Poster

GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer

Ding Jia · Jianyuan Guo · Kai Han · Han Wu · Chao Zhang · Chang Xu · Xinghao Chen


Abstract:

Recent advancements in cross-modal transformers have solidified their superiority in various vision tasks, whose success is credited to strategically integrating disparate modalities. This study first critiques prior token exchange methods which selectively replace less informative tokens with aggregated inter-modal features, and demonstrate that exchange based methods underperform cross-attention mechanisms, while thecomputational demand of the latter inevitably restricts its use with longer sequences.To surmount the computational challenges, we propose GeminiFusion, a pixel-wise fusion approach that capitalizes on aligned cross-modal representations. GeminiFusion elegantly combines intra-modal and inter-modal attentions, dynamically integrating complementary information across modalities. We employ a layer-adaptive noise to adaptively control their interplay on a per-layer basis, thereby achieving a harmonized fusion process. Notably, GeminiFusion maintains linear complexity with respect to the number of input tokens, ensuring this multimodal framework operates with efficiency comparable to unimodal networks.Comprehensive evaluations across various multimodal image-to-image translation and semantic segmentation benchmarks including RGB, depth, LiDAR and event data demonstrate the superior performance of our GeminiFusion against leading-edge techniques. The code will be available.

Live content is unavailable. Log in and register to view live content