ICML Poster GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer

Poster

GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer

Ding Jia · Jianyuan Guo · Kai Han · Han Wu · Chao Zhang · Chang Xu · Xinghao Chen

Hall C 4-9 #203

[ Abstract ] [ Project Page ] [ Paper PDF ]

[ Poster]

Abstract: Cross-modal transformers have demonstrated superiority in various vision tasks by effectively integrating different modalities. This paper first critiques prior token exchange methods which replace less informative tokens with inter-modal features, and demonstrate exchange based methods underperform cross-attention mechanisms, while the computational demand of the latter inevitably restricts its use with longer sequences. To surmount the computational challenges, we propose *GeminiFusion*, a pixel-wise fusion approach that capitalizes on aligned cross-modal representations. *GeminiFusion* elegantly combines intra-modal and inter-modal attentions, dynamically integrating complementary information across modalities. We employ a layer-adaptive noise to adaptively control their interplay on a per-layer basis, thereby achieving a harmonized fusion process. Notably, *GeminiFusion* maintains linear complexity with respect to the number of input tokens, ensuring this multimodal framework operates with efficiency comparable to unimodal networks. Comprehensive evaluations across multimodal image-to-image translation,

3

$3$ D object detection and arbitrary-modal semantic segmentation tasks, including RGB, depth, LiDAR, event data, etc. demonstrate the superior performance of our *GeminiFusion* against leading-edge techniques. The PyTorch code is available [here](https://github.com/JiaDingCN/GeminiFusion).

Chat is not available.