Skip to yearly menu bar Skip to main content


Poster

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

Zehan Wang · Ziang Zhang · xize cheng · Rongjie Huang · Luping Liu · Zhenhui Ye · Haifeng Huang · Yang Zhao · Tao Jin · Peng Gao · Zhou Zhao

Hall C 4-9 #2602
[ ]
Tue 23 Jul 4:30 a.m. PDT — 6 a.m. PDT

Abstract:

Unified multi-model representation spaces are the foundation of multimodal understanding and generation. However, the billions of model parameters and catastrophic forgetting problems make it challenging to further enhance pre-trained unified spaces. In this work, we propose FreeBind, an idea that treats multimodal representation spaces as basic units, and freely augments pre-trained unified space by integrating knowledge from extra expert spaces via ``space bonds". Specifically, we introduce two kinds of basic space bonds: 1) Space Displacement Bond and 2) Space Combination Bond. Based on these basic bonds, we design Complex Sequential & Parallel Bonds to effectively integrate multiple spaces simultaneously. Benefiting from the modularization concept, we further propose a coarse-to-fine customized inference strategy to flexibly adjust the enhanced unified space for different purposes. Experimentally, we bind ImageBind with extra image-text and audio-text expert spaces, resulting in three main variants: ImageBind++, InternVLIB, and InternVLIB++. These resulting spaces outperform ImageBind on 5 audio-image-text downstream tasks across 9 datasets. Moreover, via customized inference, it even surpasses the advanced audio-text and image-text expert spaces. Our code and checkpoints are released at https://github.com/zehanwang01/FreeBind

Live content is unavailable. Log in and register to view live content