Poster
in
Workshop: 2nd Workshop on Advancing Neural Network Training : Computational Efficiency, Scalability, and Resource Optimization (WANT@ICML 2024)
Enhancing Fine-grained Multi-modal Alignment via Adapters: A Parameter-Efficient Training Framework for Referring Image Segmentation
Zunnan Xu · Jiaqi Huang · Ting Liu · Yong Liu · Haonan Han · Kehong Yuan · Xiu Li
In the domain of computer vision, Parameter-Efficient Training (PET) is increasingly replacing the traditional paradigm of pre-training followed by full fine-tuning. PET is particularly favored for its effectiveness in large scale models, as it streamlines transfer learning costs and optimizes hardware utilization. However, the prevailing PET methods are primarily designed for single-modal optimization without fine-grained feature extraction design. When applied to multi-modal dense prediction tasks, these methods typically do not match the performance of full fine-tuning methods that utilize more resources. In this paper, we do an investigation of efficient training problems on referring image segmentation. We introduce DenseCrossAdapter, a parameter-efficient module designed to enhance low-rank visual feature propagation by establishing dense interconnections between each layer and all preceding layers. This facilitates robust cross-modal feature interaction. We also suggest using text adapters to improve textual features. Our approach greatly surpasses state-of-the-art methods with only 0.9% to 1.8% backbone parameter updates, evaluated on challenging benchmarks.