Skip to yearly menu bar Skip to main content


Poster

Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning

Shibo Jie · Yehui Tang · Ning Ding · Zhi-Hong Deng · Kai Han · Yunhe Wang


Abstract:

Current solutions for efficiently constructing large vision-language (VL) models follow a two-step paradigm: projecting the output of pre-trained vision encoders to the input space of pre-trained language models as visual prompts; and then transferring such joint models to downstream VL tasks via end-to-end parameter-efficient fine-tuning (PEFT). However, this paradigm still exhibits inefficiency since it increases the input length of the language models. In this paper, in contrast to integrating visual prompts into inputs, we regard visual prompts as additional knowledge facilitating language models in addressing tasks associated with visual inputs. Motivated by the finding that Feed-Forward Network (FFN) of language models acts as ``key-value memory'', we introduce a novel approach termed memory-space visual prompting (MemVP), wherein visual prompts are concatenated with the weights of FFN for knowledge injection. Experimental results across various VL tasks and language models reveal that MemVP significantly reduces the training time and inference latency of the fine-tuned VL models and surpasses the performance of previous PEFT methods.

Live content is unavailable. Log in and register to view live content