Poster Tue, Jul 15, 2025 • 11:00 AM – 1:30 PM PDT

Double-Filter: Efficient Fine-tuning of Pre-trained Vision-Language Models via Patch&Layer Filtering

Yaoqin He · Junchen Fu · Kaiwen Zheng · Songpei Xu · Fuhai Chen · Jie Li · Joemon Jose · Xuri Ge

[ Poster] [ OpenReview]

Abstract

In this paper, we present a novel approach, termed Double-Filter,to “slim down” the fine-tuning process of vision-language pre-trained (VLP) models via filtering redundancies in feature inputs and architectural components. We enhance the fine-tuning process using two approaches. First, we develop a new patch selection method incorporating image patch filtering through background and foreground separation, followed by a refined patch selection process. Second, we design a genetic algorithm to eliminate redundant fine-grained architecture layers, improving the efficiency and effectiveness of the model. The former makes patch selection semantics more comprehensive, improving inference efficiency while ensuring semantic representation. The latter’s fine-grained layer filter removes architectural redundancy to the extent possible and mitigates the impact on performance. Experimental results demonstrate that the proposed Double-Filter achieves superior efficiency of model fine-tuning and maintains competitive performance compared with the advanced efficient fine-tuning methods on three downstream tasks, VQA, NLVR and Retrieval. In addition, it has been proven to be effective under METER and ViLT VLP models.

Lay Summary

Modern AI models pre-trained on both images and text (vision-language models) are powerful but often inefficient to adapt for specific tasks. Our new method, Double-Filter, streamlines this process by cutting unnecessary computations in two ways:Adaptive Patch Selection: Unlike methods that focus only on foreground objects, our approach dynamically evaluates both foreground and background regions, retaining only the most semantically meaningful patches. This ensures comprehensive image understanding while avoiding redundant processing of unimportant details.Optimized Architecture: Using an automated search inspired by evolution (genetic algorithm), we identify and remove redundant layers in the model, making it leaner while preserving accuracy.Tests on three key tasks—visual question answering (VQA), visual reasoning (NLVR), and image-text retrieval—show that that Double-Filter makes fine-tuning faster and more efficient, with little to no drop in performance. It works across different popular models, offering a practical way to reduce costs and energy use in AI applications.Why it matters: By intelligently filtering data and architecture, our method helps deploy adaptable AI systems more sustainably—balancing speed, cost, and accuracy.

Video

Chat is not available.