SEPS: Semantic-Enhanced Patch Slimming Framework for Fine-Grained Cross-Modal Alignment
Abstract
Fine-grained cross-modal alignment is pivotal for multimodal reasoning yet remains limited by Semantic Sparsity Bias—a fundamental asymmetry where dense visual signals are under-represented by sparse textual captions. This disparity leads to the inadvertent suppression of contextually vital visual regions (patch redundancy) and hinders precise concept grounding (patch ambiguity). While Multimodal Large Language Models (MLLMs) offer rich descriptive capabilities, their naive integration often induces semantic drift due to inconsistencies with sparse ground-truth captions. To systematically resolve these challenges, we present the Semantic-Enhanced Patch Slimming (SEPS) framework. Central to SEPS is a novel Dual-Granularity Semantic Calibration mechanism, which synthesizes a Holistic Visual-Linguistic Anchor from MLLMs to synergize with original sparse queries. This mechanism transforms patch selection into a semantic consensus process, ensuring that retained patches satisfy both local discriminability and global contextual integrity. Furthermore, we propose a Salience-Guided Metric Aggregation strategy to mitigate the similarity dilution effect inherent in global mean pooling, thereby amplifying highly-relevant patch-word correspondences. Extensive experiments on Flickr30K and MS-COCO datasets demonstrate that SEPS surpasses existing state-of-the-art approaches across diverse backbones, delivering significant performance gains in text-to-image retrieval tasks. The complete implementation is available at https://anonymous.4open.science/r/SEPS/.