SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
Abstract
Multimodal Large Language Models (MLLMs) possess intrinsic reasoning and world-knowledge capabilities, yet adapting them for dense retrieval remains challenging. Existing approaches typically rely on invasive parameter updates, such as full fine-tuning and LoRA, which risk disrupting the pre-trained semantic manifold and degrading the complex knowledge structures crucial for logical inference. To address this, we propose SLQ, a parameter-efficient tuning framework that adapts MLLMs for retrieval while keeping the backbone entirely frozen. SLQ introduces a small set of Shared Latent Queries that are appended to both text and image tokens, leveraging the model’s native causal attention to aggregate multimodal context into a unified embedding space. Furthermore, to rigorously evaluate retrieval beyond superficial pattern matching, we construct KARR-Bench, a benchmark designed for knowledge-aware reasoning retrieval. Extensive experiments demonstrate that SLQ achieves better performance compared to full fine-tuning and LoRA baselines on COCO and Flickr30K, while significantly outperforming them on KARR-Bench, validating that preserving the frozen semantic manifold via non-invasive adaptation is an effective strategy for MLLM-based retrieval.