MoDA: Modulation Adapter for Fine-Grained Visual Understanding in Instructional MLLMs
Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable success in instruction-following tasks by integrating pretrained visual encoders with large language models (LLMs). However, existing approaches often struggle with fine-grained visual grounding due to semantic entanglement in visual patch representations, where individual patches blend multiple distinct visual elements, making it difficult for models to focus on instruction-relevant details. To address this challenge, we propose MoDA (Modulation Adapter), a lightweight module that enhances visual grounding through instruction-guided channel-wise modulation. Unlike token-level methods such as Q-Former that perform additive feature selection, MoDA operates at the channel level through multiplicative modulation on already-aligned features, enabling fine-grained control over which embedding dimensions are relevant for each instruction. MoDA applies cross-attention between language instructions and pre-aligned visual features, generating dynamic modulation masks without architectural modifications or additional supervision. We evaluate MoDA on recent baselines including LLaVA-MoRE (2025) across 12 diverse benchmarks spanning visual question answering, vision-centric reasoning, and hallucination detection, including recent 2024 benchmarks (MMVP, CV-Bench, MMStar, RealWorldQA). MoDA achieves substantial improvements of +12.0 points on MMVP and +4.8 points on ScienceQA, outperforming baselines on all 12 benchmarks with minimal overhead (<1 FLOPs).