Uncovering Grounding IDs: How External Cues Shape Multi-Modal Binding
Abstract
Large vision–language models (LVLMs) perform well on multimodal tasks, but their ability to reason and precisely align visual and textual information still has room for improvement. In this study, we show that external visual cues, such as symbols or grid lines, help LVLMs form more accurate connections between visual components, such as objects, and their corresponding textual descriptions, improving their grounding and reasoning abilities. We introduce the concept of Grounding IDs, which are latent identifiers that arise within the model as a result of external cues structuring both visual and textual modalities. Our analysis reveals that partition-inducing external cues lead to Grounding IDs that make better alignment between corresponding visual and text representations, helping the model focus on relevant information. We find that Grounding IDs enhance attention between related components, improving cross-modal grounding and reducing hallucinations. Overall, our results show that Grounding IDs are a key mechanism that enables external cues to improve cross-modal alignment, reduce errors, and enhance the overall performance of LVLMs across a range of multimodal tasks.