Seeing to Generalize: How Visual Data Corrects Binding Shortcuts
Abstract
Vision Language Models (VLMs) extend Large Language Models (LLMs) with visual capabilities, yet we observe that visual training can improve performance even on purely text-only tasks. Using mechanistic interpretability, we identify the cause: visual training induces a qualitative shift in the model's internal binding strategy from positional to symbolic binding. In a controlled synthetic retrieval task, a transformer trained only on text achieves perfect in-distribution accuracy but fails to generalize out-of-distribution (OOD) (37.2\%), while subsequent training on an image-tokenized version of the same task nearly doubles text-only OOD performance (69.5\%). We trace this improvement to a mechanistic shift: text-only training encourages positional shortcuts, whereas image-based training forces the model to adopt a more robust symbolic binding mechanism that persists when text-only examples are reintroduced. We validate that analogous shifts occur across four pretrained LLM-to-VLM families (Qwen 2/2.5/3 and InternLM 3). Our findings suggest that cross-modal training can enhance reasoning even for tasks grounded in a single modality.