Awakening Visual Reasoning: Mitigating Post-Training Failure in Vision-Text Compression
Abstract
Vision-Text Compression (VTC) offers a scalable path for long-context multimodal modeling by rendering textual data into dense visual tokens. While recent Vision-Language Models (VLMs) demonstrate high decoding fidelity (OCR) on such inputs, they exhibit a severe reasoning gap: models that reason robustly on native text often fail on visually compressed equivalents, particularly in long-range retrieval and multi-step deduction. We identify a phenomenon of post-training transfer failure, where standard supervised fine-tuning and reinforcement learning on visual prompts yield marginal gains compared to their textual counterparts. To address this, we propose CoRe (Coordinated Reasoning), a training framework that enforces lockstep consistency between the reasoning processes of textual and visual modalities. By treating the text-conditioned policy as a dynamic anchor, CoRe aligns the visual-conditioned policy via step-wise distribution matching, seamlessly integrating into both SFT and RL pipelines. Extensive evaluations across mathematical reasoning, long-context memory, and tabular retrieval benchmarks show that CoRe significantly outperforms standard visual post-training, recovering up to 70% of the performance gap relative to the textual upper bound and effectively activating latent reasoning capabilities in the compressed visual modality.