Thinking in Latent Space: Progressive Multimodal Simplification for Visual Reasoning
Abstract
Recent Multimodal Large Language Models (MLLMs) have advanced cross-modal reasoning by extending Chain-of-Thought (CoT) prompting to visual tasks. However, existing methods still rely heavily on explicit textual reasoning steps, leading to information loss, unstable perception–reasoning interaction, and high computational cost. Inspired by human cognition, we argue that effective visual reasoning emerges from a dynamic interplay between perception and latent thought, rather than a purely linear verbalization process. Motivated by this insight, we propose Latent-Driven Progressive Visual Reasoning (LDPVR), a framework that formulates multimodal reasoning as a Markov Chain of Recursive State Simplification, where explicit textual states are progressively refined under the guidance of latent transitions. Central to LDPVR is Interleaved Latent Grounding, which leverages latent semantic intent to actively retrieve fine-grained visual evidence and drive robust state evolution, enabling the model to iteratively reduce uncertainty before committing to simplified textual states. To optimize this process, we introduce a three-stage curriculum combining supervised fine-tuning, latent-text distillation, and reinforcement learning via Group Relative Policy Optimization (GRPO). Experiments on six multimodal reasoning benchmarks demonstrate that LDPVR improves reasoning accuracy while maintaining low inference latency. Code will be made public.