Spectral-Progressive Thought Flow for Lightweight Multimodal Reasoning
Yixian Shen ⋅ Zhiheng Yang ⋅ Qi Bi ⋅ Changshuo Wang ⋅ JIA-HONG HUANG ⋅ Shuai Wang ⋅ Prayag Tiwari ⋅ George Floros ⋅ Anuj Pathania
Abstract
Multimodal reasoning often relies on long chains of intermediate textual and visual thoughts, where accumulating visual tokens and dense cross-modal attention incur substantial computation and memory overhead. To address this challenge, we propose Spectral-Progressive Thought Flow (*SpecFlow*), a *novel* lightweight multimodal reasoning framework that represents intermediate visual thoughts in a fixed-size discrete cosine space. By exploiting strong energy compaction, *SpecFlow* preserves global layout and relational structure while introducing high-frequency details only when increased spatial precision is required. To align visual state evolution with linguistic intent, classifier-free guidance enables autoregressive textual thoughts to steer flow-based updates of the visual workspace without expanding the context. As a result,*SpecFlow* maintains a bounded visual workspace whose updates depend only on the current visual state and accumulated textual trace, enabling long-horizon inference with stable latency and memory usage independent of reasoning depth. Empirical results show that *SpecFlow* achieves competitive or superior reasoning performance while reducing computation and memory costs by up to *$2.1\times$*.
Successful Page Load