Dual-Latent Memory Routing for Vision-Language Reasoning
Abstract
Multimodal large language models (MLLMs) have recently made strong progress in vision-language reasoning, yet their performance often degrades as generations grow longer. A key factor is that they frequently lose track of earlier visual evidence and intermediate constraints under a monolithic growing context. Inspired by how humans separately recall what they see and what they infer when solving complex tasks, we propose DLMR, a parameter-efficient mechanism that equips MLLMs with dual latent memories: a visual memory that compresses image evidence and a reasoning memory that tracks intermediate conclusions and constraints. A router then dynamically decides which memory and how much to reuse during inference, preserving visual grounding while maintaining coherent long-horizon reasoning. DLMR is trained in three stages from latent memory construction to selective router learning while keeping the base MLLM frozen, yielding substantial gains on both general and reasoning benchmarks with only a small number of additional trainable parameters. Further analyses reveal interpretable, state-dependent routing in which the visual and reasoning memories specialize as intended, and demonstrate that this design reduces redundant decoding and improves token efficiency over long generations.