Instruction Decomposition and Action Alignment for Vision-Language Navigation
Abstract
Vision-and-Language Navigation (VLN) empowered by Multimodal Large Language Models (MLLMs) is promise, yet remains challenged by long-horizon tasks with complex user instructions. Existing approaches that continuously condition on full instructions incur high latency due to abundant visual tokens and exacerbates instruction interference, where irrelevant text noise induces hallucinations. To address these limitations, we propose IDEAL-VLN ( \textbf{I}nstruction \textbf{DE}composition and \textbf{A}ction a\textbf{L}ignment ), a novel paradigm that reformulates navigation as a causal inference chain. We decompose the task into two sequential steps: Semantic Anchoring and Action Alignment. We adopt a \textit{Think-Before-Act} mechanism that first infers the immediate semantic anchor from the global context and then generates actions conditioned solely on this anchor. This design constructs an explicit information bottleneck, suppressing spurious correlations from irrelevant instruction. Moreover, to alleviate cognitive collapse and limited exploration during training, we introduce a hierarchical correction framework that combines semantic-level thought correction with a spatially-aware adaptive intervention strategy. This strategy adjusts expert intervention probability based on geodesic distance, effectively defining a semantic safety boundary. To support this paradigm, we contribute the Instruction-Aligned Navigation Dataset containing 160K image-text pairs. Extensive experiments demonstrate that IDEAL-VLN achieves state-of-the-art performance and robustness across major benchmarks while significantly reducing inference costs.