Latent Reasoning VLA: Latent Thinking and Prediction for Vision-Language-Action Models
Abstract
Vision-Language-Action (VLA) models benefit from Chain-of-Thought (CoT) reasoning, but existing approaches incur high inference overhead and rely on discrete reasoning representations that mismatch continuous perception and control. We propose Latent Reasoning VLA (LaRA-VLA), a unified VLA framework that internalizes multi-modal CoT reasoning into continuous latent representations for embodied action. LaRA-VLA performs unified reasoning and prediction in latent space, eliminating explicit CoT generation at inference time and enabling efficient, action-oriented control. To realize latent embodied reasoning, we introduce a curriculum-based training paradigm that progressively transitions from explicit textual and visual CoT supervision to latent reasoning, and finally adapts latent reasoning dynamics to condition action generation. We construct two structured CoT datasets, LIBERO-LaRA and Bridge-LaRA, and evaluate LaRA-VLA across simulation benchmarks and long-horizon real-robot manipulation tasks. Experimental results show that LaRA-VLA outperforms existing state-of-the-art VLA methods while achieving up to a 90\% reduction in inference latency compared to explicit CoT-based VLA approaches, highlighting latent reasoning as an effective and efficient paradigm for real-time embodied control.