From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges
Abstract
Bridging high-level semantic understanding with low-level physical control remains a persistent challenge in embodied intelligence, stemming from the fundamental spatiotemporal scale mismatch between cognition and action. Existing generative policies typically adopt a "Generation-from-Noise" paradigm, which disregards this disparity, leading to representation inefficiency and optimization "Loss Collapse". In this work, we propose ResVLA, a novel architecture that shifts the paradigm to "Refinement-from-Intent". Recognizing that robotic motion naturally decomposes into global intent and local dynamics, ResVLA utilizes spectral analysis to decouple control into a deterministic low-frequency anchor and a stochastic high-frequency residual. By anchoring the generative process on the predicted intent, our model focuses strictly on refining local dynamics via a residual diffusion bridge. Extensive evaluations on LIBERO and the challenging LIBERO-Plus benchmarks demonstrate that ResVLA achieves state-of-the-art performance. Notably, our approach exhibits exceptional robustness against semantic drift and kinematic perturbations while achieving significantly faster convergence than standard generative baselines.