RA-VLA: Retrieval-Augmented VLA for Test-Time Adaptation
Abstract
Vision-Language-Action (VLA) models provide a versatile foundation for general robotic manipulation, yet they exhibit significant brittleness when confronted with novel task distributions. While In-Context Imitation Learning (ICIL) offers a training-free alternative, existing frameworks suffer from an adaptation bottleneck that hinders the effective translation of expert context to actions. This failure originates from superficial retrieval mechanisms and an entrenched behavioral inertia that anchors the policy to its pre-trained priors. To address these limitations, we present RA-VLA, a retrieval-augmented VLA framework that unifies behavior-aligned context retrieval with a grounded execution pipeline. By enforcing strict adherence to functional cues within a scalable architecture, our framework facilitates seamless task adaptation while preserving inference efficiency. Evaluations across the LIBERO benchmark and a real-world UR5e environment demonstrate that RA-VLA achieves superior success rates and computational efficiency, providing a robust framework for training-free robotic adaptation.