FOCA: Future-Oriented Conditioning for Data-Efficient Vision-Language-Action Adaptation
Abstract
Vision–Language–Action (VLA) models enable general-purpose robotic control via large-scale multimodal pretraining, yet their effectiveness under few-shot imitation learning remains limited. We conduct a systematic stress test of state-of-the-art VLA models and show that performance degrades sharply as demonstrations are reduced, revealing a key weakness of existing adaptation strategies. To address this, we introduce FOCA, a future-oriented conditioning framework for data-efficient VLA adaptation. FOCA combines explicit prediction of task-grounded future interaction embeddings with implicit alignment to future goal observations, enabling long-horizon reasoning in latent space without pixel-level prediction. This formulation naturally supports action-free co-training with synthetic videos from video world models and can be interpreted as learning a future-conditioned value-like representation. Extensive experiments demonstrate FOCA achieves 95.7\% success with 20 demonstrations on LIBERO, improves 7–12\% on RoboCasa, and delivers up to 26\% absolute gains on real robots, establishing a new state of the art in few-shot VLA adaptation.