Poster Wed, Jul 8, 2026 • 10:30 AM – 12:15 PM KST Coex: HALL A

FOCA: Future-Oriented Conditioning for Data-Efficient Vision-Language-Action Adaptation

Duc Nguyen ⋅ Nghiem Diep ⋅ Binh Nguyen Gia ⋅ Trong-Bao Ho ⋅ Doanh Le Thien ⋅ Quang Nguyen ⋅ Thien-Loc Ha ⋅ Tran Van Nhiem ⋅ Bao Thach ⋅ Tran Nhat ⋅ Tuan Tran ⋅ Artur Habuda ⋅ Philip Møller ⋅ Tran Nguyen Le ⋅ Daniel Sonntag ⋅ Mathias Niepert ⋅ Khoa Doan ⋅ Vu Duong ⋅ Hung Ngo ⋅ Minh VU ⋅ Duy Nguyen ⋅ An Thai Le ⋅ Vien Ngo

Abstract

Vision–Language–Action (VLA) models enable general-purpose robotic control via large-scale multimodal pretraining, yet their effectiveness under few-shot imitation learning remains limited. We conduct a systematic stress test of state-of-the-art VLA models and show that performance degrades sharply as demonstrations are reduced, revealing a key weakness of existing adaptation strategies. To address this, we introduce FOCA, a future-oriented conditioning framework for data-efficient VLA adaptation. FOCA combines explicit prediction of task-grounded future interaction embeddings with implicit alignment to future goal observations, enabling long-horizon reasoning in latent space without pixel-level prediction. This formulation naturally supports action-free co-training with synthetic videos from video world models and can be interpreted as learning a future-conditioned value-like representation. Extensive experiments demonstrate FOCA achieves 95.7\% success with 20 demonstrations on LIBERO, improves 7–12\% on RoboCasa, and delivers up to 26\% absolute gains on real robots, establishing a new state of the art in few-shot VLA adaptation.