ICML Poster OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction

Poster

OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction

Huang Huang · Fangchen Liu · Letian Fu · Tingfan Wu · Mustafa Mukadam · Jitendra Malik · Ken Goldberg · Pieter Abbeel

West Exhibition Hall B2-B3 #W-409

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Tue 15 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions. Existing approaches require fine-tuning pre-trained vision-language models (VLMs) as visual and language features are independently fed into downstream policies, degrading the pre-trained semantic alignments. We propose OTTER, a novel VLA architecture that leverages these existing alignments through explicit, text-aware visual feature extraction. Instead of processing all visual features, OTTER selectively extracts and passes only task-relevant visual features that are semantically aligned with the language instruction to the policy transformer. This allows OTTER to keep the pre-trained vision-language encoders frozen. Thereby, OTTER preserves and utilizes the rich semantic understanding learned from large-scale pre-training, enabling strong zero-shot generalization capabilities. In simulation and real-world experiments, OTTER significantly outperforms existing VLA models, demonstrating strong zero-shot generalization to novel objects and environments. Video, code, checkpoints, and dataset: https://ottervla.github.io/.

Lay Summary:

Teaching robots to follow instructions like “pick up the red cup” is hard, especially with new objects or settings. Most methods retrain the robot’s vision and language models, which can weaken their vision and language understanding.We introduce OTTER, a new approach that helps robots follow instructions by working with perception models that already understand how images relate to language, without retraining them. OTTER selectively focuses only on the parts of an image that are relevant to the instruction—like just the “red cup”—and passes that focused information to the robot’s decision-making system.Experiments show OTTER outperforms current methods, bringing us closer to robots that understand and act on instructions in many situations.

Chat is not available.