Timezone: »

 
Poster
VIMA: Robot Manipulation with Multimodal Prompts
Yunfan Jiang · Agrim Gupta · Zichen Zhang · Guanzhi Wang · Yongqiang Dou · Yanjun Chen · Li Fei-Fei · Anima Anandkumar · Yuke Zhu · Jim Fan

Thu Jul 27 04:30 PM -- 06:00 PM (PDT) @ Exhibit Hall 1 #200
Event URL: https://vimalabs.github.io/ »
Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. Yet task specification in robotics comes in various forms, such as imitating one-shot demonstrations, following language instructions, and reaching visual goals. They are often considered different tasks and tackled by specialized models. We show that a wide spectrum of robot manipulation tasks can be expressed with multimodal prompts, interleaving textual and visual tokens. Accordingly, we develop a new simulation benchmark that consists of thousands of procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert trajectories for imitation learning, and a four-level evaluation protocol for systematic generalization. We design a transformer-based robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively. VIMA features a recipe that achieves strong model scalability and data efficiency. It outperforms alternative designs in the hardest zero-shot generalization setting by up to $2.9\times$ task success rate given the same training data. With $10\times$ less training data, VIMA still performs $2.7\times$ better than the best competing variant. Code and video demos are available at https://vimalabs.github.io

Author Information

Yunfan Jiang (Stanford University)
Agrim Gupta (Stanford University)
Zichen Zhang (Allen Institute for Artificial Intelligence)
Guanzhi Wang (California Institute of Technology)
Yongqiang Dou (Tsinghua University)

Turning ideas into action.

Yanjun Chen (Meta)
Li Fei-Fei (Stanford University)
Anima Anandkumar (Caltech and NVIDIA)

Anima Anandkumar is a Bren Professor at Caltech and Director of ML Research at NVIDIA. She was previously a Principal Scientist at Amazon Web Services. She is passionate about designing principled AI algorithms and applying them to interdisciplinary domains. She has received several honors such as the IEEE fellowship, Alfred. P. Sloan Fellowship, NSF Career Award, Young investigator awards from DoD, Venturebeat’s “women in AI” award, NYTimes GoodTech award, and Faculty Fellowships from Microsoft, Google, Facebook, and Adobe. She is part of the World Economic Forum's Expert Network. She has appeared in the PBS Frontline documentary on the “Amazon empire” and has given keynotes in many forums such as the TEDx, KDD, ICLR, and ACM. Anima received her BTech from Indian Institute of Technology Madras, her PhD from Cornell University, and did her postdoctoral research at MIT and assistant professorship at University of California Irvine.

Yuke Zhu (The University of Texas at Austin)
Jim Fan (NVIDIA)

More from the Same Authors