Workshop
Multi-modal Foundation Model meets Embodied AI (MFM-EAI)
Zhenfei (Jeremy) Yin · Mahi Shafiullah · Zhenhua Xu · Quan Vuong · Jing Shao · Lu Sheng · Takayuki Osa · Hengshuang Zhao · Mohamed Elhoseiny · Xihui Liu · Tatsuya Harada · Cewu Lu · Wanli Ouyang · Pete Florence · Yu Qiao · Dacheng Tao · Phil Torr
Lehar 4
Fri 26 Jul, midnight PDT
Multi-modal Foundation Model meets Embodied AI (MFM-EAI)In recent years, Multi-modal Foundation Models (MFM) such as CLIP, ImageBind, DALL·E 3, GPT-4V, and Gemini have emerged as one of the most captivating and rapidly advancing areas in AI, drawing significant attention and progressing swiftly. The open-source community for MFM has also seen vigorous growth, with the emergence of models and algorithms like LLaVA, LAMM, Stable Diffusion, and OpenFlamingo. These MFMs are now actively exploring ultimate application scenarios beyond traditional computer vision tasks.Recent studies have unveiled the immense potential these models hold in empowering embodied AI agents, marking the intersection of these fields with a multitude of open questions and unexplored territories. This workshop, MFM-EAI, is dedicated to exploring these critical challenges:- How can we train and evaluate MFM in open-ended environments?- What constitutes an effective system architecture for MFM-based Embodied AI Agents?- And importantly, how can MFM augment the perceptual and decision-making capabilities of these agents, balancing their high-level decision-making prowess with the nuanced requirements of low-level control in embodied systems?Topics include but are not limited to:- Training and evaluation of MFM in open-ended scenarios- Data collection for training Embodied AI Agents and corresponding MFM- Framework design for MFM-powered embodied agents- Decision-making in Embodied Agents empowered by MFM- Low-level control in Embodied Agents empowered by MFM- Evaluation and simulation of Embodied Agents- Limitations of MFM in empowering Embodied AI
Schedule
Fri 12:00 a.m. - 12:10 a.m.
|
Opening remark
(
Opening remark
)
>
SlidesLive Video |
🔗 |
Fri 12:10 a.m. - 12:40 a.m.
|
General-Purpose Embodied AI
(
Keynote Talk
)
>
SlidesLive Video |
Sergey Levine 🔗 |
Fri 12:40 a.m. - 1:10 a.m.
|
On Building General-Purpose Robots
(
Keynote Talk
)
>
SlidesLive Video |
Lerrel Pinto 🔗 |
Fri 1:10 a.m. - 1:50 a.m.
|
Poster session #1 and Coffee break
|
🔗 |
Fri 1:50 a.m. - 2:20 a.m.
|
Foundation models for robotics
(
Keynote Talk
)
>
SlidesLive Video |
Chelsea Finn 🔗 |
Fri 2:20 a.m. - 3:15 a.m.
|
Early career researchers in Embodied AI: Challenges and Opportunities in Multimodal Foundation Models
(
Panel Discussion
)
>
SlidesLive Video |
Zhenfei (Jeremy) Yin · Mahi Shafiullah · Yilun Du · Boyuan Chen · Haoshu Fang 🔗 |
Fri 3:15 a.m. - 4:00 a.m.
|
Lunch
|
🔗 |
Fri 4:00 a.m. - 5:00 a.m.
|
Poster session #2
|
🔗 |
Fri 5:00 a.m. - 5:30 a.m.
|
Compositional Foundation Models
(
Keynote Talk
)
>
SlidesLive Video |
Yilun Du 🔗 |
Fri 5:30 a.m. - 5:40 a.m.
|
DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning (outstanding paper)
(
Outstanding paper talk
)
>
SlidesLive Video |
🔗 |
Fri 5:40 a.m. - 5:50 a.m.
|
Instruction-Guided Visual Masking
(
Outstanding paper talk
)
>
SlidesLive Video |
🔗 |
Fri 5:50 a.m. - 6:00 a.m.
|
BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks
(
Outstanding paper talk
)
>
SlidesLive Video |
🔗 |
Fri 6:00 a.m. - 6:10 a.m.
|
Behavior Generation with Latent Actions
(
Outstanding paper talk
)
>
SlidesLive Video |
🔗 |
Fri 6:10 a.m. - 6:20 a.m.
|
Multimodal foundation world models for generalist embodied agents
(
Outstanding paper talk
)
>
SlidesLive Video |
🔗 |
Fri 6:20 a.m. - 6:50 a.m.
|
MFM-EAI Challenge 1&2&3
SlidesLive Video |
🔗 |
Fri 6:50 a.m. - 7:20 a.m.
|
LEO: An embodied generalist agent in 3D world and Beyond
(
Keynote Talk
)
>
SlidesLive Video |
Xiaojian Ma 🔗 |
Fri 7:20 a.m. - 7:50 a.m.
|
Generative Interactive Environments
(
Keynote Talk
)
>
SlidesLive Video |
Jake Bruce 🔗 |
Fri 7:50 a.m. - 8:00 a.m.
|
End of program
|
🔗 |
-
|
GROOT-1.5: Learning to Follow Multi-Modal Instructions from Weak Supervision ( Poster ) > link | Shaofei Cai · Bowei Zhang · Zihao Wang · Xiaojian Ma · Anji Liu · Yitao Liang 🔗 |
-
|
Instruction-Guided Visual Masking ( Poster ) > link | Jinliang Zheng · Jianxiong Li · Sijie Cheng · Yinan Zheng · Jiaming Li · Jihao Liu · Yu Liu · Jingjing Liu · Xianyuan Zhan 🔗 |
-
|
STREAM: Embodied Reasoning through Code Generation ( Poster ) > link | Daniil Cherniavskii · Phillip Lippe · Andrii Zadaianchuk · Efstratios Gavves 🔗 |
-
|
DPO-Finetuned Large Multi-Modal Planner with Retrieval-Augmented Generation @ EgoPlan Challenge ICML 2024 ( Poster ) > link |
11 presentersKwanghyeon Lee · Mina Kang · Hyungho Na · HeeSun Bae · Byeonghu Na · Doyun Kwon · Seungjae Shin · Yeongmin Kim · Kim taewoo · Seungmin Yun · IL CHUL MOON |
-
|
RoboGolf: Mastering Real-World Minigolf with a Reflective Multi-Modality Vision-Language Model ( Poster ) > link | Hantao Zhou · Tianying Ji · Lukas Sommerhalder · Michael Görner · Norman Hendrich · Fuchun Sun · Jianwei Dr. Zhang · Huazhe Xu 🔗 |
-
|
Jina CLIP: Your CLIP Model Is Also Your Text Retriever ( Poster ) > link | Han Xiao · Georgios Mastrapas · Bo Wang 🔗 |
-
|
What can VLMs Do for Zero-shot Embodied Task Planning? ( Poster ) > link | Xian Fu · Min Zhang · Jianye Hao · Peilong Han · Hao Zhang · Lei Shi · Hongyao Tang 🔗 |
-
|
An Embodied Generalist Agent in 3D World ( Poster ) > link | Jiangyong Huang · Silong Yong · Xiaojian Ma · Xiongkun Linghu · Puhao Li · Yan Wang · Qing Li · Song-Chun Zhu · Baoxiong Jia · Siyuan Huang 🔗 |
-
|
DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning ( Poster ) > link | Hao Bai · Yifei Zhou · Mert Cemri · Jiayi Pan · Alane Suhr · Sergey Levine · Aviral Kumar 🔗 |
-
|
MAP-THOR: Benchmarking Long-Horizon Multi-Agent Planning Frameworks in Partially Observable Environments ( Poster ) > link |
13 presentersSiddharth Nagar Nayak · Adelmo Orozco · Marina Have · Vittal Thirumalai · Jackson Zhang · Darren Chen · Aditya Kapoor · Eric Robinson · Karthik Gopalakrishnan · brian ichter · James Harrison · Anuj Mahajan · Hamsa Balakrishnan |
-
|
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents ( Poster ) > link | Zihao Wang · Shaofei Cai · Zhancun Mu · Haowei Lin · Ceyao Zhang · Xuejie Liu · Qing Li · Anji Liu · Xiaojian Ma · Yitao Liang 🔗 |
-
|
EPD: Long-term Memory Extraction, Context-aware Planning and Multi-iteration Decision @ EgoPlan Challenge ICML 2024 ( Poster ) > link | Letian Shi · Qi Lv · Xiang Deng · Liqiang Nie 🔗 |
-
|
Multimodal foundation world models for generalist embodied agents ( Poster ) > link | Pietro Mazzaglia · Tim Verbelen · Bart Dhoedt · Aaron Courville · Sai Rajeswar 🔗 |
-
|
RISE: 3D Perception Makes Real-World Robot Imitation Simple and Effective ( Poster ) > link | Chenxi Wang · Hongjie Fang · Hao-Shu Fang · Cewu Lu 🔗 |
-
|
BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks ( Poster ) > link | Stephanie Milani · Anssi Kanervisto · Karolis Jucys · Sander Schulhoff · Brandon Houghton · Rohin Shah 🔗 |
-
|
LEGENT: Open Platform for Embodied Agents ( Poster ) > link | Zhili Cheng · Jinyi Hu · Zhitong Wang · Yuge Tu · Shengding Hu · an liu · Pengkai Li · Lei Shi · Zhiyuan Liu · Maosong Sun 🔗 |
-
|
The Embodied World Model Based on LLM with Visual Information and Prediction-Oriented Prompts ( Poster ) > link | Wakana Haijima · KOU NAKAKUBO · Masahiro Suzuki · Yutaka Matsuo 🔗 |
-
|
Hierarchical State Space Models for Continuous Sequence-to-Sequence Modeling ( Poster ) > link | Raunaq Bhirangi · Chenyu Wang · Venkatesh Pattabiraman · Carmel Majidi · Abhinav Gupta · Tess Hellebrekers · Lerrel Pinto 🔗 |
-
|
Vision-Language Models Provide Promptable Representations for Reinforcement Learning ( Poster ) > link | William Chen · Oier Mees · Aviral Kumar · Sergey Levine 🔗 |
-
|
Behavior Generation with Latent Actions ( Poster ) > link | Seungjae Lee · Yibin Wang · Haritheja Etukuru · H. Jin Kim · Mahi Shafiullah · Lerrel Pinto 🔗 |
-
|
DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning ( Poster ) > link |
12 presentersJianxiong Li · Jinliang Zheng · Yinan Zheng · Liyuan Mao · Xiao Hu · Sijie Cheng · Haoyi Niu · Jihao Liu · Yu Liu · Jingjing Liu · Ya-Qin Zhang · Xianyuan Zhan |
-
|
Long-Horizon Planning for Multi-Agent Robots in Partially Observable Environments ( Poster ) > link |
13 presentersSiddharth Nagar Nayak · Adelmo Orozco · Marina Have · Jackson Zhang · Vittal Thirumalai · Darren Chen · Aditya Kapoor · Eric Robinson · Karthik Gopalakrishnan · James Harrison · Anuj Mahajan · brian ichter · Hamsa Balakrishnan |
-
|
LLM3: Large Language Model-based Task and Motion Planning with Motion Failure Reasoning ( Poster ) > link | Shu Wang · Muzhi Han · Ziyuan Jiao · Zeyu Zhang · Ying Nian Wu · Song-Chun Zhu · Hangxin Liu 🔗 |