Multi-modal Foundation Model meets Embodied AI (MFM-EAI)

Workshop

Multi-modal Foundation Model meets Embodied AI (MFM-EAI)

Zhenfei (Jeremy) Yin · Mahi Shafiullah · Zhenhua Xu · Quan Vuong · Jing Shao · Lu Sheng · Takayuki Osa · Hengshuang Zhao · Mohamed Elhoseiny · Xihui Liu · Tatsuya Harada · Cewu Lu · Wanli Ouyang · Pete Florence · Yu Qiao · Dacheng Tao · Phil Torr

Fri 26 Jul, midnight PDT

[ Abstract ] Workshop Website

Multi-modal Foundation Model meets Embodied AI (MFM-EAI)In recent years, Multi-modal Foundation Models (MFM) such as CLIP, ImageBind, DALL·E 3, GPT-4V, and Gemini have emerged as one of the most captivating and rapidly advancing areas in AI, drawing significant attention and progressing swiftly. The open-source community for MFM has also seen vigorous growth, with the emergence of models and algorithms like LLaVA, LAMM, Stable Diffusion, and OpenFlamingo. These MFMs are now actively exploring ultimate application scenarios beyond traditional computer vision tasks.Recent studies have unveiled the immense potential these models hold in empowering embodied AI agents, marking the intersection of these fields with a multitude of open questions and unexplored territories. This workshop, MFM-EAI, is dedicated to exploring these critical challenges:- How can we train and evaluate MFM in open-ended environments?- What constitutes an effective system architecture for MFM-based Embodied AI Agents?- And importantly, how can MFM augment the perceptual and decision-making capabilities of these agents, balancing their high-level decision-making prowess with the nuanced requirements of low-level control in embodied systems?Topics include but are not limited to:- Training and evaluation of MFM in open-ended scenarios- Data collection for training Embodied AI Agents and corresponding MFM- Framework design for MFM-powered embodied agents- Decision-making in Embodied Agents empowered by MFM- Low-level control in Embodied Agents empowered by MFM- Evaluation and simulation of Embodied Agents- Limitations of MFM in empowering Embodied AI

Chat is not available.

Timezone: America/Los_Angeles

Schedule

Fri 12:00 a.m. - 12:10 a.m.	Opening remark ( Opening remark ) > SlidesLive Video	🔗
Fri 12:10 a.m. - 12:40 a.m.	General-Purpose Embodied AI ( Keynote Talk ) > SlidesLive Video	Sergey Levine 🔗
Fri 12:40 a.m. - 1:10 a.m.	On Building General-Purpose Robots ( Keynote Talk ) > SlidesLive Video	Lerrel Pinto 🔗
Fri 1:10 a.m. - 1:50 a.m.	Poster session #1 and Coffee break	🔗
Fri 1:50 a.m. - 2:20 a.m.	Foundation models for robotics ( Keynote Talk ) > SlidesLive Video	Chelsea Finn 🔗
Fri 2:20 a.m. - 3:15 a.m.	Early career researchers in Embodied AI: Challenges and Opportunities in Multimodal Foundation Models ( Panel Discussion ) > SlidesLive Video	Zhenfei (Jeremy) Yin · Mahi Shafiullah · Yilun Du · Boyuan Chen · Haoshu Fang 🔗
Fri 3:15 a.m. - 4:00 a.m.	Lunch	🔗
Fri 4:00 a.m. - 5:00 a.m.	Poster session #2	🔗
Fri 5:00 a.m. - 5:30 a.m.	Compositional Foundation Models ( Keynote Talk ) > SlidesLive Video	Yilun Du 🔗
Fri 5:30 a.m. - 5:40 a.m.	DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning (outstanding paper) ( Outstanding paper talk ) > SlidesLive Video	🔗
Fri 5:40 a.m. - 5:50 a.m.	Instruction-Guided Visual Masking ( Outstanding paper talk ) > SlidesLive Video	🔗
Fri 5:50 a.m. - 6:00 a.m.	BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks ( Outstanding paper talk ) > SlidesLive Video	🔗
Fri 6:00 a.m. - 6:10 a.m.	Behavior Generation with Latent Actions ( Outstanding paper talk ) > SlidesLive Video	🔗
Fri 6:10 a.m. - 6:20 a.m.	Multimodal foundation world models for generalist embodied agents ( Outstanding paper talk ) > SlidesLive Video	🔗
Fri 6:20 a.m. - 6:50 a.m.	MFM-EAI Challenge 1&2&3 SlidesLive Video	🔗
Fri 6:50 a.m. - 7:20 a.m.	LEO: An embodied generalist agent in 3D world and Beyond ( Keynote Talk ) > SlidesLive Video	Xiaojian Ma 🔗
Fri 7:20 a.m. - 7:50 a.m.	Generative Interactive Environments ( Keynote Talk ) > SlidesLive Video	Jake Bruce 🔗
Fri 7:50 a.m. - 8:00 a.m.	End of program	🔗
-	GROOT-1.5: Learning to Follow Multi-Modal Instructions from Weak Supervision ( Poster ) > link Link	Shaofei Cai · Bowei Zhang · Zihao Wang · Xiaojian Ma · Anji Liu · Yitao Liang 🔗
-	Instruction-Guided Visual Masking ( Poster ) > link Link	Jinliang Zheng · Jianxiong Li · Sijie Cheng · Yinan Zheng · Jiaming Li · Jihao Liu · Yu Liu · Jingjing Liu · Xianyuan Zhan 🔗
-	STREAM: Embodied Reasoning through Code Generation ( Poster ) > link Link	Daniil Cherniavskii · Phillip Lippe · Andrii Zadaianchuk · Efstratios Gavves 🔗
-	DPO-Finetuned Large Multi-Modal Planner with Retrieval-Augmented Generation @ EgoPlan Challenge ICML 2024 ( Poster ) > link Link	11 presenters Kwanghyeon Lee · Mina Kang · Hyungho Na · HeeSun Bae · Byeonghu Na · Doyun Kwon · Seungjae Shin · Yeongmin Kim · Kim taewoo · Seungmin Yun · IL CHUL MOON 🔗
-	RoboGolf: Mastering Real-World Minigolf with a Reflective Multi-Modality Vision-Language Model ( Poster ) > link Link	Hantao Zhou · Tianying Ji · Lukas Sommerhalder · Michael Görner · Norman Hendrich · Fuchun Sun · Jianwei Dr. Zhang · Huazhe Xu 🔗
-	Jina CLIP: Your CLIP Model Is Also Your Text Retriever ( Poster ) > link Link	Han Xiao · Georgios Mastrapas · Bo Wang 🔗
-	What can VLMs Do for Zero-shot Embodied Task Planning? ( Poster ) > link Link	Xian Fu · Min Zhang · Jianye Hao · Peilong Han · Hao Zhang · Lei Shi · Hongyao Tang 🔗
-	An Embodied Generalist Agent in 3D World ( Poster ) > link Link	Jiangyong Huang · Silong Yong · Xiaojian Ma · Xiongkun Linghu · Puhao Li · Yan Wang · Qing Li · Song-Chun Zhu · Baoxiong Jia · Siyuan Huang 🔗
-	DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning ( Poster ) > link Link	Hao Bai · Yifei Zhou · Mert Cemri · Jiayi Pan · Alane Suhr · Sergey Levine · Aviral Kumar 🔗
-	MAP-THOR: Benchmarking Long-Horizon Multi-Agent Planning Frameworks in Partially Observable Environments ( Poster ) > link Link	13 presenters Siddharth Nagar Nayak · Adelmo Orozco · Marina Have · Vittal Thirumalai · Jackson Zhang · Darren Chen · Aditya Kapoor · Eric Robinson · Karthik Gopalakrishnan · brian ichter · James Harrison · Anuj Mahajan · Hamsa Balakrishnan 🔗
-	OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents ( Poster ) > link Link	Zihao Wang · Shaofei Cai · Zhancun Mu · Haowei Lin · Ceyao Zhang · Xuejie Liu · Qing Li · Anji Liu · Xiaojian Ma · Yitao Liang 🔗
-	EPD: Long-term Memory Extraction, Context-aware Planning and Multi-iteration Decision @ EgoPlan Challenge ICML 2024 ( Poster ) > link Link	Letian Shi · Qi Lv · Xiang Deng · Liqiang Nie 🔗
-	Multimodal foundation world models for generalist embodied agents ( Poster ) > link Link	Pietro Mazzaglia · Tim Verbelen · Bart Dhoedt · Aaron Courville · Sai Rajeswar 🔗
-	RISE: 3D Perception Makes Real-World Robot Imitation Simple and Effective ( Poster ) > link Link	Chenxi Wang · Hongjie Fang · Hao-Shu Fang · Cewu Lu 🔗
-	BEDD: The MineRL BASALT Evaluation and Demonstrations Dataset for Training and Benchmarking Agents that Solve Fuzzy Tasks ( Poster ) > link Link	Stephanie Milani · Anssi Kanervisto · Karolis Jucys · Sander Schulhoff · Brandon Houghton · Rohin Shah 🔗
-	LEGENT: Open Platform for Embodied Agents ( Poster ) > link Link	Zhili Cheng · Jinyi Hu · Zhitong Wang · Yuge Tu · Shengding Hu · an liu · Pengkai Li · Lei Shi · Zhiyuan Liu · Maosong Sun 🔗
-	The Embodied World Model Based on LLM with Visual Information and Prediction-Oriented Prompts ( Poster ) > link Link	Wakana Haijima · KOU NAKAKUBO · Masahiro Suzuki · Yutaka Matsuo 🔗
-	Hierarchical State Space Models for Continuous Sequence-to-Sequence Modeling ( Poster ) > link Link	Raunaq Bhirangi · Chenyu Wang · Venkatesh Pattabiraman · Carmel Majidi · Abhinav Gupta · Tess Hellebrekers · Lerrel Pinto 🔗
-	Vision-Language Models Provide Promptable Representations for Reinforcement Learning ( Poster ) > link Link	William Chen · Oier Mees · Aviral Kumar · Sergey Levine 🔗
-	Behavior Generation with Latent Actions ( Poster ) > link Link	Seungjae Lee · Yibin Wang · Haritheja Etukuru · H. Jin Kim · Mahi Shafiullah · Lerrel Pinto 🔗
-	DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning ( Poster ) > link Link	12 presenters Jianxiong Li · Jinliang Zheng · Yinan Zheng · Liyuan Mao · Xiao Hu · Sijie Cheng · Haoyi Niu · Jihao Liu · Yu Liu · Jingjing Liu · Ya-Qin Zhang · Xianyuan Zhan 🔗
-	Long-Horizon Planning for Multi-Agent Robots in Partially Observable Environments ( Poster ) > link Link	13 presenters Siddharth Nagar Nayak · Adelmo Orozco · Marina Have · Jackson Zhang · Vittal Thirumalai · Darren Chen · Aditya Kapoor · Eric Robinson · Karthik Gopalakrishnan · James Harrison · Anuj Mahajan · brian ichter · Hamsa Balakrishnan 🔗
-	LLM3: Large Language Model-based Task and Motion Planning with Motion Failure Reasoning ( Poster ) > link Link	Shu Wang · Muzhi Han · Ziyuan Jiao · Zeyu Zhang · Ying Nian Wu · Song-Chun Zhu · Hangxin Liu 🔗