Poster
in
Workshop: Multi-modal Foundation Model meets Embodied AI (MFM-EAI)
STREAM: Embodied Reasoning through Code Generation
Daniil Cherniavskii · Phillip Lippe · Andrii Zadaianchuk · Efstratios Gavves
Recent advancements in the reasoning and code generation abilities of Large Language Models (LLMs) have provided new perspectives on Embodied AI tasks, enhancing planning for both high-level control problems and low-level manipulation. However, efficiently informing the embodied agent about the environment in a concise and task-specific manner remains a challenge. Inspired by modular visual reasoning, we propose a novel approach that utilizes code generation to ground the planner in the environmental context and enable reasoning about past agent experiences. Our modular framework allows the code-generating LLM to extract and aggregate information from relevant observations via API calls to image understanding models, including flexible VLMs. To evaluate our approach, we choose Embodied Question Answering (EQA) as a target task and develop a procedure for synthetic data collection by utilizing the ground truth states of a simulator. Our framework demonstrates notable improvements over baseline methods.