Poster
in
Workshop: Multi-modal Foundation Model meets Embodied AI (MFM-EAI)
DPO-Finetuned Large Multi-Modal Planner with Retrieval-Augmented Generation @ EgoPlan Challenge ICML 2024
Kwanghyeon Lee · Mina Kang · Hyungho Na · HeeSun Bae · Byeonghu Na · Doyun Kwon · Seungjae Shin · Yeongmin Kim · Kim taewoo · Seungmin Yun · IL CHUL MOON
This paper presents technical details for solving a multi-modal task, EgoPlan-Bench. Our model adopts Direct Preference Optimization (DPO), which is originally developed for a single-modal task, to be utilized in a multi-modal setting. This DPO adaptation improves prediction accuracy by highlighting positive answers over negative choices. Additionally, we apply Retrieval-Augmented Generation (RAG) to further enhance generation performance in Multi-modal Large Language Models (MLLMs). However, in our settings, the RAG method does not result in a performance improvement due to the limited retrieval of similar tasks. Our model utilizing DPO shows performance improvements and achieves 53.98% test accuracy compared to baseline methods of 41.35%. Our code is available at https://github.com/aailabkaist/EgoPlanChallengeTeam_AAILab.