ICML Poster Diving into Self-Evolving Training for Multimodal Reasoning

Poster

Diving into Self-Evolving Training for Multimodal Reasoning

Wei Liu · Junlong Li · Xiwen Zhang · Fan Zhou · Yu Cheng · Junxian He

East Exhibition Hall A-B #E-2606

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Tue 15 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract: Self-evolving training—where models iteratively learn from their own outputs—has emerged as a key approach for complex reasoning tasks, addressing the scarcity of high-quality chain-of-thought data. However, its effectiveness in multimodal reasoning, a domain more intricate than text-only reasoning, remains underexplored, and the understanding of critical factors in this training paradigm remains limited. Furthermore, a central challenge for this training method is performance saturation, which impedes further improvements and scalability. Inspired by reinforcement learning (RL), in this paper, we reframe self-evolving training for multimodal reasoning through the lens of RL, identifying three pivotal factors: $\textit{Training Method}$, $\textit{Reward Model}$, and $\textit{Prompt Variation}$. Through systematic analysis, we establish relatively optimal design principles that significantly enhance multimodal reasoning capabilities. Moreover, delving deeper into training dynamics, we uncover the roots of saturation and propose a new automatic balancing mechanism to mitigate this limitation. Building on these insights, we propose M-STaR (**M**ultimodal **S**elf-evolving **T**r**a**ining for **R**easoning), a framework that achieves consistent performance gains across models of varying sizes and diverse benchmarks. All resources will be made publicly available.

Lay Summary:

When training AI to solve complex problems, one promising method is having models learn from their own previous answers, refining their reasoning over time without constant human guidance. However, this self-teaching approach, though successful in language tasks, hasn't been widely explored for tasks that involve multiple types of data, such as images combined with text (multimodal reasoning). In this work, we investigate how to effectively apply this method to multimodal tasks, highlighting crucial factors such as how the model learns, how it evaluates its success, and how variations in instructions affect learning. Additionally, we identify why these methods sometimes hit performance ceilings, preventing further improvement. To overcome this, we introduce a new, automated way of balancing the learning process. Our proposed approach, called M-STaR, consistently improves multimodal reasoning performance across various tasks and model sizes. All our methods and tools will be publicly shared for wider use.

Chat is not available.