Poster
Diving into Self-Evolving Training for Multimodal Reasoning
Wei Liu · Junlong Li · Xiwen Zhang · Fan Zhou · Yu Cheng · Junxian He
East Exhibition Hall A-B #E-2606
When training AI to solve complex problems, one promising method is having models learn from their own previous answers, refining their reasoning over time without constant human guidance. However, this self-teaching approach, though successful in language tasks, hasn't been widely explored for tasks that involve multiple types of data, such as images combined with text (multimodal reasoning). In this work, we investigate how to effectively apply this method to multimodal tasks, highlighting crucial factors such as how the model learns, how it evaluates its success, and how variations in instructions affect learning. Additionally, we identify why these methods sometimes hit performance ceilings, preventing further improvement. To overcome this, we introduce a new, automated way of balancing the learning process. Our proposed approach, called M-STaR, consistently improves multimodal reasoning performance across various tasks and model sizes. All our methods and tools will be publicly shared for wider use.
Live content is unavailable. Log in and register to view live content