MMBench-Live: A Continuously Evolving Benchmark for Multimodal Models
Abstract
Evaluation benchmarks play a central role in assessing vision–language models (VLMs). However, most existing multimodal benchmarks are static, making them increasingly vulnerable to data contamination, temporal staleness, and high construction costs. In this work, we introduce MMBench-Live, a multi-agent-driven dynamic multimodal benchmark that supports continuous updates without human in the loop. MMBench-Live is maintained through an end-to-end automated pipeline that integrates structured benchmark description, real-time data acquisition, and verifiable question–answer (QA) generation, enabling scalable, live, and low-cost benchmark updates. To ensure reliable evaluation across versions, we further propose a distribution-consistent updating strategy based on semantic task interpretation and feedback-driven data collection and filtering. Based on MMBench-Live, we conduct systematic evaluations of multiple open-source VLMs and analyze their performance, cross-version consistency, and data contamination, providing empirical evidence for the effectiveness of the proposed dynamic benchmark updating framework.