Dynamic Multimodal Evaluation via Knowledge-Enhanced Benchmark Evolution
Abstract
The rapid development of multimodal large language models (MLLMs) has created an urgent demand for more reliable and robust evaluation protocols. However, existing static benchmarks are prone to data contamination and performance saturation, which can result in inflated or misleading evaluation results. To address these limitations, we first introduce a graph formulation to represent both static and dynamic visual question answering (VQA) samples. Building upon this frmulation, we propose Knowledge-Enhanced Benchmark Evolution (KBE), a dynamic multimodal evaluation framework that first analyzes the original static benchmark, then expands it by integrating multimodal knowledge, transforming the static benchmark into a controllable, dynamic evolving version. Crucially, KBE can both reconstruct questions by Re-selecting visual information in the original image and expand existing questions with external textual knowledge. By explicitly controlling the degree of question exploration, KBE enables difficulty-controllable evaluation across a wide range of model capabilities. Extensive experimental results demonstrate that KBE effectively mitigates data contamination and benchmark saturation, while providing a more comprehensive and flexible assessment of MLLM performance.