MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Kaining Ying ⋅ Fanqing Meng ⋅ Jin Wang ⋅ Zhiqian Li ⋅ Han Lin ⋅ Yue Yang ⋅ Hao Zhang ⋅ Wenbo Zhang ⋅ Yuqi Lin ⋅ Shuo Liu ⋅ jiayi lei ⋅ Quanfeng Lu ⋅ Runjian Chen ⋅ Peng Xu ⋅ Renrui Zhang ⋅ Haozhe Zhang ⋅ Peng Gao ⋅ Yali Wang ⋅ Yu Qiao ⋅ Ping Luo ⋅ Kaipeng Zhang ⋅ WENQI SHAO
2024 Poster
Abstract
Large Vision-Language Models (LVLMs) show significant strides in general-propose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, and reasoning. MMT-Bench comprises $31,325$ meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering $32$ core meta-tasks and $162$ subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving $20$ publicly available LVLMs such as the proprietary GeminiProVision model, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.
Chat is not available.
Successful Page Load