Skip to yearly menu bar Skip to main content


Poster

VisionGraph: Leveraging Large Multimodal Models for Graph Theory Problems in Visual Context

yunxin li · Baotian Hu · Haoyuan Shi · Wei Wang · Longyue Wang · Min Zhang


Abstract:

Large Multimodal Models (LMMs, e.g., GPT-4V and Gemini) have achieved impressive success in visual understanding and reasoning, remarkably improving the performance of mathematical reasoning in visual context. Yet, a challenging type of visual math lies in the multimodal graph theory problem, which is crucial in fields such as biology, transportation, and robotics planning. These graph theory problems require that LMMs understand the graphical structures accurately and perform multi-step reasoning on the visual graph. To step forward in this direction, we first design a benchmark named VisionGraph, used to explore the capabilities of advanced LMMs in solving multimodal graph theory problems. It encompasses eight graph problem tasks ranging in complexity, from connectivity to shortest path problems. Subsequently, we present a Description-Program-Reasoning (DPR) chain, which enhances the logical accuracy of reasoning processes through graphical structure description generation and algorithm-aware multi-step reasoning. Our overall study shows that 1) GPT-4V outperforms Gemini in multi-step graph reasoning; 2) All LMMs exhibit inferior perception accuracy for graphical structures, whether in zero/few-shot settings or with supervised fine-tuning (SFT), which further affects problem-solving performance; 3) DPR significantly improves the multi-step graph reasoning capabilities of LMMs and the GPT-4V (DPR) agent achieves state-of-the-art (SOTA) performance.

Live content is unavailable. Log in and register to view live content