Poster Wed, Jul 8, 2026 • 10:30 AM – 12:15 PM KST Coex: HALL A

ERGeoBench: A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models

Kaiwen Xue ⋅ Tao Wei ⋅ Zhonghong Ou ⋅ Guoxin Zhang ⋅ Kaoyan Lu ⋅ Yu Feng ⋅ Yifan Zhu ⋅ Haoran Luo

Abstract

Multimodal large language models (MLLMs) have shown strong potential for building embodied agents, yet embodied geo-localization remains underexplored due to the lack of fine-grained evaluation. We introduce ERGeoBench, a large-scale benchmark for vision-driven embodied geo-localization. ERGeoBench evaluates models under three progressively interactive settings—single-view, multi-view, and embodied-view—where agents actively acquire observations through sequential viewpoint changes. The benchmark comprises 2,207 globally distributed street-view panoramas and assesses four core capability dimensions: foundational perception, spatial awareness, commonsense reasoning, and geo-localization. Extensive evaluations of leading proprietary and open-source MLLMs reveal that while current models perform well in high-level semantic geo-localization, they struggle with low-level perceptual operations and maintaining spatial consistency across views. Notably, geo-localization performance exhibits a strong positive correlation with the other three capability dimensions, indicating that accurate localization emerges from robust perception, coherent spatial reasoning, and sound commonsense understanding. Overall, ERGeoBench provides a unified and diagnostic framework for advancing human-like embodied geo-localization.