ERGeoBench: A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models
Abstract
Multimodal large language models (MLLMs) have shown strong potential for building embodied agents, yet embodied geo-localization remains underexplored due to the lack of fine-grained evaluation. We introduce ERGeoBench, a large-scale benchmark for vision-driven embodied geo-localization. ERGeoBench evaluates models under three progressively interactive settings—single-view, multi-view, and embodied-view—where agents actively acquire observations through sequential viewpoint changes. The benchmark comprises 2,207 globally distributed street-view panoramas and assesses four core capability dimensions: foundational perception, spatial awareness, commonsense reasoning, and geo-localization. Extensive evaluations of leading proprietary and open-source MLLMs reveal that while current models perform well in high-level semantic geo-localization, they struggle with low-level perceptual operations and maintaining spatial consistency across views. Notably, geo-localization performance exhibits a strong positive correlation with the other three capability dimensions, indicating that accurate localization emerges from robust perception, coherent spatial reasoning, and sound commonsense understanding. Overall, ERGeoBench provides a unified and diagnostic framework for advancing human-like embodied geo-localization.