Skip to yearly menu bar Skip to main content


Poster

GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model

Ling Li · Yu Ye · Bingchuan Jiang · Wei Zeng


Abstract:

The ability to accurately predict geo-locations with proper reasoning can benefit many applications, such as navigation. This work tackles the problem with a new paradigm using a large vision-language model (LVLM) augmented with human inference knowledge. However, there is a scarcity of data for training the LVLM - existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise a CLIP-based network to quantify the degree of locatability in street-view images, leading to the creation of a new dataset comprising high-locatability street views. To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities. The data are utilized to train GeoReasoner, which undergoes fine-tuning through dedicated reasoning and location-tuning stages. Qualitative and quantitative evaluations illustrate that GeoReasoner outperforms counterpart LVLMs by more than 25% at country-level and 38% at city-level geo-localization tasks, and surpasses StreetCLIP performance while requiring fewer training resources.

Live content is unavailable. Log in and register to view live content