UrbanMLLM: Joint Learning of Cross-view Imagery for Urban Understanding
Abstract
Comprehensive urban understanding requires integrating macroscopic spatial structure with fine-grained street-level semantics. However, existing urban Multimodal Large Language Models (MLLMs) primarily rely on satellite imagery, limiting their ability to capture detailed urban appearance and cross-view relationships. We propose \textbf{UrbanMLLM}, a unified MLLM that jointly learns from satellite and street-view imagery for cross-view urban perception and reasoning. To support this, we construct a large-scale dataset with paired cross-view urban images, geospatial alignment, and textual annotations. UrbanMLLM introduces a cross-view perceiver to explicitly model interactions between satellite and street-view representations, and adopts a structural interleaved pre-training paradigm that organizes cross-view image–text content as coherent urban documents to enhance cross-view knowledge fusion. We evaluate UrbanMLLM on 13 diverse urban understanding tasks spanning satellite, street-view, and cross-view settings. Experimental results demonstrate consistent improvements over strong open-source and proprietary MLLMs, highlighting effectiveness and scalability of UrbanMLLM for urban environment understanding.