MetaStreet: Semi-Supervised Multimodal Learning for Street-Level Socioeconomic Prediction
Abstract
Predicting street-level socioeconomic indicators from street view imagery is fundamental to urban planning. Existing methods typically extract visual features via pretrained encoders and propagate information through graph-based learning, but they fail to fully exploit the structured, task-relevant, and label-efficient learning signals inherent in urban scenes. We propose MetaStreet, a semi-supervised multimodal framework with three components: (1) a semantic-spatial visual encoder that jointly models object co-occurrence and spatial adjacency at the semantic category level, (2) a task-aware textual encoder that steers LLMs toward prediction-relevant features via task-specific prompts, and (3) a geography-aware graph contrastive learning module that leverages spatial autocorrelation to extend contrastive supervision to unlabeled streets, enabling them to actively participate in representation learning. Experiments on two cities across three socioeconomic prediction tasks demonstrate that MetaStreet consistently outperforms state-of-the-art methods.