Poster
in
Workshop: Combining Theory and Benchmarks: Towards A Virtuous Cycle to Understand and Guarantee Foundation Model Performance Fri, Jul 10, 2026 • 12:00 AM – 1:00 AM PDT

m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning

Yosub Shin ⋅ Michael Buriek ⋅ Igor Molybog

Project Page

Abstract

Vision--language models (VLMs) achieve strong performance on many multimodal benchmarks but remain brittle on spatial reasoning tasks that require aligning abstract overhead representations with egocentric views. We introduce m2sv, a scalable benchmark for map-to-street-view spatial reasoning that asks models to infer camera viewing direction by aligning a north-up overhead map with a Street View image captured at the same real-world intersection. We release m2sv-20k, a geographically diverse benchmark with controlled ambiguity, along with m2sv-sft-11k, a curated set of structured reasoning traces for supervised fine-tuning.