Zero-shot Active Mapping via Fused 360-BEV Representations and Vision–Language Models
Abstract
Active mapping enables embodied agents to understand and interact in previously unseen environments. However, most methods struggle to achieve zero-shot generalization to large-scale scenes and lack support for language instructions. We propose a VLM-based active mapping method that achieves zero-shot mapping while facilitating language-driven human–agent interaction. First, we introduce a 360-BEV representation that integrates omnidirectional semantics with BEV-aligned geometric structure to enhance scene understanding. Second, we develop a candidate waypoint generation strategy that allows the VLM-driven agent to select informative 2D waypoints in image space and back-project them into executable metric actions in 3D space, enabling the VLM to plan in its strongest modality. Third, we design a VLM-based depth-first exploration agent that decomposes the scenes into explorable regions, selects informative waypoints within each region, and organizes them into a topological tree. The agent follows the depth-first exploration policy to achieve thorough coverage of large-scale scenes. Without task-specific training, our method outperforms the strongest baseline, improving coverage and AUC by approximately 13.25\% and 14.00\%, respectively, while enabling language-conditioned interaction.