AmbiRefer3D: 3D Visual Grounding with Referential Ambiguity
Abstract
Traditional 3D visual grounding typically assumes that natural language expressions unambiguously refer to target objects in a 3D scene. However, in practical applications, human instructions are often ambiguous or insufficient, which may lead existing models to associate the query with multiple possible objects, resulting in incorrect results. In this paper, we propose a new task, 3D visual grounding with referential ambiguity, which allows for referential ambiguity in language descriptions, making it more broadly applicable to real-world scenarios. To tackle this task, we propose an interactive grounding framework that performs multi-round question-answer interactions, in which the model actively generates clarifying questions and receives human-provided answers to acquire additional object attributes, spatial relationships, and other contextual information, thereby resolving referential ambiguity and achieving accurate grounding. To support the learning of interactive grounding, we construct a large-scale dataset named AmbiRefer3D, which contains 47,085 samples with 141,255 annotations of question-answer dialogues that capture interactive disambiguation processes, covering 7,316 indoor 3D scenes. Furthermore, we establish multi-round evaluation metrics to measure both disambiguation efficiency and grounding accuracy.