Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Knowledge and Logical Reasoning in the Era of Data-driven Learning

SQA3D: Situated Question Answering in 3D Scenes

Xiaojian Ma · Silong Yong · Zilong Zheng · Qing Li · Yitao Liang · Song-Chun Zhu · Siyuan Huang


Abstract:

We propose a new task to benchmark scene understanding and knowledge-intensive reasoning of embodied agents: SQA3D: Situated Question Answering in 3D Scenes. Given a scene context (\eg, 3D scan), SQA3D requires the tested agent to first understand its \textbf{situation} (position, orientation, etc.) in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation. Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations. These questions examine a wide spectrum of reasoning capabilities for an intelligent agent, ranging from spatial relation comprehension to commonsense understanding, navigation, and multi-hop reasoning. SQA3D imposes a significant challenge to current multi-modal especially 3D reasoning models. We evaluate various state-of-the-art approaches and find that the best one only achieves an overall score of 47.20%, while amateur human participants can reach 90.06\%. We believe SQA3D could facilitate future embodied AI research with stronger situation understanding and reasoning capabilities. Code and data will be released.

Chat is not available.