3D Scene Assertion Verification
Abstract
Existing 3D Visual Question Answering (3D-VQA) methods rely on generative paradigms, producing ambiguous descriptions that hinder deterministic decision-making. We introduce 3D Scene Assertion Verification, a task requiring models to verify natural language assertions in 3D scenes with strict binary judgments. To enable rigorous evaluation, we present 3DSAV, the first large-scale diagnostic benchmark comprising 22.5k samples tailored for this objective. We observe that current 3D-VQA models struggle in this setting due to a lack of specialized reasoning mechanisms. To address this, we propose DualLPSS. This framework uses a dual-stage routing mechanism to enable type-aware cross-modal fusion and scene-guided assertion focusing. Extensive experiments show that DualLPSS achieves state-of-the-art performance on 3DSAV, distinguishing itself by correctly handling complex logical assertions where baselines fail. The code and dataset will be made publicly available.