SpaCeFormer: Space-Curve Transformer for Open-Vocabulary 3D Instance Segmentation without Proposals
Christopher Choy ⋅ Junha Lee ⋅ Chunghyun Park ⋅ Minsu Cho ⋅ Jan Kautz
Abstract
Open-vocabulary 3D segmentation is crucial for real-world applications, yet existing methods are constrained by fragmented masks and inconsistent captions in dataset generation, and by multi-stage pipelines prone to error propagation. We present SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset with 846K instances from 15K scenes, and SpaCeFormer (Space-Curve Transformer), a proposal-free segmentation architecture. Our data pipeline leverages multi-view mask clustering to produce geometry-consistent 3D instances and employs multi-view VLM prompting for view-consistent captions. On the modeling side, SpaCeFormer combines spatial window attention with Morton curve serialization for spatially coherent features, and a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200, our approach achieves 11.1 zero-shot mAP, a 2.8$\times$ improvement over prior proposal-free methods while requiring only 0.21 seconds per scene.
Successful Page Load