Multi-label learning with contrastive cluster self-supervision for 3D hierarchical semantic segmentation
Abstract
3D hierarchical semantic segmentation (3DHS) is crucial for embodied intelligence that demands the coarse-to-fine grained and multi-hierarchy understanding of 3D scenes. 3DHS tasks can be addressed by multi-label learning, but facing two issues: I) learning multiple labels for each point with a shared model can lead to multi-hierarchy conflicts in cross-hierarchy optimization, and II) the class imbalance issue is inevitable across multiple hierarchies of 3D scenes, making the model easily be dominated by major classes. To address these issues, we propose a novel multi-label learning with contrastive cluster self-supervision framework for 3DHS. Specifically, we propose a late-decoupled multi-label learning 3DHS network which employs decoupled decoders with the coarse-to-fine hierarchical consistency guidance. This late-decoupled model architecture can mitigate the underfitting and overfitting conflicts among multiple hierarchies and also constrain the class imbalance problem within each individual hierarchy. Moreover, we introduce a 3DHS-oriented contrastive cluster self-supervision learning method, which learns cluster-wise point cloud features with contrastive loss and produces self-supervised information to enhance the class-imbalance segmentation. Extensive experiments on multiple datasets and backbones demonstrate that our approach promotes the multi-hierarchy balance and mitigates the class imbalance issue in 3DHS tasks.