DSGCR: Decomposed Spectral Geometry-Aware Cross-Modal Semantic Representation for 3D Visual Grounding
Abstract
3D visual grounding encompassing 3D referring expression comprehension (3DREC) and segmentation (3DRES) requires robust cross-modal representation to achieve fine-grained semantic alignment and precise geometric reasoning. However, most methods employ unimodal pre-trained encoders that transfer visual and linguistic knowledge independently, inducing domain shift and poor cross-modal alignment. Meanwhile, spatial modeling with handcrafted priors limits cross-modal geometric representation, struggling to capture complex object relations due to spectral bias. To address these challenges, we propose Text-aware Feature Tuning (TFT) and Decomposed Spectral Geometry (DSG) to enhance cross-modal semantic representation. Specifically, TFT injects linguistic context into the visual hierarchy to mitigate domain shift and facilitate early cross-modal alignment. DSG employs a learnable Fourier basis and explicitly decomposes pairwise relations into symmetric and antisymmetric spectral components, allowing the model to capture high-frequency geometric details and direction-aware relations for precise spatial reasoning. Extensive experiments on ScanRefer, Nr3D and Sr3D validate the effectiveness of our method, demonstrating state-of-the-art performance with improvements of 2.05\% Acc@0.25 for 3DREC and 1.09\% mIoU for 3DRES on ScanRefer.