Rh-3DGS: Robust Open-Vocabulary Scene Understanding via Riemannian Huber Distillation and Manifold-Aware Sampling
Abstract
Open-vocabulary 3D scene understanding answers free-form text queries over reconstructed scenes. However, lifting dense 2D foundation-model embeddings into 3D Gaussian Splatting (3DGS) is still challenging. Existing 3DGS-based methods often average normalized embeddings in Euclidean space. This ignores their hyperspherical geometry and can cause feature collapse. They also distill supervision from all views equally, which amplifies occlusion noise and mixed-depth artifacts. We propose Rh-3DGS, a robust semantic 3DGS framework that uses reliability-aware distillation and manifold-consistent aggregation. Visibility-Calibrated Distillation (VCD) computes per-pixel reliability weights from rasterization statistics and down-weights ambiguous pixels. Visibility-Weighted Fréchet Mean (VFM) aggregates embeddings on the unit hypersphere with a Riemannian Huber objective for robust distillation. Lightweight Consistency Contrast (LIC) regularizes the 3D semantic field with neighborhood-based multi-positive contrast to improve local consistency and sharper boundaries. Experiments on three benchmarks show that Rh-3DGS is best on open-vocabulary segmentation, boundary quality, and view-consistent rendering.