TVDRNet: Text-driven Viewpoint Optimization via Differentiable Rendering for 3D Reasoning Segmentation
Abstract
Three-dimensional (3D) reasoning segmentation aims to segment target objects based on text instructions and 3D spatial cues. Recent efforts in 3D reasoning leverage Multimodal Large Language Models (MLLMs) to bridge the gap between text and 3D data. However, since MLLMs are primarily trained on text-image pairs, directly adapting them to unstructured 3D point clouds often fails to capture implicit semantic intent and reliably localize objects. This paper introduces TVDRNet to address these challenges. Inspired by Active Vision theory, where humans selectively choose optimal viewpoints to better observe targets, TVDRNet employs a differentiable renderer to simulate this active process in 3D perception. By using text instructions as supervision to optimize intrinsic and extrinsic rendering parameters, the TVDRNet identifies the optimal viewpoints for observing the 3D scene, and therefore learning 'where to look' based on what the text instruction 'asked to find'. This process generates informative, task-relevant 2D images that are compatible with MLLMs. TVDRNet comprises: (1) the AVPL module, establishing a learnable mapping from semantics to optimal rendering parameters; and (2) the MGL module, fusing multi-modalities via semantic grouping to guide mask generation. Experiments show TVDRNet achieves SOTA performance in 3D reasoning segmentation (Reason3D, Instruct3D) and 3D visual grounding (ScanRefer).