3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key paradigm for unlocking complex reasoning in Large Language Models (LLMs), yet its potential in 3D scene understanding remains untapped. To bridge this gap, we present Reinforcement Fine-Tuning for Video-based 3D Scene Understanding (3D-RFT), the first framework to extend RLVR to 3D perception and reasoning. Our pipeline operates in two stages: activating 3D-aware Multi-modal Large Language Models (MLLMs) via Supervised Fine-Tuning (SFT), followed by reinforcement fine-tuning using Group Relative Policy Optimization (GRPO) with strictly verifiable reward functions. We design task-specific rewards—such as 3D IoU and F1-score—to provide deterministic signals for spatial alignment. Extensive experiments demonstrate that 3D-RFT achieves state-of-the-art performance on video-based 3D scene understanding benchmarks, significantly outperforming VG LLM-8B on detection and grounding tasks. Moreover, our model surpasses larger mainstream models on VSI-Bench, demonstrating the efficiency of verifiable reinforcement learning. We conclude by offering valuable insights into optimal training strategies .