Joint Navigation and Manipulation Planning with 3D Interaction Chains
Abstract
Open-vocabulary mobile manipulation (OVMM) requires long-horizon navigation in unseen environments and object-centric manipulation. Most existing methods treat navigation and manipulation as separate stages, which can yield navigation endpoints that are poor for manipulation or manipulation-friendly poses that are globally inefficient. We address this mismatch with 3D Interaction Chains (3D-IC), a unified framework that couples multi-stage navigation and manipulation planning. 3D-IC maintains a shared 3D feature map for both skills, generates stage-aligned interaction waypoints, and links them into candidate multi-stage chains. A hierarchical policy then scores these chains by jointly considering feasibility (via VLM reasoning over waypoint-centric 3D features) and transition cost, selecting the best trade-off between success and path efficiency. The robot executes the next waypoint and replans as new observations arrive. Experiments in simulation and on a real Stretch 3 robot demonstrate consistent gains in both task success and trajectory efficiency.