RVAS: Referring Video Active Exploration and Segmentation
Abstract
Existing referring video object segmentation (RVOS) is largely built on passive perception and assumes the target is already visible in the observed video, which limits real-world use when queries refer to objects beyond the current view. To address this gap, we introduce Referring Video Active Exploration and Segmentation (RVAS), a new task that focuses on reasoning about exploration policy and then locating and segmenting the object according to an input referring expression. To support RVAS, we build a large-scale dataset with manually annotated exploration actions and reference reasoning traces, enabling supervised training and evaluation. We benchmark representative RVOS and related video understanding baselines and find that they struggle to perform active target search and incur substantial overhead when coupled with online decision making. Motivated by these challenges, we propose LESA, a baseline framework that introduces a state controller and hierarchical memory for efficient streaming processing and sparse MLLM reasoning. LESA substantially reduces inference cost while maintaining competitive planning quality, and consistently improves segmentation accuracy on the RVAS dataset.