Learning Visually-Grounded Active View Selection for Embodied Question Answering
Abstract
Embodied Question Answering (EQA) pipelines typically combine scene exploration, scene memory, and a final question-answering step on the resulting view. While exploration and memory have received substantial attention, one component remains underdeveloped: refining the agent's viewpoint near the target so that the final observation actually supports answering. We isolate this as a learnable task, Visually-Grounded Active View Selection (VG-AVS), in which the agent iteratively adjusts its viewpoint based on partial visual clues and terminates once the evidence is sufficient. We construct a synthetic dataset with automatically generated paired query–target views, from single-step adjustments to multi-step refinements through complex spatial layouts, and fine-tune pretrained VLMs through supervised fine-tuning (SFT) followed by RL-based policy optimization. Our approach generalizes to unseen synthetic and real scenes, and plugging the learned module into existing EQA pipelines improves downstream question-answering accuracy.