Beyond Single-View Indexing: Structure-Aware Multi-View Retrieval for Knowledge-Based VQA
Abstract
Knowledge-Based Visual Question Answering (KB-VQA) relies on retrieval from large-scale knowledge bases, yet this stage is often treated simplistically. Existing methods typically adopt single-view indexing or naive multi-view fusion, leading to systematic coverage gaps. In this work, we demonstrate that different views exhibit strong complementarity in retrieval. Motivated by this observation, we propose SCAR, a Structure-aware Cross-View Retrieval framework that exploits cross-view structural complementarity at inference time without additional training. SCAR enhances retrieval via structure-aware similarity propagation within each view and explicit cross-view redundancy regulation. Experiments on multiple KB-VQA benchmarks demonstrate that SCAR substantially improves retrieval recall, approaches retrieval coverage upper bounds, and consistently boosts end-to-end KB-VQA performance with negligible inference overhead.