EntRAG: Entity-Centric Retrieval-Augmented Generation for Knowledge-based Visual Question Answering
Abstract
Knowledge-based Visual Question Answering (KB-VQA) remains a challenging task, particularly when queries require precise identification and grounding of fine-grained entities within large-scale knowledge base. Existing methods often treat visual and textual signals in isolation and rely heavily on image-centric retrieval, which makes them sensitive to visual ambiguities. To address these limitations, we propose EntRAG, an entity-centric retrieval-augmented generation framework. Our approach first introduces EntBind to align query representations with multimodal entity embeddings by explicitly binding entity tokens to latent visual features, retrieving a set of relevant candidate entities. A reranking mechanism is applied to these candidate entities to select the most informative context by combining entity-level alignment with overall contextual relevance. The selected evidence is incorporated into context-aware generation module to produce final answer. By explicitly operating at the entity level, EntRAG achieves more consistent and reliable results. Extensive experiments demonstrate that EntRAG consistently outperforms prior methods, achieving scores of 45.2 on E-VQA and 43.8 on InfoSeek.