Towards Whole-corpus Reconstruction of Heterogeneous RAG Knowledge Bases
Abstract
Retrieval-Augmented Generation (RAG) systems are increasingly deployed to provide query-based access to large knowledge bases, thereby introducing concrete privacy risks whereby the underlying corpus may be partially or fully extracted through the deployed service. Existing extraction attacks typically rely on locally driven search strategies, in which newly extracted content is inferred or expanded based on previously recovered fragments. However, real-world knowledge bases are often multi-source and heterogeneous, with pronounced semantic discontinuities across domains. Such gaps can trap extraction methods that rely on local semantic continuity in local optima, severely limiting large-scale corpus reconstruction. In this paper, we introduce an extraction framework (GeoEx) designed to navigate and reconstruct heterogeneous RAG knowledge bases without any prior knowledge. The framework plans extraction directly in the embedding space of a proxy retrieval model to improve global coverage, and employs an embedding inversion module to convert latent vectors into executable queries. We further propose a composite geometric strategy that combines orthogonal query synthesis for cross-domain exploration with local embedding perturbations for dense extraction within discovered clusters. Experiments on mixed corpora spanning eight diverse domains and multiple retrievers and LLMs show that GeoEx significantly outperforms baselines in both extraction coverage and query efficiency.