Efficient, Property-Aligned Fan-Out Retrieval via RL-Amortized Diffusion
Abstract
Many modern retrieval problems are \emph{set-valued}: given a broad intent, the system must return a \emph{collection} of results that optimizes higher-order properties (e.g., diversity, coverage, complementarity, coherence) while staying grounded to a fixed database. Set-valued objectives are inherently non-decomposable and are not captured by existing supervised (query, content) datasets which only prioritize top-1 retrieval. While reinforcement learning (RL) can optimize set-level objectives via interaction, deploying an RL-tuned LLM for fan-out retrieval is prohibitively expensive at query time. Conversely, diffusion-based generative retrieval enables efficient single-pass fan-out in embedding space, but requires objective-aligned training targets. To address these issues, we propose {R4T (Retrieve-for-Train)}, which uses RL \emph{once} as an objective transducer in a three step process: (i) train a fan-out LLM with composite set-level rewards, (ii) synthesize objective-consistent training pairs, and (iii) train a lightweight diffusion retriever to model the conditional distribution of set-valued outputs. Across Polyvore and a music playlist dataset, R4T improves retrieval quality over strong baselines while reducing query-time fan-out latency by an order of magnitude.