Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs
Abstract
Large language models (LLMs) are strong passive responders, but learning to proactively elicit information—asking the right questions and stopping at the right time—remains difficult. Existing approaches, such as optimizing turn-level attributes or relying on user simulators to generate training trajectories, often struggle with a persistent reality gap. We propose \texttt{Learn-to-Ask}, a simulator-free framework that learns proactive questioning policies directly from offline expert conversations. Our key insight is to leverage the \textbf{observed future} of each expert trajectory to derive dense, turn-level rewards that reflect expert long-horizon strategy, reducing policy learning to a sequence of supervised learning tasks that jointly enable LLMs to know \textbf{what to ask} and \textbf{when to stop}. To ensure the LLM-generated contents, such as reward fidelity and sampling quality, align with expectations, we further introduce an automated pipeline that calibrates the prompts with minimal human supervision. Across multiple datasets and model scales, \texttt{Learn-to-Ask} consistently improves proactive information-seeking behavior. We also report a large-scale real-world deployment where the trained agent surpasses an internal expert baseline under professional audit, which demonstrates the effectiveness of our framework and our rewards as a reality-validated proxy metric for LLM proactivity.