Learning to Ask May Be Better Than Learning to Answer
Abstract
Prompt optimization is increasingly important for compound LLM systems, where instructions shape not only final answers but also multi-step tool use, retrieval decisions, and stopping behavior. We study prompt-side reinforcement learning: instead of training the downstream solver, we use GRPO to train a policy that generates task-conditioned prompts for a frozen ReAct agent. Across HoVer, HotPotQA, and GSM8K, prompt-side GRPO is most effective on retrieval-intensive tasks, improving top-5 supporting-title recall on HoVer by 7.33 points and answer F1 on HotPotQA by 10.98 points, while providing little benefit for direct mathematical reasoning. It scales more favorably with rollout budget than reflective search-based optimization and can generate structurally distinct retrieval strategies for different question types without explicit supervision. These results suggest that learning to ask is a viable and sometimes preferable alternative to learning to answer in retrieval-heavy compound LLM systems, and that rollout budget allocation should be treated as an explicit design choice.