Learning Prompt Chains for Frozen LLMs: Inter-Call Orchestration Beyond Single-Call Reasoning
Abstract
Reinforcement learning has made Chain-of-Thought prompting an effective form of test-time computation, but it remains unclear whether multi-call prompting strategies can be learned in the same way. We study prompt chaining as a trainable orchestration policy for frozen large language models (LLMs). We introduce Chainer, a lightweight model that maps each user request to an open-loop sequence of self-contained sub-prompts executed by a frozen LLM. Chainer is first initialized by supervised fine-tuning and then optimized with GRPO using end-task rewards, chain-quality signals, validity bonuses, and a dynamic step cost. Across six benchmarks, learned chaining improves a non-reasoning executor on four tasks. With reasoning enabled, it remains complementary on five, improving Arena-Hard (+17.3), AIME (+10.0), LiveCodeBench (+8.2), GPQA (+2.0), and AA-Omniscience (-2.9; lower is better), while degrading IFBench. Experiments shows that these gains are not redundant with longer single-call reasoning: gains persist when token budgets are matched. Since executor weights are fixed, the improvements arise solely from optimizing the prompt trajectory. These results position learned prompt chaining as a distinct axis for scaling frozen LLM systems.