On-Policy Self-Distillation via Prompt Optimization
Abstract
Reinforcement Learning (RL) has become the dominant paradigm for post-training language models (LMs), yet its scalar reward collapses rich environment feedback into a single number. On-policy self-distillation (OPSD) recovers some of this signal by training a student to match a stronger self-teacher that is given privileged information. On the other hand, parameter-free prompt optimization methods like GEPA excel in domains that offer rich feedback and use compound LM programs. In this paper, we bridge this divide by introducing Self-Distillation via Prompt Optimization (SDPRO), a framework that jointly optimizes both prompts and parameters. SDPRO leverages environment feedback through an iterative two-phase cycle: it first discovers generalizable prompts via GEPA, and then OPSD internalizes those gains into the model weights. Not only does SDPRO lead to better performance than GEPA alone, but our method naturally extends OPSD to compound LM programs, which prior OPSD formulations could not naturally handle. On HotpotQA with Qwen3-8B, three cycles outperform GEPA by +4.7 points and GRPO by +22.5 points under the same compute budget, with no degradation and even gains on the held-out HoVer and IFBench benchmarks, which use distinct compound programs. We present these results as a proof of concept that integrating prompt and parameter optimization is a promising route to learning from rich natural-language environment feedback.