MOPD: Multi-Teacher On-Policy Distillation for Capability Integration in LLM Post-Training
Abstract
Modern large language models (LLMs) rely on reinforcement learning during post-training to push specific capabilities, yet integrating multiple capabilities into one model remains hard. Existing methods, such as RL-then-Distill and Mix-RL, are either inefficient or lose performance. In this work, we propose Multi-teacher On-Policy Distillation (MOPD), a post-training paradigm for combining the capabilities of multiple domain RL teachers: we first run per-domain specialised RL to obtain a set of domain teachers; then we use these teachers as dense, on-policy feedback signals that drive the student's updates on its own rollouts. This eliminates exposure bias and supplies a dense, model-derived feedback channel for capability integration. On Qwen3-30B-A3B (Math / IF / SWE), MOPD outperforms Mix-RL, Cascade RL, RL-then-Distill, and Param-Merge baselines, inheriting nearly all of each teacher's capability. We further validate MOPD on the industrial-scale MiMo-V2-Flash, where it matches or exceeds the corresponding teacher on most of the twelve benchmarks. MOPD also enables parallel, independent development of domain teachers, removing the cross-domain coupling typical of multi-domain post-training. This is of high practical value for capability integration in frontier-scale industrial LLMs.