OnePO: Direct One-stage Policy Optimization for SFT-free Domain Adaptation
Abstract
Domain adaptation transforms general-purpose LLMs into specialized experts for specific domains or tasks. This process typically follows a two-stage recipe: first, Supervised Fine-Tuning (SFT) to inject domain knowledge or induce specific behaviors (e.g., reasoning patterns), followed by Reinforcement Learning (RL) for self-improvement. However, does RL truly require a pre-SFT as cold-start phase? We argue that pre-SFT is inherently problematic: (1) it indiscriminately reinforces knowledge and behaviors from references regardless of whether the LLM has already acquired them, leading to distribution contraction that constrains subsequent exploration; (2) it introduces substantial overhead in multi-stage training and data curation. While our pilot studies reveal that, without pre-SFT, RL struggles to acquire off-policy knowledge from scratch, we bridge this gap with One-stage Policy Optimization (OnePO). OnePO is an SFT-free paradigm that enables LLMs to selectively internalize off-policy knowledge and behaviors directly during RL evolution. Crucially, we design an Adaptive Objective Evolution mechanism for rapid knowledge injection and a Teacher Retirement mechanism that prevents off-policy anchoring. Experiments demonstrate that OnePO successfully transforms the Qwen3-8B-Base model into a high-performance medical LLM in one RL stage, achieving competitive performance on HealthBench (67.2) and other benchmarks using only 20K samples. This highlights SFT-free RL can efficiently cultivate domain experts without the need for traditional multi-stage pipelines.