Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning
Dylan Zhang ⋅ Yufeng Xu ⋅ Haojin Wang ⋅ Qingzhi Chen ⋅ Hao Peng
Abstract
Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT performance alone. We show that, after identical RL training, models initialized from stronger SFT checkpoints can significantly underperform those initialized from weaker ones. We propose PEAR ($\textbf{P}$olicy $\textbf{E}$valuation–inspired $\textbf{A}$lgorithm for Offline Learning Loss $\textbf{R}$eweighting), an SFT-stage method that corrects this mismatch and better prepares the model for RL. PEAR uses importance sampling to reweight the SFT loss, with three variants operating at the token, block, and sequence levels. It can be used to augment standard SFT objectives and incurs little additional training overhead once probabilities for the offline data are collected. We conduct controlled experiments on verifiable reasoning games and mathematical reasoning tasks on Qwen2.5/3 and DeepSeek-distilled models. PEAR consistently improves post-RL performance over canonical SFT, with pass@8 gains up to a 14.6% on AIME-2025. Our results suggest that PEAR is an effective step toward more holistic LLM post-training by designing and evaluating SFT with downstream RL in mind rather than in isolation.
Successful Page Load