Clipping Low-Probability Tokens in SFT Yields a Generalizable Initialization for RL
Abstract
Supervised Fine-Tuning (SFT) is a critical step for adapting Large Language Models (LLMs) to specialized domains, often serving as an initialization for subsequent reinforcement learning (RL). However, SFT can overfit a small set of expert data, harming generalization and eroding prior knowledge. This can limit downstream RL, which benefits from a strong, generalizable initialization for exploration. Here, we demonstrate that prior knowledge degradation primarily results from tokens in the expert data to which the base model assigns low probability. Specifically, these low-probability tokens represent a significant deviation from the model’s current prior knowledge. Due to the nature of the log-likelihood objective, they produce larger gradient magnitudes, which speed up adaptation to the new data but degrade generalization. In this paper, we study the token-wise clipping strategy, a commonly used trust-region method for bounding per-token updates. We find that it reshapes token-level learning priorities, promoting more progressive adaptation that fits the new data while preserving general abilities. Compared with standard SFT, clipping low-probability tokens reduces out-of-distribution forgetting by 11.54\% and improves final RL performance by 7.09\% across the agentic benchmarks. Moreover, latent-space analysis shows smaller representational drift under clipping, indicating that it provides a generalizable initialization.