Demystifying Entropy Control in LLM RL Training: Theoretical Analysis and Dynamic Scheduling
Abstract
This paper investigates a pivotal yet debated component of reinforcement learning (RL) for training large language models (LLMs): controlling entropy (increasing or decreasing it) during RL fine-tuning. The existing literature presents a dichotomy: some studies posit that increasing entropy facilitates exploration, whereas others argue that decreasing entropy enhances performance. To reconcile these conflicting observations, we provide a theoretical framework showing that the effect of entropy is governed by \emph{Entropy Discrepancy}, the distributional divergence between positive and negative samples. Guided by this insight, we derive a principled dynamic scheduling method that adaptively modulates the entropy coefficient, effectively switching between entropy maximization and minimization as training evolves. Extensive experiments confirm the correlation between Entropy Discrepancy and the efficacy of entropy control. Furthermore, our adaptive method yields substantial improvements, boosting Pass@K by 6.7\% on AIME24 and 17.52\% on puzzle tasks compared to vanilla RL, while consistently outperforming recent state-of-the-art reasoning methods.