One-shot Entropy Minimization for Language Model Reasoning
Abstract
In this work, we propose One-shot Entropy Minimization (EM), a simple and fully unsupervised post-training approach that significantly improves reasoning and generation performance using only a single unlabeled data and approximately ten gradient steps. To avoid data contamination, we pretrain a 7-billion-parameter language model from scratch with strictly decontaminated data. Despite its extreme simplicity, one-shot EM yields substantial performance gains and improves reasoning abilities across a broad range of domains, including mathematical reasoning, logical reasoning, and coding. We further show that entropy minimization induces a characteristic right-skewed logit shift, amplifying high-probability tokens while suppressing low-probability tails, in contrast to reinforcement learning. Our findings suggest that entropy minimization primarily acts as a distribution shaping mechanism rather than a conventional learning process, offering an efficient and practical algorithm for post-training large language models.