Towards Efficient LLMs Annealing with Principled Sample Selection
Abstract
The annealing stage of Large Language Model (LLM) training is a critical phase where model loss drops sharply and downstream capabilities solidify. Despite its importance, current practices rely on empirical heuristics like quality filtering or context extension, lacking a principled understanding of the underlying optimization dynamics. We address this gap by providing a theoretical characterization of the spectral properties targeted during annealing. We demonstrate that effective annealing requires balancing global Hessian geometry with sample-wise gradient noise, navigating a landscape of highly anisotropic curvature. Based on these insights, we formulate sample selection as a constrained optimization problem to suppress noise in sharp directions while preserving descent signals in flat subspaces. Our method, solved via Successive Convex Programming (SCP), achieves state-of-the-art results across multiple model scales. Code is available at \url{https://anonymous.4open.science/r/LLM-Annealing-Phase}.